Observability¶

The zeroclaw-bridge exposes a Prometheus exposition endpoint at /metrics covering first-audio latency, request rate / errors per endpoint, ACP session state, perception events, calendar health, and Kid Mode state. A starter Grafana dashboard lives at monitoring/grafana-dashboard.json.

These metrics are the measurement prerequisite for the first-audio latency reduction follow-up work. Numbers come first; you can't tune what you can't see.

LAN-only — never expose /metrics to the internet

The bridge listener should live on your home LAN (or behind a reverse proxy with auth). /metrics is unauthenticated by design — Prometheus expects to scrape it directly. Do not publish the bridge port to the public internet.

Enable¶

Metrics are on by default once the bridge has its dependency installed:

pip install -r bridge/requirements.txt   # picks up prometheus-client
systemctl restart zeroclaw-bridge        # or `docker compose restart bridge`
curl -s http://<BRIDGE_HOST>:8080/metrics | head -20

If prometheus-client is missing the bridge still serves traffic — it just returns a 503 from /metrics so you (and your alerting) can notice the degraded state instead of waiting on a timeout.

Prometheus scrape config¶

Add to your prometheus.yml:

scrape_configs:
  - job_name: dotty-bridge
    metrics_path: /metrics
    scrape_interval: 15s
    static_configs:
      - targets: ["<BRIDGE_HOST>:8080"]
        labels:
          service: zeroclaw-bridge
          env: home

Replace <BRIDGE_HOST> with the LAN address of the box running the bridge. Reload Prometheus (SIGHUP or /-/reload) and confirm the target shows UP under Status → Targets.

Import the Grafana dashboard¶

Open Grafana → Dashboards → New → Import.
Click Upload JSON file and pick monitoring/grafana-dashboard.json.
When prompted for the DS_PROMETHEUS datasource, choose your Prometheus instance. Save.

The dashboard ships with eight panels: first-audio latency (P50/P95/P99), request rate by endpoint, error rate by endpoint+kind, active ACP sessions, Smart-Mode invocation rate, perception events per minute (stacked by type), calendar fetch failure rate, and a Kid Mode single-stat toggle.

What each metric means¶

Metric	Type	What it tells you
`dotty_first_audio_latency_seconds`	Histogram	Bridge-side seconds from request received to first content chunk emitted. Tightly correlated with perceived robot responsiveness.
`dotty_request_duration_seconds{endpoint}`	Histogram	End-to-end duration per endpoint (`message`, `message_stream`, `vision_explain`, `calendar_today`, `perception_event`).
`dotty_request_errors_total{endpoint,kind}`	Counter	Errors partitioned by endpoint and `kind` (`timeout`, `binary_missing`, `exception`).
`dotty_llm_tokens_total{kind,model}`	Counter	LLM token volume; reserved for future per-call accounting.
`dotty_active_acp_sessions`	Gauge	Live ACP child sessions. The bridge is single-child so this is normally 0 (idle) or 1 (in flight).
`dotty_calendar_fetch_failures_total{kind}`	Counter	Google Calendar fetch errors partitioned by `kind` (`timeout`, `parse`, `other`, `orchestrator`). The cache backs off automatically; sustained failures mean look at the bridge log. A spike of `timeout` reads as a network/quota issue; `parse` usually means the upstream `gws` CLI changed shape.
`dotty_smart_mode_invocations_total`	Counter	Smart-Mode requests (the `metadata.smart_mode` flag opted into the larger LLM).
`dotty_kid_mode_active`	Gauge	`1` if Kid Mode guardrails are active, `0` otherwise. Flipped live by the portal admin endpoint.
`dotty_perception_events_total{type}`	Counter	Ambient-perception events ingested, partitioned by `face_detected` / `face_lost` / `sound_event`.

Suggested alerts¶

Start small — these are the four signals worth paging on for a home-deployed robot:

First-audio latency P95 > 3 s for 10 minutes. histogram_quantile(0.95, sum by (le) (rate(dotty_first_audio_latency_seconds_bucket[5m]))) > 3
Sustained error rate. sum by (endpoint, kind) (rate(dotty_request_errors_total[5m])) > 0.05
Calendar fetch flatlined failing. sum(rate(dotty_calendar_fetch_failures_total[15m])) > 0.005 for 30 m.
Bridge target down. up{job="dotty-bridge"} == 0 for 5 m. Catches the case where systemd / Docker hasn't restarted the bridge.

Adding new metrics¶

bridge/metrics.py is the single source of truth. New metrics belong in that file with a dotty_ prefix and bounded label cardinality — never label on user input, device IDs, or session IDs (each unique value adds a permanent time series). When you wire the metric into bridge.py, wrap the call in _safe_metric(...) so a typo or label mismatch can't break the request path.

Cross-references¶

Architecture — where the bridge sits in the pipeline.
Voice Pipeline — context for the first-audio latency budget; pair this dashboard with the latency-reduction work.
Troubleshooting — symptom-to-fix when the dashboard shows red.