Observability¶
The zeroclaw-bridge exposes a Prometheus exposition endpoint at /metrics
covering first-audio latency, request rate / errors per endpoint, ACP
session state, perception events, calendar health, and Kid Mode state.
A starter Grafana dashboard lives at
monitoring/grafana-dashboard.json.
These metrics are the measurement prerequisite for the first-audio latency reduction follow-up work. Numbers come first; you can't tune what you can't see.
LAN-only — never expose /metrics to the internet
The bridge listener should live on your home LAN (or behind a
reverse proxy with auth). /metrics is unauthenticated by design
— Prometheus expects to scrape it directly. Do not publish
the bridge port to the public internet.
Enable¶
Metrics are on by default once the bridge has its dependency installed:
pip install -r bridge/requirements.txt # picks up prometheus-client
systemctl restart zeroclaw-bridge # or `docker compose restart bridge`
curl -s http://<BRIDGE_HOST>:8080/metrics | head -20
If prometheus-client is missing the bridge still serves traffic — it
just returns a 503 from /metrics so you (and your alerting) can
notice the degraded state instead of waiting on a timeout.
Prometheus scrape config¶
Add to your prometheus.yml:
scrape_configs:
- job_name: dotty-bridge
metrics_path: /metrics
scrape_interval: 15s
static_configs:
- targets: ["<BRIDGE_HOST>:8080"]
labels:
service: zeroclaw-bridge
env: home
Replace <BRIDGE_HOST> with the LAN address of the box running the
bridge. Reload Prometheus (SIGHUP or /-/reload) and confirm the
target shows UP under Status → Targets.
Import the Grafana dashboard¶
- Open Grafana → Dashboards → New → Import.
- Click Upload JSON file and pick
monitoring/grafana-dashboard.json. - When prompted for the
DS_PROMETHEUSdatasource, choose your Prometheus instance. Save.
The dashboard ships with eight panels: first-audio latency (P50/P95/P99), request rate by endpoint, error rate by endpoint+kind, active ACP sessions, Smart-Mode invocation rate, perception events per minute (stacked by type), calendar fetch failure rate, and a Kid Mode single-stat toggle.
What each metric means¶
| Metric | Type | What it tells you |
|---|---|---|
dotty_first_audio_latency_seconds |
Histogram | Bridge-side seconds from request received to first content chunk emitted. Tightly correlated with perceived robot responsiveness. |
dotty_request_duration_seconds{endpoint} |
Histogram | End-to-end duration per endpoint (message, message_stream, vision_explain, calendar_today, perception_event). |
dotty_request_errors_total{endpoint,kind} |
Counter | Errors partitioned by endpoint and kind (timeout, binary_missing, exception). |
dotty_llm_tokens_total{kind,model} |
Counter | LLM token volume; reserved for future per-call accounting. |
dotty_active_acp_sessions |
Gauge | Live ACP child sessions. The bridge is single-child so this is normally 0 (idle) or 1 (in flight). |
dotty_calendar_fetch_failures_total{kind} |
Counter | Google Calendar fetch errors partitioned by kind (timeout, parse, other, orchestrator). The cache backs off automatically; sustained failures mean look at the bridge log. A spike of timeout reads as a network/quota issue; parse usually means the upstream gws CLI changed shape. |
dotty_smart_mode_invocations_total |
Counter | Smart-Mode requests (the metadata.smart_mode flag opted into the larger LLM). |
dotty_kid_mode_active |
Gauge | 1 if Kid Mode guardrails are active, 0 otherwise. Flipped live by the portal admin endpoint. |
dotty_perception_events_total{type} |
Counter | Ambient-perception events ingested, partitioned by face_detected / face_lost / sound_event. |
Suggested alerts¶
Start small — these are the four signals worth paging on for a home-deployed robot:
- First-audio latency P95 > 3 s for 10 minutes.
histogram_quantile(0.95, sum by (le) (rate(dotty_first_audio_latency_seconds_bucket[5m]))) > 3 - Sustained error rate.
sum by (endpoint, kind) (rate(dotty_request_errors_total[5m])) > 0.05 - Calendar fetch flatlined failing.
sum(rate(dotty_calendar_fetch_failures_total[15m])) > 0.005for 30 m. - Bridge target down.
up{job="dotty-bridge"} == 0for 5 m. Catches the case where systemd / Docker hasn't restarted the bridge.
Adding new metrics¶
bridge/metrics.py is the single source of truth. New metrics belong
in that file with a dotty_ prefix and bounded label cardinality —
never label on user input, device IDs, or session IDs (each unique
value adds a permanent time series). When you wire the metric into
bridge.py, wrap the call in _safe_metric(...) so a typo or label
mismatch can't break the request path.
Cross-references¶
- Architecture — where the bridge sits in the pipeline.
- Voice Pipeline — context for the first-audio latency budget; pair this dashboard with the latency-reduction work.
- Troubleshooting — symptom-to-fix when the dashboard shows red.