Roadmap¶

This is a living document. See CONTRIBUTING.md to get involved.

Shipping now (v0.1)¶

v0.1 is the first tagged release — early-feedback alpha. Everything in this list runs end-to-end on the maintainer's hardware. v1.0 is gated on real-world feedback from external users; see Known issues below.

Kid Mode -- opt-in child-safety guardrails: topic blocklist, self-harm redirect, content filter, age-appropriate vocabulary (on by default, disable with DOTTY_KID_MODE=false)
Local ASR -- FunASR SenseVoiceSmall, English-pinned, runs on your Docker host
Local TTS -- Piper voice synthesis, no cloud dependency
Streaming LLM responses -- NDJSON token-level streaming with first-token latency ~1.2s
Emoji-driven expressions -- LLM output prefixed with emoji; firmware maps to face animations
Persona system -- swappable persona files (personas/*.md), customizable via make setup
MCP tool integration -- 11 firmware-advertised tools (head servos, LEDs, camera, reminders, volume, brightness, screen theme)
Photo-based vision -- "What do you see?" triggers camera capture + vision model description
Calendar context injection -- Google Calendar events surfaced to the LLM for contextual reminders
Length-aware brevity -- default 1-2 short sentences, up to 6 for open-ended asks (story, explanation, list); cap enforced in code via MAX_SENTENCES
ASR noise filtering -- rejects punctuation-only / sub-threshold utterances
ACP session caching -- long-lived sessions with idle/turn-count/wall-clock rotation
Single-host deployment -- compose.all-in-one.yml runs everything on one machine
Multi-host deployment -- documented split across Docker host + ZeroClaw host
make setup wizard -- interactive first-run: name your robot, fetch models, validate config
MkDocs Material docs site -- architecture, protocols, quickstart, troubleshooting, FAQ
Kid Mode channel routing -- voice channels are kid-safe by default; the bridge's kid-mode sandwich (English-pin, emoji prefix, topic blocklist, jailbreak resistance) only applies when the inbound channel is in VOICE_CHANNELS, so messaging-platform channels (Discord, Telegram, etc.) skip it automatically. Pair with a separate ZeroClaw daemon on a more capable model for an unrestricted chat surface
Bridge /admin/* endpoints -- localhost-only HTTP API for runtime config mutation: toggle kid-mode (/admin/kid-mode), flip smart-mode (/admin/smart-mode), overwrite persona files (/admin/persona), swap a daemon's default_model in its config.toml (/admin/model), and amend the MCP tool allowlist (/admin/safety, py_compile-validated). Paths and systemd unit names are env-configurable
/xiaozhi/admin/* endpoints -- live-session control surface on xiaozhi-server: set-state, set-toggle, set-tier1slim-model, set-face-identified, set-head-angles, inject-text, abort, take-photo, play-asset, songs, say, devices. See architecture.md
Two-tier voice LLM (Tier1Slim) -- qwen3.5:4b on local llama-swap handles plain conversational turns directly; tool calls (memory_lookup, think_hard, take_photo, play_song) escalate to the bridge via /api/voice/escalate. Default LLM since commit b73f583. See tier1slim.md
Smart-mode hot-swap -- when DOTTY_VOICE_PROVIDER=tier1slim, smart-mode flips call /xiaozhi/admin/set-tier1slim-model to mutate the live provider's model / url / api_key in place — no docker restart, no daemon restart, instant. Legacy =zeroclaw path still rewrites config.toml and restarts the daemon
llama-swap voice/coding matrix -- qwen3.5:4b (voice inner loop) + qwen3.6:27b-think (think_hard target) co-resident under the voice matrix set; qwen3.6:27b for pi CLI runs alone under coding. See cookbook/llama-swap-concurrent-models.md
Perception event bus -- firmware face_detected / face_lost / sound_event / state_changed frames relay through xiaozhi's EventTextMessageHandler to the bridge's /api/perception/event, fanned out to six consumer tasks (face_greeter, sound_turner, face_lost_aborter, wake_word_turner, face_identified_refresher, purr_player)
Fully-local backend support -- compose.local.override.yml for Ollama (single binary, simple) plus llama-swap recipe for concurrent multi-model serving. Both shipped; choose based on whether you need multiple models resident at once
Voice catalog + install helper -- docs/voice-catalog.md (12 Piper + 6 EdgeTTS) + make voice-install -- shipped
Versioned docs via mike -- /latest/, /v0.1/, /dev/ URL structure shipped
Observability hooks -- Prometheus /metrics + Grafana dashboard at monitoring/grafana-dashboard.json -- shipped
Head-pet hold-to-listen wake -- firmware fires WakeWordInvoke("head_pet_hold") after 2 s touch; works in the dark. Also emits head_pet_started/_ended perception events for purr consumer

Known issues (as of v0.1)¶

The 30+ planning docs accumulated during the v0.1 prep sprint surfaced these. None are blockers for trying Dotty out, but you should know about them:

Face emoji rendering — only 5 of 9 enforced emotions render distinctly on the LCD. Sad clamps to a one-eye wink (rotation -400 clamps to 0 on left eye), Surprise is byte-identical to Neutral (weight 120 clamps to 100), Loving is a copy-paste of Happy, Laughing is an alias of Happy by design. Fix is queued (~25-40 LoC firmware patch).
Sound-direction localizer always reads left. I2S channel 1 on the M5Stack CoreS3 is the AEC speaker-loopback reference, not the right mic. Energy detection works; direction does not. Sound-driven head-turn behaves accordingly.
Kid-voice ASR accuracy — SenseVoiceSmall mangles short kid utterances ("macarena" → "maarna"). Post-ASR corrections + phrase boost help but have hit their ceiling. whisper.cpp / faster-whisper swap planned (Phase 1 CPU-only ships immediately, Phase 2 GPU once dual RTX 3060s arrive).
Privacy-indicator LEDs not yet hardwired. The camera streams DMA buffers permanently after init; mic + camera enable are software-controlled with no hardware-guaranteed indicator. Hard prereq for face recognition / continuous vision; do not ship those features without it.
Smart Mode regression (fixed in v0.1 itself) — between 434988d and the v0.1 fix, every voice "smart mode" trigger silently fell back to the default model. If you're forking from before the v0.1 tag, pull the fix.

In progress¶

Actively being worked on or partially complete. Big push 2026-04-25 evening: ~26 commits scaffolding much of what was previously "Planned" — see CHANGELOG.md [Unreleased] for the full inventory. Most items below have code on main but are not yet deployed live or fully wired.

Phase 4 firmware StateManager bench checks -- the on-device six-state mutex (idle / talk / story_time / security / sleep / dance) and 12-pixel LED contract shipped to the active firmware fork (commit d78118b, 2026-04-27) and end-to-end-verified autonomously. Visual / interactive bench checks on the live device pending in #38 (Phase 4 foundation), #39 (Phase 5 sleep behaviour), #40 (Phase 6 security behaviour). The firmware/firmware/ submodule pin in this repo lags the active fork; bump (or build from the active fork) to flash a Phase 4+ build.
CI pipeline -- YAML lint, compose validation, config parse check, firmware dry-build, docs link check
Firmware release workflow -- GitHub Actions building .bin artifacts on tag push
Quickstart improvements -- linear "flash, clone, configure, talk" path assuming published firmware releases
First-audio latency reduction -- two-tier path lands inner-loop turns under 1 s warm; further improvements queued (escalation parallelism, llama.cpp MTP PR #22673 for ~1.5-2× on think_hard)
ASR accuracy for children's speech -- post-ASR corrections live; Whisper Phase 1 scaffold landed at v0.1; A/B verification pending
Face detection + tracking -- shipped firmware-side; smoother+faster tuning queued (EMA 0.5, speed 500, deadband, MSR thr 0.40). Flash + bench-test pending
Layer 4 identity (description-based) -- shipped + deployed. VLM (Gemini 2.0 Flash) returns a free-form description plus a roster name match against ~/.zeroclaw/household.yaml's appearance: field. No biometrics, no persistent identifiers. The earlier dlib biometric scaffold (bridge/face_db.py + face_recognizer.py + on-device FaceRecognizer + ParentalGate + 4 MCP tools) was removed — description-based covers the use case and biometrics conflicted with the no-storage identity posture
Layer 6 proactive greetings -- bridge/proactive_greeter.py + lifespan wiring shipped. Cooldown + time-of-day windowing + kid-safe sandwich + calendar-aware prompt + template fallback. Depends on Layer 4 for named greetings; works today with face_detected (unknown identity) for generic
Layer 1 privacy-indicator LEDs -- firmware scaffold drives mic/camera state via RAII peripheral guards. Camera VIDIOC_STREAMOFF wiring deferred (closes the always-streaming hole; queued)
Wake word "Hey Dotty" -- interim shipped: firmware default switched Chinese → English "Hi, ESP". Custom "Hey Dotty" microWakeWord roadmap documented (docs/wake-word.md); needs sample collection + Colab training (~2 weeks calendar)
Purr-on-head-pet -- server consumer shipped (_perception_purr_player); fires on head_pet_started. Asset path bridge/assets/purr.opus is a drop-in (asset itself not committed)
Dancing mode -- shipped at v0.1; karaoke + LLM-initiated dance + Phase 2 vocal singing remain
Reproducible + signed firmware builds -- SBOM + signed-releases scaffolds shipped. Maintainer GPG key + IDF Dockerfile SHA256 pin pending

Planned¶

Designed but not yet started. Roughly in priority order.

Improve Security Mode -- expand beyond the current LED-flash + alert posture: configurable triggers, escalation rules, and richer notification surfaces
Improve Story Mode -- longer-form narrative pacing, character voices, save/resume, and child-led branching
Easily configurable model profiles -- first-class config surface for swapping the local / kid / smart models (and adding new ones) without hand-editing daemon config.toml files
Improve Kid Mode -- configurable age band -- per-child age setting that tunes vocabulary, topic blocklist strictness, and response length; today Kid Mode is one-size-fits-all
Improve Dance Mode -- user song library -- let users drop their own audio files into a song folder and have Dotty discover, list, and dance to them (current dance set is built-in only)
Speech bubble sync -- tie on-screen text bubble visibility to actual audio playback state (deferred at v0.1 — Brett says timing looks fine in practice)
Singing mode -- vocal synthesis or pitch-shifted TTS over backing tracks (Phase 2 of dance work)
Runtime OTA provisioning -- captive-portal WiFi + OTA URL setup on first boot (no rebuild to retarget)
Layer 2.5 stereo mic + camera person tracking -- sound-source localization + camera fusion for 360° awareness in idle mode
Phase 3 continuous vision classifier -- EfficientDet/YOLOX at 1Hz on the Docker host GPU once dual RTX 3060s land
Sleep-mode "dream" memory compaction -- while Dotty is in sleep state (idle, overnight), a background pass feeds the day's ZeroClaw memory writes (perception events, conversation turns, declared facts, scene snapshots) to the smart model for compaction + summarisation. Two outputs: rewrite/prune the raw memory store (drop duplicates and low-signal perception spam, keep durable facts and notable events), and emit a separate human-readable daily summary that next-day turns can pull as "yesterday's context". Sleep-state-gated so the heavy LLM call never runs during interactive states. Pairs with the per-person memory and ambient scene memory work
Variant board port guide -- walkthrough for adding support for other ESP32-S3 boards

Community wishlist¶

Ideas we would welcome help with. None are blockers.

ESP Web Tools web flasher -- one-click browser flash via esptool.js on GitHub Pages
Voice catalog + install helper -- curated Piper/EdgeTTS voices with a download script
Versioned docs via mike -- /latest/ + /v1.0/ so older firmware users see matching docs
Observability hooks -- Prometheus metrics on the bridge (latency, token counts, error rates) + starter Grafana dashboard
Variant board port guide -- walkthrough for adding support for other ESP32-S3 boards
Face/emoji asset catalog -- document the expression-id-to-emoji mapping; show how to add a new face
Firmware/server compatibility matrix -- pin which server versions work with which firmware versions
make audit network verifier -- user-runnable tool to confirm "local except LLM" claim against their own install
Reproducible + signed firmware builds -- toolchain-pinned .bin with GPG-signed release artifacts