Roadmap¶
This is a living document. See CONTRIBUTING.md to get involved.
Shipping now (v0.1)¶
v0.1 is the first tagged release — early-feedback alpha. Everything in this list runs end-to-end on the maintainer's hardware. v1.0 is gated on real-world feedback from external users; see Known issues below.
- Kid Mode -- opt-in child-safety guardrails: topic blocklist, self-harm redirect, content filter, age-appropriate vocabulary (on by default, disable with
DOTTY_KID_MODE=false) - Local ASR -- FunASR SenseVoiceSmall, English-pinned, runs on your Docker host
- Local TTS -- Piper voice synthesis, no cloud dependency
- Streaming LLM responses -- NDJSON token-level streaming with first-token latency ~1.2s
- Emoji-driven expressions -- LLM output prefixed with emoji; firmware maps to face animations
- Persona system -- swappable persona files (
personas/*.md), customizable viamake setup - MCP tool integration -- 11 firmware-advertised tools (head servos, LEDs, camera, reminders, volume, brightness, screen theme)
- Photo-based vision -- "What do you see?" triggers camera capture + vision model description
- Calendar context injection -- Google Calendar events surfaced to the LLM for contextual reminders
- Length-aware brevity -- default 1-2 short sentences, up to 6 for open-ended asks (story, explanation, list); cap enforced in code via
MAX_SENTENCES - ASR noise filtering -- rejects punctuation-only / sub-threshold utterances
- ACP session caching -- long-lived sessions with idle/turn-count/wall-clock rotation
- Single-host deployment --
compose.all-in-one.ymlruns everything on one machine - Multi-host deployment -- documented split across Docker host + ZeroClaw host
make setupwizard -- interactive first-run: name your robot, fetch models, validate config- MkDocs Material docs site -- architecture, protocols, quickstart, troubleshooting, FAQ
- Kid Mode channel routing -- voice channels are kid-safe by default; the bridge's kid-mode sandwich (English-pin, emoji prefix, topic blocklist, jailbreak resistance) only applies when the inbound
channelis inVOICE_CHANNELS, so messaging-platform channels (Discord, Telegram, etc.) skip it automatically. Pair with a separate ZeroClaw daemon on a more capable model for an unrestricted chat surface - Bridge
/admin/*endpoints -- localhost-only HTTP API for runtime config mutation: toggle kid-mode (/admin/kid-mode), flip smart-mode (/admin/smart-mode), overwrite persona files (/admin/persona), swap a daemon'sdefault_modelin itsconfig.toml(/admin/model), and amend the MCP tool allowlist (/admin/safety, py_compile-validated). Paths and systemd unit names are env-configurable /xiaozhi/admin/*endpoints -- live-session control surface on xiaozhi-server:set-state,set-toggle,set-tier1slim-model,set-face-identified,set-head-angles,inject-text,abort,take-photo,play-asset,songs,say,devices. Seearchitecture.md- Two-tier voice LLM (Tier1Slim) --
qwen3.5:4bon local llama-swap handles plain conversational turns directly; tool calls (memory_lookup,think_hard,take_photo,play_song) escalate to the bridge via/api/voice/escalate. Default LLM since commitb73f583. Seetier1slim.md - Smart-mode hot-swap -- when
DOTTY_VOICE_PROVIDER=tier1slim, smart-mode flips call/xiaozhi/admin/set-tier1slim-modelto mutate the live provider'smodel/url/api_keyin place — no docker restart, no daemon restart, instant. Legacy=zeroclawpath still rewritesconfig.tomland restarts the daemon - llama-swap voice/coding matrix --
qwen3.5:4b(voice inner loop) +qwen3.6:27b-think(think_hard target) co-resident under thevoicematrix set;qwen3.6:27bforpiCLI runs alone undercoding. Seecookbook/llama-swap-concurrent-models.md - Perception event bus -- firmware
face_detected/face_lost/sound_event/state_changedframes relay through xiaozhi'sEventTextMessageHandlerto the bridge's/api/perception/event, fanned out to six consumer tasks (face_greeter, sound_turner, face_lost_aborter, wake_word_turner, face_identified_refresher, purr_player) - Fully-local backend support --
compose.local.override.ymlfor Ollama (single binary, simple) plus llama-swap recipe for concurrent multi-model serving. Both shipped; choose based on whether you need multiple models resident at once - Voice catalog + install helper --
docs/voice-catalog.md(12 Piper + 6 EdgeTTS) +make voice-install-- shipped - Versioned docs via
mike--/latest/,/v0.1/,/dev/URL structure shipped - Observability hooks -- Prometheus
/metrics+ Grafana dashboard atmonitoring/grafana-dashboard.json-- shipped - Head-pet hold-to-listen wake -- firmware fires
WakeWordInvoke("head_pet_hold")after 2 s touch; works in the dark. Also emitshead_pet_started/_endedperception events for purr consumer
Known issues (as of v0.1)¶
The 30+ planning docs accumulated during the v0.1 prep sprint surfaced these. None are blockers for trying Dotty out, but you should know about them:
- Face emoji rendering — only 5 of 9 enforced emotions render distinctly on the LCD. Sad clamps to a one-eye wink (rotation
-400clamps to 0 on left eye), Surprise is byte-identical to Neutral (weight120clamps to100), Loving is a copy-paste of Happy, Laughing is an alias of Happy by design. Fix is queued (~25-40 LoC firmware patch). - Sound-direction localizer always reads left. I2S channel 1 on the M5Stack CoreS3 is the AEC speaker-loopback reference, not the right mic. Energy detection works; direction does not. Sound-driven head-turn behaves accordingly.
- Kid-voice ASR accuracy — SenseVoiceSmall mangles short kid utterances ("macarena" → "maarna"). Post-ASR corrections + phrase boost help but have hit their ceiling. whisper.cpp / faster-whisper swap planned (Phase 1 CPU-only ships immediately, Phase 2 GPU once dual RTX 3060s arrive).
- Privacy-indicator LEDs not yet hardwired. The camera streams DMA buffers permanently after init; mic + camera enable are software-controlled with no hardware-guaranteed indicator. Hard prereq for face recognition / continuous vision; do not ship those features without it.
- Smart Mode regression (fixed in v0.1 itself) — between
434988dand the v0.1 fix, every voice "smart mode" trigger silently fell back to the default model. If you're forking from before the v0.1 tag, pull the fix.
In progress¶
Actively being worked on or partially complete. Big push 2026-04-25 evening: ~26 commits scaffolding much of what was previously "Planned" — see CHANGELOG.md [Unreleased] for the full inventory. Most items below have code on main but are not yet deployed live or fully wired.
- Phase 4 firmware StateManager bench checks -- the on-device six-state mutex (
idle / talk / story_time / security / sleep / dance) and 12-pixel LED contract shipped to the active firmware fork (commitd78118b, 2026-04-27) and end-to-end-verified autonomously. Visual / interactive bench checks on the live device pending in #38 (Phase 4 foundation), #39 (Phase 5 sleep behaviour), #40 (Phase 6 security behaviour). Thefirmware/firmware/submodule pin in this repo lags the active fork; bump (or build from the active fork) to flash a Phase 4+ build. - CI pipeline -- YAML lint, compose validation, config parse check, firmware dry-build, docs link check
- Firmware release workflow -- GitHub Actions building
.binartifacts on tag push - Quickstart improvements -- linear "flash, clone, configure, talk" path assuming published firmware releases
- First-audio latency reduction -- two-tier path lands inner-loop turns under 1 s warm; further improvements queued (escalation parallelism, llama.cpp MTP PR #22673 for ~1.5-2× on think_hard)
- ASR accuracy for children's speech -- post-ASR corrections live; Whisper Phase 1 scaffold landed at v0.1; A/B verification pending
- Face detection + tracking -- shipped firmware-side; smoother+faster tuning queued (EMA 0.5, speed 500, deadband, MSR thr 0.40). Flash + bench-test pending
- Layer 4 identity (description-based) -- shipped + deployed. VLM (Gemini 2.0 Flash) returns a free-form description plus a roster name match against
~/.zeroclaw/household.yaml'sappearance:field. No biometrics, no persistent identifiers. The earlier dlib biometric scaffold (bridge/face_db.py+face_recognizer.py+ on-deviceFaceRecognizer+ParentalGate+ 4 MCP tools) was removed — description-based covers the use case and biometrics conflicted with the no-storage identity posture - Layer 6 proactive greetings --
bridge/proactive_greeter.py+ lifespan wiring shipped. Cooldown + time-of-day windowing + kid-safe sandwich + calendar-aware prompt + template fallback. Depends on Layer 4 for named greetings; works today withface_detected(unknown identity) for generic - Layer 1 privacy-indicator LEDs -- firmware scaffold drives mic/camera state via RAII peripheral guards. Camera
VIDIOC_STREAMOFFwiring deferred (closes the always-streaming hole; queued) - Wake word "Hey Dotty" -- interim shipped: firmware default switched Chinese → English "Hi, ESP". Custom "Hey Dotty" microWakeWord roadmap documented (
docs/wake-word.md); needs sample collection + Colab training (~2 weeks calendar) - Purr-on-head-pet -- server consumer shipped (
_perception_purr_player); fires onhead_pet_started. Asset pathbridge/assets/purr.opusis a drop-in (asset itself not committed) - Dancing mode -- shipped at v0.1; karaoke + LLM-initiated dance + Phase 2 vocal singing remain
- Reproducible + signed firmware builds -- SBOM + signed-releases scaffolds shipped. Maintainer GPG key + IDF Dockerfile SHA256 pin pending
Planned¶
Designed but not yet started. Roughly in priority order.
- Improve Security Mode -- expand beyond the current LED-flash + alert posture: configurable triggers, escalation rules, and richer notification surfaces
- Improve Story Mode -- longer-form narrative pacing, character voices, save/resume, and child-led branching
- Easily configurable model profiles -- first-class config surface for swapping the local / kid / smart models (and adding new ones) without hand-editing daemon
config.tomlfiles - Improve Kid Mode -- configurable age band -- per-child age setting that tunes vocabulary, topic blocklist strictness, and response length; today Kid Mode is one-size-fits-all
- Improve Dance Mode -- user song library -- let users drop their own audio files into a song folder and have Dotty discover, list, and dance to them (current dance set is built-in only)
- Speech bubble sync -- tie on-screen text bubble visibility to actual audio playback state (deferred at v0.1 — Brett says timing looks fine in practice)
- Singing mode -- vocal synthesis or pitch-shifted TTS over backing tracks (Phase 2 of dance work)
- Runtime OTA provisioning -- captive-portal WiFi + OTA URL setup on first boot (no rebuild to retarget)
- Layer 2.5 stereo mic + camera person tracking -- sound-source localization + camera fusion for 360° awareness in idle mode
- Phase 3 continuous vision classifier -- EfficientDet/YOLOX at 1Hz on the Docker host GPU once dual RTX 3060s land
- Sleep-mode "dream" memory compaction -- while Dotty is in
sleepstate (idle, overnight), a background pass feeds the day's ZeroClaw memory writes (perception events, conversation turns, declared facts, scene snapshots) to the smart model for compaction + summarisation. Two outputs: rewrite/prune the raw memory store (drop duplicates and low-signal perception spam, keep durable facts and notable events), and emit a separate human-readable daily summary that next-day turns can pull as "yesterday's context". Sleep-state-gated so the heavy LLM call never runs during interactive states. Pairs with the per-person memory and ambient scene memory work - Variant board port guide -- walkthrough for adding support for other ESP32-S3 boards
Community wishlist¶
Ideas we would welcome help with. None are blockers.
- ESP Web Tools web flasher -- one-click browser flash via
esptool.json GitHub Pages - Voice catalog + install helper -- curated Piper/EdgeTTS voices with a download script
- Versioned docs via
mike--/latest/+/v1.0/so older firmware users see matching docs - Observability hooks -- Prometheus metrics on the bridge (latency, token counts, error rates) + starter Grafana dashboard
- Variant board port guide -- walkthrough for adding support for other ESP32-S3 boards
- Face/emoji asset catalog -- document the expression-id-to-emoji mapping; show how to add a new face
- Firmware/server compatibility matrix -- pin which server versions work with which firmware versions
make auditnetwork verifier -- user-runnable tool to confirm "local except LLM" claim against their own install- Reproducible + signed firmware builds -- toolchain-pinned
.binwith GPG-signed release artifacts