Architecture¶
TL;DR¶
- Three hosts: robot (StackChan on your desk) → server (runs xiaozhi-esp32-server) → ZeroClaw host (runs ZeroClaw + a FastAPI bridge).
- Audio goes robot → server → (text) → ZeroClaw host → (response text) → server → (audio) → robot. The ZeroClaw host never touches audio.
- Two voice paths coexist, selected by
selected_module.LLMin.config.yaml: Tier1Slim(current default) — a small/fast LLM (qwen3.5:4b) handles plain turns directly from xiaozhi-server against a local llama-swap, no bridge round-trip. Tool calls escalate to the bridge via/api/voice/escalate.ZeroClawLLM(legacy single-tier) — every turn goes through the bridge → ZeroClaw → OpenRouter (Qwen3-30B).- Everything is LAN-local except cloud-routed LLM calls (smart-mode, ZeroClawLLM, VLM, audio caption). EdgeTTS is cloud too when selected; Piper is fully local.
- The "xiaozhi ↔ brain" seam is HTTP + ACP-over-stdio, not a library call — either side can be swapped independently.
- The robot speaks the Xiaozhi WebSocket protocol (see protocols.md). It has no hardcoded knowledge of ZeroClaw.
Topology¶
flowchart LR
subgraph Desk["Desk (LAN WiFi)"]
SC["M5Stack StackChan<br/>ESP32-S3"]
end
subgraph DockerHost["Server - <XIAOZHI_HOST>"]
XZ["xiaozhi-esp32-server<br/>(Docker)<br/>:8000 WS + :8003 HTTP"]
subgraph XZMods["voice pipeline"]
VAD["SileroVAD"]
ASR["FunASR<br/>SenseVoiceSmall"]
T1S["Tier1Slim<br/>(default LLM)"]
ZCL["ZeroClawLLM<br/>(legacy LLM)"]
TTS["TTS<br/>(LocalPiper default,<br/>EdgeTTS available)"]
end
end
subgraph Llama["llama-swap (Unraid)"]
SLIM["qwen3.5:4b<br/>(Tier1 inner loop)"]
THINK["qwen3.6:27b-think<br/>(think_hard target)"]
end
subgraph ZCHost["ZeroClaw host - <ZEROCLAW_HOST>"]
Bridge["zeroclaw-bridge<br/>FastAPI :8080<br/>systemd unit"]
ZC["ZeroClaw daemon<br/>(agent persona)"]
end
Cloud["OpenRouter<br/>(smart_mode, VLM,<br/>audio caption, legacy)"]
SC -->|"WebSocket<br/>Xiaozhi protocol"| XZ
XZ --- VAD & ASR & T1S & ZCL & TTS
T1S -->|"plain turn (default)"| SLIM
T1S -->|"tool_call escalation<br/>POST /api/voice/escalate"| Bridge
ZCL -->|"HTTP POST<br/>/api/message"| Bridge
Bridge -->|"think_hard"| THINK
Bridge -->|"JSON-RPC 2.0<br/>over stdio (ACP)"| ZC
ZC -.->|"LLM API call"| Cloud
Bridge -.->|"VLM / audio caption"| Cloud
T1S -.->|"smart_mode ON"| Cloud
XZ -->|"perception event<br/>POST /api/perception/event"| Bridge
TTS -->|"audio stream"| XZ
XZ -->|"WebSocket"| SC
Solid arrows are per-turn data flow; dotted arrows are cloud / conditional. The two voice paths share the same physical xiaozhi container — only one is active at a time via selected_module.LLM.
Actors¶
| Actor | Host | Role | Process |
|---|---|---|---|
| StackChan | Desk | Captures audio, plays audio, renders face, runs MCP tools for head/LED/camera | ESP32-S3 firmware built from m5stack/StackChan |
| xiaozhi-esp32-server | Server | VAD → ASR → LLM (proxy) → TTS pipeline, emotion dispatch, OTA, admin surface | Docker container |
| Tier1Slim custom provider | Server (inside container) | Default LLM provider — talks directly to llama-swap for plain turns, posts to /api/voice/escalate for tool calls |
Python, mounted via volume |
| ZeroClawLLM custom provider | Server (inside container) | Legacy single-tier LLM provider — translates xiaozhi's LLM-provider interface to an HTTP POST to the bridge | Python, mounted via volume |
| llama-swap | Unraid (or any GPU host) | Routes OpenAI-compatible requests to per-model llama-server children; co-loads qwen3.5:4b (voice inner loop) and qwen3.6:27b-think (think_hard target) |
Docker container (ghcr.io/mostlygeek/llama-swap:cuda) |
| zeroclaw-bridge | ZeroClaw host | Accepts HTTP POSTs from both voice paths + xiaozhi-server perception relay; spawns/holds a zeroclaw acp child; speaks ACP JSON-RPC to it |
FastAPI + uvicorn under systemd |
| ZeroClaw daemon | ZeroClaw host | The configured persona — runs the agent loop, calls the LLM, consults SOUL.md/IDENTITY.md/MEMORY.md |
Rust binary (zeroclaw acp) |
| OpenRouter | Cloud | Routes cloud LLM calls (smart_mode claude-sonnet-4-6, VLM gemini-2.0-flash, audio caption gemini-2.5-flash, and the legacy ZeroClaw path's Qwen3-30B) |
External |
Data flow (single utterance, Tier1Slim path — plain turn)¶
sequenceDiagram
autonumber
participant User
participant SC as StackChan
participant XZ as xiaozhi-server<br/>(server)
participant LS as llama-swap<br/>qwen3.5:4b
User->>SC: speaks
SC->>XZ: Opus audio frames (WebSocket)
XZ->>XZ: SileroVAD → speech end
XZ->>XZ: FunASR / Whisper → text
XZ->>LS: chat/completions (tools=auto)
LS-->>XZ: "😊 The sky is blue!" (no tool_calls)
XZ->>XZ: strip leading emoji → emotion frame
XZ->>XZ: TTS (Piper / EdgeTTS) → Opus frames
XZ-->>SC: audio + emotion
SC-->>User: speaks + face animation
Data flow (Tier1Slim path — escalated tool call)¶
sequenceDiagram
autonumber
participant XZ as xiaozhi-server
participant LS as llama-swap<br/>qwen3.5:4b
participant Br as zeroclaw-bridge
participant ZC as ZeroClaw
participant THK as llama-swap<br/>qwen3.6:27b-think
XZ->>LS: chat/completions (tools=auto)
LS-->>XZ: tool_calls=[think_hard(...)]
XZ-->>XZ: yield filler ("hmm, let me think…") to TTS
XZ->>Br: POST /api/voice/escalate<br/>{"tool":"think_hard","args":...}
Br->>THK: chat completion
THK-->>Br: reasoned answer
Br-->>XZ: {"result":"..."}
XZ->>LS: chat/completions (tool result in context)
LS-->>XZ: streamed final answer
XZ->>XZ: TTS → Opus frames
Data flow (legacy ZeroClawLLM path)¶
sequenceDiagram
autonumber
participant XZ as xiaozhi-server
participant Br as zeroclaw-bridge
participant ZC as ZeroClaw
participant LLM as OpenRouter<br/>→ Qwen3-30B
XZ->>Br: POST /api/message<br/>{"content":"...","channel":"stackchan"}
Br->>ZC: session/new + session/prompt<br/>(JSON-RPC over stdio)
ZC->>LLM: chat completion
LLM-->>ZC: response
ZC-->>Br: stdout JSON-RPC result
Br->>Br: ensure emoji prefix<br/>enforce English suffix
Br-->>XZ: {"response":"😊 Sure, the weather is..."}
See protocols.md for every wire format referenced above, and tier1slim.md for the four-tool escalation catalogue.
Why this shape¶
- Audio lives with the server because the StackChan firmware already speaks the Xiaozhi WS protocol; xiaozhi-esp32-server is the matching server. Putting the brain next to the mic would require us to reimplement that protocol, and we'd still need a voice server.
- The brain lives with ZeroClaw because ZeroClaw already has memory, tools, persona, channels, LLM routing wired up. The robot is just another channel into the same agent.
- A bridge lives between them because ZeroClaw's HTTP API is observational; to prompt a running agent you have to use ACP (JSON-RPC 2.0 over stdio) against a
zeroclaw acpchild. The bridge is a tiny FastAPI adapter that does exactly that. - The seam is a custom xiaozhi LLM provider (
zeroclaw.py, mounted into the container). xiaozhi-server thinks it's calling a local Python LLM class; the class just does an HTTP POST to<ZEROCLAW_HOST>:8080/api/message. That means we could swap ZeroClaw for anything HTTP-serviceable without touching xiaozhi.
What each host sees¶
Robot knows only:
- An OTA HTTP URL (http://<XIAOZHI_HOST>:8003/xiaozhi/ota/)
- A WS URL (provided by OTA response, typically ws://<XIAOZHI_HOST>:8000/xiaozhi/v1/)
It does not know about the ZeroClaw host, ZeroClaw itself, or any LLM.
Server / xiaozhi-server knows:
- Its own device-facing WS + OTA ports
- A handful of pluggable providers selected via data/.config.yaml selected_module:
- The LLM provider's base_url: http://<ZEROCLAW_HOST>:8080 to reach the bridge
It does not know about ACP, ZeroClaw workspace files, or the OpenRouter key.
ZeroClaw host / bridge knows:
- How to subprocess.Popen("zeroclaw acp") and speak JSON-RPC 2.0 on stdin/stdout
- How to wrap turns that arrive with channel="stackchan" in the English+emoji sandwich
- The OpenRouter API key lives in ZeroClaw's config, not in the bridge
ZeroClaw knows everything agent-side — provider keys, memory, tools, persona files. It does not know what host/channel the request came from beyond the channel string passed in.
Admin surface (two services, two prefixes)¶
Admin routes are split across the two services and reached at different prefixes. Don't conflate them.
Bridge /admin/* (ZeroClaw host, 127.0.0.1:8080 only)¶
Runtime mutations to ZeroClaw / the bridge process itself. Bound to localhost only — LAN callers get 403.
| Endpoint | Effect | Restart |
|---|---|---|
POST /admin/kid-mode {enabled: bool} |
Persists + hot-reloads kid-mode via _apply_kid_mode() (no daemon restart needed; module globals re-bind atomically). Pushes the kid pip via the xiaozhi admin relay. |
none |
POST /admin/smart-mode {enabled: bool, device_id?} |
Persists + pushes the smart pip. When DOTTY_VOICE_PROVIDER=tier1slim: hot-swaps Tier1Slim's model/url/api_key via /xiaozhi/admin/set-tier1slim-model (in-process; no restart). When DOTTY_VOICE_PROVIDER=zeroclaw: rewrites ZeroClaw's config.toml and restarts the voice daemon. |
conditional (see above) |
POST /admin/persona {file, content} |
Overwrites a workspace persona file (SOUL.md, IDENTITY.md, USER.md, AGENTS.md, TOOLS.md, BOOTSTRAP.md, HEARTBEAT.md, MEMORY.md). Atomic via .new + rename. |
none |
POST /admin/model {daemon, model} |
TOML-edits default_model in the chosen daemon's config.toml. |
named daemon |
POST /admin/safety {action, tool} |
Adds/removes a tool in MCP_TOOL_ALLOWLIST via the # === ADMIN_ALLOWLIST_START/END === marker block. py_compile-validated; on syntax error the bridge is left untouched. |
bridge (self-restart) |
Paths and systemd unit names are env-configurable (ZEROCLAW_VOICE_CFG, ZEROCLAW_VOICE_UNIT, ZEROCLAW_DISCORD_CFG, ZEROCLAW_DISCORD_UNIT, ZEROCLAW_WORKSPACE); defaults match the documented ZeroClaw-host layout.
xiaozhi-server /xiaozhi/admin/* (server, port 8003)¶
Operations that need to touch a live device session — head servos, MCP dispatch, TTS injection, the live LLM provider's runtime config. Exposed by custom-providers/xiaozhi-patches/http_server.py. Bound to the xiaozhi container's listen address (not localhost-only); protect with network ACLs if the server is reachable from untrusted networks.
| Endpoint | Purpose |
|---|---|
POST /xiaozhi/admin/inject-text |
Speak arbitrary text through TTS as if Dotty originated it. Used by face-greeter and proactive prompts. |
GET /xiaozhi/admin/devices |
List connected device-ids. |
POST /xiaozhi/admin/abort |
Abort an in-flight TTS turn (used by the face-lost aborter). |
POST /xiaozhi/admin/set-head-angles |
Move the head servos (used by sound-direction turn). |
POST /xiaozhi/admin/set-state |
Dispatch a set_state MCP call to firmware. |
POST /xiaozhi/admin/set-toggle |
Dispatch a set_toggle MCP call (kid/smart pip on the firmware ring). |
POST /xiaozhi/admin/set-face-identified |
Light the face-identified pixel for ~4 s. |
POST /xiaozhi/admin/take-photo |
Trigger a camera capture from the bridge (separate from the firmware MCP path). |
POST /xiaozhi/admin/set-tier1slim-model |
Hot-swap the live Tier1Slim provider's model / url / api_key. Driven by smart-mode flips. |
GET /xiaozhi/admin/songs |
List audio assets available to play_song. |
POST /xiaozhi/admin/play-asset |
Play a named audio asset through the speaker. |
POST /xiaozhi/admin/say |
Synthesise + play arbitrary text (lower-level than inject-text). |
Perception event bus¶
Firmware-resident producers emit JSON event frames over the xiaozhi WebSocket:
{"type":"event","name":"face_detected","data":{}}
{"type":"event","name":"face_lost","data":{}}
{"type":"event","name":"sound_event","data":{"direction":"left","balance":0.997,"energy":1807933247}}
The xiaozhi-server's EventTextMessageHandler (custom-providers/xiaozhi-patches/textMessageHandlerRegistry.py) POSTs each frame to the bridge's POST /api/perception/event. The bridge maintains a pub/sub bus (_perception_listeners) and per-device state (_perception_state[device_id]), and runs six consumer tasks against it:
| Consumer | What it does |
|---|---|
_perception_face_greeter |
"Hi!" greeting (via /xiaozhi/admin/inject-text) on first face detection after a cooldown window. |
_perception_sound_turner |
Head-turn (via /xiaozhi/admin/set-head-angles) toward sound direction. |
_perception_face_lost_aborter |
Aborts an in-flight TTS turn (via /xiaozhi/admin/abort) when the audience walks away. |
_perception_wake_word_turner |
Head-turn toward the speaker on wake-word event. |
_perception_face_identified_refresher |
Re-asserts the face-identified pixel every ~3 s so the firmware's 4 s timeout doesn't drop it. |
_perception_purr_player |
Plays an idle purr asset when conditions match. |
WebSocket lifecycle gotcha: xiaozhi only opens the WS during a conversation. Firmware-side perception producers must call OpenAudioChannel() first, or events from idle silently drop.
Threat-model implications¶
- Device compromise gives an attacker a WS session to xiaozhi-server and the ability to invoke any server-exposed MCP tool. It does not give them the LLM key, ZeroClaw's memory, or network access to OpenRouter beyond what proxied prompts can achieve.
- Server compromise gives them access to the bridge over HTTP (no auth currently). Anything the bridge can ask ZeroClaw, the attacker can ask it. The
/admin/*mutation endpoints are unreachable (they're127.0.0.1-only). - ZeroClaw host compromise gives them everything — LLM keys, memory DB, workspace persona files.
- OpenRouter compromise gives them log access to every prompt sent. Treat prompts as non-confidential.
See ROADMAP.md for related backlog items (privacy-indicator LEDs, child-safety hardening).
Deployment files (this repo)¶
The canonical working copies live in this repo. The deployed copies on each host should match — if they drift, redeploy from here.
| File | Deployed to | Purpose |
|---|---|---|
bridge.py |
ZeroClaw host <BRIDGE_PATH>/bridge.py |
FastAPI HTTP→ZeroClaw translator (ACP over stdio) |
bridge/requirements.txt |
bare-metal venv | Pinned Python deps for the bridge (fastapi, uvicorn, pydantic) |
zeroclaw-bridge.service |
ZeroClaw host /etc/systemd/system/ |
systemd unit for bridge |
custom-providers/zeroclaw/zeroclaw.py |
Server core/providers/llm/zeroclaw/zeroclaw.py |
xiaozhi LLM provider, proxies to the ZeroClaw bridge |
custom-providers/zeroclaw/__init__.py |
Server core/providers/llm/zeroclaw/__init__.py |
Python package marker |
custom-providers/edge_stream/edge_stream.py |
Server core/providers/tts/edge_stream.py |
Streaming EdgeTTS provider |
custom-providers/piper_local/piper_local.py |
Server core/providers/tts/piper_local.py |
Local Piper TTS provider (offline alternative to EdgeTTS) |
custom-providers/asr/fun_local.py |
Server core/providers/asr/fun_local.py |
Patched FunASR provider (adds language config key) |
.config.yaml |
Server data/.config.yaml |
xiaozhi-server config override |
.env.example |
reference only | Documented environment variables with defaults |
docker-compose.yml |
Server <XIAOZHI_PATH> |
Container definition |
Volume mounts (server) are listed in quickstart.md.
See also¶
- hardware.md — what the robot actually is.
- voice-pipeline.md — what the server runs.
- tier1slim.md — the default voice LLM and its escalation contract.
- brain.md — model matrix + what the ZeroClaw host runs.
- protocols.md — what's on the wire (including
/api/voice/escalateand/api/perception/event). - quickstart.md — deployment placeholders, volume mounts, common ops.
Last verified: 2026-05-17.