Architecture¶

TL;DR¶

Three hosts: robot (StackChan on your desk) → server (runs xiaozhi-esp32-server) → ZeroClaw host (runs ZeroClaw + a FastAPI bridge).
Audio goes robot → server → (text) → ZeroClaw host → (response text) → server → (audio) → robot. The ZeroClaw host never touches audio.
Two voice paths coexist, selected by selected_module.LLM in .config.yaml:
Tier1Slim (current default) — a small/fast LLM (qwen3.5:4b) handles plain turns directly from xiaozhi-server against a local llama-swap, no bridge round-trip. Tool calls escalate to the bridge via /api/voice/escalate.
ZeroClawLLM (legacy single-tier) — every turn goes through the bridge → ZeroClaw → OpenRouter (Qwen3-30B).
Everything is LAN-local except cloud-routed LLM calls (smart-mode, ZeroClawLLM, VLM, audio caption). EdgeTTS is cloud too when selected; Piper is fully local.
The "xiaozhi ↔ brain" seam is HTTP + ACP-over-stdio, not a library call — either side can be swapped independently.
The robot speaks the Xiaozhi WebSocket protocol (see protocols.md). It has no hardcoded knowledge of ZeroClaw.

Topology¶

flowchart LR
    subgraph Desk["Desk (LAN WiFi)"]
        SC["M5Stack StackChan<br/>ESP32-S3"]
    end

    subgraph DockerHost["Server - &lt;XIAOZHI_HOST&gt;"]
        XZ["xiaozhi-esp32-server<br/>(Docker)<br/>:8000 WS + :8003 HTTP"]
        subgraph XZMods["voice pipeline"]
            VAD["SileroVAD"]
            ASR["FunASR<br/>SenseVoiceSmall"]
            T1S["Tier1Slim<br/>(default LLM)"]
            ZCL["ZeroClawLLM<br/>(legacy LLM)"]
            TTS["TTS<br/>(LocalPiper default,<br/>EdgeTTS available)"]
        end
    end

    subgraph Llama["llama-swap (Unraid)"]
        SLIM["qwen3.5:4b<br/>(Tier1 inner loop)"]
        THINK["qwen3.6:27b-think<br/>(think_hard target)"]
    end

    subgraph ZCHost["ZeroClaw host - &lt;ZEROCLAW_HOST&gt;"]
        Bridge["zeroclaw-bridge<br/>FastAPI :8080<br/>systemd unit"]
        ZC["ZeroClaw daemon<br/>(agent persona)"]
    end

    Cloud["OpenRouter<br/>(smart_mode, VLM,<br/>audio caption, legacy)"]

    SC -->|"WebSocket<br/>Xiaozhi protocol"| XZ
    XZ --- VAD & ASR & T1S & ZCL & TTS
    T1S -->|"plain turn (default)"| SLIM
    T1S -->|"tool_call escalation<br/>POST /api/voice/escalate"| Bridge
    ZCL -->|"HTTP POST<br/>/api/message"| Bridge
    Bridge -->|"think_hard"| THINK
    Bridge -->|"JSON-RPC 2.0<br/>over stdio (ACP)"| ZC
    ZC -.->|"LLM API call"| Cloud
    Bridge -.->|"VLM / audio caption"| Cloud
    T1S -.->|"smart_mode ON"| Cloud
    XZ -->|"perception event<br/>POST /api/perception/event"| Bridge
    TTS -->|"audio stream"| XZ
    XZ -->|"WebSocket"| SC

Solid arrows are per-turn data flow; dotted arrows are cloud / conditional. The two voice paths share the same physical xiaozhi container — only one is active at a time via selected_module.LLM.

Actors¶

Actor	Host	Role	Process
StackChan	Desk	Captures audio, plays audio, renders face, runs MCP tools for head/LED/camera	ESP32-S3 firmware built from `m5stack/StackChan`
xiaozhi-esp32-server	Server	VAD → ASR → LLM (proxy) → TTS pipeline, emotion dispatch, OTA, admin surface	Docker container
Tier1Slim custom provider	Server (inside container)	Default LLM provider — talks directly to llama-swap for plain turns, posts to `/api/voice/escalate` for tool calls	Python, mounted via volume
ZeroClawLLM custom provider	Server (inside container)	Legacy single-tier LLM provider — translates xiaozhi's LLM-provider interface to an HTTP POST to the bridge	Python, mounted via volume
llama-swap	Unraid (or any GPU host)	Routes OpenAI-compatible requests to per-model llama-server children; co-loads `qwen3.5:4b` (voice inner loop) and `qwen3.6:27b-think` (`think_hard` target)	Docker container (`ghcr.io/mostlygeek/llama-swap:cuda`)
zeroclaw-bridge	ZeroClaw host	Accepts HTTP POSTs from both voice paths + xiaozhi-server perception relay; spawns/holds a `zeroclaw acp` child; speaks ACP JSON-RPC to it	FastAPI + uvicorn under systemd
ZeroClaw daemon	ZeroClaw host	The configured persona — runs the agent loop, calls the LLM, consults `SOUL.md`/`IDENTITY.md`/`MEMORY.md`	Rust binary (`zeroclaw acp`)
OpenRouter	Cloud	Routes cloud LLM calls (smart_mode `claude-sonnet-4-6`, VLM `gemini-2.0-flash`, audio caption `gemini-2.5-flash`, and the legacy ZeroClaw path's Qwen3-30B)	External

Data flow (single utterance, Tier1Slim path — plain turn)¶

sequenceDiagram
    autonumber
    participant User
    participant SC as StackChan
    participant XZ as xiaozhi-server<br/>(server)
    participant LS as llama-swap<br/>qwen3.5:4b

    User->>SC: speaks
    SC->>XZ: Opus audio frames (WebSocket)
    XZ->>XZ: SileroVAD → speech end
    XZ->>XZ: FunASR / Whisper → text
    XZ->>LS: chat/completions (tools=auto)
    LS-->>XZ: "😊 The sky is blue!" (no tool_calls)
    XZ->>XZ: strip leading emoji → emotion frame
    XZ->>XZ: TTS (Piper / EdgeTTS) → Opus frames
    XZ-->>SC: audio + emotion
    SC-->>User: speaks + face animation

Data flow (Tier1Slim path — escalated tool call)¶

sequenceDiagram
    autonumber
    participant XZ as xiaozhi-server
    participant LS as llama-swap<br/>qwen3.5:4b
    participant Br as zeroclaw-bridge
    participant ZC as ZeroClaw
    participant THK as llama-swap<br/>qwen3.6:27b-think

    XZ->>LS: chat/completions (tools=auto)
    LS-->>XZ: tool_calls=[think_hard(...)]
    XZ-->>XZ: yield filler ("hmm, let me think…") to TTS
    XZ->>Br: POST /api/voice/escalate<br/>{"tool":"think_hard","args":...}
    Br->>THK: chat completion
    THK-->>Br: reasoned answer
    Br-->>XZ: {"result":"..."}
    XZ->>LS: chat/completions (tool result in context)
    LS-->>XZ: streamed final answer
    XZ->>XZ: TTS → Opus frames

Data flow (legacy ZeroClawLLM path)¶

sequenceDiagram
    autonumber
    participant XZ as xiaozhi-server
    participant Br as zeroclaw-bridge
    participant ZC as ZeroClaw
    participant LLM as OpenRouter<br/>→ Qwen3-30B

    XZ->>Br: POST /api/message<br/>{"content":"...","channel":"stackchan"}
    Br->>ZC: session/new + session/prompt<br/>(JSON-RPC over stdio)
    ZC->>LLM: chat completion
    LLM-->>ZC: response
    ZC-->>Br: stdout JSON-RPC result
    Br->>Br: ensure emoji prefix<br/>enforce English suffix
    Br-->>XZ: {"response":"😊 Sure, the weather is..."}

See protocols.md for every wire format referenced above, and tier1slim.md for the four-tool escalation catalogue.

Why this shape¶

Audio lives with the server because the StackChan firmware already speaks the Xiaozhi WS protocol; xiaozhi-esp32-server is the matching server. Putting the brain next to the mic would require us to reimplement that protocol, and we'd still need a voice server.
The brain lives with ZeroClaw because ZeroClaw already has memory, tools, persona, channels, LLM routing wired up. The robot is just another channel into the same agent.
A bridge lives between them because ZeroClaw's HTTP API is observational; to prompt a running agent you have to use ACP (JSON-RPC 2.0 over stdio) against a zeroclaw acp child. The bridge is a tiny FastAPI adapter that does exactly that.
The seam is a custom xiaozhi LLM provider (zeroclaw.py, mounted into the container). xiaozhi-server thinks it's calling a local Python LLM class; the class just does an HTTP POST to <ZEROCLAW_HOST>:8080/api/message. That means we could swap ZeroClaw for anything HTTP-serviceable without touching xiaozhi.

What each host sees¶

Robot knows only: - An OTA HTTP URL (http://<XIAOZHI_HOST>:8003/xiaozhi/ota/) - A WS URL (provided by OTA response, typically ws://<XIAOZHI_HOST>:8000/xiaozhi/v1/)

It does not know about the ZeroClaw host, ZeroClaw itself, or any LLM.

Server / xiaozhi-server knows: - Its own device-facing WS + OTA ports - A handful of pluggable providers selected via data/.config.yaml selected_module: - The LLM provider's base_url: http://<ZEROCLAW_HOST>:8080 to reach the bridge

It does not know about ACP, ZeroClaw workspace files, or the OpenRouter key.

ZeroClaw host / bridge knows: - How to subprocess.Popen("zeroclaw acp") and speak JSON-RPC 2.0 on stdin/stdout - How to wrap turns that arrive with channel="stackchan" in the English+emoji sandwich - The OpenRouter API key lives in ZeroClaw's config, not in the bridge

ZeroClaw knows everything agent-side — provider keys, memory, tools, persona files. It does not know what host/channel the request came from beyond the channel string passed in.

Admin surface (two services, two prefixes)¶

Admin routes are split across the two services and reached at different prefixes. Don't conflate them.

Bridge `/admin/*` (ZeroClaw host, `127.0.0.1:8080` only)¶

Runtime mutations to ZeroClaw / the bridge process itself. Bound to localhost only — LAN callers get 403.

Endpoint	Effect	Restart
`POST /admin/kid-mode` `{enabled: bool}`	Persists + hot-reloads kid-mode via `_apply_kid_mode()` (no daemon restart needed; module globals re-bind atomically). Pushes the kid pip via the xiaozhi admin relay.	none
`POST /admin/smart-mode` `{enabled: bool, device_id?}`	Persists + pushes the smart pip. When `DOTTY_VOICE_PROVIDER=tier1slim`: hot-swaps Tier1Slim's model/url/api_key via `/xiaozhi/admin/set-tier1slim-model` (in-process; no restart). When `DOTTY_VOICE_PROVIDER=zeroclaw`: rewrites ZeroClaw's `config.toml` and restarts the voice daemon.	conditional (see above)
`POST /admin/persona` `{file, content}`	Overwrites a workspace persona file (`SOUL.md`, `IDENTITY.md`, `USER.md`, `AGENTS.md`, `TOOLS.md`, `BOOTSTRAP.md`, `HEARTBEAT.md`, `MEMORY.md`). Atomic via `.new` + rename.	none
`POST /admin/model` `{daemon, model}`	TOML-edits `default_model` in the chosen daemon's `config.toml`.	named daemon
`POST /admin/safety` `{action, tool}`	Adds/removes a tool in `MCP_TOOL_ALLOWLIST` via the `# === ADMIN_ALLOWLIST_START/END ===` marker block. py_compile-validated; on syntax error the bridge is left untouched.	bridge (self-restart)

Paths and systemd unit names are env-configurable (ZEROCLAW_VOICE_CFG, ZEROCLAW_VOICE_UNIT, ZEROCLAW_DISCORD_CFG, ZEROCLAW_DISCORD_UNIT, ZEROCLAW_WORKSPACE); defaults match the documented ZeroClaw-host layout.

xiaozhi-server `/xiaozhi/admin/*` (server, port 8003)¶

Operations that need to touch a live device session — head servos, MCP dispatch, TTS injection, the live LLM provider's runtime config. Exposed by custom-providers/xiaozhi-patches/http_server.py. Bound to the xiaozhi container's listen address (not localhost-only); protect with network ACLs if the server is reachable from untrusted networks.

Endpoint	Purpose
`POST /xiaozhi/admin/inject-text`	Speak arbitrary text through TTS as if Dotty originated it. Used by face-greeter and proactive prompts.
`GET /xiaozhi/admin/devices`	List connected device-ids.
`POST /xiaozhi/admin/abort`	Abort an in-flight TTS turn (used by the face-lost aborter).
`POST /xiaozhi/admin/set-head-angles`	Move the head servos (used by sound-direction turn).
`POST /xiaozhi/admin/set-state`	Dispatch a `set_state` MCP call to firmware.
`POST /xiaozhi/admin/set-toggle`	Dispatch a `set_toggle` MCP call (kid/smart pip on the firmware ring).
`POST /xiaozhi/admin/set-face-identified`	Light the face-identified pixel for ~4 s.
`POST /xiaozhi/admin/take-photo`	Trigger a camera capture from the bridge (separate from the firmware MCP path).
`POST /xiaozhi/admin/set-tier1slim-model`	Hot-swap the live Tier1Slim provider's `model` / `url` / `api_key`. Driven by smart-mode flips.
`GET /xiaozhi/admin/songs`	List audio assets available to `play_song`.
`POST /xiaozhi/admin/play-asset`	Play a named audio asset through the speaker.
`POST /xiaozhi/admin/say`	Synthesise + play arbitrary text (lower-level than `inject-text`).

Perception event bus¶

Firmware-resident producers emit JSON event frames over the xiaozhi WebSocket:

{"type":"event","name":"face_detected","data":{}}
{"type":"event","name":"face_lost","data":{}}
{"type":"event","name":"sound_event","data":{"direction":"left","balance":0.997,"energy":1807933247}}

The xiaozhi-server's EventTextMessageHandler (custom-providers/xiaozhi-patches/textMessageHandlerRegistry.py) POSTs each frame to the bridge's POST /api/perception/event. The bridge maintains a pub/sub bus (_perception_listeners) and per-device state (_perception_state[device_id]), and runs six consumer tasks against it:

Consumer	What it does
`_perception_face_greeter`	"Hi!" greeting (via `/xiaozhi/admin/inject-text`) on first face detection after a cooldown window.
`_perception_sound_turner`	Head-turn (via `/xiaozhi/admin/set-head-angles`) toward sound direction.
`_perception_face_lost_aborter`	Aborts an in-flight TTS turn (via `/xiaozhi/admin/abort`) when the audience walks away.
`_perception_wake_word_turner`	Head-turn toward the speaker on wake-word event.
`_perception_face_identified_refresher`	Re-asserts the face-identified pixel every ~3 s so the firmware's 4 s timeout doesn't drop it.
`_perception_purr_player`	Plays an idle purr asset when conditions match.

WebSocket lifecycle gotcha: xiaozhi only opens the WS during a conversation. Firmware-side perception producers must call OpenAudioChannel() first, or events from idle silently drop.

Threat-model implications¶

Device compromise gives an attacker a WS session to xiaozhi-server and the ability to invoke any server-exposed MCP tool. It does not give them the LLM key, ZeroClaw's memory, or network access to OpenRouter beyond what proxied prompts can achieve.
Server compromise gives them access to the bridge over HTTP (no auth currently). Anything the bridge can ask ZeroClaw, the attacker can ask it. The /admin/* mutation endpoints are unreachable (they're 127.0.0.1-only).
ZeroClaw host compromise gives them everything — LLM keys, memory DB, workspace persona files.
OpenRouter compromise gives them log access to every prompt sent. Treat prompts as non-confidential.

See ROADMAP.md for related backlog items (privacy-indicator LEDs, child-safety hardening).

Deployment files (this repo)¶

The canonical working copies live in this repo. The deployed copies on each host should match — if they drift, redeploy from here.

File	Deployed to	Purpose
`bridge.py`	ZeroClaw host `<BRIDGE_PATH>/bridge.py`	FastAPI HTTP→ZeroClaw translator (ACP over stdio)
`bridge/requirements.txt`	bare-metal venv	Pinned Python deps for the bridge (fastapi, uvicorn, pydantic)
`zeroclaw-bridge.service`	ZeroClaw host `/etc/systemd/system/`	systemd unit for bridge
`custom-providers/zeroclaw/zeroclaw.py`	Server `core/providers/llm/zeroclaw/zeroclaw.py`	xiaozhi LLM provider, proxies to the ZeroClaw bridge
`custom-providers/zeroclaw/__init__.py`	Server `core/providers/llm/zeroclaw/__init__.py`	Python package marker
`custom-providers/edge_stream/edge_stream.py`	Server `core/providers/tts/edge_stream.py`	Streaming EdgeTTS provider
`custom-providers/piper_local/piper_local.py`	Server `core/providers/tts/piper_local.py`	Local Piper TTS provider (offline alternative to EdgeTTS)
`custom-providers/asr/fun_local.py`	Server `core/providers/asr/fun_local.py`	Patched FunASR provider (adds `language` config key)
`.config.yaml`	Server `data/.config.yaml`	xiaozhi-server config override
`.env.example`	reference only	Documented environment variables with defaults
`docker-compose.yml`	Server `<XIAOZHI_PATH>`	Container definition

Volume mounts (server) are listed in quickstart.md.