Cross-Layer Interaction Map¶

One-page reference for every cross-layer signal in the Dotty stack.

Layers:

StackChan firmware -- ESP32-S3 (m5stack/StackChan). The physical robot.
xiaozhi-esp32-server -- Docker on a Linux host. Voice I/O pipeline (ASR, TTS, VAD, emotion parsing).
zeroclaw-bridge -- FastAPI on ZeroClaw host. LLM brain interface (HTTP to ZeroClaw ACP-over-stdio).

Audio & Speech¶

Signal	Source	Destination	Protocol	Notes
Audio frames	StackChan	xiaozhi	WebSocket, Opus 60 ms frames	16 kHz mono; sent while `listen` state is `start`
TTS audio	xiaozhi	StackChan	WebSocket, Opus frames	24 kHz; streamed sentence-by-sentence as TTS completes
VAD state	xiaozhi (local)	xiaozhi (internal)	SileroVAD in-process	Detects speech-end silence; triggers ASR on the buffered audio
ASR text	FunASR (in xiaozhi)	LLM provider (internal)	In-process call	SenseVoiceSmall; `language` config key patched in `fun_local.py`
STT frame	xiaozhi	StackChan	WebSocket JSON `{"type":"stt","text":"..."}`	Sent as soon as ASR finishes; firmware shows thinking face

LLM & Responses¶

Signal	Source	Destination	Protocol	Notes
LLM request	xiaozhi (ZeroClawLLM provider)	bridge	HTTP POST `/api/message/stream`	Carries the user text; bridge wraps it in ACP JSON-RPC to ZeroClaw
LLM response	bridge	xiaozhi	NDJSON streaming (HTTP chunked)	Each chunk is one partial sentence; bridge enforces emoji prefix
Sentence chunks	xiaozhi	TTS then StackChan	Internal then WebSocket Opus	xiaozhi splits response into sentences, synthesizes each, streams audio back

Emotion & Expression¶

Signal	Source	Destination	Protocol	Notes
Emoji in LLM text	bridge (LLM output)	xiaozhi	First char of NDJSON response text	Three-layer enforcement: ZeroClaw prompt, xiaozhi system prompt, `_ensure_emoji_prefix` fallback
Emotion frame	xiaozhi	StackChan	WebSocket JSON `{"type":"llm","text":"emoji","emotion":"name"}`	Mapped from leading emoji (e.g. `😊`=smile, `🤔`=thinking); 9-emoji subset used
Thinking emotion	bridge	xiaozhi (forwarded to StackChan)	Emitted before LLM call starts	Shows thinking face while waiting for first token
Face animation	StackChan firmware (local)	Avatar renderer (local)	Internal	Firmware maps emotion string to animated face expression

MCP Tools¶

Signal	Source	Destination	Protocol	Notes
tools/list	StackChan	xiaozhi	JSON-RPC 2.0 over WebSocket	Sent during WS handshake; 11 tools registered (camera, LED, head, audio, etc.)
tools/call	xiaozhi	StackChan	JSON-RPC 2.0 over WebSocket	e.g. `self.camera.take_photo`, `self.robot.set_led_color`, `self.robot.set_head_angles`
tool result	StackChan	xiaozhi	JSON-RPC 2.0 over WebSocket	Result forwarded to LLM provider so the model can use the output

Session & Control¶

Signal	Source	Destination	Protocol	Notes
hello	StackChan	xiaozhi	WebSocket JSON `{"type":"hello"}`	Includes `features:{mcp:true}`, audio params; must get reply within 10 s
hello response	xiaozhi	StackChan	WebSocket JSON `{"type":"hello"}`	Returns `session_id` and server audio params (24 kHz Opus)
listen	StackChan	xiaozhi	WebSocket JSON `{"type":"listen"}`	`state:"start"/"stop"`, `mode:"auto"/"manual"`; controls when audio is processed
abort	StackChan	xiaozhi	WebSocket JSON	Sent when user speaks during TTS playback; cancels current response
OTA check	StackChan	xiaozhi :8003	HTTP GET `/xiaozhi/ota/`	Returns WebSocket URL and config on boot; firmware connects to the returned URL

Modes & LED¶

For the behavioural layer that consumes these signals -- what mode the robot is in, what LED plays, and how modes hand off to one another -- see modes.md. That doc is the canonical taxonomy (ambient / conversation / performance / maintenance), the per-mode trigger reference, and the LED contract table. This file remains the wire-level signal reference; modes.md is the state-machine view on top of it.