Brain — ZeroClaw + the model matrix + the bridge¶

TL;DR¶

The "brain" is two cooperating pieces: ZeroClaw (a Rust AI-agent runtime, MIT/Apache-2.0 dual-licensed) on the ZeroClaw host, plus a FastAPI bridge (bridge.py) that fronts it over HTTP.
The bridge accepts POSTs from the voice path and translates them into ACP (Agent Client Protocol) JSON-RPC 2.0 over stdio against a long-running zeroclaw acp child.
Which LLM runs which turn depends on the active voice provider. Two paths coexist:
Tier1Slim path (default in .config.yaml) — small/fast model (qwen3.5:4b on a local llama-swap by default) handles every conversational turn; only tool calls escalate to the bridge. Smart-mode flips the inner-loop model to a cloud model in-process.
ZeroClawLLM path (legacy) — every turn runs through ZeroClaw with Qwen3-30B-A3B-Instruct-2507 (a 30.5 B-param MoE, 3.3 B active per token) via OpenRouter.
The bridge picks the smart-mode dispatch path based on the DOTTY_VOICE_PROVIDER env var: "tier1slim" → hot-swap via /xiaozhi/admin/set-tier1slim-model; "zeroclaw" → rewrite ZeroClaw's config.toml and restart the daemon.
Persona lives in two different places depending on path: Tier1Slim reads personas/dotty_voice.md; ZeroClawLLM reads ~/.zeroclaw/workspace/{SOUL,IDENTITY,MEMORY,AGENTS}.md. See change-persona.md.
Known weak spot of Qwen3: it leaks Chinese on long-context English-only prompts. Both paths compensate — the bridge wraps channel="stackchan" turns in an English+emoji sandwich; Tier1Slim appends an English-only suffix per turn.

Model matrix¶

Path	Model	Where	When called
Tier1Slim inner loop (smart_mode OFF)	`qwen3.5:4b`	local llama-swap (`TIER1SLIM_LOCAL_URL`, defaults to `http://localhost:8080/v1`)	Every plain conversational turn. ~500 ms warm.
Tier1Slim inner loop (smart_mode ON)	`anthropic/claude-sonnet-4-6` (`SMART_MODEL`)	OpenRouter	Every conversational turn while smart_mode is on.
Tier1Slim escalation: `think_hard`	`qwen3.6:27b-think`	local llama-swap	Multi-step reasoning, 3+ digit arithmetic, anything the small model would have to guess at.
Tier1Slim escalation: `memory_lookup`	(no LLM call — FTS)	ZeroClaw memory	`"do you remember…"` queries.
Tier1Slim escalation: `take_photo`	`google/gemini-2.0-flash-001` (`VLM_MODEL`)	OpenRouter (or local Ollama if `VLM_API_URL` is repointed)	Camera describe.
Tier1Slim escalation: `play_song`	(no LLM call)	Firmware via xiaozhi `/xiaozhi/admin/play-asset`	Song request.
ZeroClawLLM (legacy single tier)	`Qwen3-30B-A3B-Instruct-2507`	OpenRouter	Every turn when `selected_module.LLM = ZeroClawLLM`.
Vision narrative LLM (security/scene synthesis)	`VISION_MODEL` (`google/gemini-2.0-flash-001` by default)	OpenRouter	Bridge-internal — describes the camera frame for journaling and security mode.
Audio captioning (security mode)	`AUDIO_CAPTION_MODEL` (`google/gemini-2.5-flash` by default)	OpenRouter	Bridge-internal — `what does Dotty hear` describer.

The full Tier1Slim wire format, escalation payload, and set_runtime() hot-swap are documented in tier1slim.md.

ZeroClaw architecture¶

From github.com/zeroclaw-labs/zeroclaw (see references.md):

Component	Role
Gateway	Control plane: HTTP / WS / SSE, sessions, config, cron, webhooks, web dashboard (localhost:42617 by default)
Runtime	Agent execution. Two modes: Native (direct process, default, fastest) or Docker (sandboxed)
Channels	Pluggable inputs. Supports WhatsApp, Telegram, Slack, Discord, Signal, iMessage, Matrix, IRC, Email, Bluesky, Nostr, Mattermost, Nextcloud Talk, DingTalk, Lark, QQ, Reddit, LinkedIn, Twitter, MQTT, WeChat Work, and others. This bridge uses `channel="stackchan"` as an arbitrary string identifier — ZeroClaw treats it as just another channel.
Providers	LLM backends. OpenAI, Anthropic (API or OAuth), Gemini, and 17+ OpenAI-compatible endpoints (including OpenRouter, Ollama, GLM). Failover + multi-account auth profiles supported.
Memory	Pluggable backends. SQLite is the default; PostgreSQL and Markdown are options. Hybrid keyword+vector search per upstream wiki.
Tools / MCP	70+ built-in tools plus bidirectional MCP — ZeroClaw is both an MCP client (consumes external servers) and an MCP server (can expose its internals to other agents).

Resource claims from upstream: ~8.8 MB static binary, <5 MB runtime RAM on release builds. The Pi can comfortably run it.

Workspace files — the persona surface¶

Under /root/.zeroclaw/workspace/:

File	Upstream description	What we use it for
`SOUL.md`	"Core identity and operating principles"	The configured persona's voice, values, role in the household. The dedicated `channel="stackchan"` section was removed after bridge-level English-only enforcement subsumed it.
`IDENTITY.md`	Agent personality and role definition	Name, backstory, household/family context specific to your deployment.
`USER.md`	User context and preferences	Per-user context; optional. Not heavily populated in our deploy.
`MEMORY.md`	"Long-term facts and lessons learned"	Git-visible core memories. Complemented by an SQLite `brain.db`.
`AGENTS.md`	"Session conventions and initialization rules"	Cross-agent behavioral guardrails. Self-modifying per upstream — the agent writes to it.

Important: the bridge wraps voice turns in a hard English+emoji sandwich outside of ZeroClaw's persona files. The enforcement lives in bridge.py, not in SOUL.md, because a persona-level constraint wasn't strong enough to keep Qwen3 from leaking Chinese mid-session. See Qwen3 caveat.

The confusing "ACP" terminology in ZeroClaw¶

ZeroClaw uses the acronym ACP in two different contexts. Don't conflate them:

Autonomy Control (ACP Mode) — ZeroClaw's README describes autonomy levels (ReadOnly, Supervised, Full) under a heading that calls them "ACP Mode". This is about how much the agent is allowed to do without asking.
Agent Client Protocol — the zeroclaw acp CLI subcommand launches ZeroClaw as an ACP server speaking JSON-RPC 2.0 over stdio. This is the Zed-originated Agent Client Protocol — unrelated to autonomy levels, despite sharing an acronym.

Our bridge uses the second one. The robot's autonomy mode is whatever ZeroClaw's config.toml sets (likely Supervised), and is separate from the stdio protocol.

See protocols.md for the ACP wire format.

The bridge — `bridge.py`¶

Lives at <BRIDGE_PATH>/bridge.py, runs under systemd (zeroclaw-bridge.service).

HTTP surface:

Endpoint	Caller	Purpose
`POST /api/message`	`ZeroClawLLM` (legacy single-tier)	Accept a message + channel, return a response string.
`POST /api/voice/escalate`	`Tier1Slim`	Dispatch a tool call (`memory_lookup` / `think_hard` / `take_photo` / `play_song`) and return the result.
`POST /api/voice/remember`	`Tier1Slim`	Fire-and-forget: persist a `[REMEMBER: ...]` fact extracted from the model's reply.
`POST /api/voice/memory_log`	`Tier1Slim`	Fire-and-forget: log the completed turn so ZeroClaw indexes it for future recall.
`POST /api/perception/event`	xiaozhi-server perception relay	Receive `face_detected` / `face_lost` / `sound_event` / `state_changed` events from the firmware via xiaozhi.
`GET /health`	health checks	Liveness probe.
`POST /admin/*` (localhost-only)	operator scripts	Runtime mutations (kid-mode, persona, model, safety).

See protocols.md for exact wire formats.

Per-turn responsibilities on the ZeroClawLLM path:

Spawn (or reuse) a zeroclaw acp child, holding stdin/stdout open.
Send JSON-RPC session/new to get a fresh session_id. (Note: currently one session/new per turn — see session reuse in latent-capabilities.md.)
Wrap the user content in the English+emoji sandwich when channel == "stackchan".
Send JSON-RPC session/prompt with the wrapped content.
Wait for the terminal result (not streamed — see latent-capabilities.md).
Run _ensure_emoji_prefix() — if the first non-whitespace char isn't in the 9-emoji allowlist (😊😆😢😮🤔😠😐😍😴), prepend 😐.
Return JSON {"response": "<cleaned text>"}.

Per-turn responsibilities on the Tier1Slim path:

The bridge doesn't see plain conversational turns at all — they happen entirely inside Tier1Slim against llama-swap. The bridge is invoked only when the small model emits a tool_call, at which point it:

Looks up the tool name and dispatches to ZeroClaw (memory_lookup), llama-swap's qwen3.6:27b-think (think_hard), the VLM (take_photo), or xiaozhi's /xiaozhi/admin/play-asset (play_song).
Returns the result string truncated to 1000 chars.
Separately, accepts /api/voice/remember and /api/voice/memory_log posts so future memory_lookup calls find the new content.

The LLMs¶

Qwen3-30B-A3B-Instruct-2507 (legacy `ZeroClawLLM` path)¶

From the HuggingFace card:

Fact	Value
Total parameters	30.5 B
Activated per token	3.3 B
Non-embedding params	29.9 B
Layers	48
Attention heads	32 Q / 4 KV (GQA)
Experts	128 total, 8 active per token
Native context	262 144 tokens (256 K)
Extended context	up to 1 M tokens with Dual Chunk Attention + MInference sparse attention
Mode	Non-thinking only (no `<think>` blocks)
Recommended sampling	T=0.7, top_p=0.8, top_k=20, min_p=0, presence_penalty 0–2
Tool calling	Supported (OpenAI-compatible; Qwen-Agent framework recommended for full features)
Suggested output length	16 384 tokens

The "2507" suffix indicates the 2025 revision (the HF card calls it the May 2025 update; the YYMM reading would be July 2025 — these conflict in upstream copy, treat the revision semantically rather than calendar-pinning it).

Qwen3 caveat — Chinese leak and long-context drift¶

Qwen3 is multilingual by training and occasionally leaks Chinese mid-response when: - Context is long (persona + memory + history = lots of tokens before the user turn). - System-prompt adherence is weakened by MoE expert routing.

Observed symptom in our deploy: the model would start a response in English and drop a Chinese character or phrase partway through. en-* EdgeTTS voices return silent audio on non-English input, so the whole response sounded like a dead mic — it was a language bug, not a TTS bug.

Our mitigation, layered:

ZeroClaw's own system prompt has English hard rules.
xiaozhi-server's top-level prompt: is also English-only.
The bridge adds both a prefix and a suffix around every voice turn — the suffix sits at the end-of-prompt position (max attention) and reiterates the constraint. 8/8 adversarial prompts (broken English, embedded Japanese, short utterances) passed cleanly after this change.

qwen3.5:4b (Tier1Slim inner loop)¶

Local on llama-swap (Unraid, dual RTX 3060). Fast: ~500 ms warm round-trip including TTS dispatch. Trained for tool calling, which is what lets the four-tool catalogue work reliably at 4 B. See tier1slim.md for the wire format.

qwen3.6:27b-think (Tier1Slim `think_hard` target)¶

Local on the same llama-swap, separate alias. ~18 tok/s generation, ~30 s cold-load. Co-resident with qwen3.5:4b under the voice matrix set in llama-swap/config.yaml so an escalation doesn't evict the inner loop. See cookbook/llama-swap-concurrent-models.md.

Cloud models (smart_mode + visual + audio)¶

Smart-mode inner loop: anthropic/claude-sonnet-4-6 (SMART_MODEL env var). Used by Tier1Slim when smart_mode is on; flipped in-process via set_runtime().
VLM (take_photo, security camera frames): google/gemini-2.0-flash-001 (VLM_MODEL).
Audio captioning (security mode): google/gemini-2.5-flash (AUDIO_CAPTION_MODEL).

OpenRouter¶

Routing: OpenRouter fronts cloud models (SMART_MODEL, VLM_MODEL, AUDIO_CAPTION_MODEL, and the legacy ZeroClaw path's Qwen3-30B). It handles multiple upstream providers and exposes an OpenAI-compatible API. ZeroClaw's config references it via an openrouter provider section with an encrypted API key; the bridge reads OPENROUTER_API_KEY from the systemd unit's environment for the VLM and audio-caption calls.

Observability OpenRouter itself offers (not currently surfaced in this stack): - Per-request latency + cost dashboards. - Multi-model A/B routing. - Per-provider failover for the same model.

If per-turn latency ever needs deeper analysis than ZeroClaw's state/costs.jsonl and state/runtime-trace.jsonl, OpenRouter's dashboard is where to look next.