FAQ¶
What hardware do I need?¶
The verified setup is an M5Stack CoreS3 mounted in the M5Stack StackChan servo kit (2x SG90 servos for pan/tilt, 12 RGB LEDs, 3D-printed chassis). You also need a Docker-capable host on your LAN (a spare PC or any Linux box with Docker) and a ZeroClaw host for the agent brain.
See hardware-support.md for the full spec table and support tiers.
Can I use a different LLM?¶
Yes. The LLM is pluggable at two levels:
- At the ZeroClaw level (the brain): change the
default_modeland provider in ZeroClaw'sconfig.tomlon the ZeroClaw host. ZeroClaw supports OpenAI-compatible APIs, Anthropic, Gemini, Ollama, and 17+ other backends. Restart the bridge after changing. - At the xiaozhi-server level: if you want to bypass ZeroClaw entirely, swap the
selected_modulefor LLM indata/.config.yamlto any of the built-in providers (OpenAI-compatible, Ollama, Dify, FastGPT, etc.). You lose ZeroClaw's agent features (memory, tools, persona files) but gain a simpler stack.
The reference config uses Qwen3-30B-A3B via OpenRouter. Any model that handles English well and can follow emoji-prefix instructions will work. Larger models give better persona adherence; smaller models respond faster.
Is it fully offline?¶
Almost, and it can be with two swaps:
- ASR (speech recognition): already fully local. FunASR runs on your server.
- TTS (speech synthesis): local if you use Piper TTS. EdgeTTS requires internet (it hits Microsoft's servers).
- LLM: cloud by default (OpenRouter). Swap in Ollama pointing at a local model for fully offline inference.
With Piper TTS and a local Ollama model, nothing leaves your LAN. The trade-off is that local LLMs need a GPU or beefy CPU to run at conversational speed.
How much does it cost to run?¶
Hardware (one-time): - M5Stack StackChan kit: check current pricing on the M5Stack store. Expect roughly $60-80 USD for the CoreS3 + servo kit. - Docker host: whatever you already have. Any machine that can run Docker and has a few GB of RAM. - ZeroClaw host: any Pi 3 or later. The ZeroClaw binary is ~9 MB and uses <5 MB RAM.
Recurring: - Electricity for the hosts (negligible for most home setups). - LLM API costs if using a cloud provider. OpenRouter pricing for Qwen3-30B-A3B is on the order of $0.10-0.30 per million tokens — casual household use is pennies per day. Check OpenRouter's pricing page for current rates. - $0 if you run a local model via Ollama.
Is it safe for kids?¶
Kid Mode is ON by default (DOTTY_KID_MODE=true). It enforces child-safe guardrails. You can disable it with DOTTY_KID_MODE=false for general-purpose use.
What Kid Mode enforces:
- Per-turn sandwich enforcement in the bridge forces the LLM to respond in English with an emoji prefix, which limits the scope of unexpected output.
- The ZeroClaw persona files (SOUL.md, IDENTITY.md) define the robot's personality and boundaries with kid-safe defaults.
- Content and tone are constrained to be age-appropriate.
What Kid Mode does not do: - Content-filter the LLM's output at a network level. If the LLM says something inappropriate, the bridge passes it through. - Prevent a determined child from asking adversarial questions. - Guarantee the LLM won't hallucinate inappropriate content (no model can).
This is a self-hosted system — you control the prompt, the model, and every log. That's more control than any cloud voice assistant gives you, but it's not a substitute for parental judgment.
Can I change the robot's personality?¶
Yes. The persona is defined in Markdown files on the ZeroClaw host:
~/.zeroclaw/workspace/SOUL.md— core identity and values.~/.zeroclaw/workspace/IDENTITY.md— name, backstory, role.
These are hot-read by ZeroClaw — edit them and the next conversation turn picks up the changes, no restart needed.
There's also a secondary prompt: key in data/.config.yaml on the server that gets injected as a system message. ZeroClaw's own persona files take precedence, but this is a useful place for voice-pipeline-level hints.
What voices are available?¶
Depends on which TTS provider you use:
- Piper TTS (local): browse the piper-voices catalog on HuggingFace. Dozens of languages and speakers. The reference config uses
en_GB-cori-medium. - EdgeTTS (cloud): any Microsoft Edge Neural voice. Hundreds of voices across 70+ languages. Set the voice ID in
data/.config.yamlunderTTS.EdgeTTS.voice(e.g.,en-US-AvaNeural,en-AU-WilliamNeural,ja-JP-NanamiNeural). Restart the container after changing.
To switch between Piper and EdgeTTS, change the selected_module for TTS in data/.config.yaml and restart the container.
Does it work with other StackChan variants?¶
It depends on what you mean by "StackChan variant":
- M5Stack CoreS3 + StackChan servo kit (the official M5Stack product): This is what we test on. It works.
- Original Stack-chan by Shinya Ishikawa (
meganetaaan/stack-chan): That's a different firmware (TypeScript on Moddable SDK). The server-side infrastructure doesn't care what firmware the device runs as long as it speaks the Xiaozhi WebSocket protocol. But the original Stack-chan firmware doesn't speak Xiaozhi — so no, not without porting the firmware. - Other ESP32-S3 boards running
78/xiaozhi-esp32: The server side will work (same WebSocket protocol), but you won't get StackChan-specific features (servos, avatar, LEDs) without board-specific firmware work.
See hardware-support.md for the full support matrix.
See also¶
- about.md — what the project is and who it's for.
- hardware-support.md — hardware requirements and support tiers.
- troubleshooting.md — symptom-based debugging guide.
- SETUP.md — deployment guide.
Last verified: 2026-05-17.