Latent capabilities — features available but not wired up¶
TL;DR¶
- Every row below is something the hardware, voice pipeline, or brain already supports but that the current deployment doesn't use. It's raw material for the backlog.
- Organised by where the capability lives: hardware, voice pipeline, brain.
- Each row ends with a cross-reference: a related
ROADMAP.mditem, or flagged as a new-task candidate if no backlog entry exists yet. - Treat this as a menu, not a plan — some are cheap wins, others are complex.
Hardware — unused¶
Underlying peripherals on the M5Stack CoreS3 / StackChan kit that the firmware doesn't currently expose through MCP.
| Capability | What it unlocks | Priority | Cross-ref |
|---|---|---|---|
| 9-axis IMU shake/gesture (BMI270 + BMM150) | Tap-to-activate, shake-to-dismiss, head-tilt-aware responses | Medium | New-task candidate |
| Proximity sensor (LTR-553ALS) | Wake-on-approach; auto-dim face at idle | Low | New-task candidate |
| Ambient-light (same sensor) | Match face brightness to room lighting | Low | New-task candidate |
| NFC module | Tap an NFC-tagged toy/card to trigger a scripted interaction | Medium | New-task candidate |
| IR tx/rx | Universal remote mode (learn + replay legacy appliance codes) | Low | New-task candidate |
| microSD slot | Offline sound packs, local fallback voices, recorded memories | Medium | Partially overlaps ROADMAP.md → "Create backup script" |
| 3-zone touch panel | Multi-zone gesture controls (head-pat as a discrete event) | Low | New-task candidate |
Camera beyond take_photo |
On-device VLM preprocessing; streaming to a local vision server | Medium | Cross-refs ROADMAP.md → "Lock down for child-safe operation" (camera exposure) |
| Hardware-enforced privacy LEDs | LED state wired to the peripheral-enable signal, not a software hint | High / safety | ROADMAP.md → "Hardwire privacy-indicator LEDs in firmware" |
| Servo velocity/acceleration caps | Calmer, safer, less-startling head motion | High / safety | ROADMAP.md → "Tame violent servo motion" |
Voice pipeline — unused¶
Features xiaozhi-esp32-server supports upstream that aren't turned on or surfaced.
| Capability | What it unlocks | Priority | Cross-ref |
|---|---|---|---|
| SenseVoice Speech Emotion Recognition (SER) | Use the user's vocal emotion as LLM context (not just the LLM's own emoji output) | High | New-task candidate |
| SenseVoice Audio Event Detection (AED) | Detect bgm, applause, laughter, crying, coughing, sneezing — useful context for a kids' robot | Medium | New-task candidate |
| SenseVoice language-ID output | Detect when the user actually spoke a non-English language; respond in kind or request clarification | Low | Cross-refs the English-pin patch fun_local.py |
| Sherpa-ONNX ASR | Alternative to FunASR; fully offline, supports different languages | Low | New-task candidate |
| Custom wake word | Replace/add to the stock wake word via ESP-SR MultiNet | Low | New-task candidate |
| Voiceprint speaker ID | Distinguish family members; apply per-user persona/context | Medium | Cross-refs child-safety task (different guardrails for kids vs adults) |
| xiaozhi-server VLLM module | Server-side "What's in this photo?" pipeline | Medium | Already covered by the bridge-side take_photo + VLM long-poll path described in modes.md; this row tracks the upstream xiaozhi-server VLLM module, which we don't enable. |
| PowerMem | Dual-layer short-term + summarized memory (currently ZeroClaw owns memory) | Low | Would overlap with ZeroClaw's memory — probably don't |
Intent router (function_call mode) |
Route simple commands (turn off lights, set timer) without round-tripping to the LLM | Medium | New-task candidate |
| RagFlow knowledge base | Retrieval-augmented responses against a household doc store | Low | New-task candidate |
| Multi-device routing | Run the StackChan as one of several voice surfaces on the same ZeroClaw brain | Low | Needs the full-module deployment (DB-backed) |
| Piper streaming synthesis | Lower first-audio latency than the current batch synthesis | Medium | ROADMAP.md → "Reduce first-audio latency" |
| ffmpeg post-processing on TTS | Robot-voice character via ring modulator / bitcrush / vocoder | Medium | ROADMAP.md → "TTS provider swap — robot-sounding voice" |
Brain — unused¶
ZeroClaw + Qwen3 + OpenRouter features that could be wired into the bridge.
| Capability | What it unlocks | Priority | Cross-ref |
|---|---|---|---|
ACP session/update streaming |
First-token TTS instead of waiting for the full response (perceived-latency win) | High | ROADMAP.md → "Reduce first-audio latency" |
| Long-lived ZeroClaw sessions | Skip session/new per turn — carry context across turns within a conversation |
Medium | ROADMAP.md → "Reduce first-audio latency" (ACP session overhead) |
session/request_permission |
Bridge confirms tool calls before they execute — useful for child-safety. Bridge now auto-approves (2026-04-25); tool allowlist for child-safety is a follow-up. | Medium | ROADMAP.md → "Lock down for child-safe operation" |
| ~~Qwen3 function-calling / tool-use~~ | Wired up (2026-04-25). ZeroClaw auto-approves tools in auto_approve list and sends tool execution as session/event notifications. Bridge logs tool calls at INFO level. Works for weather, web_search_tool, calculator, etc. |
~~Medium~~ Done | — |
| ZeroClaw MCP-server mode | Expose ZeroClaw's tools/memory to other MCP clients | Low | New-task candidate |
Qwen3 role: "system" injection |
Move the English+emoji constraints into a proper system message instead of a prompt prefix/suffix; better MoE adherence | Medium | Rework of bridge's wrapping logic |
| Qwen3 extended context (256K native) | Keep long conversation history / memory verbatim instead of summarising | Low | Costs more tokens per turn — probably not worth it yet |
| OpenRouter latency/cost dashboard | Observability beyond the local state/costs.jsonl |
Low | Already available — just point a browser at it |
| OpenRouter failover / multi-model | A/B a smaller faster model for voice turns specifically | Medium | ROADMAP.md → "Reduce first-audio latency" (smaller model for voice) |
| ZeroClaw cost/trace surfacing | Expose state/costs.jsonl + runtime-trace.jsonl via the bridge /health or a new /stats endpoint |
Low | New-task candidate |
| ZeroClaw cron scheduler | The robot could say "good morning" on a schedule, not just on demand | Low | New-task candidate |
Cross-cutting — observability¶
None of these are feature requests — they're gaps in what we can see about the running system.
| Gap | What it'd unlock |
|---|---|
Capture a real tools/list response |
Ground-truth for the MCP tool table in hardware.md |
| Per-turn latency breakdown | Which of ASR / LLM / TTS / network is the dominant cost |
| Per-turn cost breakdown | Whether Qwen3 via OpenRouter is cheaper than a smaller local model |
| Per-session trace diff | Whether English-sandwich is still needed after a hypothetical model upgrade |
These are all feeders for the ROADMAP.md "Map the ZeroClaw ↔ xiaozhi-server ↔ StackChan firmware interaction" backlog item.
Prioritisation rule of thumb¶
| Signal | Do it sooner |
|---|---|
| Child-safety or privacy | Always |
| Reduces perceived latency | Usually |
| Uses hardware we already paid for | Often |
| Requires an external service | Often skip |
| Needs the full-module DB deployment | Bundle for a future migration |
See also¶
- ROADMAP.md — live backlog; this file is a source for it, not a replacement.
- hardware.md — what the hardware features actually are.
- voice-pipeline.md — what the server supports upstream.
- brain.md — what ZeroClaw/Qwen/OpenRouter expose.
- references.md — upstream source for every capability claim.
Last verified: 2026-05-17.