-
Symptom:
401 unauthorized- Check
Authorization: Bearer <api-key-or-legacy-token>ormc_sessioncookie. - Verify key scopes via
GET /api/auth/api_keys.
- Check
-
Symptom: unsure whether deployment config is safe
- Run
GET /api/config/self_checkand inspectrisk_level+warnings. - Fix high-severity items first (
severity: high).
- Run
-
Symptom: login throttled
- Login endpoint rate-limits repeated attempts per client key.
- Wait for cooldown window and retry.
- List hooks:
microclaw hooks list - Inspect hook:
microclaw hooks info <name> - Disable bad hook quickly:
microclaw hooks disable <name>
If a hook times out or crashes, runtime skips the hook and continues.
- Inspect tree:
GET /api/sessions/tree - Create branch:
POST /api/sessions/fork - Deleting parent session does not cascade to children.
-
Symptom:
sessions_spawn(runtime="acp")returnsACP runtime is disabled- Set
subagents.acp.enabled: true - Set
subagents.acp.commandto an ACP-compatible agent executable, or add at least one enabled target undersubagents.acp.targets
- Set
-
Symptom: ACP run is accepted but later fails quickly
- Verify the selected ACP command starts and speaks ACP over stdio
- Check
subagents_logfor the run and inspect stderr included in the failure - Confirm the chat working directory contains the repo/context the external agent expects
-
Symptom:
sessions_spawn(runtime="acp")returns unknown target or multiple target errors- Pass
runtime_targetwhen using named ACP workers - Or set
subagents.acp.default_targetso plainruntime="acp"has a stable default - Verify target names under
subagents.acp.targets
- Pass
-
Operational notes:
- ACP-backed subagents reuse the normal
subagent_runslifecycle and announce path - ACP file and terminal operations are scoped to the chat working directory
subagents_lognow records ACP permission, plan, tool-call, file, and terminal events- ACP permission requests are still operator-configurable and may be auto-approved, so keep the feature operator-controlled
- ACP-backed subagents reuse the normal
-
Symptom: Mission Control / OpenClaw operator cannot connect
- The compatibility bridge is served from
GET /with WebSocket upgrade, not/ws. - Verify the web listener first:
GET /api/health. - Check service state:
microclaw gateway status --json --deep
- The compatibility bridge is served from
-
Symptom: local gateway RPC returns unauthorized
- Export one of
MICROCLAW_GATEWAY_TOKEN,OPENCLAW_GATEWAY_TOKEN,GATEWAY_TOKEN, orMICROCLAW_API_KEY. - WebSocket
connectrequires a valid operator token; mutating RPC methods requireoperator.write.
- Export one of
-
Quick smoke checks:
MICROCLAW_GATEWAY_TOKEN=... microclaw gateway call health
MICROCLAW_GATEWAY_TOKEN=... microclaw gateway call status
MICROCLAW_GATEWAY_TOKEN=... microclaw gateway call sessions.send \
--params '{"sessionKey":"main","message":"status summary"}'-
Supported bridge methods:
health,status,chat.send,chat.historysessions.delete,sessions.send,sessions.kill,sessions.spawn,sessions.listsessions.setThinking,sessions.setVerbose,sessions.setReasoning,sessions.setLabelagents.list,models.list,config.get,node.list
-
Session-control notes:
sessions.sendemits livechatevents and a terminalfinalstate for normal messages.sessions.spawncan create a new session and persist an initial label.sessions.set*writes only the provided field and preserves the rest of the stored session settings.- Session labels and settings are persisted in the
sessionstable and flow back into session listings / bridge payloads. sessions.sendcontrol payloads are acknowledged today, but not yet enforced as runtime controls.
-
Symptom: logs show send failures after operator disconnect
- Messages such as
sending after closing is not allowed,already closed, orconnection closed normallyare downgraded as expected closed-socket noise.
- Messages such as
- Check snapshot:
GET /api/metrics - Check history:
GET /api/metrics/history?minutes=60 - If OTLP is enabled, verify
channels.observability.otlp_endpointis reachable. - If points are missing under burst traffic, raise
otlp_queue_capacityand review retry settings.
If history is empty, generate traffic first and re-check.
MCP reliability counters (snapshot/summary):
mcp_rate_limited_rejectionsmcp_bulkhead_rejectionsmcp_circuit_open_rejections
These counters are also persisted to metrics_history and available in
GET /api/metrics/history.
- Run stability smoke suite locally:
scripts/ci/stability_smoke.sh - CI gate:
Stability Smokejob in.github/workflows/ci.yml - Scope:
- cross-chat permissions
- scheduler restart persistence
- sandbox fallback and require-runtime fail-closed behavior
- web inflight and rate-limit regression
- Query SLO summary:
GET /api/metrics/summary - Request success burn alert:
slo.request_success_rate.value < 0.99 - Latency burn alert:
slo.e2e_latency_p95_ms.value > 10000 - Tool reliability burn alert:
slo.tool_reliability.value < 0.97 - Scheduler recoverability alert:
slo.scheduler_recoverability_7d.value < 0.999
When any burn alert is active:
- freeze non-critical feature merges
- triage and assign incident owner
- if user-facing impact continues, prepare rollback/hotfix path per stability plan
- Inspect failed scheduler runs: use tool
list_scheduled_task_dlqwithchat_id - Replay pending failures: use tool
replay_scheduled_task_dlqwithchat_id - Optional filters:
task_idto target one tasklimitto bound replay batch size
- Replay behavior:
- re-queues task with immediate
next_run - marks DLQ entry as replayed with a replay note (
queuedorskippedreason)
- re-queues task with immediate
- Global tool timeout default:
default_tool_timeout_secs - Per-tool timeout overrides:
tool_timeout_overrides.<tool_name> - Global MCP request timeout default:
default_mcp_request_timeout_secs - MCP per-server override remains supported in
mcp.jsonandmcp.d/*.json:mcpServers.<name>.request_timeout_secs
Precedence:
- Tools: input
timeout_secs>tool_timeout_overrides>default_tool_timeout_secs - MCP: server
request_timeout_secs>default_mcp_request_timeout_secs
Example config:
default_tool_timeout_secs: 30
tool_timeout_overrides:
bash: 90
browser: 45
web_fetch: 20
web_search: 20
default_mcp_request_timeout_secs: 120web_fetch now has two independent validation layers:
- content validation (
web_fetch_validation) to detect prompt-injection patterns in fetched text - URL policy validation (
web_fetch_url_validation) to enforce scheme/host rules before fetch
Recommended baseline:
web_fetch_validation:
enabled: true
strict_mode: true
max_scan_bytes: 100000
web_fetch_url_validation:
enabled: true
allowed_schemes: ["https", "http"]
allowlist_hosts: []
denylist_hosts:
- "localhost"
- "127.0.0.1"
- "169.254.169.254"web_fetch_url_validation.feed_sync can pull host entries from remote feeds and merge them into
runtime allowlist/denylist sets.
Example:
web_fetch_url_validation:
enabled: true
allowed_schemes: ["https", "http"]
allowlist_hosts: []
denylist_hosts: []
feed_sync:
enabled: true
fail_open: true
max_entries_per_source: 10000
sources:
- enabled: true
mode: denylist
url: "https://example.com/blocked-hosts.txt"
format: lines
refresh_interval_secs: 3600
timeout_secs: 10
- enabled: true
mode: allowlist
url: "https://example.com/approved-hosts.csv"
format: csv_first_column
refresh_interval_secs: 3600
timeout_secs: 10Operational notes:
- URL checks run before network fetch; denylist takes precedence over allowlist.
- If
allowlist_hostsis non-empty, target host must match allowlist. fail_open: trueskips an unavailable feed;fail_open: falseblocks requests when feed fetch fails.- Feed data is cached in memory and refreshed per-source
refresh_interval_secs.
mcp.json/mcp.d/*.jsonsupport per-server circuit breaker knobs:circuit_breaker_failure_threshold(default5)circuit_breaker_cooldown_secs(default30)
request_timeout_secsremains per-server timeout budget.
mcp.json/mcp.d/*.jsonsupport server-level isolation controls:max_concurrent_requests(default4)queue_wait_ms(default200)rate_limit_per_minute(default120)
Example:
{
"mcpServers": {
"remote": {
"transport": "streamable_http",
"endpoint": "http://127.0.0.1:8080/mcp",
"request_timeout_secs": 120,
"circuit_breaker_failure_threshold": 5,
"circuit_breaker_cooldown_secs": 30,
"max_concurrent_requests": 4,
"queue_wait_ms": 200,
"rate_limit_per_minute": 120
}
}
}Behavior:
- consecutive MCP request failures trip the breaker and short-circuit calls during cooldown.
- after cooldown, requests are attempted again automatically.
- requests are fail-fast when queue wait budget is exceeded.
- per-server rate limit enforces a fixed 60s window budget.
Use this pattern when you want remote terminal/session capability via an external bridge, without embedding terminal runtime logic into MicroClaw core.
- Keep base MCP config in
<data_dir>/mcp.json(optional). - Add sidecar config as a fragment in
<data_dir>/mcp.d/hapi-bridge.json. - Start the bridge service separately, then start MicroClaw.
Example fragment:
{
"mcpServers": {
"hapi_bridge": {
"transport": "streamable_http",
"endpoint": "http://127.0.0.1:3010/mcp",
"request_timeout_secs": 180
}
}
}Reference template: repository file mcp.hapi-bridge.example.json.
Step-by-step guide: docs/operations/hapi-bridge.md.
If any configured MCP server exposes both memory_query and memory_upsert, MicroClaw enables MCP-first structured-memory operations:
- prompt-time structured-memory reads/searches
- explicit
remember ...fast path writes - reflector insert/update/supersede/touch flows
- structured memory tool operations (
structured_memory_search/update/delete)
Fallback policy:
- on MCP call failure, timeout, or invalid/unknown response shape, MicroClaw falls back to built-in SQLite memory automatically
- fallback is per operation (best-effort), so one MCP failure does not disable memory for the whole runtime
- fallback reasons are classified and surfaced as stable categories in health state, for example
timeout,transport,invalid_payload, orunsupported_operation
Operational checks:
- confirm startup logs include
Memory MCP backend enabled via server '<name>' - confirm startup health probe does not log
memory MCP startup probe failed; if it does, MicroClaw keeps the external provider attached but serves operations via SQLite fallback as needed - verify server tool list includes exact names
memory_queryandmemory_upsert - inspect
/api/healthor config self-check output formemory_backendhealth:startup_probe_ok=falsemeans the provider was discovered but failed the initialmemory_query(list)probe- high
consecutive_primary_failuresortotal_fallbacksmeans the runtime is degrading to SQLite frequently last_fallback_reasonrecords the most recent classified fallback reason
- if semantic retrieval quality drops while MCP is enabled, note that local
sqlite-vecKNN ranking is skipped for MCP-backed rows
Consistency model:
- fallback is evaluated per operation, not per runtime: a failed MCP read/write can fall back to SQLite even while later operations retry the external provider
- this prioritizes availability over strict cross-store consistency; short-lived divergence between SQLite and the external memory store is therefore possible
- explicit memory reads/writes and tool calls continue through SQLite fallback when the external provider is unhealthy
- background reflector writes are paused when startup probe fails or repeated primary-provider failures indicate the external provider is unhealthy; this limits silent divergence from bulk background writes
- after the external provider recovers, new operations go back to MCP-first behavior, but data written only through SQLite fallback is not automatically backfilled into the external store
Minimal MCP config example (memory MCP server + local filesystem):
{
"defaultProtocolVersion": "2024-11-05",
"mcpServers": {
"filesystem": {
"transport": "stdio",
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "."]
},
"memory": {
"transport": "streamable_http",
"endpoint": "http://127.0.0.1:8090/mcp",
"headers": {
"Authorization": "Bearer REPLACE_ME"
},
"request_timeout_secs": 60
}
}
}The memory server must expose MCP tools named exactly memory_query and memory_upsert.