Planner troubleshooting¶
What it is / when to use it¶
This page is a symptom-driven checklist for common ReactPlanner issues.
Use it when you are debugging:
- schema/JSON failures and repair loops,
- tool selection/visibility problems,
- parallel join behavior,
- memory scoping and pause/resume issues.
Non-goals / boundaries¶
- This page is not a replacement for planner contracts; it links to canonical contract pages.
- This page does not cover ToolNode configuration in depth (see tools runbooks).
Contract surface (things to check first)¶
Before deep-diving, confirm these knobs and contracts:
- production configuration: Configuration
- tool discovery and per-tenant filtering: Tool discovery & filtering
- action contract: Actions & schema
- tool design: Tool design
- tool discovery/integration: Tooling
- rich UI outputs: Rich output
- safety/policy: Guardrails
- pause/resume contract: Pause/resume (HITL)
- memory isolation: Memory
- runtime control: Steering
- concurrent work: Background tasks
- playbook retrieval: Skills
- provider integration: Native LLM layer / LLM clients
Operationally important planner settings:
json_schema_mode,repair_attempts,arg_fill_enabled,max_consecutive_arg_failuresmax_iters,deadline_s,hop_budgetabsolute_max_parallelandplanning_hints.max_parallelllm_timeout_s,llm_max_retries
Operational defaults (fast path)¶
If you’re in production and need a safe “baseline”:
- use
temperature=0.0,json_schema_mode=True - keep tool schemas small and add
@tool(examples=...) - enforce tool visibility/policy (reduce catalog surface)
- set timeouts/budgets to avoid infinite waits
Failure modes & recovery¶
The sections below are organized by symptom. For each, prefer:
- confirm the contract (what the planner expects)
- reduce surface area (visibility, schemas, concurrency)
- instrument (events + structured logs)
Observability¶
If you can’t explain a failure quickly, you usually need an event stream:
- configure
event_callbackand recordPlannerEvent.to_payload() - alert on
budget_exhausted, high repair counts, high tool error rates, and frequent pauses
Security / multi-tenancy notes¶
- Treat
llm_contextas LLM-visible andtool_contextas privileged. - Don’t “fix” tool visibility bugs by adding secrets to
llm_context. - Treat
resume_tokenas an authorization capability.
Invalid JSON loops¶
Symptoms
- Many repeated “repair” attempts
- Planner hits
budget_exhausted
Likely causes
- model cannot reliably emit strict JSON
- too-large tool catalog or ambiguous tool names
Fix
- use a stronger model or enable
json_schema_mode=True - reduce tool surface (policy/visibility)
- add tool examples
- keep args schemas small
“Tool not found” / unknown next_node¶
Symptoms
- planner reports invalid node/tool name
Fix
- ensure the tool is present in the catalog
- ensure the tool is visible to the model (tool policy / visibility)
- avoid multiple tools with near-identical names
Arg-fill stuck / repeated invalid args¶
Symptoms
- repeated tool selection with invalid args
Fix
- keep args model minimal and well-typed
- add
exampleson the tool spec - keep
arg_fill_enabled=True - adjust
max_consecutive_arg_failuresto force followup instead of looping
Parallel join skipped¶
Symptoms
- join shows
reason="branch_failures"
Fix
- inspect branch failures
- add retries/timeouts to flaky tools
- lower ToolNode concurrency to reduce rate-limiting
Memory disabled unexpectedly¶
Symptoms
- memory strategy configured but no memory appears in
llm_context
Likely causes
MemoryIsolation.require_explicit_key=Trueand no key could be derived
Fix
- pass
memory_key=MemoryKey(...)explicitly, or - configure isolation paths and ensure
tool_contextcontains those keys
Pause token invalid/expired¶
Symptoms
resume(...)raises KeyError for token
Fix
- ensure pause records are persisted (StateStore) in distributed deployments
- ensure tokens are not re-used (tokens are consumed on load in some stores)
- ensure TTL policies match your UX
Guardrail stops / redaction surprises¶
Symptoms
- tool call fails with
guardrail_stop:* - output is unexpectedly redacted
- planner pauses with
reason="approval_required"even though your tools didn’t callctx.pause(...)
Fix
- confirm whether the gateway is in
shadoworenforcemode - inspect the
failure.guardrailpayload in tool call results (it includesrule_idandreason) - tune rule configs (allowlists, redaction patterns, timeouts)
See Guardrails.
Steering cancellation / redirect confusion¶
Symptoms
SteeringCancelledraised unexpectedly- “redirect” doesn’t take effect until later
Fix
- ensure you pass the inbox into
run(..., steering=inbox)/resume(..., steering=inbox) - treat
SteeringCancelledas a normal control-path outcome (surface it to callers/UI) - validate/sanitize steering payloads at ingress
See Steering.
Background tasks don’t spawn¶
Symptoms
tasks.spawnfails withtask_service_unavailableorsession_id_missing- tools marked as background-enabled still run inline
Fix
- ensure tasks.* tools are in the catalog (foreground)
- ensure
tool_context["task_service"]is present andtool_context["session_id"]is a string - confirm
BackgroundTasksConfig.enabled=Trueand (for tool-initiated)allow_tool_background=True
See Background tasks.
Troubleshooting checklist¶
- Nothing happens: check that
catalogis non-empty and tool visibility isn’t filtering everything out. - Planner is slow: check LLM latency, ToolNode concurrency/rate limits, and whether parallelism is being throttled.
- Planner is unsafe: confirm allowlists/HITL for write/external tools and verify secrets never reach
llm_context.