Skip to content

Planner observability

What this is / when to use it

This page documents how to observe and operate ReactPlanner in production:

  • structured event streams (planner-level)
  • what to log and what to redact
  • what to monitor and alert on

Non-goals / boundaries

  • This page does not mandate a specific telemetry backend (Datadog, Prometheus, OpenTelemetry).
  • This page does not cover runtime-level FlowEvent in detail (see docs/observability/*).

Contract surface

event_callback

ReactPlanner accepts an event_callback that receives PlannerEvent objects:

  • penguiflow.planner.models.PlannerEvent

Common event types include:

  • step_start, step_complete
  • llm_call
  • tool_call_start, tool_call_end, tool_call_result
  • pause, resume, finish
  • stream_chunk, artifact_chunk, llm_stream_chunk
  • observation_clamped (planner-level safety net for oversized observations)
  • steering_received (when steering inbox events are drained)
  • guardrail_retry (when a guardrail requests an LLM retry)

Runnable example: logging PlannerEvent

import logging

from penguiflow.planner.models import PlannerEvent


log = logging.getLogger("penguiflow.planner.events")


def on_event(ev: PlannerEvent) -> None:
    # Avoid reserved logging keys; PlannerEvent.to_payload() filters them.
    log.info(ev.event_type, extra=ev.to_payload())

Operational defaults (what to record)

At minimum record:

  • finish reason (answer_complete / no_path / budget_exhausted)
  • planner step count and hop budget usage
  • LLM call latency and retries
  • tool call latency and error rates
  • pause frequency and resume latency
  • observation clamping/truncation events (if enabled)

Failure modes & recovery

“Works locally, fails in production”

Common causes:

  • missing env vars for ToolNode (${VAR} substitution is fail-fast)
  • tool visibility differs by tenant/user
  • state store not configured for distributed pause/resume

Silent data leaks in logs

Do not log:

  • raw llm_context (may contain user text or derived sensitive content)
  • raw tool outputs for external APIs

Prefer:

  • artifact references (ids + metadata)
  • redacted summaries

Security / multi-tenancy notes

  • Treat tool_context as privileged: it may contain secrets and clients.
  • Keep per-tenant isolation in:
  • memory keys
  • tool visibility policies
  • artifact scopes and access checks

Troubleshooting checklist

  • No planner events: confirm event_callback is passed and not overwritten by per-session dispatch.
  • Missing stream chunks: verify stream_final_response and tool streaming usage; confirm UI is wired to the stream sink.
  • High tool error rate: tighten retries/timeouts and reduce concurrency to respect rate limits.