Errors, retries, and timeouts¶

What it is / when to use it¶

PenguiFlow emphasizes predictable runtime behavior:

timeouts and retries are configured via NodePolicy,
errors are wrapped with trace and node metadata (FlowError),
failures are emitted as structured FlowEvent signals (and optionally persisted).

Use these knobs when you need “production semantics”:

bound latency with timeouts,
retry transient failures,
surface terminal failures to a caller or UI.

Non-goals / boundaries¶

Retries cannot make non-idempotent side effects safe. Design nodes to tolerate retries (or gate commits).
The runtime does not automatically classify “transient vs permanent” for you beyond timeouts and basic exception semantics.

Contract surface¶

NodePolicy knobs¶

NodePolicy controls validation and reliability behaviors:

timeout_s: hard timeout for a node invocation
max_retries: retry count after failures
backoff_base, backoff_mult, max_backoff: exponential backoff parameters
validate: "both" | "in" | "out" | "none"

Example:

from penguiflow import NodePolicy

policy = NodePolicy(
    validate="both",
    timeout_s=10.0,
    max_retries=3,
    backoff_base=0.5,
    backoff_mult=2.0,
    max_backoff=10.0,
)

Retry semantics:

attempts start at 0
on failure, the runtime retries while attempt < max_retries
total attempts = max_retries + 1

Operational defaults¶

Prefer small timeouts on network-bound nodes.
Keep retries bounded (max_retries 1–3) and backoff reasonable.
Emit errors to the sink only when you want caller-visible failures:
create(..., emit_errors_to_rookery=True)

Recommended defaults¶

Use short timeouts on network-bound nodes.
Make tool calls idempotent where possible.
Prefer retrying on transient errors only (timeouts, 429/5xx).

What happens on failure¶

When a node raises:

PenguiFlow emits a node_error (or node_timeout) event with exception metadata.
If max_retries > 0, it emits node_retry, sleeps for an exponential backoff delay, and re-invokes the node.
Once retries are exhausted, PenguiFlow creates a FlowError with:
trace_id, node_name, node_id
an error code (FlowErrorCode.NODE_EXCEPTION / FlowErrorCode.NODE_TIMEOUT)
attempt/latency metadata

By default, FlowError is logged/observable via events. If you want errors to be treated as “normal outputs” you can enable:

create(..., emit_errors_to_rookery=True)

This routes FlowError to the Rookery sink so fetch() can return it.

Retry-safe node design¶

Retries are only safe if your node is idempotent (or can tolerate duplicates).

Common strategies:

Use request ids / trace ids as idempotency keys when calling external services.
For side-effecting operations, separate “plan” and “commit” steps and gate the commit step (HITL or explicit checks).
Keep timeouts low and bound I/O so cancellation and deadlines can interrupt work.

Failure modes & recovery¶

Retries cause duplicate side effects¶

Fix

add idempotency keys (use trace_id as the request id where appropriate)
split “plan” and “commit” nodes and gate commit behind HITL/policy

Timeouts fire but work continues¶

Timeouts cancel the node invocation task, but external systems may continue work if you triggered a non-cancellable request.