Skip to content

Harness events

Every agent run involves harness-side activity that isn't part of the model's response — retries, rate-limit backoffs, model fallbacks, context trims, cache hits, guardrail triggers, budget cuts, stream interrupts, tool lifecycle events. These shape the production behavior of the agent and are usually invisible to standard chat traces.

The harness_event record kind captures them in a single line per event.

Taxonomy

A closed taxonomy of nine categories so a diff renderer can compare apples to apples:

Category Meaning
retry The harness retried a failed call
rate_limit A 429 or upstream rate-limit signal
model_switch Routed to a different model (cost, fallback, A/B)
context_trim Tokens dropped to fit the window
cache Prompt-cache hit or fill
guardrail A guardrail (Bedrock / Lakera / Llama Guard / NeMo) fired
budget A budget cap (cost, tokens, time) was hit
stream_interrupt The streaming response was cut short
tool_lifecycle Tool registered, deregistered, or hot-swapped

Each event also carries a name (a sub-event identifier within the category, e.g. retry.attempted, cache.hit), a severity (info / warning / error / fatal), and a free-form attributes dict.

Recording

from shadow.sdk import Session
from shadow.v02_records import record_harness_event

with Session(output_path="trace.agentlog") as s:
    record_harness_event(
        s,
        category="retry",
        name="retry.attempted",
        severity="warning",
        attributes={"reason": "anthropic 503, retry 1/3"},
    )

name is required and must be a non-empty string. Use it to subdivide a category (e.g. category="cache", name="cache.hit"). attributes is intentionally schemaless — typed-attribute validation isn't enforced at the record layer so adding new event types doesn't require code changes.

Diff

shadow diff --harness-diff renders a per-(category, name) diff, separating regressions (candidate has more) from fixes (candidate has fewer). Within each group, entries are sorted by severity descending then by absolute count delta descending — so the most severe new event appears first.

harness events: 2 regression(s), 1 fix(es), 0 unchanged

regressions (candidate has more):
  🔴 rate_limit.: 1 → 3 (+2) first at pair 0
  🟠 retry.: 2 → 4 (+2) first at pair 0

fixes (candidate has fewer):
  ✓ context_trim.: 2 → 0 (-2)

A markdown variant is emitted for PR comments (two tables — regressions and fixes — with severity emoji).