Hierarchical diff¶

Shadow's reports sit at six levels of granularity. All six ship today:

Level	What it covers	Shipped in
trace	Whole `.agentlog`, nine-axis table	v1.0
session	One user-facing conversation (between metadata records)	v1.0
turn	One chat request/response pair	v1.0 (drill-down)
span	One content block (text, tool_use, tool_result) within a turn	v1.0
token	Per-pair input / output / thinking token distribution	v1.2
policy	Declarative YAML rules checked against the trace	v1.2

Session-level¶

shadow diff baseline.agentlog candidate.agentlog --hierarchical

Prints a worst-severity-per-session rollup after the nine-axis table:

Hierarchical diff, 3 session(s):
  ✓  session #0: 5 pair(s), worst severity none
  ✗  session #1: 4 pair(s), worst severity severe
  !  session #2: 3 pair(s), worst severity moderate

Useful when your trace contains 10 conversations and only 2 regressed - the flat trace-level table would dilute the signal.

Span-level¶

Programmatic access:

from shadow.hierarchical import span_diff

spans = span_diff(baseline_response_payload, candidate_response_payload)
for s in spans:
    print(f"{s.kind}: {s.summary}")

Span kinds:

text_block_changed, text drifted (reports char-shingle similarity)
tool_use_added / tool_use_removed
tool_use_args_changed, same tool, different argument keys/values
tool_result_changed, is_error flip or content differ
stop_reason_changed
block_type_changed, type swap at same index

Alignment¶

≤ 5 blocks per side, greedy index alignment (optimal and cheap)
> 5 blocks, Needleman-Wunsch alignment with cost model that prefers add + remove over block_type_changed when types mismatch. Catches the real case where an agent has 20+ tool calls per turn and the candidate inserts one in the middle, greedy would misreport every downstream block as block_type_changed.

Verified with a dedicated test that drops block #10 of a 20-block response; NW alignment correctly reports a single tool_use_removed, zero cascaded changes.

Token-level (v1.2)¶

shadow diff baseline.agentlog candidate.agentlog --token-diff

Per-dimension distribution summaries (median, p25, p75, p95, max, total) for input_tokens / output_tokens / thinking_tokens, plus a ranked list of the top-k worst per-pair deltas:

Token-level diff, 10 pair(s):
  input_tokens       baseline median 100.0  p95 118.0  →  candidate median 100.0  p95 118.0  shift +0.00%
  output_tokens      baseline median  40.0  p95  48.0  →  candidate median  60.0  p95  72.0  shift +50.00%
  thinking_tokens    baseline median   0.0  p95   0.0  →  candidate median   5.0  p95   5.0  shift +inf
  worst pairs (by absolute token delta):
    · turn #3: input +0, output +400, thinking +5
    · turn #7: input +0, output +150, thinking +5

Programmatic access: shadow.hierarchical.token_diff(baseline, candidate) returns a TokenDiff dataclass with per-dimension TokenDistSummary + per-pair TokenPairDelta entries.

Scale-tested to 10k pair traces in under 500ms.

Policy-level (v1.2)¶

A declarative YAML overlay of rules checked against both traces:

# policy.yaml
rules:
  - id: backup-before-migrate
    kind: must_call_before
    params: {first: backup_database, then: run_migration}
    severity: error
  - id: no-ssn-leak
    kind: forbidden_text
    params: {text: "SSN:"}
    severity: error
  - id: finish-cleanly
    kind: required_stop_reason
    params: {allowed: [end_turn, tool_use]}
    severity: error

shadow diff baseline.agentlog candidate.agentlog --policy policy.yaml

Output classifies every rule's violations as regressions (new in candidate) vs fixes (cleared in candidate):

Policy diff, 0 baseline violation(s), 2 candidate violation(s)
  regressions (2):
    ✗ [critical] no-ssn-leak @ turn #2: forbidden text 'SSN:' present in response
    ✗ [critical] backup-before-migrate @ turn #0: `backup_database` must be called before `run_migration`

All 12 supported rule kinds¶

Kind	Params	Checks
`must_call_before`	`first`, `then` (tool names)	`first` must be invoked before `then` when `then` appears
`must_call_once`	`tool`	exactly one invocation of `tool`
`no_call`	`tool`	tool must never be invoked
`max_turns`	`limit` (int)	trace has ≤ `limit` chat_response records
`required_stop_reason`	`allowed` (list of strings)	final stop_reason must be in the allowed set
`max_total_tokens`	`limit` (int)	total input+output+thinking tokens ≤ `limit`
`must_include_text`	`text`	at least one response body contains `text`
`forbidden_text`	`text`	no response body contains `text`
`must_match_json_schema`	`schema` (inline) or `schema_path`	every response's text content parses as JSON and validates against the schema
`must_remain_consistent`	`path` (dotted)	once a value is observed at `path`, every later pair must equal it
`must_followup`	`trigger` (conditions), `must` (`tool_call` or `text_includes`)	when `trigger` holds in pair N, pair N+1 must satisfy `must`
`must_be_grounded`	`retrieval_path`, `min_unigram_precision`	response must overlap meaningfully with retrieved chunks at `retrieval_path`

See Behavior policy for the full reference, including conditional when: gating, scope (trace / session), and severity → --fail-on mapping.

Programmatic: shadow.hierarchical.policy_diff(baseline, candidate, rules).