Nine-axis diff¶

Shadow's core report is a nine-row table with a statistically-rigorous delta + 95% CI on each axis. The axes cover what actually matters about an agent's behaviour, not what's easy to measure.

The nine axes¶

Axis	What it measures	Unit
semantic	Final-response text similarity (TF-IDF cosine by default; sentence-transformer embeddings with the `[embeddings]` extra)	0–1 (1 = identical)
trajectory	Tool-call sequence edit distance	0–1 (0 = identical)
safety	Refusal rate	0–1
verbosity	Response length in output tokens	tokens
latency	End-to-end wall-clock	ms
cost	Per-response USD spend	$
reasoning	Reasoning / thinking token depth	tokens
judge	LLM-as-judge score (empty unless `--judge` is set)	0–1
conformance	Schema / JSON parseability rate	0–1

Statistical guarantees¶

Bootstrap 95% CIs: 1000 paired resamples per axis, percentile method. CI bounds are emitted even on small samples, the low_power flag fires automatically when n < 5.
Severity tiers: none / minor / moderate / severe computed from both the effect size and the CI bracket. A delta whose CI crosses zero is capped at minor, regardless of point estimate.
No hidden coercion: raw units per axis. No "normalised score" that hides magnitude.

Output format¶

Terminal¶

Shadow diff, 5 response pair(s)
baseline : sha256:8fc9f133…
candidate: sha256:11a5b3a2…

┏━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━┓
┃axis       ┃ baseline ┃ candidate┃     delta ┃   95% CI ┃ severity ┃ flags ┃ n┃
┡━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━╇━━┩
│semantic   │    1.000 │    0.061 │    -0.939 │ [-0.95, -0.85] │ severe │       │ 5│
│trajectory │    0.000 │    1.000 │    +1.000 │ [+1.00, +1.00] │ severe │       │ 5│
│conformance│    1.000 │    0.000 │    -1.000 │ [-1.00, -1.00] │ severe │       │ 5│
│...
└───────────┴──────────┴──────────┴───────────┴──────────┴──────────┴───────┴──┘

worst severity: severe

Markdown (PR comment)¶

Same data as a GitHub-flavoured markdown table, with emoji severity indicators:

| axis | baseline | candidate | delta | 95% CI | severity | n |
|------|---------:|----------:|------:|:-------|:---------|---:|
| semantic | 1.000 | 0.061 | -0.939 | [-0.95, -0.85] | 🔴 severe | 5 |
| trajectory | 0.000 | 1.000 | +1.000 | [+1.00, +1.00] | 🔴 severe | 5 |

JSON (machine-readable)¶

shadow diff baseline.agentlog candidate.agentlog --output-json diff.json

{
  "rows": [
    {"axis": "semantic", "baseline_median": 1.0, "candidate_median": 0.061,
     "delta": -0.939, "ci95_low": -0.95, "ci95_high": -0.85,
     "severity": "severe", "flags": [], "n": 5}
  ],
  "drill_down": [...],
  "first_divergence": {...},
  "divergences": [...],
  "recommendations": [...]
}

Reading the report¶

Worst severity appears at the top. If it's severe, stop and read the "What this means" paragraph first.
Low n warning: at n < 5, bootstrap CIs are unreliable. Record more pairs (10+ is a conservative floor).
Top divergences lists the specific turn(s) where the candidate diverged from the baseline. Structural > decision > style drift in priority order.
Recommendations: prescriptive one-line fixes, with severity tier and action kind (restore / remove / revert / review / verify).
Drill-down: ranks the most regressive pair(s) with per-axis normalised scores. Use this to click into the worst specific turn.