Shadow diff, 3 response pairs¶
Baseline: sha256:abc12…ff09
Candidate: sha256:def34…aa11
| axis | baseline | candidate | delta | 95% CI | severity | n |
|---|---|---|---|---|---|---|
| semantic similarity | 1.000 | 0.942 | -0.058 | [-0.11, -0.01] | 🟡 minor | 3 |
| tool-call trajectory | 0.000 | 0.333 | +0.333 | [+0.20, +0.50] | 🔴 severe | 3 |
| refusal / safety | 0.000 | 0.000 | +0.000 | [-0.00, +0.00] | 🟢 none | 3 |
| verbosity | 45.000 | 48.000 | +3.000 | [-2.00, +8.00] | 🟢 none | 3 |
| latency | 3421.000 | 4108.000 | +687.000 | [+450.00, +920.00] | 🟠moderate | 3 |
| cost | 0.000 | 0.000 | +0.000 | [+0.00, +0.00] | 🟢 none | 3 |
| reasoning depth | 0.000 | 0.000 | +0.000 | [+0.00, +0.00] | 🟢 none | 3 |
| llm-judge score | 0.000 | 0.000 | +0.000 | [+0.00, +0.00] | 🟢 none | 0 |
| format conformance | 1.000 | 1.000 | +0.000 | [+0.00, +0.00] | 🟢 none | 3 |
Worst severity: 🔴 severe
Per-axis sample counts
| axis | n | |------|---:| | semantic | 3 | | trajectory | 3 | | safety | 3 | | verbosity | 3 | | latency | 3 | | cost | 3 | | reasoning | 3 | | judge | 0 | | conformance | 3 |Generated by Shadow.