Skip to content

Shadow diff, 3 response pairs

Baseline: sha256:abc12…ff09
Candidate: sha256:def34…aa11

axis baseline candidate delta 95% CI severity n
semantic similarity 1.000 0.942 -0.058 [-0.11, -0.01] 🟡 minor 3
tool-call trajectory 0.000 0.333 +0.333 [+0.20, +0.50] 🔴 severe 3
refusal / safety 0.000 0.000 +0.000 [-0.00, +0.00] 🟢 none 3
verbosity 45.000 48.000 +3.000 [-2.00, +8.00] 🟢 none 3
latency 3421.000 4108.000 +687.000 [+450.00, +920.00] 🟠 moderate 3
cost 0.000 0.000 +0.000 [+0.00, +0.00] 🟢 none 3
reasoning depth 0.000 0.000 +0.000 [+0.00, +0.00] 🟢 none 3
llm-judge score 0.000 0.000 +0.000 [+0.00, +0.00] 🟢 none 0
format conformance 1.000 1.000 +0.000 [+0.00, +0.00] 🟢 none 3

Worst severity: 🔴 severe

Per-axis sample counts | axis | n | |------|---:| | semantic | 3 | | trajectory | 3 | | safety | 3 | | verbosity | 3 | | latency | 3 | | cost | 3 | | reasoning | 3 | | judge | 0 | | conformance | 3 |

Generated by Shadow.