Skip to content

Shadow vs the rest of the AI-agent tooling lane

Honest matrix. Where Shadow overlaps with adjacent tools, where the overlap is real, and where each tool has a unique strength Shadow doesn't. No vaporware claims; if a Shadow column says "✓" it's verified by a committed test.

Last updated: 2026-05-03 against the v0.1 release.


TL;DR

Tool What it does best Where Shadow doesn't compete
EvalView Golden-baseline regression testing for AI agents — tool-call/parameter diffs, production traffic capture, GitHub Actions, framework adapters Generic eval framework
Microsoft AGT Runtime governance with sub-millisecond policy enforcement, multiple language SDKs, OPA/Rego/Cedar support, marketplace signing Runtime guardrails / control plane
Preloop Open-source AI-agent control plane with MCP firewall, model gateway, policy-as-code, approvals, audit trails MCP firewall / approvals / runtime observability
AgentEvals Behavior scoring from OpenTelemetry traces without rerunning expensive LLM calls Generic OTel trace scoring
Speedscale API/code production-traffic replay before merge, before/after payload diffs Generic API/payload replay

Shadow's lane: Causal Regression Forensics for AI Agents. Names the exact change that broke the agent — proven against production-like traces before merge, with bootstrap CI + E-value, plus a verified fix loop.


Where Shadow is differentiated

What Shadow does that none of the above does in one tool:

Capability Shadow EvalView MS AGT Preloop AgentEvals Speedscale
Causal cause attribution with Pearl-style ATE + bootstrap CI + E-value sensitivity
Single command that names the exact prompt/model/tool/config change that caused the regression
Verify-fix loop closing diagnose → fix → verify
9-axis structured behavior diff (semantic, trajectory, safety, verbosity, latency, cost, reasoning, judge, conformance) partial partial
Bootstrap CI on the per-axis severities
Reusable trace-alignment library exposed as a category primitive ✓ (Python + TS)

Where each adjacent tool wins

Capability Shadow EvalView MS AGT Preloop AgentEvals Speedscale
Hosted SaaS dashboard
Auto-suggested test cases from production traffic partial
MCP-server firewall
Sub-millisecond runtime policy enforcement partial
OPA/Rego/Cedar policy languages
Multi-language runtime SDKs (Python + TS + Java + .NET + Go) partial (Py + TS) partial partial partial
OTLP-collector ingestion as a first-class input partial (file-based) partial partial
Marketplace + signing (AIUC-1 / Schellman) partial (sigstore)
API/payload replay (non-LLM HTTP traffic)

Choose Shadow when

  • A PR-time CI gate that names the exact change that broke the agent matters more than dashboards.
  • You need bootstrap CI + E-value on causal claims; "did behavior change" alone isn't enough.
  • You want the diagnose → fix → verify loop closed in one tool, not three.
  • You've already instrumented agents with OTel and want a causal-diagnosis layer that consumes those traces.

Choose [EvalView / AGT / Preloop / AgentEvals / Speedscale] when

  • The bullet under their column above is your primary requirement.
  • You want a hosted dashboard as the main UX (Shadow is CLI-first; the GitHub Action is the only first-class UI).
  • You need runtime policy enforcement with sub-millisecond latency in production (Shadow's policy_runtime exists but isn't the headline; AGT and Preloop are purpose-built here).

What Shadow doesn't try to be

(Repeating the design spec §1.3 explicit non-goals so this comparison is truthful, not aspirational.)

  • ABOM expansion beyond the existing certificate format
  • Generic runtime governance suite
  • A control plane competing with Microsoft AGT or Preloop
  • Generic agent-eval framework competing with EvalView
  • Certification marketplace
  • Broad MCP firewall

Shadow's lane is causal regression forensics. If an adjacent tool already nails one of the above, Shadow integrates with it (shadow import --format otel-genai) rather than replicating it.


Sources

This page is a living comparison; PRs welcome to update tool capabilities as they ship features. Shadow capabilities marked "✓" are pinned by committed tests; if a claim is wrong, file an issue and link the failing test.