CLI reference¶
shadow quickstart [PATH]¶
Scaffold a working Shadow scenario in PATH (default
shadow-quickstart). No API keys required. See
Install and first diff.
shadow init [PATH]¶
Scaffold .shadow/ in PATH. --github-action also drops
.github/workflows/shadow-diff.yml. Path-traversal hardened -
refuses system directories (/etc, /usr, etc.).
shadow record -- <cmd>¶
Run <cmd> with zero-config auto-instrumentation. Writes to
-o path.agentlog. Flags:
--tags KEY=V,K=V, metadata tags--no-auto-instrument, skip the sitecustomize shim- Fail-fast writability preflight on the output path
shadow replay <config> --baseline <trace>¶
Replay <trace> through <config> via --backend {mock,positional}.
mock returns the baseline response verbatim; positional uses a
recorded reference trace (--reference <path>) and replays the
candidate against it. Live LLM backends (anthropic / openai) live
on the diff path through --judge-backend, not on replay.
Partial replay (v1.2)¶
Lock a baseline prefix verbatim and replay only the suffix through the backend:
--partial, enable partial-replay mode--branch-at N, 0-based turn index where live replay begins (0= fully live, same as default;>= len(turns)= full-baseline copy)
Useful for "what would have happened from turn 3 onward if the model had stayed on the baseline path through turn 2?" experiments.
shadow diff <baseline> <candidate>¶
Nine-axis behavioural diff. Key flags:
--judge {none,auto,sanity,pairwise,llm,procedure,schema,factuality,refusal,tone}--judge-config <file.yaml>for rubric-based judges--judge-backend {mock,anthropic,openai}for live judges--explainfor LLM-sourced paragraph summary--hierarchicalfor session-level breakdown--pricing <file.json>for cost attribution--output-json <file>to write the full report
v1.2 additions¶
--token-diff, per-dimension token distribution (input / output / thinking) with median + p25 + p75 + p95 + max + total; plus the top-k worst per-pair deltas. See Hierarchical diff, token-level.--policy path/to/rules.yaml, check a declarative YAML policy overlay against both traces and classify rule violations as regressions vs fixes. Supports 12 rule kinds:must_call_before,must_call_once,no_call,max_turns,required_stop_reason,max_total_tokens,must_include_text,forbidden_text,must_match_json_schema,must_remain_consistent,must_followup,must_be_grounded. Each rule can carry awhen:clause that gates it on field-path conditions (operators:==,!=,>,>=,<,<=,in,not_in,contains,not_contains). See Behavior policy.--fail-on {minor,moderate,severe}, exit non-zero when the worst axis severity or policy regression hits the threshold. Default isnever(post the report, exit 0). Use--fail-on severeto gate a PR merge on agent regressions.--suggest-fixes, layer an LLM pass on top of the deterministic recommendation engine to produce concrete code-level fix proposals. Each suggestion is grounded on a deterministic anchor (ungrounded model output is rejected). Requires a live backend (--judge-backend anthropic|openaior--judge autowith the corresponding env var set). Retry/backoff on 429/5xx/timeout.
v2.4 additions¶
--harness-diff, render a per-(category, name) diff overharness_eventrecords (retry, rate_limit, model_switch, context_trim, cache, guardrail, budget, stream_interrupt, tool_lifecycle). Regressions appear before fixes, ordered by severity then absolute count delta.--multimodal-diff, render a per-blob diff overblob_refrecords. Cheap tier uses 64-bit dHash Hamming distance; semantic tier uses cosine similarity over recorded embeddings when both sides have them. Identicalblob_idshort-circuits.
shadow gate <report.json>¶
Apply --fail-on to a saved report.json (produced by shadow diff
--output-json) without re-running the diff. Designed for CI flows
that already produced the report for the PR comment and want to
gate the merge as a separate, cheap step:
shadow diff base.agentlog cand.agentlog --output-json report.json
shadow gate report.json --fail-on severe
With --policy <yaml>, the gate also recomputes policy regressions
from the original traces (passed via --baseline / --candidate)
and counts them toward the threshold. Without --policy, it gates
purely on axis severity and is fast.
shadow bisect <config_a> <config_b> --traces <trace>¶
LASSO-over-corners causal attribution. --backend anthropic|openai
enables live-replay mode; default (none) uses the heuristic
allocator. --candidate-traces <trace> supplies a candidate trace
when the backend is none.
shadow schema-watch <config_a> <config_b>¶
Tool-schema change detection. --format {terminal,markdown,json}.
--fail-on {breaking,risky,additive,neutral,none}.
shadow report <report.json>¶
Re-render a saved JSON report. --format {terminal,markdown,github-pr}.
shadow import <source> --format <fmt>¶
Import foreign traces to .agentlog. Supported formats (v1.2):
langfuse, Langfusetracesexportbraintrust, Braintrust experiment row export (JSONL or array)langsmith, LangSmith runs export (top-level array)openai-evals, OpenAI Evals JSONLotel, OpenTelemetry OTLP/JSON with GenAI semconv attributesmcp, Model Context Protocol session log (JSONL, JSON array, or wrapped{messages: [...]})vercel-ai(new in v1.2), Vercel AI SDK telemetry export (OTLP-style{spans: [...]}or dashboard-style{events: [...]})pydantic-ai(new in v1.2), PydanticAIall_messages_json()output or Logfire span export
shadow export <trace>¶
Export to otel (OTLP/JSON) for OpenTelemetry collectors.
shadow join <logs...>¶
Merge multiple .agentlog files into one logical trace via
meta.trace_id.
shadow mine <traces...>¶
Cluster a corpus of production traces by tool sequence, stop reason,
response length, and latency, then surface representative cases as a
regression suite. Output is a list of (trace_id, cluster_id, why)
triples that you can commit alongside the agent as your golden test
set.
shadow mcp-serve¶
Run Shadow as a Model Context Protocol server over stdio. Any MCP-aware agentic CLI (Claude Code, Cursor, Zed, Claude Desktop, Windsurf) can invoke Shadow as a tool. Tools exposed:
shadow_diffshadow_check_policyshadow_token_diffshadow_schema_watchshadow_summariseshadow_certify(v1.7.2+)shadow_verify_cert(v1.7.2+)
Install the extra first: pip install 'shadow-diff[mcp]'. See
MCP importer for the reverse direction
(importing MCP traces into Shadow).
shadow certify <trace>¶
Generate an Agent Behavior Certificate (ABOM) for a release. The certificate is a content-addressed JSON release artefact capturing the trace's content-id, all distinct models, content-ids of system prompts, content-ids of tool schemas, optional policy hash, and an optional baseline-vs-candidate nine-axis regression-suite rollup.
Required: --agent-id <id> and --output <path>. Optional:
--policy <file> (records its hash), --baseline <trace> (folds
in a regression-suite rollup), --pricing <file> (for the
regression-suite cost axis), --seed <int>.
The certificate is self-verifying via shadow verify-cert. See
Release certificate.
shadow verify-cert <cert>¶
Verify a certificate's content-addressed cert_id matches the
body. Exits 0 when consistent, 1 on tamper, malformed payload, or
unsupported cert_version. Designed to run as a release-pipeline
gate.
shadow serve¶
Start the live diff dashboard (requires the serve extra:
pip install 'shadow-diff[serve]').
shadow version¶
Prints the installed Shadow version + .agentlog spec version.