CLI reference¶

`shadow quickstart [PATH]`¶

Scaffold a working Shadow scenario in PATH (default shadow-quickstart). No API keys required. See Install and first diff.

`shadow init [PATH]`¶

Scaffold .shadow/ in PATH. --github-action also drops .github/workflows/shadow-diff.yml. Path-traversal hardened - refuses system directories (/etc, /usr, etc.).

`shadow record -- <cmd>`¶

Run <cmd> with zero-config auto-instrumentation. Writes to -o path.agentlog. Flags:

--tags KEY=V,K=V, metadata tags
--no-auto-instrument, skip the sitecustomize shim
Fail-fast writability preflight on the output path

`shadow replay <config> --baseline <trace>`¶

Replay <trace> through <config> via --backend {mock,positional}. mock returns the baseline response verbatim; positional uses a recorded reference trace (--reference <path>) and replays the candidate against it. Live LLM backends (anthropic / openai) live on the diff path through --judge-backend, not on replay.

Partial replay (v1.2)¶

Lock a baseline prefix verbatim and replay only the suffix through the backend:

--partial, enable partial-replay mode
--branch-at N, 0-based turn index where live replay begins (0 = fully live, same as default; >= len(turns) = full-baseline copy)

Useful for "what would have happened from turn 3 onward if the model had stayed on the baseline path through turn 2?" experiments.

`shadow diff <baseline> <candidate>`¶

Nine-axis behavioural diff. Key flags:

--judge {none,auto,sanity,pairwise,llm,procedure,schema,factuality,refusal,tone}
--judge-config <file.yaml> for rubric-based judges
--judge-backend {mock,anthropic,openai} for live judges
--explain for LLM-sourced paragraph summary
--hierarchical for session-level breakdown
--pricing <file.json> for cost attribution
--output-json <file> to write the full report

v1.2 additions¶

--token-diff, per-dimension token distribution (input / output / thinking) with median + p25 + p75 + p95 + max + total; plus the top-k worst per-pair deltas. See Hierarchical diff, token-level.
--policy path/to/rules.yaml, check a declarative YAML policy overlay against both traces and classify rule violations as regressions vs fixes. Supports 12 rule kinds: must_call_before, must_call_once, no_call, max_turns, required_stop_reason, max_total_tokens, must_include_text, forbidden_text, must_match_json_schema, must_remain_consistent, must_followup, must_be_grounded. Each rule can carry a when: clause that gates it on field-path conditions (operators: ==, !=, >, >=, <, <=, in, not_in, contains, not_contains). See Behavior policy.
--fail-on {minor,moderate,severe}, exit non-zero when the worst axis severity or policy regression hits the threshold. Default is never (post the report, exit 0). Use --fail-on severe to gate a PR merge on agent regressions.
--suggest-fixes, layer an LLM pass on top of the deterministic recommendation engine to produce concrete code-level fix proposals. Each suggestion is grounded on a deterministic anchor (ungrounded model output is rejected). Requires a live backend (--judge-backend anthropic|openai or --judge auto with the corresponding env var set). Retry/backoff on 429/5xx/timeout.

v2.4 additions¶

--harness-diff, render a per-(category, name) diff over harness_event records (retry, rate_limit, model_switch, context_trim, cache, guardrail, budget, stream_interrupt, tool_lifecycle). Regressions appear before fixes, ordered by severity then absolute count delta.
--multimodal-diff, render a per-blob diff over blob_ref records. Cheap tier uses 64-bit dHash Hamming distance; semantic tier uses cosine similarity over recorded embeddings when both sides have them. Identical blob_id short-circuits.

`shadow gate <report.json>`¶

Apply --fail-on to a saved report.json (produced by shadow diff --output-json) without re-running the diff. Designed for CI flows that already produced the report for the PR comment and want to gate the merge as a separate, cheap step:

shadow diff base.agentlog cand.agentlog --output-json report.json
shadow gate report.json --fail-on severe

With --policy <yaml>, the gate also recomputes policy regressions from the original traces (passed via --baseline / --candidate) and counts them toward the threshold. Without --policy, it gates purely on axis severity and is fast.

`shadow bisect <config_a> <config_b> --traces <trace>`¶

LASSO-over-corners causal attribution. --backend anthropic|openai enables live-replay mode; default (none) uses the heuristic allocator. --candidate-traces <trace> supplies a candidate trace when the backend is none.

`shadow schema-watch <config_a> <config_b>`¶

Tool-schema change detection. --format {terminal,markdown,json}. --fail-on {breaking,risky,additive,neutral,none}.

`shadow report <report.json>`¶

Re-render a saved JSON report. --format {terminal,markdown,github-pr}.

`shadow import <source> --format <fmt>`¶

Import foreign traces to .agentlog. Supported formats (v1.2):

langfuse, Langfuse traces export
braintrust, Braintrust experiment row export (JSONL or array)
langsmith, LangSmith runs export (top-level array)
openai-evals, OpenAI Evals JSONL
otel, OpenTelemetry OTLP/JSON with GenAI semconv attributes
mcp, Model Context Protocol session log (JSONL, JSON array, or wrapped {messages: [...]})
vercel-ai (new in v1.2), Vercel AI SDK telemetry export (OTLP-style {spans: [...]} or dashboard-style {events: [...]})
pydantic-ai (new in v1.2), PydanticAI all_messages_json() output or Logfire span export

`shadow export <trace>`¶

Export to otel (OTLP/JSON) for OpenTelemetry collectors.

`shadow join <logs...>`¶

Merge multiple .agentlog files into one logical trace via meta.trace_id.

`shadow mine <traces...>`¶

Cluster a corpus of production traces by tool sequence, stop reason, response length, and latency, then surface representative cases as a regression suite. Output is a list of (trace_id, cluster_id, why) triples that you can commit alongside the agent as your golden test set.

`shadow mcp-serve`¶

Run Shadow as a Model Context Protocol server over stdio. Any MCP-aware agentic CLI (Claude Code, Cursor, Zed, Claude Desktop, Windsurf) can invoke Shadow as a tool. Tools exposed:

shadow_diff
shadow_check_policy
shadow_token_diff
shadow_schema_watch
shadow_summarise
shadow_certify (v1.7.2+)
shadow_verify_cert (v1.7.2+)

Install the extra first: pip install 'shadow-diff[mcp]'. See MCP importer for the reverse direction (importing MCP traces into Shadow).

`shadow certify <trace>`¶

Generate an Agent Behavior Certificate (ABOM) for a release. The certificate is a content-addressed JSON release artefact capturing the trace's content-id, all distinct models, content-ids of system prompts, content-ids of tool schemas, optional policy hash, and an optional baseline-vs-candidate nine-axis regression-suite rollup.

Required: --agent-id <id> and --output <path>. Optional: --policy <file> (records its hash), --baseline <trace> (folds in a regression-suite rollup), --pricing <file> (for the regression-suite cost axis), --seed <int>.

The certificate is self-verifying via shadow verify-cert. See Release certificate.

`shadow verify-cert <cert>`¶

Verify a certificate's content-addressed cert_id matches the body. Exits 0 when consistent, 1 on tamper, malformed payload, or unsupported cert_version. Designed to run as a release-pipeline gate.

`shadow serve`¶

Start the live diff dashboard (requires the serve extra: pip install 'shadow-diff[serve]').

`shadow version`¶

Prints the installed Shadow version + .agentlog spec version.

CLI reference¶

shadow quickstart [PATH]¶

shadow init [PATH]¶

shadow record -- <cmd>¶

shadow replay <config> --baseline <trace>¶

Partial replay (v1.2)¶

shadow diff <baseline> <candidate>¶

v1.2 additions¶

v2.4 additions¶

shadow gate <report.json>¶

shadow bisect <config_a> <config_b> --traces <trace>¶

shadow schema-watch <config_a> <config_b>¶

shadow report <report.json>¶

shadow import <source> --format <fmt>¶

shadow export <trace>¶

shadow join <logs...>¶

shadow mine <traces...>¶

shadow mcp-serve¶

shadow certify <trace>¶

shadow verify-cert <cert>¶

shadow serve¶

shadow version¶