Changelog¶
All notable changes to Shadow are documented here. Format follows Keep a Changelog and Conventional Commits.
[Unreleased]¶
[2.4.0] - 2026-04-25¶
The two final roadmap items shipped — every entry in ROADMAP's "What's next" is now in "Shipping today." ROADMAP.md is deleted; remaining work is tracked as GitHub issues against this repo.
Added¶
- Cross-modal semantic diff axis in
shadow.multimodal_diff— comparesblob_refrecords across two traces. Two-tier comparison aligned with RAGAS / TruLens / DeepEval / LangSmith / Langfuse conventions: - Cheap tier: 64-bit dHash Hamming distance (always available when
phashis on the records). Threshold ≤ 10/64 = "near-duplicate" (severity none), 10–16 = "minor visual drift," > 16 = "moderate." Cheap tier alone never escalates to severe — there isn't enough signal. - Semantic tier: cosine similarity over
embedding.vecwhen both sides have an embedding of the same model. Threshold ≥ 0.85 = "same content" (none), ≥ 0.75 = "same subject" (minor), ≥ 0.5 = "moderate," < 0.5 = "severe." Per LangSmith / Langfuse defaults. - Severity decision: semantic wins when both tiers are present (embeddings are higher signal). Identical
blob_idshort-circuits to none (content-addressing means same id = same bytes). Unmatched blobs (one side has more than the other) flagged severe — the candidate either lost or introduced a blob the baseline didn't have. -
Renderers:
render_terminal()for CLI output,render_markdown()for PR comments. Both render unchanged blobs silently — only show what changed. -
Harness-event diff renderer in
shadow.harness_diff_render— surfacesharness_event_diffoutput (regressions, fixes, count deltas, first-occurrence pair indices) as reviewer-friendly text: render_terminal(): separates regressions from fixes, sorts regressions by severity desc then absolute delta desc, emits severity-coloured glyphs (🔴 error,🟠 warning,🟡 info).-
render_markdown(): two-table PR-comment layout — regressions table with severity column + first-occurrence pair index, fixes table simpler. Empty input returns a one-line notice so callers can pipe unconditionally. -
shadow diffgains two new flags:--harness-diffsurfaces the harness-event diff inline in the report,--multimodal-diffruns the cross-modal axis. Both default off; cost is zero when the trace has no relevant records.
Tests¶
- 24 new tests at
python/tests/test_v24_renderers.py— cosine identity / orthogonality / opposite / zero-norm / length-mismatch, dHash near-dup / far / no-signal severity classification, semantic-takes-precedence-over-phash, unmatched-blob severity, worst-severity aggregation, terminal + markdown rendering shape, severity ordering in the harness renderer (errors before warnings before info), CLI integration with the--harness-diffflag.
Roadmap¶
- ROADMAP.md is deleted. Every entry that was in "What's next" has shipped: streaming replay (v2.3 chunk records), multimodal traces (v2.3 blob_ref + v2.4 cross-modal diff axis), harness-diff instrumentation (v2.3 harness_event records + v2.4 renderer), MCP-native replay (v2.3), TypeScript streaming parity (v2.2), auto-instrument-layer pre-dispatch (v2.2). Future work is tracked as GitHub issues against the repo.
810 pytest, 205 cargo, 34 vitest, ci-local green, mkdocs --strict green.
[2.3.0] - 2026-04-25¶
.agentlog v0.2 + MCP-native replay. Four roadmap items shipped together. Each design choice researched against canonical conventions (OpenTelemetry GenAI semconv stable Jan 2026, OpenInference, Langfuse v3 media API, RAGAS / TruLens / DeepEval multimodal baselines, MCP SEP-1287 draft) before implementation.
Added¶
-
.agentlogv0.2 spec (SPEC §4.8 / §4.9 / §4.10) adds three new record kinds. Backwards-compatible: every v0.1 record still validates. v0.2 readers MUST treat unknown kinds as passthrough so future spec adds don't break old tools. -
chunk(§4.8) — single streaming-LLM chunk withchunk_index, absolutetime_unix_nano(per OTel convention; relative offsets drift on long streams), provider-shapedelta(Anthropictext_delta/input_json_delta/thinking_delta, OpenAI{content?, tool_calls?[]}), optionalis_final. Logical-response identity remains the assembledchat_response's content-id. harness_event(§4.9) — single record kind withcategorydiscriminator over the closed taxonomy{retry, rate_limit, model_switch, context_trim, cache, guardrail, budget, stream_interrupt, tool_lifecycle}(matches OTelgen_ai.cache.*,gen_ai.guardrail.*, etc.). Each event carriesname,severity ∈ {info, warning, error, fatal}, free-formattributes. Single-kind-with-discriminator beats kinds-per-event because new event types don't require code changes — same lesson Langfuse / Helicone / Phoenix all hit.-
blob_ref(§4.10) — content-addressed binary reference. sha256blob_id,mime,size_bytes, optionalagentlog-blob://URI (mirrors OTel'sotel-blob://), optional 64-bit dHashphash(RAGAS / TruLens / DeepEval no-LLM-judge baseline; Hamming ≤10/64 = near-dup, ≥16 = different), optionalembeddingfor the semantic-tier diff. Inline base64 stays permitted under a 4 KiB cap; anything larger is ablob_refto keep records parseable in line-buffered tools. -
shadow.v02_recordsPython module with full recording + diff support: record_harness_event(session, *, category, name, severity, attributes)— validates category + severity at record time so typos surface up front instead of as silent diff misses.record_chunk(session, *, chunk_index, delta, is_final, time_unix_nano)—time_unix_nanodefaults totime.time_ns()at the call site.replay_chunks_async(chunks, yielder, speed=1.0)— monotonic-deadline replay loop, NOT cumulativesleep(delta)(cumulative drifts on long streams; deadline-relative stays accurate). Handles non-monotonic timestamps without deadlocking.speedmultiplier accepts1e9for effectively-instant replay.BlobStore(root)— git-objects-style sharded sha256 blob store with atomic temp-file + rename for crash safety. Identical content collapses to one file across repeated puts.compute_phash_dhash64(image_bytes)— optionalimagehashdep; returns SPEC-shaped{algo: dhash64, hex: ...}or None when the lib is missing.phash_distance(a, b)— Hamming distance over hex; returns None on algo mismatch so callers can branch.record_blob_ref(session, *, blob, mime, store)— content-addresses + writes the blob, computes dHash forimage/*mime types, appends ablob_refrecord.-
harness_event_diff(baseline, candidate)— returns[HarnessEventDelta]with(category, name)keying, count delta, first-occurrence pair index for both sides, sorted by absolute count delta descending. -
shadow.mcp_replayPython module — protocol-level MCP replay via the transport-stream shim pattern (research-recommended path; survives SDK upgrades, aligns with SEP-1287'sreplay://URI scheme): canonicalize_params(params)— sorted-keys, no-whitespace,ensure_ascii=FalseJSON encoding so non-ASCII URIs inresources/readround-trip cleanly.RecordingIndex(calls)— indexesMCPCallobjects by(method, canonicalize(params)). Repeated calls return responses in recorded order then fall back to the last recorded response (preserves "the secondtools/listreturned one fewer tool" behaviour).unconsumed_keys()surfaces calls the candidate skipped — drift detection at the protocol layer.ReplayClientSession(index, strict=False)— drop-in replacement formcp.ClientSession. Implementscall_tool,read_resource,list_tools/list_resources/list_prompts,get_prompt,initialize(with synthetic capability stub when not recorded). Sync + async variants.strict=TrueraisesMCPCallNotRecordedon misses; non-strict returns None for null-check paths. Errors in recordings raiseMCPServerError.-
index_from_imported_mcp_records(records)— builds an index from an MCP-imported.agentlog(Shadow's existingshadow import --format mcpoutput). Recognisestool_call+ pairedtool_resultrecords, plusmetadata.payload.mcp.callsfor non-tool methods. -
Rust core:
Kind::Chunk,Kind::HarnessEvent,Kind::BlobRefadded to therecord::Kindenum so the parser accepts v0.2 records and the replay engine copy-throughs them.
Tests¶
- 41 new tests across
python/tests/test_v02_records.py(22) andpython/tests/test_mcp_replay.py(19) — chunk record/replay round-trip, replay timing fidelity (deadline loop, non-monotonic timestamps, speed multiplier), harness event recording + diff at scale, BlobStore dedup + atomic replace + URI scheme, dHash distance correctness, MCP canonicalization (key-order independence, unicode, integer-vs-float distinction), 1000-call lookup performance, repeated-call ordering, error propagation, strict vs non-strict miss handling, drift detection viaunconsumed_keys. - Real-world adverse stress harness at
examples/stress_v23x/run_stress.py— 20 assertions covering 10K-chunk session round-trip, 5 concurrent replays without state leakage, backward-timestamp non-deadlock, harness diff at scale (sub-100ms over thousands of records), 1000-puts dedup, atomic-replace crash simulation, 16 MiB blob round-trip, real PNG dHash, 1000-call MCP recording lookup in <2ms, canonicalize collision matrix (int vs float, key order, unicode). 20/20 passes in 0.32s wall-clock.
Roadmap¶
- "What's next" loses streaming replay, multimodal traces, harness-diff instrumentation, MCP-native replay (all shipped). Remaining: cross-modal semantic diff axis (CLIP / Whisper-embed on top of the v0.2
blob_ref.embeddingslot), harness-diff renderer for PR comments.
786 pytest, 205 cargo, 34 vitest, ci-local green, mkdocs --strict green.
[2.2.0] - 2026-04-25¶
Two roadmap items shipped together. Both researched against canonical guardrail / auto-instrumentation patterns (NeMo Guardrails, Bedrock Guardrails, OpenTelemetry openai-instrumentation) before implementation — buffer-to-completion + replace-whole-response is the production norm, not strip-individual-blocks.
Added¶
- Auto-instrument-layer pre-dispatch enforcement. When an
EnforcedSessionis active, the OpenAI / Anthropic.createwrapper now probes the enforcer with everytool_useblock in the non-streaming response BEFORE returning to user code. Violating tool calls raisePolicyViolationErrorat the wrapped.createsite — the user's tool dispatcher never sees the violating response, so dangerous tools (issue_refund,send_email,execute_sql,delete_user,deploy_service) can't fire. No code changes to user tool functions; works for any OpenAI / Anthropic-driven agent. Replace mode at this layer is approximated by raise (modifying SDK response objects across versions is fragile — usewrap_toolsfor finer control). PlainSession(no enforcer) is a complete no-op; users who never opted into runtime enforcement see zero behaviour change. 9 new tests atpython/tests/test_instrumentation_predispatch.pycover the no-op, raise / replace / warn modes, allowed pass-through, repeated-block probe-state cleanliness, and translator-error graceful handling, plus an end-to-end test driving a fake OpenAI Completions class through the full Instrumentor pipeline. - TypeScript SDK streaming aggregation. The TS SDK's auto-instrument wrapper now intercepts
stream: truecalls via an async-iterator proxy that yields each chunk through to the caller AND feeds it to a per-provider aggregator. On stream end (or caller-side break), a singlechat_responserecord lands with the assembled content. Two production aggregators: - OpenAI: rebuilds text from
choice.delta.contentdeltas, reconstructs interleavedtool_callsby index (each tool's id / name / arguments string assembled across chunks), capturesfinish_reasonfrom the final chunk, and folds inusageifstream_options: {include_usage: true}was set. - Anthropic: tracks content blocks by index across
content_block_start/content_block_delta/content_block_stopevents, accumulates text deltas /input_json_deltapartial JSON / thinking deltas, capturesstop_reasonfrommessage_delta, finalises into the same Message shapeanthropicTranslators.resp()consumes for non-streaming responses. - 4 new tests at
typescript/test/instrumentation_streaming.test.tscovering OpenAI text aggregation, OpenAI tool-call argument-delta reassembly, Anthropic mixed text +tool_useblock reassembly, and caller-side-break early termination. - New exports from
typescript/src/instrumentation.ts:Translators,StreamAggregator,openaiTranslators,anthropicTranslators. Lets integration code drive the production aggregators directly.
Changed¶
- README TypeScript / Python parity matrix updated — streaming aggregation row now ✅ on both.
- ROADMAP "What's next" loses two entries (TS streaming, auto-instrument pre-dispatch). Remaining items: streaming replay (
.agentlogv0.2 chunk records), multimodal traces, harness-diff instrumentation, MCP-native replay.
[2.1.0] - 2026-04-25¶
Added¶
- Pre-tool-call (pre-dispatch) policy enforcement. New public API in
shadow.policy_runtime: wrap_tools(tools, enforcer, *, session=None, records_provider=None, blocked_replacement=None)— wraps a{name: callable}tool registry. Each entry returns aGuardedToolthat probes the enforcer with a synthesised candidatetool_callrecord BEFORE invoking the underlying function. Onallow, the function runs. On deny:raisemode throwsPolicyViolationError,replacemode returns a placeholder (configurable per-tool viablocked_replacement=),warnmode logs and runs anyway. Catchesno_call,must_call_before,must_call_onceat the dispatch site for dangerous tools (issue_refund,send_email,execute_sql,delete_user,deploy_service).Session.wrap_tools(tools)convenience method onEnforcedSessionthat auto-binds the session.PolicyEnforcer.probe(records)non-mutating evaluation. The probe asks "if these records were the trace, would any rule fire?" without remembering the violation in_known— repeatedly blocked tool calls don't pollute enforcer state, and a denied probe followed by a real dispatch correctly fires once on the nextevaluate.-
GuardedTool— the per-tool wrapper. Exposes.name,.fn, and a__call__that performs the probe + dispatch. -
_extract_tool_call_sequencenow reads standalonetool_callrecords, not onlytool_useblocks insidechat_responsecontent. This is what makes pre-dispatch enforcement work — a synthesised candidatetool_callrecord is now visible tono_call/must_call_before/must_call_oncerules. Side benefit:Session.record_tool_callcalls are now first-class to the policy engine; previously they were invisible to those rules unless paired with an Anthropic-styletool_usecontent block. -
11 new tests at
python/tests/test_policy_runtime_predispatch.pycovering: probe non-mutation, allowed dispatch passing through, blocked dispatch in all three modes (raise / replace / warn),must_call_beforeordering enforcement,wrap_toolswith explicitrecords_provider,wrap_toolsrequiring either session or records_provider, customblocked_replacement, and repeated-block probe-state cleanliness.
Docs¶
docs/features/runtime-enforcement.mdadds a "Pre-tool-call enforcement (v2.1)" section covering the new surface, the probe-vs-evaluate distinction, what rule kinds fire pre-dispatch vs response-side, and therecords_provider=integration point for framework adapters.- README runtime-enforcement section gains a runnable
s.wrap_tools(...)example with thedelete_userblocked case. - ROADMAP moves "Pre-tool-call interception" out of "What's next." Remaining roadmap entry is auto-instrument-layer pre-dispatch (so OpenAI/Anthropic-driven agents get pre-dispatch enforcement automatically without wrapping their tool registry).
[2.0.5] - 2026-04-25¶
All six items the reviewer raised were verified real and fixed.
Fixed¶
SPEC.md§3.3 said "A trace MUST NOT contain more than onemetadatarecord" — directly contradicted shipping code.Session.record_metadata()has been writing additional metadata records to mark session boundaries since v1.4 (a docstring explicitly says "Shadow's session detector treats multiple metadata records in a trace as the canonical session boundary signal"). The spec rule is removed; replaced with an explicit clause documenting that non-rootmetadatarecords are valid as session-boundary markers, MUST have a non-nullparent, and that consumers without session-boundary semantics MAY treat them as no-ops.SECURITY.md"Supported versions" still listed1.xand0.x. Updated to2.x(active) +1.x(security fixes only on the latest 1.7.x line) +0.x(unsupported).SECURITY.mdoverclaim about "end-to-end" private advisories — softened. GitHub's private advisory channel is access-restricted but not cryptographic E2E (GitHub holds the data at rest). The doc now says it's the "preferred private reporting transport" and offers a separate cryptographic channel on request.ROADMAP.mdduplicated runtime enforcement and richer behavior contracts in BOTH "Shipping today" and "What's next" — these shipped in v2.0.0. Removed from the "What's next" section, leaving only the truly outstanding items (streaming replay, multimodal, harness-diff, MCP-native replay, TypeScript streaming parity, tool-call pre-dispatch interception).ROADMAP.mdsaid "Eight importers" but listed nine (Langfuse, Braintrust, LangSmith, OpenAI Evals, OTLP, MCP, A2A, Vercel AI SDK, PydanticAI). Off-by-one fixed.ROADMAP.mdMCP server bullet listed five tools (diff, policy check, token diff, schema watch, summary). v1.7.2 addedshadow_certifyandshadow_verify_cert. Now lists all seven.ROADMAP.mdclaimed Python and TypeScript auto-instrumentation "including the OpenAI Responses API and streaming" — TypeScript explicitly passes streaming through unrecorded (typescript/src/instrumentation.ts:10). Bullet now states the gap honestly: Python covers streaming aggregation, TypeScript currently passes streaming through unrecorded. New roadmap entry "TypeScript SDK parity for streaming" tracks closing it.- CI
python-full-extrasjob was installing only six extras (dev,anthropic,openai,otel,serve,embeddings) — missedmcp,sign, and the three framework adapters (langgraph,crewai,ag2). Localci-local-extraswas already more complete. The CI job now installs every optional extra including[sign]under--prerelease=allow. This closes the "local parity stronger than GitHub CI" inversion. - README claimed the TypeScript SDK "works the same way" as Python. Replaced with an explicit feature parity matrix that names every gap: TS streaming passes through unrecorded; runtime enforcement / certify / sign / replay / diff / bisect / mine / MCP server are Python-CLI-only. The
.agentlogformat itself is the contract — TS-recorded traces feed into Python's tooling without translation.
[2.0.4] - 2026-04-25¶
Fixed¶
shadow.certify_signwas breaking mypy--stricton CI. The module's lazysigstoreimports raiseimport-not-foundwhen sigstore isn't installed (which is the default — sigstore is gated behind the optional[sign]extra and additionally requires--prerelease=allowat install time because its dependency tree pulls pre-release wheels). CI doesn't install the[sign]extra; my local venv had sigstore from a manual install during v1.8 development, which is why ci-local was green locally while v1.8.0–v2.0.3 silently failed mypy on every CI run.
Fix: add shadow.certify_sign to the existing ignore_errors mypy override block in pyproject.toml, alongside the other optional-extra-only modules (shadow.serve.*, shadow.mcp_server, shadow.adapters.*, shadow.tools.sandbox). Verified: with sigstore uninstalled locally, mypy --strict now passes; with sigstore installed, the imports type-check normally.
This is the same local/CI parity drift class that v1.6.5's ci-local recipe was meant to prevent. The recipe's python-full-extras job installs every extra EXCEPT [sign] (because the --prerelease=allow flag complicates the install command), so CI exposed a mismatch the local recipe didn't. Worth a follow-up to extend ci-local-extras with a sigstore-install step under --prerelease=allow.
[2.0.3] - 2026-04-25¶
Fixed¶
docs/features/runtime-enforcement.mddescribed the enforcer's dedup key as(rule_id, pair_index, detail). The v2.0.1 fix already changed the code to(rule_id, pair_index)only, but the docs still showed the old shape. Now corrected with the same explanation as the inline code comment — whole-trace rules embed running counts in the detail string, so detail-keyed dedup let them respam.
Docs¶
- README hedges
causal bisectionfrom "isolates which specific change caused which specific regression" to "estimates which specific change most likely explains each regression, then points you at the replay / counterfactual primitives to confirm it." Matches the hedged terminal renderer already in the bisect command (theest.prefix and(stable, CI excludes 0)qualifiers shipped in v1.5). - README runtime-enforcement headline rewords "block a violating response as it happens" to "block or replace a violating model response at record time". More precise —
EnforcedSessionevaluates after the model returned, not before tool dispatch. - README CLI reference table for
shadow certifyandshadow verify-certnow mentions the v1.8 signing flags (--sign,--verify-signature,--cert-identity) so a reader scanning the table doesn't miss that signing is shipped. - README
must_be_groundedmention in the rule list now flags it as "cheap lexical grounding gate, not NLI-backed faithfulness" with a pointer to the docs page that documents what it catches and what it doesn't. Same hedging that v2.0.2 added todocs/features/policy.md, surfaced inline in the README.
[2.0.2] - 2026-04-25¶
Fixed¶
- Stale rule-count strings in three docs surfaces still said "9 rule kinds" or "Nine kinds ship today" after v2.0 added three new kinds (
must_remain_consistent,must_followup,must_be_grounded): docs/features/policy.mdheaderdocs/quickstart/ci.mdnext-section linkdocs/reference/cli.mdshadow diff --policydescriptionshadow.mcp_servershadow_check_policytool description All four now say "twelve" / "12" and list every kind.
Docs¶
must_be_groundedhonest scope added todocs/features/policy.md. The rule is lexical-overlap, not semantic faithfulness or NLI-backed grounding. Now explicitly documents what it catches (off-topic responses) and what it doesn't (semantic-equivalent paraphrase with different vocabulary, citations with unsupported conclusions, claims a chunk contradicts). For deeper grounding, pair with theJudgeaxis or an external faithfulness evaluator. Treat the rule as a cheap CI gate, not a hallucination guarantee.- Runtime enforcement scope now explicit in the README:
EnforcedSession.record_chatevaluates AFTER the model response, not before tool dispatch. The README points users at theenforcer.evaluate(records_so_far)pattern between model response and tool dispatch when pre-tool blocking matters. Pre-dispatch interception via the auto-instrument layer is documented as roadmap. - README comparison table softened. Cells for Langfuse / Braintrust / LangSmith on policy rules and merge-blocking moved from "no" to "partial via evals" / "partial via webhooks" with a one-line clarification under the table: those platforms support evals + webhooks + custom CI a team can wire into a PR-comment / gate workflow. Shadow's claim is that it ships the workflow as a single command and ships the trace format / policy language / release certificate as primitives, not that competitors can't be made to work. Self-hostable cell on Braintrust softened to "partial."
[2.0.1] - 2026-04-25¶
Fixed¶
PolicyEnforcerwas respamming whole-trace rules every turn after they crossed. The dedup key was(rule_id, pair_index, detail). Whole-trace rules likemax_turnsandmust_call_onceembed a running count in their detail string ("trace has 5 turns; max is 4", then "trace has 6 turns; max is 4", etc.), so each subsequent turn produced a new detail and the enforcer reported it as new. Now keyed on(rule_id, pair_index)only — detail is human-output, not identity. Caught by the v2.0 real-LLM stress harness; existing 15 runtime tests still pass and a new regression test (test_enforcer_whole_trace_rule_with_growing_count_does_not_respam) locks the fix.- New committed real-LLM stress harness at
examples/stress_v20x/run_stress.py— 13 assertions against real OpenAI gpt-4o-mini coveringmust_remain_consistentagainst live agent behavior,must_be_groundedagainst real RAG context (both grounded and off-topic prompts), all threeEnforcedSessionmodes (replace/raise/warn) verifying the on-disk trace shape, incremental violation detection across a 6-turn live trace, the certify+verify-cert pipeline against anEnforcedSessionoutput, and three concurrentEnforcedSessions. 13/13 passes against real OpenAI in ~17 seconds at well under $0.05.
[2.0.0] - 2026-04-25¶
Major version bump because v2.0 grows the SDK's public surface (new shadow.policy_runtime module with EnforcedSession / PolicyEnforcer). All v1.x APIs remain backwards-compatible — existing Session, policy_diff, shadow diff --policy, certificate workflow are unchanged. The major bump reflects the new public module, not a breaking change to existing code.
Added¶
- Three new policy rule kinds for stateful and RAG-aware contracts:
must_remain_consistent— once a value atpathis observed, every later pair where the path resolves must equal it. Useful for "the agent must not change the refund amount after confirming it." Pairs where the path is absent are skipped (absence ≠ change).must_followup— whentriggerconditions hold in pair N, pair N+1 must satisfymust(a tool call by name, or a text-includes substring). A trigger on the final pair is itself a violation. Captures patterns like "after a quote, the next turn must callconfirm_with_user."must_be_grounded— every response must overlap meaningfully with retrieved chunks atretrieval_path. Defaultmin_unigram_precision: 0.5matches the no-LLM-judge fallback baseline used by RAGAS, TruLens, DeepEval. Tokenisation drops punctuation and len-1 tokens so an attacker can't satisfy the rule by emittingthe , .. Twelve rule kinds total now.- Runtime policy enforcement in new module
shadow.policy_runtime: PolicyEnforcer(rules, on_violation=...)evaluates rules incrementally on a growing record list and reports only NEW violations since the last call. Three modes:replace(default — swap the offending response for a refusal payload while preserving structural fields),raise(throwPolicyViolationError),warn(log only).EnforcedSession(enforcer=..., output_path=...)extendsSessionand runs the enforcer on everyrecord_chat. The flushed.agentlogis structurally valid even when responses were replaced — every existing Shadow command (diff,verify-cert,mine,mcp-serve) reads it without modification.Verdictdataclass carries(allow, replacement, reason, violations).default_replacement_responsebuilds a refusal payload that preservesmodel,usage,latency_msso downstream renderers don't break. Custom builders accepted viareplacement_builder=.- API shape mirrors NeMo Guardrails / Bedrock Guardrails / Guardrails AI conventions: callback/verdict pattern, return-replacement default, raising opt-in. Researched against canonical guardrails-API patterns to pick the most-canonical shape.
- 36 new tests across
python/tests/test_policy_stateful_rag.py(rule semantics) andpython/tests/test_policy_runtime.py(enforcer + EnforcedSession). Covers happy paths, anchor pinning, absence-isn't-change, final-pair triggers, replace/raise/warn modes, custom replacement builders, incremental violation detection, and round-trip via disk. - New docs page
docs/features/runtime-enforcement.mdcovering the three modes, custom replacements, the programmatic API for callers not usingEnforcedSession, and what the surface explicitly does NOT do (no tool-call interception, no network-level guardrails, no cross-process state). Wired into mkdocsFeaturesnav. docs/features/policy.mdupdated with a "Stateful and RAG-aware rules" section covering the three new kinds with runnable examples.
[1.8.0] - 2026-04-25¶
Added¶
- Cosign / sigstore keyless signing for Agent Behavior Certificates.
shadow certify --signwrites a sidecar<output>.sigstoreBundle containing the signature, the Fulcio-issued signing certificate, and a Rekor transparency-log entry. The signed payload is the canonical certificate body bytes — the same bytescert_idhashes — so tampering breaks both content-id and signature. Optional via the[sign]extra (pip install 'shadow-diff[sign]'); unsigned certificates from prior versions still verify content-addressing as before. shadow verify-cert --verify-signature --cert-identity <email-or-workflow-url>binds verification to a specific signer identity. A leaked Bundle signed by another identity fails this check even if its cryptography is otherwise valid — the keyless flow's whole value is identity binding. Defaults to GitHub Actions OIDC issuer (https://token.actions.githubusercontent.com); override with--cert-oidc-issuerfor other providers.- New
shadow.certify_signmodule wraps the sigstore-pythonSigner/VerifierAPI and handles canonicalisation + identity policy. Eight new tests atpython/tests/test_certify_sign.pycover canonical-body determinism (dataclass and dict forms produce identical bytes),cert_idexclusion (the body fingerprint must equal the body part ofcert_id), sidecar path convention, sign-writes-bundle (with the sigstore boundary mocked), no-OIDC-token error path, missing/corrupt bundle, and the verify boundary's input-bytes contract.pytest.importorskip("sigstore")gates the file so the default install path stays sigstore-free. - README +
docs/features/certificate.mddocument the signing flow with both CI (GitHub Actions OIDC) and local (interactive browser) examples. Comparison-table row in README is now "Cosign-signed release certificate."
Changed¶
- ROADMAP entry for the v1.8 signing layer is now in "Shipping today (v1.8.x)." Next-up sections cover runtime policy enforcement (a major-version surface change tracked for v2.0.0) and stateful / RAG-aware contracts (v1.9).
[1.7.6] - 2026-04-25¶
Fixed¶
- README +
docs/features/hierarchical.mdpolicy examples usedseverity: critical— not a valid severity. The loader silently stored unknown values as the raw string, and theshadow diff --fail-ongate's rank lookup fell through to a default, so a rule the user wrote as "block hard" never tripped a severe gate. Examples updated toseverity: error(the documented value for a hard-block rule). load_policynow validatesseverityagainst{info, warning, error}and raisesShadowConfigError("policy rule #N has invalid severity 'X'")at load time. Three new tests cover the rejection, the three valid values, and the default. The validation makes the v1.6.5--fail-ongate actually trip on the rule the user wrote, instead of silently downgrading to info.- README comparison table said "Signed release certificate" — overclaim. The certificate is content-addressed and self-verifying; PKI/cosign signing is on the roadmap, not shipped. Row corrected to "Content-addressed release certificate."
[1.7.5] - 2026-04-25¶
Fixed¶
docs/features/mcp.md(the MCP importer page) now cross-links tomcp-server.md. v1.7.4 added the server→importer link but missed the symmetric one — readers landing on the importer page wouldn't discover the server page.docs/quickstart/ci.mdwas anchored to the v1.6 era CI workflow and never picked up the v1.6.5--fail-onflag. Quickstart readers got the same non-gating workflow that v1.7.2 fixed inshadow init --github-action's template. Now includes a "Gating the merge on regressions" section with the recommended--fail-on severestep.docs/quickstart/ci.md"Next" section now links to the policy and certificate feature pages added in v1.7.2.
[1.7.4] - 2026-04-25¶
Added¶
docs/features/mcp-server.md— dedicated docs page for running Shadow as an MCP server (shadow mcp-serve). The existingfeatures/mcp.mdcovers the MCP importer (ingesting MCP traces into.agentlog); the new page covers the reverse direction (Shadow exposing its analyses as MCP tools to clients like Claude Desktop, Cursor, Zed). Documents all seven tools with per-tool purposes and a typical agentic-CLI session. Wired into theFeaturesmkdocs nav.
Fixed¶
- README's
## Use Shadow from an agentic CLI (MCP server)section listed only the original five tools — missingshadow_certifyandshadow_verify_certfrom v1.7.2. Same drift class as the docs/reference/cli.md fix in v1.7.3. - README's policy and certificate sections now deep-link to
docs/features/policy.mdanddocs/features/certificate.md. Without those links the new feature pages were unreachable from the README.
[1.7.3] - 2026-04-25¶
Fixed¶
docs/reference/cli.mdhad drifted four releases behind the CLI: missing entries forshadow mine,shadow mcp-serve,shadow certify,shadow verify-cert. Added all four. Also corrected a stale "8 rule kinds" line undershadow diff --policy(now nine, includingmust_match_json_schema) and added the--fail-onflag with explanation.shadow mcp-servereference section enumerates all seven MCP tools, includingshadow_certifyandshadow_verify_cert(added in v1.7.2).
[1.7.2] - 2026-04-25¶
Added¶
- MCP server gains
shadow_certifyandshadow_verify_certso agentic CLIs (Claude Desktop, Claude Code, Cursor, Zed, Windsurf, any MCP-aware client) can generate and verify Agent Behavior Certificates over the protocol — same arguments and contract as the CLI commands. Tool-handler registry, tool descriptors, and module docstring all updated. Five new tests cover the round-trip. shadow init --github-actiontemplate includes a commented-out merge-gate step so freshly scaffolded workflows can opt into--fail-on severewithout rewriting the YAML. Default behaviour is still non-blocking; uncommenting one step turns Shadow into a required check.- mkdocs site adds two new feature pages —
docs/features/policy.md(the nine rule kinds, conditionalwhen:operators, structured-output assertions, severity →--fail-onmapping, scope) anddocs/features/certificate.md(ABOM format, generate/verify workflow, what it proves vs. what it doesn't, MCP integration, format stability). Both wired into theFeaturesnav.
Fixed¶
- MCP server's
shadow_check_policydescription listed eight rule kinds; now lists nine, includingmust_match_json_schema(the rule landed in v1.7.0 but the MCP description didn't).
[1.7.1] - 2026-04-25¶
Fixed¶
examples/stress_v17x/run_stress.pyhad 5 mypy--stricterrors that ci-local missed because the harness was outside mypy scope. Same drift pattern caught forstress_v16xin v1.6.4 — applied here too. Type signatures fixed; the harness still passes 26/26 at runtime.just ci-localand.github/workflows/ci.ymlmypy scope now coversexamples/stress_v17x/run_stress.py. Future stress-harness changes are caught by both local and CI mypy.
Docs¶
- README now documents
must_match_json_schema(with example),--fail-onforshadow diff, and theshadow certify/shadow verify-certworkflow. Comparison table includes "Merge-blocking CI gate" and "Signed release certificate" rows. CLI reference table covers the new commands. The previous 1.7.0 README still listed only eight policy rule kinds — fixed to nine.
[1.7.0] - 2026-04-25¶
Added¶
must_match_json_schemapolicy rule kind. Asserts that every response's text content parses as JSON and validates against a supplied JSON Schema. Accepts either an inlineschema:dict or aschema_path:to a JSON Schema file. Mismatches surface with the offending dotted path (e.g.json schema mismatch at properties.amount: ...). This closes the most common gap in v1.6.x policies for agents that produce structured output. Usesjsonschema>=4.0(now a runtime dependency).- Agent Behavior Certificate (ABOM) via
shadow certifyandshadow verify-cert. Generates a content-addressed JSON release artefact that capturesagent_id,released_at, the trace's content-id, all distinct models observed, content-ids of all distinct system prompts, content-ids of every tool schema, optionalpolicy_hash(sha256 of the policy file), and optionalregression_suite(the nine-axis severity rollup vs a baseline trace). The certificate is self-verifying:shadow verify-cert release.cert.jsonrecomputes the body's hash and exits 1 on mismatch, so it can run as a release gate. PKI / cosign signing lands in v1.8 — the format is stable today, signing layers on top.
Fixed¶
must_match_json_schemawas acceptingNaN,Infinity, and-Infinitybecause Python'sjson.loadsaccepts them as a CPython extension. Those literals are NOT valid JSON per RFC 8259 and downstream consumers (browsers, other-language parsers, strict JSON consumers) will choke on them. The rule now rejects them with a clear "non-standard JSON literal" violation. Caught by the new adverse-stress harness inexamples/stress_v17x/run_stress.py.
Tests¶
- 12 new tests for
must_match_json_schema(valid JSON passes, malformed JSON / schema mismatch / empty text / both-or-neither schema params / external schema_path / policy_diff regressions / NaN-Infinity rejection). - 13 new tests for the certificate module: build extracts models/prompts/tools, self-verifies, tampering breaks verification, unsupported
cert_versionrejected, optional policy hash + baseline regression suite, CLIcertifywrites JSON, CLIverify-certexits 0 on valid / 1 on tampered. - New
examples/stress_v17x/run_stress.pyadverse harness — 26 assertions covering 12 malformed-JSON variants, unicode/RTL/emoji payloads, 62 KB / 2000-item payload scaling,oneOf/$refschemas, invalid-schema short-circuiting, pathologicalschema_pathcases, 50-thread concurrent validation, deterministic certificate builds with fixed timestamp, 20-thread concurrent builds producing identicalcert_id, all 9 per-field tamper detections,cert_id-only swap detection, round-trip via disk, forward-compat with unknown fields, version + format rejection, 100-turn trace certification scaling.
[1.6.5] - 2026-04-25¶
Added¶
shadow diff --fail-on {minor,moderate,severe}— exits non-zero when the worst axis severity or a policy regression reaches the threshold. The diff report and policy summary are still printed (and the JSON output is still written) before the gate fires, so blocked PRs always see the explanation. Default remainsnever(post the report, exit 0). Use--fail-on severein CI to convert Shadow from "shows you a diff" to "blocks the merge."- GitHub Action gains a
fail-oninput plumbed through toshadow diff. The PR comment is posted first, then the gate runs as a separate step, so blocked PRs always have the comment that explains why. New optionalpolicyandshadow-versioninputs too. Action defaults remain non-blocking, so existing consumers don't suddenly fail.
Fixed¶
- GitHub Action install was broken. The composite action attempted
pip install shadow==0.1.0— wrong package name (Shadow ships asshadow-diffon PyPI) and a version that was never published. External consumers always silently fell through to the in-tree fallback, which only works when running the action from this repo. Install line now usesshadow-diff(current latest) with optional pinning via the newshadow-versioninput. ROADMAP.mdwas anchored to v1.2.x and listed sandboxed deterministic agent-loop replay under "What's next" even though it shipped in v1.6.0. Section header is now "Shipping today (v1.6.x)"; sandboxed replay, tool backends, novel-call policies, counterfactual primitives, conditionalwhen:policies, framework adapters, importers, MCP server, trace mining, and PyPI Trusted Publisher are all in the shipping list. The "What's next" section now reflects the real outstanding work (streaming replay, multimodal traces, harness-diff instrumentation, MCP-native replay, runtime policy enforcement, richer behaviour contracts, ABOM). Added a "Not on the roadmap" entry making the sandbox's "best-effort isolation, not a security boundary" framing explicit.
[1.6.4] - 2026-04-25¶
Fixed¶
examples/stress_v16x/run_stress.py— correctedrecord_baseline's declared return type (waslist[dict[str, Any]], actually returned a(records, summary)tuple), removed two stale# type: ignore[arg-type]comments, and added explicit type annotations on baseline-construction dicts. The harness ran correctly at runtime but the type signatures lied; mypy can now actually help.
Changed¶
just ci-localand.github/workflows/ci.ymlmypy scope now includesexamples/stress_v16x/run_stress.py. Previously onlyexamples/demo/agent.pyandexamples/demo/generate_fixtures.pywere type-checked, so the committed stress harness drifted unchecked. Future stress-harness changes are now caught by both local and remote CI.
[1.6.3] - 2026-04-25¶
Fixed¶
OpenAILLMwas droppingtool_callsandtool_call_idfrom messages on follow-up requests. The agent-loop engine emits assistant messages of the form{role:"assistant", content:"", tool_calls:[…]}(the OpenAI wire shape). The converter's early-return path for stringcontentwas returning before forwardingtool_calls. The very next request — carrying therole:"tool"follow-up — was rejected by the API with HTTP 400 "messages with role 'tool' must be a response to a preceding message with 'tool_calls'". This blocked every real-world OpenAI agent-loop replay past the first tool round-trip. The converter now forwards both fields regardless ofcontentshape.- Found by an end-to-end stress test against real
gpt-4o-mini(examples/stress_v16x/run_stress.py) — 25 adverse-condition assertions covering branch_at_turn mid-trajectory, replace_tool_result re-drive with hostile output, replace_tool_args under sandbox redispatch, hostile-tool sandbox (socket / subprocess / write), max_turns truncation under runaway, four novel-call policies, five concurrent branches, long-trace truncation, empty seed, and past-end branch. The harness went from 20/24 (broken at OpenAI handoff) to 25/25 once the converter was fixed.
Added¶
python/tests/test_openai_backend.py— five focused unit tests covering the converter's tool-calls forwarding, including the exact shape the agent-loop engine produces. Locks the regression.examples/stress_v16x/run_stress.py— runnable real-LLM adverse stress harness. Gated behindSHADOW_RUN_NETWORK_TESTS=1andOPENAI_API_KEY; skips otherwise. Costs well under $0.05 per run against gpt-4o-mini.
[1.6.2] - 2026-04-25¶
Fixed¶
drive_loop_forwardnow returnsAgentLoopSummary(a public type) instead of the private_SessionStats. The function was added to__all__in 1.6.1 but leaked an internal struct, which made it impossible to type-annotate user code that consumed it. The internal_accumulate(summary, stats)helper inshadow.counterfactual_loopis replaced by_merge(summary, addend)which sums two public summaries.CHANGELOG.mdv1.6.1 was incorrectly dated 2026-04-24 (predating v1.6.0 on the same day). The 1.6.1 section is unchanged in content.
Added¶
- Direct contract tests for
drive_loop_forward:test_drive_loop_forward_returns_public_summary_type(verifies the public-type return, parent chaining, and content-addressing) andtest_drive_loop_forward_truncation_surfaces_in_summary(verifiessessions_truncatedpropagates from inner stats to the public summary). just ci-localnow also runs thepython-full-extrasjob locally — installs every optional extra (anthropic,openai,otel,serve,mcp,embeddings) and re-runs pytest with no--ignorefilter so optional-extras gating bugs (the kind that bit v1.4.1 withmcp) get caught before pushing.
[1.6.1] - 2026-04-25¶
Fixed¶
branch_at_turn(turn=N)for N≥1 now preserves the baseline prefix verbatim (with content-addressed ids intact through end of turn N, including any trailingtool_call/tool_resultrecords) and then drives the agent loop forward from turn N+1's seed messages. Earlier behaviour stopped after the prefix and never produced the candidate continuation, so the docstring's promise of "branch and replay forward" was a no-op.replace_tool_resultre-drive mode (whenllm_backendis supplied) preserves the prefix through the patchedtool_result, then continues forward from that point. Previously it re-drove from turn 0, which brokeMockLLMlookups (the patched tool message changed the next request's content-id) and produced a trace whose prefix did not match the baseline.branch_at_turn(turn=K)where K exceeds the baseline's turn count now raisesShadowConfigError("baseline has fewer than K …")instead of silently emitting a stub trace.- New
drive_loop_forwardprimitive inshadow.replay_loopis the shared driver these counterfactual helpers now use; existingrun_agent_loop_replayis unchanged.
Added¶
- Local CI parity script.
just ci-localruns the exact command set from.github/workflows/ci.ymlin the same order — including thepython/ examples/ruff/mypy scope and the demo step — so drift between local and CI lint scope is caught before pushing. Catches the three classes of failure that bit prior releases: ruff/mypy scope, optional-extras gating, and demo wall-clock regressions. The recipe is portable across macOS (usesgtimeoutif installed, falls back to plain bash) and Linux. just lint-pythonwas widened to match CI scope (python/ examples/, plus the demo entry points for mypy). Previously narrower than CI, sojust cicould pass locally and still fail on push.- New tests:
test_branch_at_turn_one_replays_prefix_then_drives_forward— verifies the prefix is preserved with content-addressed ids and that forward-drive emits at least one additional chat pair.test_branch_at_turn_past_end_raises— verifies the new bounds-check error.test_replace_tool_result_redrive_preserves_prefix_then_drives_forward— verifies prefixchat_requestids carry through verbatim.test_delegate_policy_can_bridge_to_sandboxed_backend— pinpoints the documented composition pattern: novel calls flow throughDelegatePolicyintoSandboxedToolBackend.execute.test_engine_handles_multi_session_baseline— multi-session baselines (twometadatarecords) now have an explicit assertion that both sessions replay end-to-end.test_replay_loop_live.py— real-LLM end-to-end test againstgpt-4o-mini, gated behindSHADOW_RUN_NETWORK_TESTS=1so CI never opts in. Asserts the agent-loop engine produces a structurally valid, content-addressed trace from a live API call.
637 pytest tests pass, 205 cargo tests pass, full CI parity script (just ci-local) is green on macOS.
[1.6.0] - 2026-04-25¶
Added¶
Sandboxed deterministic agent-loop replay. A new replay mode that drives the candidate's agent loop forward against a baseline — same shape as a real agent run, no real network calls, no real database writes, no real charges. The output is an ordinary .agentlog so every existing Shadow command (diff, check-policy, mine, mcp-serve, bisect) reads it without modification.
- New
shadow.toolspackage mirrorsshadow.llm:ToolBackendProtocol with three implementations. ReplayToolBackendindexes baselinetool_resultrecords by(tool_name, canonical_args_hash)and serves them back. Default forshadow replay --agent-loop.SandboxedToolBackendwraps user tool functions; blockssocket.connect,subprocess.run/Popen/os.system/os.execvp, and write-modeopen()calls (redirected to a tempdir). Optionalfreeze_timepinstime.timeanddatetime.utcnow. Best-effort isolation for replay determinism, not a security boundary.StubToolBackendreturns deterministic placeholders. For tests and thestubnovel-call policy.- New
shadow.replay_loopmodule:run_agent_loop_replay(baseline, llm_backend, tool_backend)drives the loop forward with amax_turnssafety cap and a structuredAgentLoopSummaryof stats. Errors from a tool backend becomeis_error=Truetool_resultrecords by default; runaway loops emit anerrorrecord withcode=loop_max_exceeded. - New
shadow.tools.novelmodule: four configurable policies for tool calls the baseline never recorded —StrictPolicy(raise),StubPolicy(placeholder),FuzzyMatchPolicy(Jaccard distance over arg keys),DelegatePolicy(defer to a user-supplied async callable). - New
shadow.counterfactual_loopmodule:replace_tool_result,replace_tool_args,branch_at_turn. Each produces aCounterfactualLoopResultcarrying the new trace, summary, and anoverridedict describing the substitution. These are the rails the bisect renderer's "confirm withshadow replay" caveat references. shadow replaygains--agent-loop,--tool-backend {replay|stub|sandbox},--novel-tool-policy {strict|stub|fuzzy}, and--max-turnsflags.- New docs page (
docs/features/sandboxed-replay.md) and a runnable worked example atexamples/sandboxed-replay/run.py(no API keys needed).
48 new tests across the new modules. 632 pytest, 205 cargo, mypy --strict / ruff / clippy / fmt all clean. Coverage 85.56%.
[1.5.0] - 2026-04-25¶
Added¶
- Conditional policy rules. Every rule can now carry a
when:clause that gates it on a list of field-path conditions; the rule only fires on pairs whose request/response context satisfies every condition. Path is a dotted expression rooted at a per-pair context (request.*,response.*, plusmodelandstop_reasonaliases). Operators:==,!=,>,>=,<,<=,in,not_in,contains,not_contains. Multiple conditions AND together. Missing paths silently don't match instead of crashing. This unlocks rules like "high-value refunds must call confirm first" without forking the policy file by amount bucket. shadow bisectterminal renderer with hedged language. The CLI now renders attribution output through a dedicated formatter that prefixes percentages withest., replaces the bare✓with explicit(stable, CI excludes 0)/(screening only)/(weak signal)qualifiers, labels brackets as 95% bootstrap CIs, and leads every output with a one-line caveat noting attribution is correlational, not causally proven (confirm withshadow replay). New--format terminal|markdown|jsonflag; previous JSON shape is preserved under--format json.
Changed¶
- README: framework users are now pointed at the matching adapter as the more stable instrumentation surface than auto-instrument's monkey-patching of provider SDK
.createmethods. docs/features/bisect.md: example output updated to match the new hedged renderer; reading-the-signal section explains each row's qualifier instead of claiming✓means "proven."
Notes¶
- The provider SDK upper bounds (
anthropic<1,openai<3) are intentionally aligned with the next major because that's where.createclass paths can move and our auto-instrument patches can silently break. Users on a major above the cap can lift the pin in their own pyproject and report breakage.
[1.4.1] - 2026-04-24¶
Changed¶
shadow-diff[langgraph]now pullslangchain-openai>=0.2,<2alongsidelangchain-coreandlanggraph. Most LangGraph users pick ChatOpenAI as their chat provider and were hittingModuleNotFoundErroron first run. The adapter is still provider-neutral; users on Anthropic or Bedrock addlangchain-anthropic/langchain-awsalongside without conflicts.
Fixed¶
- CI coverage gate was failing at 83.97% on the default extras-less matrix because
shadow.mcp_servercounted against the denominator while its tests skipped. Added it to the[tool.coverage.run].omitlist (same pattern as adapters, enterprise, serve, embeddings). Local run now reports 85.74%. test_mcp_server.pynow skips cleanly viapytest.importorskip("mcp")when the[mcp]extra is absent, instead of raisingModuleNotFoundErrorat import time and failing the whole job.ShadowCrewAIListener(quiet_internal_listeners=True)detaches CrewAI's built-inTraceCollectionListener.on_crew_startedand sibling handlers that raise'str' object has no attribute 'id'when synthetic events drive the bus. ProductionCrew.kickoff()paths are untouched.
Docs¶
- New
docs/features/adapters.mdcovering LangGraph, CrewAI, and AG2 adapters in depth. Wired into the mkdocs nav. - README has a "Record from agent frameworks" section with runnable snippets for each adapter and an "Import traces from any OpenTelemetry backend" section noting GenAI semconv v1.40 support.
[1.4.0] - 2026-04-24¶
Added¶
OTel GenAI semconv v1.40 compliance. The shadow import --format otel importer now parses the full v1.40 attribute surface: structured gen_ai.input.messages / gen_ai.output.messages (whether carried as span attributes or inside the gen_ai.client.inference.operation.details event), gen_ai.provider.name, cache-token attributes (gen_ai.usage.cache_read.input_tokens and cache_creation.input_tokens), gen_ai.response.id, gen_ai.output.type, gen_ai.conversation.id, gen_ai.tool.definitions, agent spans (create_agent/invoke_agent land in metadata with id/name/description/version), and evaluation events. The deprecated v1.28-v1.36 flat-indexed gen_ai.prompt.N.* / gen_ai.completion.N.* shape still parses, so traces from OpenLLMetry and other implementers that haven't tracked the v1.37 restructure round-trip cleanly. Spans are sorted by startTimeUnixNano so content IDs are deterministic.
Framework adapters: LangGraph, CrewAI, AG2. New shadow.adapters package exposes first-class tracing hooks for the three dominant agent frameworks as of April 2026. Each routes its framework's native instrumentation surface through Session.record_chat / record_tool_call / record_tool_result.
shadow.adapters.langgraph.ShadowLangChainHandler— anAsyncCallbackHandlersubclass. Hooks:on_chat_model_start/end/error,on_tool_start/end/error. Pair-buffers by LangChain'srun_idso concurrent graph branches don't cross-contaminate. Session-groups by the config'sconfigurable.thread_id.shadow.adapters.crewai.ShadowCrewAIListener— aBaseEventListenersubclass wired toLLMCall{Started,Completed,Failed}andToolUsage{Started,Finished,Error}. Pairs via CrewAI'scall_id(LLM) andstarted_event_id(tools) conventions.shadow.adapters.ag2.ShadowAG2Adapter— uses AG2'sregister_hooksurface onsafeguard_llm_inputs/safeguard_llm_outputs. Captures message bodies that the defaultautogen.opentelemetryspans redact. Supports per-agent andinstall_all(agents)registration.
New extras: pip install 'shadow-diff[langgraph]', [crewai], [ag2].
Fixed¶
Three correctness gaps surfaced by a real-world MCP adversarial test on 10 support-triage tickets.
- Session-scoped policy rules.
must_call_beforeand the other rule kinds now acceptscope: sessionin the policy YAML. Under session scope the rule is evaluated independently within each user-initiated session — inferred frommessages[-1].role == "user"on the request — so a correct ordering in one ticket can no longer mask violations in later tickets of the same multi-ticket trace. Default staysscope: tracefor back-compat; the MCP tool description and policy loader both document the new field. Real-world test: the adversarial candidate trace went from 0 reported violations (bug) to 6 (correct — one per offending ticket). - Cost axis
no_pricingflag. When no pricing table is supplied, or the traced models aren't in the table, the cost axis previously reporteddelta=0, severity=none— indistinguishable from "both sides priced equally." The axis now emitsflags: ["no_pricing"]when fewer than half the pairs can be priced. The summariser surfaces this as "per-call cost not comparable (no pricing table supplied)" rather than omitting cost silently. - Mining metadata field name.
MiningResult.to_agentlog()now writescases_selected(wasselected_cases), aligning with the sibling keystotal_input_pairsandclusters_found. - Explicit session markers for adapter traces.
Session.record_metadata(payload)appends an authoritative session-boundary marker. When a trace contains two or more metadata records, Shadow's session detector uses them exclusively instead of falling back to the stop-reason heuristic. The CrewAI adapter now emits a marker on everyCrewKickoffStartedEvent, so one kickoff equals one session even when everyLLMCallCompletedends withend_turn. Verified end-to-end on real GPT-4o-mini traces: 3 topics run through CrewAI produce 3 sessions (previously fragmented).
[1.3.0] - 2026-04-24¶
Added¶
Three features aimed at the April 2026 agent ecosystem.
shadow mcp-serve: run Shadow as a Model Context Protocol server¶
Shadow now speaks MCP. Any MCP-aware client (Claude Desktop, Claude Code, Cursor, Zed, Windsurf, and others) can connect over stdio and invoke Shadow's capabilities as tools. Five tools exposed:
shadow_diff: nine-axis diff between two.agentlogfiles, with optional policy enforcementshadow_check_policy: run a YAML/JSON policy against two tracesshadow_token_diff: per-dimension token distribution summaryshadow_schema_watch: classify tool-schema changes before replayshadow_summarise: plain-English summary from a saved DiffReport
Wire it into a client's config:
Install the extra: pip install 'shadow-diff[mcp]'.
shadow import --format a2a: Agent-to-Agent session logs¶
The A2A protocol (Linux Foundation, used in production at Microsoft, AWS, Salesforce, SAP, ServiceNow) is the agent-to-agent companion to MCP. Shadow's A2A importer:
- Reads JSONL, JSON-array, and wrapped session shapes
- Maps
tasks/sendandtasks/resultpairs intochat_requestandchat_response - Extracts Signed Agent Cards (A2A v1.0) into the metadata payload
- Captures error responses as
stop_reason=error
The resulting .agentlog plugs into every existing Shadow command (diff, bisect, policy, suggest-fixes).
shadow mine: turn production traces into regression suites¶
Clusters turn-pairs by tool sequence, stop reason, verbosity, and latency. Picks the most interesting representative from each cluster (errors, refusals, high cost, heavy reasoning, empty or very long responses). Writes a new .agentlog suitable as a committed baseline for CI. Optional --pricing pricing.json biases the selection toward expensive pairs.
Tests¶
- 31 new tests covering MCP server, A2A importer, and mine
- Full pytest suite: 519 passed (up from 488)
- MCP server verified live over stdio: initialize, notifications/initialized, tools/list return 5 tools cleanly
- cargo test, mypy --strict, ruff, ruff format --check, clippy -D warnings, cargo fmt --check all clean
[1.2.4] - 2026-04-24¶
Fixed, fourth discrepancy-sweep pass (deepest)¶
A last deep audit turned up real user-facing drift in docs + metadata:
Docs-site was missing every v1.2 feature¶
docs/reference/cli.mddidn't document any v1.2 flag: no--token-diff,--policy,--suggest-fixes,--partial,--branch-at,vercel-ai,pydantic-ai. Users reading the published CLI reference couldn't find half the v1.2 release's features. Rewrote with full v1.2 coverage.docs/features/hierarchical.mdclaimed "token-level: deferred to v1.1+", but we shipped token-level in v1.2.0. The table now documents all six real levels (trace/session/turn/span/token/policy) with actual shipped-in-version labels, plus dedicated sections for the two v1.2 levels covering CLI flags, rule kinds, and scale.docs/index.mdHighlights was pre-v1.2: no mention of token/ policy diff, partial replay, Vercel AI + PydanticAI importers, LLM-assisted fixes, or Python 3.13 support. Updated.
SPEC.md drift from code¶
replay_summary§4.7 documented acandidate_config_hashfield that the implementation has never emitted; the actual field isbackend_id. Also missing the v1.2branch_at+prefix_turn_countoptional fields for partial replay. SPEC now documents the real shape with both required and optional fields in a proper table.
Platform-claim alignment¶
pyproject.tomlclassifiers claimed Linux + macOS only but Shadow has shipped a working Windows CI matrix since v1.1. AddedOperating System :: Microsoft :: Windows.pyproject.tomlclassifies Python 3.13 support but CI didn't test it. Addedpython: "3.13"for all three OSs in the CI matrix (9 pytest jobs now: 3 OS × 3 Python versions). Verified Shadow 1.2.x imports and runs cleanly under a fresh Python 3.13 venv.
TypeScript license inconsistency¶
typescript/package.jsonwas licensed"MIT"only while Rust + Python packages are dual-licensed"MIT OR Apache-2.0". SPDX form"(MIT OR Apache-2.0)"now matches.
Stale version references¶
.github/ISSUE_TEMPLATE/bug_report.ymlplaceholder said"shadow 0.1.0"→ now"1.2.4".docs/PYPI-PUBLISHING.mdexample usedv0.2.1as the tag to push → nowv1.2.4.
Added, crates.io publish (gated, auto-skip if token absent)¶
Noticed during this audit: shadow-core is published on crates.io at
v0.1.0 and has never been updated, the release workflow builds a
Rust source tarball but doesn't push to crates.io. Added a new
publish-crates job to .github/workflows/release.yml that:
- Runs on every
v*tag push aftersign-and-release. - Emits a
::notice::and exits clean ifCARGO_REGISTRY_TOKENis not configured as a repo secret (so the release pipeline still succeeds). - Publishes
shadow-coreviacargo publishwhen the token is set.
Maintainers can enable crates.io auto-publish by adding the
CARGO_REGISTRY_TOKEN secret to the repo, no workflow change
needed.
Verified¶
- 488 pytest green
- Python 3.13 compat confirmed against the PyPI 1.2.2 wheel in a fresh venv
cargo test+ clippy + fmt clean- mypy
--strict, ruff check + format clean - Docs auto-sync (
CHANGELOG.md/SECURITY.md→docs/) continues to run on every docs build
[1.2.3] - 2026-04-24¶
Fixed, third discrepancy-sweep pass (deep dependency + docs audit)¶
A deeper cross-repo audit turned up two more real issues:
Dependency pins were dangerously stale¶
The runtime + optional-extra dependency pins were exact-pinned to April-2025-era versions:
anthropic==0.40.0→ current PyPI is0.97.0(57 minor versions behind, incompatible with users on modernanthropic)openai==1.58.1→ current PyPI is2.32.0(one major version behind, users withopenai>=2couldn't installshadow[openai])pydantic==2.10.3→ current2.13.3httpx==0.28.1,rich==13.9.4,scikit-learn==1.6.0,numpy==2.2.0,pyyaml==6.0.2, all exact pins
Verified Shadow works correctly against anthropic 0.97 +
openai 2.32 + pydantic 2.13 (all module paths Shadow monkey-patches
still exist: anthropic.resources.messages.Messages,
openai.resources.chat.completions.Completions,
openai.resources.responses.Responses). Loosened to permissive ranges:
| Dependency | Old | New |
|---|---|---|
anthropic |
==0.40.0 |
>=0.40,<1 |
openai |
==1.58.1 |
>=1.58,<3 |
pydantic |
==2.10.3 |
>=2.10,<3 |
httpx |
==0.28.1 |
>=0.27,<1 |
rich |
==13.9.4 |
>=13.9,<15 |
scikit-learn |
==1.6.0 |
>=1.6,<2 |
numpy |
==2.2.0 |
>=2.2,<3 |
pyyaml |
==6.0.2 |
>=6.0,<7 |
sentence-transformers |
==3.3.1 |
>=3.3,<6 |
opentelemetry-sdk |
>=1.27.0 |
>=1.27,<2 |
fastapi |
>=0.115.0 |
>=0.115,<1 |
uvicorn |
>=0.32.0 |
>=0.32,<1 |
websockets |
>=13.1 |
>=13.1,<16 |
Dev deps (hypothesis, mypy, ruff, pytest, pytest-asyncio,
pytest-cov, maturin, types-PyYAML) stay exact-pinned for CI
reproducibility.
Docs-site drift¶
docs/changelog.mdwas a stale copy of rootCHANGELOG.md, missing every v1.2.x entry. Published docs site at manav8498.github.io/Shadow was showing a changelog stuck at v1.1.0. Now re-synced and the docs workflow mirrors root →docs/on every build so this won't drift again (.github/workflows/docs.yml).ROADMAP.mdwas written from a pre-1.0 perspective, calling things already shipped (live backends, LASSO bisection, auto-instrumentation, OTel, 10 judges, Windows CI, PyPI pipeline) "planned for v0.2" and ending with "Shadow is a v0.1.0 project with early adopters welcome". Rewritten to reflect actual v1.2.x state with accurate "shipped" vs "next up" sections.SECURITY.md: version-reference bumped tov1.2.x.examples/README.mdfalsely claimed every subdirectory ships aWALKTHROUGH.mdand uses agenerate_fixtures.pyrecipe. Only 4 of 9 fit that shape. Prose rewritten to match reality; "the other four directories" section added foredge-cases/,acme-extreme/,integrations/,judges/.
Verified¶
- 488 pytest green against the loosened dep set
- 30/30 TypeScript tests green
cargo test+ clippy + fmt clean- mypy
--strict, ruff check + format clean - Smoke-tested a fresh venv install with the latest
anthropic 0.97 openai 2.32+pydantic 2.13, all Shadow imports and backend instantiations work, instrumentation module paths resolve.
[1.2.2] - 2026-04-24¶
Fixed, second discrepancy-sweep pass¶
Caught during a deeper post-1.2.1 audit across every version string, command-reference, and doc in the repo:
CITATION.cff, version0.1.0→1.2.2(in both top-level andpreferred-citation). Academic citations produced via GitHub's "Cite this repository" button will now pin the actual released version. Release date also updated.- TypeScript SDK (
typescript/src/session.ts), hardcodedversion: '0.1.0'in emitted metadata records replaced with a module-load-time read from the shippedpackage.json. Every.agentlogthe TS SDK writes now carries accurate SDK provenance. examples/README.md, four example directories existed on disk but weren't listed:acme-extreme/,judges/,mcp-session/,integrations/. All four added with their real scope descriptions.- All package versions (Cargo, pyproject, TS package, README badge,
CITATION.cff) bumped to
1.2.2in lockstep.
No functional changes to CLI or library APIs.
[1.2.1] - 2026-04-24¶
Fixed, maturity + documentation discrepancies¶
A pass against the repo audit for discrepancies between what the package is and what it says about itself:
- PyPI classifier
Development Status :: 3 - Alpha→4 - Beta. A 1.x release with 90 days of feature work behind it isn't Alpha; calling it Alpha undersells maturity for package discovery. - GitHub Action template fix (functional bug). The workflow
that
shadow init --github-actionscaffolds pinnedshadow-diff>=0.2,<0.3, which would install a pre-1.0 CLI into every user's CI. Now pinsshadow-diff>=1.2,<2. Users who already ranshadow init --github-actionbefore 1.2.1 should update their generated.github/workflows/shadow-diff.ymlpin. - Embedded SDK version tracks package version. Every emitter
(langfuse/braintrust/langsmith/openai-evals/otel importers, OTel
exporter, FastAPI
shadow serve) hard-coded"sdk": {"name": "shadow", "version": "0.1.0"}in the metadata record's SDK-provenance field. That's a lie: a v1.2 install was stamping records "written by v0.1". Now readsshadow.__version__at import time. Metadata content-IDs change between versions (expected, that's what provenance is for); diff results are unaffected. - TypeScript SDK
@shadow/sdk0.1.0→1.2.1to match the Python + Rust packages. - Stale "Phase-N stub" comments removed from
crates/shadow-core/ src/lib.rsandagentlog/mod.rs. Replaced with real submodule documentation. Similar prose fixes inerror.rs,parser.rs, andtests/test_bisect.py.
Not changed (and why)¶
- No third-party security audit. Noted in
SECURITY.md, Shadow is self-hostable and processes traces locally by default, but has not been penetration-tested. Users who need formal assurance should assume that gap and treat.agentlogfiles as potentially-sensitive (they are, by default). - SPEC-example record versions stay
"0.1.0". The example payloads inSPEC.mdandtest_core.pyhashing vectors are illustrative and pin specific SHA-256 values; changing the version string there would break the known-vector tests that lock the canonical JSON algorithm.
[1.2.0] - 2026-04-24¶
Added, partial-completion close-out¶
Five partial items from the strategic roadmap closed out this release, each shipped with implementation + unit tests + CLI wiring.
1. Vercel AI SDK importer¶
New shadow.importers.vercel_ai + shadow import --format vercel-ai.
Accepts both OTLP-style {spans: [...]} (the AI SDK telemetry exporter)
and dashboard-style {events: [...]} (the Vercel AI Observability JSON
export). Maps the full ai.* attribute namespace: ai.prompt.messages,
ai.response.text, ai.response.toolCalls, ai.tools, ai.settings.*,
ai.usage.*, ai.finishReason. Tool invocations surface as
Anthropic-shape tool_use content blocks so the rest of the differ
works unchanged. Error spans map to a synthetic error stop-reason.
16 unit tests + CLI integration.
2. PydanticAI importer¶
New shadow.importers.pydantic_ai + shadow import --format pydantic-ai.
Accepts the native all_messages_json() output, wrapped
{messages: [...]} dumps, and Logfire span exports that carry the
message history under attributes.all_messages_json. Handles the full
part-kind set: system-prompt, user-prompt, text, tool-call,
tool-return, retry-prompt, both snake-case and CamelCase variants
across PydanticAI versions. Tool schemas from model_request_parameters
propagate to every downstream request. 11 unit tests + CLI integration.
3. Token-level + policy-level hierarchical diff¶
Two new layers in shadow.hierarchical complete the strategic-plan
hierarchy (trace → session → turn → span → token → policy):
- Token-level,
token_diff(baseline, candidate)produces per-dimension distribution summaries (median, p25, p75, p95, max, total) forinput_tokens/output_tokens/thinking_tokens, plus a per-pair delta list ranked by absolute shift. Surfaces viashadow diff --token-diff. Handles zero-median baseline without blowing up (returns+infnormalised shift). - Policy-level, declarative YAML overlay with eight rule kinds:
must_call_before,must_call_once,no_call,max_turns,required_stop_reason,max_total_tokens,must_include_text,forbidden_text.policy_diff(baseline, candidate, rules)classifies each violation as pre-existing, a regression, or a fix. Surfaces viashadow diff --policy path/to/policy.yaml.
21 new unit tests covering both layers + CLI wiring.
4. Partial replay¶
New shadow.replay.run_partial_replay(baseline, branch_at, backend)
, the 2nd of the replay-as-science slices (after counterfactual in
v1.0). Locks the baseline prefix verbatim (turns 0..branch_at-1),
then switches to live replay at the branch point. Isolates behaviour
change to a specific turn so reviewers can ask "if we diverged at turn
3, what happens from turn 4 onwards under config B?" without
confounding earlier turns. Surfaces via
shadow replay --partial --branch-at <idx>. Clamps gracefully when
branch_at exceeds the trace length (full-baseline copy). Preserves
parent DAG consistency under all three modes (zero = fully live,
mid = split, end = pure copy). 10 unit tests.
5. LLM-assisted prescriptive fixes¶
New shadow.suggest_fixes module + shadow diff --suggest-fixes flag.
Layers an LLM pass on top of the deterministic recommendation engine to
produce concrete code-level fix proposals. The module:
- Collects up to 6 anchors from the deterministic
Recommendationlist, prioritised by severity. - Builds a bounded evidence window (top axes + first-divergence +
flagged-turn request/response payloads, truncated to
MAX_EVIDENCE_CHARS = 1800per record). - Calls the configured LLM backend with a strict JSON schema.
- Rejects ungrounded suggestions, if the model invents an anchor id not in the deterministic set, that suggestion is dropped. This keeps the LLM from inventing fixes for problems that don't exist.
- Tolerates markdown fences, trailing chatter, malformed JSON (returns empty gracefully), and out-of-range confidence values.
- Flags suggestions with
confidence < 0.3as[speculative]rather than dropping them silently.
Opt-in only (~1-2k output tokens per diff, same backend selection rules
as --judge / --explain). 13 unit tests covering anchor-grounding
enforcement, JSON robustness, and evidence-truncation safety.
Upgraded¶
- Rust workspace version
1.1.0 → 1.2.0. - Python package
1.1.0 → 1.2.0(ABI-compatible with 1.1 consumers). shadow.importersnow exports 8 formats: Braintrust, Langfuse, LangSmith, MCP, OpenAI Evals, OTel, Vercel AI, PydanticAI.
Testing¶
- 70 new unit tests across the five modules.
- Full suite: 468 passed (up from 398), no regressions.
cargo test --workspace: 201 passed.ruff check,ruff format --check,mypy --strict,cargo clippy -- -D warnings,cargo fmt --check: all clean.
[1.1.0] - 2026-04-24¶
Added, scale, correctness, ops hardening¶
A six-item hardening pass against the honest gaps called out in the v1.0 postmortem. Each item ships with concrete scope and explicit documentation of what's not done.
1. Scale verified to N=10k (item 6)¶
Extended benchmarks/scale_drill_down.py with
SHADOW_SCALE_BIG=1 and SHADOW_SCALE_HUGE=1 tiers. Running at
N=5k surfaced a real super-linear blow-up (17.92s at N=1k →
484.82s at N=5k, 27× wall-time for 5× pairs). Root-caused to the
O(N²) Needleman-Wunsch matrix allocation in
crates/shadow-core/src/diff/alignment.rs.
Fix: banded Needleman-Wunsch. Above
SCALE_BAND_THRESHOLD = 1000 pairs, the DP is restricted to a band
of max(|N-M| + 100, sqrt(max(N,M))) cells around the diagonal.
standard technique from the sequence-alignment literature (SWAT,
Hirschberg). Below the threshold the full-matrix variant stays
exact for all existing tests. At N=5k the new numbers: 19.43s
(3.89 ms/pair). At N=10k: 40.10s (4.01 ms/pair). Per-pair cost
stays flat at big N, confirmed linear.
Added per-pair ms budget MAX_MS_PER_PAIR = 50 so accidental
algorithmic regressions fail loudly at any N, not just at the
scale tier that happens to be running.
2. Property-based tests via Hypothesis (item 5)¶
New python/tests/test_properties.py, 8 property tests
exercising ~600 generated inputs each. Properties:
- Canonical-JSON roundtrip is byte-deterministic.
compute_diff_reportnever crashes, always emits 9 axes, finite CI bounds, recognised severity enum values.- Self-diff produces
|delta| < 1e-6on every axis for any trace. - Cost-attribution identity:
model_swap + token_movement + mix_residual == total_deltato f64 precision, for any session pair and any pricing table. - Schema-watch is monotone on no-op inputs for any config.
canonical_bytesandcontent_idare deterministic on arbitrary nested JSON.
Catches regressions the example-based tests miss.
3. Needleman-Wunsch span alignment (item 4)¶
shadow.hierarchical.span_diff previously used greedy per-index
alignment. On long tool-heavy responses (the real case as agents
accumulate 20+ tool calls per turn), a single inserted tool_use
block would cascade into every downstream block being reported as
block_type_changed.
Now: two-path dispatch by size. ≤ 5 blocks either side uses the
greedy fast path (optimal and cheap). > 5 blocks uses Needleman-
Wunsch alignment with a cost model that nudges the aligner toward
reporting add + remove over block_type_changed when block
types differ. Verified with two new tests that drop / insert a
block in position 10 of a 20-block response, NW correctly reports
exactly 1 add/remove and zero cascaded type changes.
Token-level 5th hierarchy deferred to v1.2+.
4. Security hardening pass (item 8), NOT a formal audit¶
Concrete hardening pass across four attack surfaces. Explicitly documented as "hardening pass, not a formal third-party audit" in SECURITY.md.
- Parser resource bounds: new
DEFAULT_MAX_LINE_BYTES(16 MiB per record) andDEFAULT_MAX_TOTAL_BYTES(1 GiB per trace) with typedLineTooLarge/TraceTooLargeerrors. Tunable perParserviawith_max_line_bytes/with_max_total_bytes. The per-line cap usesRead::takeso a newline-free stream errors out at the cap rather than growing the buffer unbounded. - Path-traversal on
shadow quickstart: refuses system directories (/etc,/usr,/bin,/sbin,/boot,/proc,/sys,/dev). - SECURITY.md updated with an honest threat-model section, hardening-pass summary, and explicit list of what was NOT hardened (JSON depth, reproducible builds, formal audit).
- 2 new Rust tests (
rejects_a_line_longer_than_the_configured_limit,rejects_total_trace_exceeding_byte_cap).
5. Published docs site (item 7)¶
New mkdocs.yml + docs/ tree + .github/workflows/docs.yml
GitHub Pages deploy. Complete navigation:
- Quickstart: Install, Record, Wire into CI
- Features: Nine-axis diff, Judges, Bisect, Schema-watch, MCP, Cost attribution, Hierarchical diff
- Reference: CLI, .agentlog format, Pricing table
- Security, Changelog
Built locally with mkdocs build --strict (zero warnings).
Deploys automatically from main.
6. Counterfactual replay (item 3, one slice)¶
New shadow.counterfactual module, the first of five replay-as-
science slices the strategic analysis called out. Isolates a
single config delta (model swap, temperature change, system-prompt
override, tools-list replacement, etc.) and re-runs the trace
through a live backend with only that one thing changed.
Composes with shadow bisect: bisect gives statistical attribution
("we think the model swap is 78% of the latency regression"); a
counterfactual replay confirms it with a direct experiment that
holds everything else constant.
14 new unit tests. Explicitly documented deferred slices in the module docstring: partial replay, sandboxed replay, streaming replay, multimodal replay.
Test totals¶
- 201 Rust tests (was 199, +2 parser-bound tests)
- 398 Python tests (was 374, +14 counterfactual, +2 hierarchical NW, +8 Hypothesis properties)
- 79 hero end-to-end assertions (unchanged)
- 17 live-LLM judge tests (unchanged)
[1.0.0] - 2026-04-24¶
Added, hierarchical diff (Phase D)¶
New shadow.hierarchical module closing the last remaining gap from
the four-phase plan. Shadow's reports previously sat at two levels.
trace (nine-axis table) and turn (drill-down). Two real-world
questions those couldn't answer:
- Which session in a multi-conversation trace regressed?
- Within a regressed turn, which content block actually changed?
Phase D adds both layers:
-
diff_by_session(baseline, candidate, ...), partitions both traces bymetadatarecord, runscompute_diff_reporton each session pair, returns oneSessionDiffper session with its ownDiffReportandworst_severity. Mismatched session counts pad the shorter side with empty sessions (pair_count 0 rows). -
span_diff(baseline_response, candidate_response). content-block-level classifier. Surfaces:text_block_changed(with char-Jaccard similarity + previews),tool_use_added/ _removed,tool_use_args_changed(arg-level deltas including rename-as-remove-plus-add),tool_result_changed(is_error flip -
content differ),
stop_reason_changed,block_type_changed. Uses greedy index alignment, per-turn block counts are small enough that Needleman-Wunsch's extra alignment artefacts outweigh the benefits. -
CLI: new
--hierarchicalflag onshadow diffprints a worst-severity-per-session rollup after the nine-axis table. Defaults off, on single-session traces it's redundant with the top-level severity.
1.0 milestone¶
v1.0.0 marks feature-complete for the four-phase strategic plan:
- Phase A (first-real-diff experience), auto-judge, low-n
guidance, "what this means" deterministic summary,
--explainLLM narrative (shipped in v0.4.0). - Phase B (MCP server importer),
shadow import --format mcpingests Anthropic's Model Context Protocol session logs (shipped in v0.5.0). - Phase C (session-cost attribution), per-session cost delta decomposition into model_swap + token_movement + mix_residual (shipped in v0.6.0).
- Phase D (hierarchical diff), session-level + span-level breakdowns (this release).
Test state at v1.0.0:
- 374 Python unit tests (all green, no skips outside live-API)
- 199 Rust tests
- 17 live-LLM judge tests (verified against real Claude Haiku
4.5 and GPT-4o-mini)
- 79 hero end-to-end assertions across 10 stages on real
committed .agentlog fixtures
- cargo fmt/clippy -D warnings, ruff check/format --check,
mypy --strict all clean across 68 source files
Tests (Phase D)¶
- 18 new
test_hierarchical.pytests: session partitioning, worst-severity propagation, mismatched-session padding, span-level type-swap / tool-use arg changes / tool-result flips / stop_reason changes / identity-mapped block indices, both renderers. - Hero harness: 73 → 79 assertions. New
stage_hierarchical_diffasserts exactly 1 session detected in the devops-agent trace, worst_severity = severe, 5 paired responses, and span-level detects ≥ 1 tool_use change on turn #0.
[0.6.0] - 2026-04-24¶
Added, session-cost attribution (Phase C)¶
New shadow.cost_attribution module + CLI integration. Shadow's
existing per-response cost axis says whether a PR moved cost.
This answers the follow-up question a CFO or eng lead always asks:
why, and by how much per user-facing session?
-
Session partitioning. A "session" is the span between two
metadatarecords in an.agentlog, one user-facing conversation including all follow-up tool calls. Shadow rolls up per-session input / output / cached / reasoning token counts and USD spend. -
Attribution decomposition. Cost delta between a baseline session and its candidate decomposes into three independent sources:
total_delta = model_swap + token_movement + mix_residual
-
model_swap: how much of the delta is the candidate model's price-per-token vs the baseline's, holding tokens constant at candidate levels. token_movement: how much is the token-count change, holding price at baseline.-
mix_residual: non-additive interaction (simultaneous model swap + token movement). When|residual| > 10% of |total_delta|the decomposition is flagged as "less trustworthy" so the user knows the simple two-factor story is incomplete. -
Pricing table compatibility. Uses the same rich-dict pricing shape the Rust
costaxis accepts, input / output / cached_input / cached_write_5m / cached_write_1h / reasoning rates. Unknown models contribute $0 (no crash). -
Rendering.
shadow diffprints the attribution section after the nine-axis table when the cost delta is non-zero:
cost attribution (per session):
session #0: $0.0870 → $0.0174 (Δ $-0.0696, -80.0%)
model swap claude-opus-4-7→claude-sonnet-4-6: $-0.0696 (+100%)
token movement: $+0.0000 (-0%)
total: $0.0870 → $0.0174 (Δ $-0.0696)
Markdown renderer emits a GitHub-flavoured table when callers
attach a cost_attribution key to the DiffReport dict they
pass to render_markdown / render_github_pr.
Tests¶
- 21 new
test_cost_attribution.pytests: session partitioning, per-session roll-up, pricing-shape tolerance (rich dict + legacy tuple), cached-input / reasoning token rates, fundamental identity (swap + move + residual == deltaacross every scenario), noisy flag, multi-session alignment, mismatched session counts, both renderers. - Hero harness: 64 → 73 assertions. New stage
(
stage_session_cost_attribution) synthesises pure-swap and mixed scenarios to lock the attribution arithmetic.
[0.5.0] - 2026-04-24¶
Added, MCP (Model Context Protocol) importer (Phase B)¶
Shadow now ingests MCP server session logs. MCP is Anthropic's open JSON-RPC-2.0 protocol (spec 2025-06-18, v1.0) for agents-to-tools communication, adopted by Claude Desktop, Cursor, Windsurf, Zed, VS Code, and every Anthropic-flavoured IDE by early 2026.
-
shadow import --format mcp <log-file>, new importer (shadow.importers.mcp) converts an MCP session into a partial.agentlog. Accepts three in-the-wild input shapes: (a) JSONL, one JSON-RPC message per line (mcp-server --logoutput), (b) JSON array, what MCP Inspector's export produces, (c) wrapped object,{"messages": [...], "metadata": {...}}. Auto-detects between JSONL and wrapped-object by trying a whole-file parse before falling back to line-by-line. -
Perfectly captured: tool-call trajectory (arg rename, sequence change, omitted calls) and tool-schema (everything
tools/listadvertised). Shadow's trajectory axis, first- divergence detector, and schema-watch all work on imported traces with no further wiring. -
Partial: tool results (when present in the log). Mapped to Anthropic-style
tool_resultcontent blocks. -
Not captured: LLM completions. MCP is the tool protocol, not the LLM protocol; semantic / verbosity / safety axes show zero on MCP imports by design.
-
MCP error responses are mapped to
{"type": "tool_result", "is_error": true, ...}blocks so they surface on the conformance axis. -
Orphan requests (no response, client crash, disconnect) still produce a
chat_responserecord with just the tool_use block, preserving the trajectory signal.
Added, real-world MCP example¶
examples/mcp-session/, committed baseline + candidate JSONL MCP logs for an Acme customer-support scenario. Baseline agent usessearch_orders→refund_orderin the correct order with original arg names; candidate silently renamescustomer_id→cidand skips the confirmation step. Shadow's trajectory axis fires end-to-end.
Tests¶
- 13 new
test_mcp_importer.pytests (shape round-trip, metadata tools-list capture, tool_use + tool_result block shape, parent- chain connectivity, error responses, orphan requests, all three input shapes, JSONL / JSON array / wrapped object, and full CLI integration including a trajectory-axis assertion on two imported MCP sessions). - Hero harness extended from 54 to 64 end-to-end assertions.
a new stage (
stage_mcp_importer) imports both committed MCP logs through the CLI and proves the trajectory axis detects the arg rename.
[0.4.1] - 2026-04-24¶
Fixed¶
- Release pipeline, the
sign-and-releasejob inrelease.ymlfailed uploading wheels because all three OS matrix jobs were producing a file namedsbom-python.cdx.json, and the GitHub release API rejects a second attempt to attach a same-named asset. SBOM generation is now gated to theubuntu-latestmatrix entry only, Python deps are identical across OSes, so a single SBOM is authoritative.
[0.4.0] - 2026-04-24¶
Added, first-real-diff experience (Phase A)¶
Turns "cool diff" into "actionable PR review", the natural follow-up to
v0.3.x's adoption focus. A new user who just pip install shadow-diff'd
now sees a useful diff on first run, not a noisy blank-judge table.
-
--judge auto, new value for--judgethat resolves tosanityagainst whichever API-key env var is present (ANTHROPIC_API_KEY preferred for Claude Haiku 4.5's cost, then OPENAI_API_KEY for gpt-4o-mini). Falls through tononecleanly with no key. A one-cent judge signal that doesn't require flag archaeology. -
Low-n guidance banner, terminal and markdown renderers both warn loudly when
pair_count < 5: bootstrap CIs at that sample size are unreliable, and the user needs to know before reading severities. -
"What this means" deterministic summary, new
shadow.report.summary.summarise_report. Turns the nine-axis table into a 2–4 line paragraph a PR reviewer can read in one breath: leads with structural axes (trajectory / conformance / safety), cites verbatim deltas with axis-appropriate units (ms, tokens, USD), embeds the first-divergence line, calls out the worst drill-down pair when its regression_score > 1.0, and surfaces the top error-severity recommendation. Rendered above the axis table in both terminal and markdown output. No LLM involved, fully reproducible. -
shadow diff --explain, opt-in flag that pipes the deterministic summary through the judge backend to produce a ~60-word prose narrative. Verified end-to-end against live Claude Haiku 4.5 on the real devops-agent fixture; produces a tight, accurate rundown ("tool set shrunk 4→1, format compliance failed (-1.0), root cause: structural drift at turn 0"). ~$0.0003 per run. Never fires without explicit opt-in, respects zero-friction defaults.
Fixed¶
AnthropicLLM/OpenAILLMdefault model. Real-API verification surfaced this: when neither--judge-modelnor the request's model field was set, the backends forwarded an empty string, and the API rejected withinvalid_request_error: model: String should have at least 1 character. Judges returnederror→ neutral 0.5 scores silently. AddedDEFAULT_MODELclass constants (claude-haiku-4-5-20251001andgpt-4o-mini) used when both the override and the request payload are empty. Every live judge call from the CLI now works with no model override, fixing the accidental 0.5-axis-8 behaviour for users running--judge sanityor--judge auto.
Tests¶
- 17 new
test_summary.pytests (low-n caveats, axis priority, delta unit formatting, first-divergence embedding, worst-pair threshold, recommendation ranking, byte-budget discipline). - 5 new
test_auto_judge.pytests (no keys → fall-through, only anthropic → anthropic, only openai → openai, both → anthropic, explicitnonebypasses auto). - Hero harness extended from 47 to 54 end-to-end assertions covering the new UX on committed fixtures.
- Live-API end-to-end:
--judge auto+--explainagainst real Claude Haiku 4.5 produces a correct narrative; 17/17 judge-live tests still pass after the default-model fix.
[0.3.1] - 2026-04-24¶
Fixed¶
shadow recordnow fails fast on an unwritable output path. An in-depth real-world verification of v0.3.0 caught one behavioural bug: if-opointed at a read-only directory,shadow recordwould launch the wrapped agent anyway (burning real LLM tokens), then quietly emit a warning on atexit when the flush failed. The recording was silently lost.shadow recordnow probes the output path's writability before spawning the child and exits with code 2 + a human-actionable error if the write would fail. New ground-truth testtest_shadow_record_fails_fast_on_unwritable_output_pathguards the invariant.
[0.3.0] - 2026-04-24¶
Added, zero-friction adoption¶
A full week on the one thing that was harder than it should have been:
making Shadow trivially usable for a new user who just ran
pip install shadow-diff.
-
Auto-instrumentation via PYTHONPATH + sitecustomize.
shadow record -- python your_agent.pynow records an agent's LLM calls with zero code changes to the agent itself. A shimsitecustomize.pyprepended to the child's PYTHONPATH fires at interpreter startup, checks forSHADOW_SESSION_OUTPUT, and if set, constructs and enters aSessionwhose atexit handler flushes the trace on process shutdown. All existing Session features (anthropic/openai monkey-patching, redaction, trace propagation) apply automatically. New--tagsflag onshadow recordpropagates to the metadata record. New--no-auto-instrumentopts out cleanly if the agent already manages its own Session. -
shadow quickstartscaffolder. Drops a working demo (agent.py,config_a.yaml,config_b.yaml, two pre-recorded.agentlogfixtures,QUICKSTART.md) into a directory in one command. A brand-new user goes frompip install shadow-diffto seeing a real nine-axis diff in under 60 seconds, no API keys required.--forceoverwrites existing files; otherwise existing content is preserved. -
shadow init --github-action. Extends the existinginitcommand with a flag that drops a ready-to-commit.github/workflows/shadow-diff.ymlinto the user's repo. Wires uppip install shadow-diff, runsshadow diffon every PR, renders the report as PR-comment markdown, and posts via theghCLI. Edit two env vars (BASELINE / CANDIDATE paths), commit, and every PR gets a behavioural-diff comment with no further setup. -
README rewrite. The "Try it" section is now "5-minute adoption" and leads with the three-command adoption path:
pip install shadow-diff→shadow quickstart→shadow diff. The "Instrument your own agent" section now documents the zero-configshadow recordpath as the recommended option, with the explicitSessionpattern as the secondary choice.
Fixed¶
- Typer extra-args forwarding on
shadow record. Typer 0.24 (bumped for the--helpfix in 0.2.x) changed how the--separator is handled at the command level;shadow recordneeded its owncontext_settings={"allow_extra_args": True, "ignore_unknown_options": True}for child args after--to be forwarded toctx.args. Previously only set at the app level, which no longer sufficed.
Tests¶
- 10 new
test_autostart.pytests covering env-var handling, tag parsing, empty-command rejection, exit-code propagation,--no-auto-instrumentbehaviour, and the PYTHONPATH shim contents. - 12 new
test_quickstart.pytests covering file scaffolding, valid agentlog output,shadow diffon the scaffolded fixtures,--forcebehaviour,--github-actionworkflow generation, and the composed quickstart + init flow. - Hero harness extended from 34 to 47 end-to-end assertions covering the new adoption path on committed fixtures.
[0.2.2] - 2026-04-23¶
Changed¶
- PyPI distribution name renamed from
shadowtoshadow-diff. The shortshadowname was already registered on PyPI by an unrelated 2015 btrfs-snapshot utility, so the project is now published aspip install shadow-diff. The Python import path (import shadow), the installed CLI command (shadow), and the GitHub repo slug (manav8498/Shadow) are unchanged, only the PyPI distribution name differs.
Fixed¶
- Release pipeline,
cargo cyclonedx --output-pattern packagerejected the--output-patternflag oncargo-cyclonedx0.5.7 (upstream removed it). Dropped the flag and rely on the default per-crate output path; the schema-valid minimal-SBOM fallback still catches any future upstream path drift.
[0.2.1] - 2026-04-23¶
Fixed¶
- Release pipeline,
cargo package -p shadow-corefailed in the v0.2.0 release run because the crate'sincludeallowlist matched onlysrc/**/*.rs;src/store/schema.sql, which is read viainclude_str!(), was excluded and the verify-build inside the published tarball broke. Addedsrc/**/*.sqlto the allowlist. - Release pipeline, the Python SBOM step wrote to
dist/at the repo root (which didn't exist) while wheels landed inpython/dist/; redirected SBOM output to match.
Added¶
- PyPI publish job (
publish-pypiinrelease.yml) using OIDC Trusted Publisher, no API token required. Bound to apypiGitHub Environment so the trust link is (repo, workflow, environment)-scoped. Seedocs/PYPI-PUBLISHING.mdfor the one-time setup (pypi.org pending publisher + GitHub Environment creation).
[0.2.0] - 2026-04-23¶
Fixed¶
shadow <cmd> --helpnow renders. Bumpedtyperfrom pinned0.13.0to>=0.15,<1.0; the old pin was incompatible withclick 8.2+(breakingTyperArgument.make_metavar()signature). All twelve subcommands print their help pages again.
Added¶
-
Live-LLM judge tests, new
python/tests/test_judge_live.pyexercises every judge against real Anthropic and OpenAI backends. Gated bySHADOW_RUN_NETWORK_TESTS=1plusANTHROPIC_API_KEY/OPENAI_API_KEY; auto-skips otherwise. Each test picks a scenario where the correct verdict is unambiguous so a real LLM's behaviour can be asserted directly. ~$0.01 token budget per backend per full run. -
Scale benchmark for drill-down, new
benchmarks/scale_drill_down.pyruns the full nine-axis diff plus drill-down ranking on synthetic traces atN ∈ {100, 500, 1000}. Asserts correctness (ranking invariants, dominant axis,pair_countmatch) and a 60-second wall-time cap at N=1000. Current numbers on a local laptop: 0.18 s @ N=100, 4.5 s @ N=500, 18.1 s @ N=1000. -
Every optional extra now exercised in CI. New
serveandotelextras inpyproject.toml, plus a new CI jobpython-full-extrasthat installsdev,anthropic,openai,otel,serve, andembeddingsand runs the full test suite. no more silently-skipped tests. Test count went from 260 passed / 18 skipped to 278 passed / 0 skipped (excluding the network-gatedtest_judge_live.pymodule, which correctly stays skipped without API keys). -
Per-pair drill-down,
DiffReportgains adrill_downfield ranking the top-K most-regressive response pairs in a trace set by an aggregate regression score, with a per-axis breakdown for each. Surfaces which specific turns drove each aggregate axis delta, so a reviewer scanning a PR with many paired traces can click-in to the single worst pair instead of hand-auditing them all.
Each row carries pair_index, baseline_turn, candidate_turn,
regression_score, dominant_axis, and an axis_scores list of
8 PairAxisScore entries (Judge excluded, the Rust core never
populates it). Per-axis normalized_delta is |delta| /
axis_scale clamped to [0, 4]; scales are calibrated against the
existing Severity::Severe thresholds so a value of 1.0
corresponds to one severity-severe-sized movement on that axis
and scores sum coherently across axes.
On the real devops-agent fixture: top pair is pair #1 with
regression score 9.32, verbosity collapsed (402→128 output
tokens) and trajectory flipped (baseline had 0 tool-divergence,
candidate had 100%). A reviewer spots the regression without
opening any raw .agentlog bytes.
Rendered by every report format: terminal (one line per pair with
the top 2 contributing axes indented), markdown / github-pr
(### Top regressive pairs section with the top 3 inline and the
rest under a collapsible <details> block).
Tests: 8 new Rust unit tests in diff::drill_down::tests + 7
Python renderer/integration tests. Hero harness extended to
assert drill_down surfaces ≥ 3 regressive pairs on the real
devops-agent scenario (now 34/34 end-to-end assertions).
Closes the last open v0.2 ROADMAP item ("Per-pair drill-down in the diff report").
-
Hero end-to-end real-world scenario. New harness
benchmarks/hero_devops_scenario.pythat exercises every Shadow feature (schema-watch, nine-axis diff, first-divergence, top-K divergences, recommendations, hardened bisection, and all ten judges) end-to-end against committed.agentlogtrace fixtures, not hand-crafted verdicts. Inputs are the real YAML configs and real recorded agent traces underexamples/devops-agent/andexamples/customer-support/. Every feature must independently surface a symptom of the PR's regression for the run to pass: 30/30 assertions currently green across two domains. This answers the integration question the per-featurevalidate_*.pyharnesses can't: do the features compose on real trace data? -
Judge axis defaults, 10 ready-made judges. Previously axis 8 was effectively a no-op without a user-written rubric. The package now ships six new Judge classes on top of the existing four (
SanityJudge,PairwiseJudge,CorrectnessJudge,FormatJudge): -
LlmJudge, generic, user-configurable LLM-as-judge. Caller supplies a rubric string (may reference{task},{baseline},{candidate}) plus ascore_mapof verdict strings to[0, 1]scores. Construction-time validation rejects unknown placeholders so mistakes surface before the first judge call. Defaults (temperature=0,max_tokens=512) match published LLM-as-judge best practice (Zheng et al. 2024). ProcedureAdherenceJudge, flags candidates that skip steps from a required procedure. Catches the devops-agent pattern where a prompt rewrite silently dropsbackup_database → run_migrationordering.SchemaConformanceJudge, semantic schema review (shape + meaning). ComplementsFormatJudge's mechanical JSON-schema validation.FactualityJudge, flags candidates whose claims contradict a known-fact set.RefusalAppropriateJudge, catches over- AND under-refusals against an explicit policy.ToneJudge, tone / persona drift against a target.
Shared judge helpers consolidated in shadow.judge._common
(response-text extraction, JSON-object extraction from prose, NaN-
safe clamping, uniform error verdicts) so breaking changes to
judge-response parsing now land in one file.
CLI: shadow diff --judge <kind> extended from none|sanity to
nine values (none, sanity, pairwise, llm, procedure,
schema, factuality, refusal, tone). New
--judge-config <file.yaml> option loads rubric data for the
domain judges. Six example config templates shipped under
examples/judges/ covering the devops-agent procedure, the
customer-support schema, Acme factuality, a domain-restriction
refusal policy, a concise-persona tone target, and a generic
three-tier LlmJudge.
Tests: 23 new ground-truth unit tests in test_llm_judge.py
(placeholder validation, custom score maps, error paths,
out-of-range confidence clamping, each of the five domain
judges). Real-world validation harness
(benchmarks/alignment/validate_judges.py) passes 14/14
assertions exercising the full rubric → render → parse → score
pipeline via a deterministic backend.
shadow schema-watch, proactive tool-schema change detection that runs before replay. Classifies each change between two configs into one of four severity tiers (breaking / risky / additive / neutral) across eleven change kinds (tool added/removed, param added/removed/renamed, type changed, required flipped, enum narrowed/broadened, description edited). Rename detection pairs a removed and added param on the same tool by matching type and required-status, catching the most common breaking tool-schema edit in practice (see thedevops-agentexample where every tool silently renamesdatabasetodb).
Exposed as a new CLI command with three output formats, terminal
(rich markup), markdown (GitHub table + expandable rationale), and
json, and a --fail-on flag that controls exit-code behaviour in
CI. Example output on the committed customer-support fixture:
✖ BREAKING lookup_order: parameter renamed `order_id` → `id`
! RISKY refund_order: description rewritten, imperative verbs removed
+ ADDITIVE lookup_order: parameter `include_shipping` added (optional)
· NEUTRAL lookup_order: description rewritten
Intended to run first in CI so PRs get a fast schema-breakage
signal before the full nine-axis diff runs. 24 ground-truth unit
tests plus a real-world validation harness
(benchmarks/alignment/validate_schema_watch.py) that passes
14/14 assertions against the devops-agent (8-tool database→db
rename) and customer-support (order_id→id rename + optional
param + risky description edit) fixtures.
- Hardened causal bisection, new
shadow.bisect.attribution.rank_attributions_with_interactions. Fits pairwise interaction effects (delta A x delta B) in addition to main effects, and emits honest bootstrap 95% confidence intervals on every attribution weight. Example output on a realistic 4-delta config PR:
semantic:
prompt.system 74.9% [71.0%, 89.2%] sel_freq=1.00 ✓
model_id x prompt.system 13.8% [3.8%, 17.5%] sel_freq=0.89 ✓
latency:
model_id 61.3% [59.2%, 68.0%] sel_freq=1.00 ✓
tools 19.7% [15.3%, 22.4%] sel_freq=0.94 ✓
model_id x tools 16.6% [12.0%, 19.4%] sel_freq=0.96 ✓
Implementation follows the research brief's explicit guidance:
PolynomialFeatures(interaction_only=True) for the augmented
design; residual bootstrap (Chatterjee & Lahiri 2011) instead
of pairs bootstrap to avoid LASSO's point-mass-at-zero pathology;
alpha fixed via outer LassoCV once on the original data (not
re-tuned per resample, Efron 2014 shows that inflates CI width);
per-resample normalisation before percentile (not normalise-
then-bootstrap, which breaks CI independence); and a strong-
hierarchy post-filter that drops A x B interactions where neither
main effect survived stability selection (Lim & Hastie 2015
glinternet).
The significant flag is now a conjunction, selection frequency
≥ 0.6 AND CI excludes zero, which is described as
screening + magnitude, not a multiplicity-adjusted p-value. Lex-
sorted interaction pair labels eliminate the (A,B) vs (B,A)
ambiguity. Output available via run_bisect(... backend=...)
under the new attributions_with_interactions key. 7 new Rust/
Python ground-truth tests plus a real-world validation harness
(8/8 passing) in benchmarks/alignment/validate_hardened_bisect.py.
-
Prescriptive fix recommendations (
shadow.diff.recommendations). Transforms the divergence list from "what changed" into "what to do about it": a ranked list of specific, imperative actions a PR reviewer can act on in under 30 seconds. Examples from the real- world 10-turn scenario: -
[error] RESTOREturn 9, "Restoresend_confirmation_emailat turn 9." [error] REMOVEturn 8, "Remove duplicate invocation oflookup_order(order_id)at turn 8."[error] REVIEWturn 4, "Review refusal behaviour at turn 4: candidate may be over-refusing."[warning] REVERTturn 6, "Revertrefund(amount)at turn 6 to the baseline value."
Three-tier severity (Error / Warning / Info) matching ESLint / SonarQube / Rustc conventions. Five action kinds (Restore / Remove / Revert / Review / Verify) covering the expected regression patterns. Pure deterministic rule engine, no LLM dependency; LLM-enriched recommendations can layer on later.
Each recommendation carries severity, action, turn, a one-line
message, a rationale line citing the signal that triggered it,
and the primary axis. Sorted by severity × confidence, capped at
8 entries so PR comments don't bloat.
Exposed on DiffReport as the new recommendations field
(documented by the Recommendation TypedDict in _core.pyi).
Rendered in all three report formats: markdown / github-pr
(bulleted ### Recommendations section with severity icons),
terminal (colour-coded by severity). 14 new Rust unit tests + 8
new Python renderer tests; real-world benchmark harness passes
18/18 including action-correctness assertions.
- Top-K divergence ranking on top of first-divergence detection.
The diff report now carries a
divergenceslist (up toDEFAULT_K=5entries) in addition to the backward-compatiblefirst_divergencefield. Divergences are sorted by importance: Structural > Decision > Style (by class), then by confidence within a class, with walk order as the stable tiebreaker. Renderers show the top 3 inline and collapse any extras (#4 and beyond) into a<details>section in markdown / github-pr, or a "+N more" line in terminal.first_divergenceremains walk-order- first for back-compat;divergences[0]is severity-rank-first. Validated end-to-end with the benchmark harness at 27/27 cases (real committed fixtures + adversarial stress + top-K-specific coverage). -
First-divergence detection (
shadow.diff.alignment). Given two traces, identifies the first turn at which the candidate diverged from the baseline and classifies the divergence asstyle_drift/decision_drift/structural_drift. Uses a Needleman-Wunsch global alignment with Gotoh affine gap penalties over the chat-response sequence; per-cell cost combines Jaccard distance on tool-shape, character-shingle text similarity, arg- value diff, and stop-reason mismatch. Surfaces in every report renderer (terminal / markdown / github-pr) as a one-line root- cause summary like "tool set changed: removedsearch(query), addedsearch(limit,query)". Exposed on theDiffReportdict as the newfirst_divergencekey (documented by theFirstDivergenceTypedDict in_core.pyi);Nonewhen the traces agree end-to-end. 14 new Rust tests + 6 new Python renderer tests. -
Live LLM backends:
shadow.llm.AnthropicLLMwrapsanthropic.AsyncAnthropic,shadow.llm.OpenAILLMwrapsopenai.AsyncOpenAI. Both implement theLlmBackendProtocol and lazy-import their SDK soshadowstill runs without the extras.shadow.llm.get_backend(name, **kwargs)factory dispatches by name. - LASSO-over-corners bisection scorer (
shadow.bisect.corner_scorer). Given a liveLlmBackend, builds a 2^k full-factorial over the differing config categories ({model, prompt, params, tools}), replays the baseline through the backend at each corner, computes the nine-axis divergence per corner, and fits LASSO per axis. Ground-truth test recoverslatency → modelandverbosity → promptwith > 70% weight. - CLI:
shadow bisect --backend {anthropic,openai,positional}wires the live-replay scorer. Without--backend, falls back to the heuristic kind-based allocator when--candidate-tracesis supplied, or zero-placeholder otherwise. run_bisectthree-mode dispatch:lasso_over_corners(with backend) →heuristic_kind_allocator(only a candidate trace) →lasso_placeholder_zero(neither). Themodefield names which ran.- 11 new tests covering the backends (fake SDK stubs, no network) and
the corner scorer (deterministic
FakeBackend). - OSS governance / community files:
SUPPORT.md,GOVERNANCE.md,MAINTAINERS.md,TRADEMARK.md,CITATION.cff,.github/FUNDING.yml,.github/dependabot.yml. AxisStatandDiffReportTypedDicts inshadow/_core.pyiso downstream Python consumers get real types for the Rust-extension return shapes.
Changed¶
- Dual-license the implementation under MIT OR Apache-2.0 (Rust
community default).
SPEC.mdstays Apache-2.0 only so re-implementers get an explicit patent grant for the format. - README rewritten in plain English (~200 lines), one quickstart,
sample output matches real
just demooutput. - Project metadata on
Cargo.tomlandpyproject.toml: keywords, categories, URLs,Typing :: Typed, explicitincludeallowlist for the Rust crate. .gitignorehardened: addednode_modules/, JS tool caches,.env.*.localvariants, agent-state dirs,*.tsbuildinfo, Jupyter checkpoints, SBOM outputs.shadow/__init__.pywraps the abi3-mismatch ImportError with a "requires Python 3.11+" hint.AnthropicLLM.__init__fails fast with a branded error if no API key is set, instead of deferring to the opaque HTTP-layer error from the SDK.- Removed the
CLAUDE.mdinternal coding-agent instruction file from the repository (not appropriate in a public OSS project).
Fixed¶
- Severity classifier false negative on rate-bounded axes: a CI like
[0.0, 1.0]was treated as straddling zero and downgraded Severe to Minor. Nowci_straddles_zerois strict (ci_low < -epsilon && ci_high > +epsilon). A unanimous+1.0trajectory delta now correctly classifies as Severe. - Conformance axis was dead for tool-use-only agents (
n=0). It now also fires ontool_use-intent and scores by top-level key-set match. - Delta extractor exploded a single tool-schema edit into dozens of
leaf-level deltas, which over-determined the LASSO fit. Default
diff_configs(coalesce=True)collapses each tool to a single delta keyed by tool name. Legacy leaf-level output still available viacoalesce=False. - Attribution row schema was inconsistent across bisect modes. All
three modes now emit the same six keys (
delta,weight,ci95_low,ci95_high,significant,selection_frequency). DELTA_KIND_AFFECTS["tools"]was missingconformance, added, so tool-schema edits can be attributed to the conformance axis.- 11 pre-existing
rufflints (all pattern/idiom nits such asclass X(str, Enum)→class X(StrEnum)for Python 3.11+).
Notes¶
- This release has no production users yet. All "real-world" validation references the project's own example scenarios, not external deployments. Claims about behaviour should be read as "what the code does," not "what teams have confirmed in prod."
[0.1.0], 2026-04-22¶
First tagged release. Ships the Rust core, Python SDK + CLI, bisection module, GitHub Action, end-to-end demo, and CI, see the per-phase sections below for specifics.
Summary¶
- 13 commits across 7 phases.
- 164 tests total: 125 Rust (
cargo test -p shadow-core) + 47 Python (pytest python/tests). - Rust coverage: 97.63% line / 98.93% function on
shadow-core. - Python coverage: 88.07% across
shadow.*packages. cargo clippy --all-targets --all-features -- -D warningsclean.cargo fmt --checkclean.mypy --strictclean across 26 Python source files.ruff check+ruff format --checkclean.bash examples/demo/demo.shruns in 1.14 s on an M-series laptop (target ≤ 10 s).
Phase 0, Scaffold¶
Decisions¶
- PyO3 build system: maturin (D1 in the plan). Using
abi3-py311so one wheel supports Python 3.11+. Alternative considered: setuptools-rust, rejected because maturin'sdeveloploop is the same-day ergonomics we want for TDD. - Workspace-level Rust pinned to 1.83.0 (stable at planning time). Individual
dep versions pinned exactly in
Cargo.tomlper the user's "no bleeding-edge churn" constraint. - Clippy
unwrap_used/expect_used/panicdenied in non-test code via inner attributes inlib.rs(not workspace-wide) so tests can still useunwrap(). - Python deps pinned to exact versions in
python/pyproject.toml. Real LLM SDKs (anthropic,openai) are optional extras, Shadow can run without them viaMockLLM. - Cargo
crate-type = ["cdylib", "rlib"]on shadow-core so the same crate builds both as a normal Rust library (forcargo test) and as a PyO3 extension (formaturin develop). Thepythonfeature gates the PyO3 module; non-Python consumers get a clean rlib. - CLI entrypoint:
shadow = "shadow.cli.app:main"(Pythonconsole_script), not a Rust binary. This keeps the CLI logic testable withtyper.testing.CliRunnerand lets the Rust core remain pure-library.
Dead ends¶
- (none yet, Phase 0 ran cleanly through file scaffolding)
Blockers surfaced and resolved¶
- Rust toolchain install,
cargo/rustcwere absent; user-local rustup curl install authorized post-Phase-1 per<interaction_policy>checkpoint 1, installed into~/.cargo/~/.rustup(no sudo, no brew, no system-wide changes). - Rust toolchain version, initially pinned to
1.83.0(stable at original-plan time, late 2024); the release drop tov0.1.0then bumped to Rust 1.95.0 (stable 2026-04-14) because the ecosystem had already moved past 1.83,indexmap 2.14+,proptest 1.11, latestjust/maturinall require edition2024 (Rust ≥ 1.85). The interim 1.83 workaround (direct-pinindexmap = "=2.6.0"soserde_json's transitive dep couldn't resolve to edition2024) was removed on the bump; Cargo.lock regenerated; all direct deps updated to the versions Cargo picks cleanly on 1.95. - Companion tool versions,
just 1.50.0,maturin 1.13.1,cargo-llvm-cov 0.8.5. All installed viacargo install --locked. - PyO3 feature split, The original single
pythonfeature usedpyo3/extension-module, which omits libpython link directives. That is correct for a maturin-built.sobut makescargo test --features pythonfail to link. Split into two features:python(pyo3 types available, libpython linked,abi3-py311) andextension(adds extension-module, used only by maturin).cargo test --workspaceruns pure-Rust tests without pulling pyo3 at all; the PyO3 bindings are tested from Python via pytest aftermaturin develop. - Rust 1.95 clippy tightening, the toolchain bump surfaced three new
lints:
doc_overindented_list_items(fixed by de-indenting continuation lines in the replay engine's doc comment),useless_conversionfiring on PyO3's idiomatic?-in-PyResultpatterns (addressed with a module-level#![allow]insrc/python.rswith a comment explaining why), and aclone→slice::from_refsuggestion in a parser test.
Phase 1, SPEC.md¶
Decisions¶
- Content-address payload only, not the envelope.
id = sha256(canonical_json(payload))so two identical requests dedupe to the same blob. Envelope (ts,parent) is not hashed. Alternative considered: hash the whole envelope, rejected because it defeats dedup and makes MockLLM replay lookups harder (you'd need to reconstruct the envelope to look up a response). - RFC 8785 (JCS) for canonical JSON, with two application clarifications
(§5.2, Unicode NFC normalization on strings and keys; §5.4, no
Decimal/NaN/Infinity). Picking an existing RFC instead of inventing our own rules means any JCS library is most of the way there; the NFC addition covers a gap in JCS where visually-identical strings encoded differently would hash differently. - Known-vector lives in §5.6 as a "Conformance test case" (moved on review from §6.2, which now points back to §5.6). The vector covers both canonicalization bytes and the resulting content id, so a fresh implementer can verify both at once.
- Known-vector hash pinned in §6.2:
{"hello":"world"}→sha256:93a23971a914e5eacbf0a8d25154cda309c3c1c72fbb9914d47c60f3cb681588. Verified locally withpython3 -c 'import hashlib; print(hashlib.sha256(b"{\"hello\":\"world\"}").hexdigest())'. Phase 2'sagentlog::hashtest suite pins this vector. - One trace = one file. Concatenating two
.agentlogfiles does NOT produce a valid.agentlog, simpler invariant than allowing multi-trace files, and forces the SQLite index to be the "set" abstraction. - First record is always
metadatawithparent: null. A trace is identified by its root's content id, so we don't need a separate trace_id field in the envelope. - Redaction before canonicalization. The hash reflects redacted content, not raw. Means a post-hoc audit can't trivially reconstruct the original , that's intentional.
- Streaming responses = one record with an optional
stream_timingsarray, not a record per token. Per-token records would explode storage and makeshadow diffmuch slower. Timing preserved, token stream content aggregated.
Dead ends¶
- (none yet, spec came together in one pass)
Phase 2, shadow-core (Rust)¶
Nine commits land the full Rust core. Final tree: 125 unit tests, cargo
clippy --all-features -- -D warnings clean, cargo fmt --check clean,
97.63% line coverage on shadow-core (target ≥85%) measured via
cargo llvm-cov --workspace (98.93% function coverage).
Decisions¶
- Payload-only hashing (SPEC §6.1). Record envelope (
ts,parent) is not in the hash; onlycanonical_json(payload). Two identical requests dedupe to the same id. - Records hold payloads as
serde_json::Value, not typed structs. Typed payload structs come later alongside the consumers that need them (Python SDK). Keeps the core storage path provider-agnostic. - Atomic writes for the FS store via write-tmp + rename. A crash
mid-write leaves a
.tmporphan, not a corrupt real trace. include_str!SQLite schema. Schema lives instore/schema.sql; bundled into the binary so there's no runtime file resolution and the schema can be versioned alongside the code that consumes it.PyO3feature split:pythonvsextension. The basepythonfeature pulls pyo3 withabi3-py311but NOTextension-module, socargo test --features pythonlinks libpython normally.extensionaddsextension-modulefor maturin. This avoids the "cargo test link-fails because libpython isn't linked" footgun.cargo test --workspaceruns NO features by default. PyO3 bindings are tested from Python aftermaturin develop. Keeps the Rust test loop zero-config (no need for python3.11 on PATH).- Backend trait takes
&Value, returnsValue. Envelope ownership stays with the engine; backends are tiny adapters (a live Anthropic backend is ~30 lines). Matches SPEC §10's replay algorithm cleanly. - Replay errors →
Errorrecords, not engine panics. A baseline with some-missing responses still produces a complete candidate trace, with error counts in thereplay_summary. - Clock abstraction in
replay::engineso tests can pin timestamps. Ships aFixedClockfor fixtures; production uses the system clock (Phase 3). - Semantic axis: hash-surrogate embedding in Rust. Clearly labelled
test-only in the module docstring. The production embedding
(
sentence-transformers/all-MiniLM-L6-v2) is plugged in by the Python layer per CONTRIBUTING.md. This kept an ML dep out of the Rust crate. - Axis nodes take
pairs: &[(&Record, &Record)]. Pair extraction lives indiff/mod.rs::extract_response_pairs. Uneven counts (e.g. candidate missed some) truncate to the shorter side; callers that want to flag count-mismatch consult the replay_summary directly. - Severity thresholds:
None(CI crosses zero and delta is tiny),Minor(<10% relative),Moderate(<30%),Severe(≥30% or CI clearly excludes zero with large magnitude). - Bootstrap defaults to 1000 iterations with seeded RNG for reproducibility; callers can override both.
- Test-only
assert_eq!/unwrap()allowed; clippy's unwrap_used/panic lints are denied only in non-test code via#![cfg_attr(not(test), ...)].
Dead ends¶
- Attempted to write
assert_eq!inpaired_cifor length-mismatch precondition, tripped clippy'spaniclint which applies topanic!()expansions ofassert!macros. Resolved with a narrow#[allow(clippy::panic)]block around an explicitpanic!(). cleaner than restructuring to return a Result for a programmer-error guard. - First pass at the
metaomit-when-None test used!wire.contains("meta"), which accidentally matched the kind string"metadata". Fix: match the quoted field name"\"meta\""instead. Small lesson: substring assertions on JSON are fragile; prefer explicit field-level checks.
Phase 3, Python SDK + CLI¶
Decisions¶
- PyO3 bindings take/return dicts, not typed pyclass wrappers. Simpler
surface for Python users and no second type system to maintain. The
serde_json::Value ↔ PyObject conversion goes through the
pythonizecrate (pinned=0.22.0to matchpyo3=0.22.6). - Type stubs shipped in
python/src/shadow/_core.pyi. mypy --strict users see the PyO3 surface without importing the compiled extension. - Session is a manual recorder in v0.1. Monkey-patch-based
auto-instrumentation of
anthropic/openaiPython clients is deferred to v0.2, their streaming surfaces are too divergent to unify cleanly in a first cut, and forcing users intorecord_chat(req, resp)is a small enough overhead that it's not blocking adoption. - Python-side
run_replay. The Rust replay engine exists but isn't exposed through PyO3, theLlmBackendtrait is async and calling back into Python from a Rust trait object needs PyO3 ceremony that wasn't worth the code for v0.1. The Python replay mirrors SPEC §10 semantics exactly. - CLI uses typer. Every subcommand has an end-to-end integration
test via
typer.testing.CliRunner. Machine-consumed JSON outputs go throughsys.stdout.write(unstyled) to avoid Rich's ANSI escapes breakingjqpipelines. PositionalMockLLMadded alongsideMockLLM. Positional replay is the only sensible demo backend when baseline and reference traces were recorded with different configs (different request payloads → different content ids → MockLLM strict would miss every request). Clearly labelled as "for demos and integration tests, not production."
Dead ends¶
- First pass at
cargo test --features pythonfailed to link on macOS because theextension-modulepyo3 feature omits libpython link directives. Resolved by splitting the feature intopython(abi3-py311 only, links libpython) andextension(adds extension-module, used by maturin).cargo testtakes neither by default, PyO3 bindings are tested from Python aftermaturin developbuilds the.so. - Initial pass at the
metaomission test used a substring check against"meta"which accidentally matched"metadata"(the kind name). Fix: check for the quoted field name"\"meta\"".
Phase 4, Bisection (LASSO + Plackett-Burman)¶
Decisions¶
- Per-axis LASSO with
alpha=0.01. scikit-learn handles the coord descent. Normalization:|coef| / sum(|coef|)so each axis's attributions sum to 1 (or 0 when the axis is invariant across corners). - Hadamard/Paley PB matrices tabulated for runs ∈ {8, 12, 16, 20,
24}. Runs > 24 error out in v0.1 (k ≤ 23). Full factorial capped at
k=6 (64 runs).
choose_designpicks the right one automatically. - v0.1 runner emits placeholder-zero divergence. Real per-corner replay scoring lands in v0.2 with live-LLM support. The plumbing (delta extraction, design, LASSO, attribution ranking) is correct and tested against a synthetic ground-truth recovery case (≥0.9 attribution to delta #2 on the trajectory axis; ≤0.05 on every other delta; zero attribution on every other axis since there is no signal).
Phase 5, GitHub Action¶
Decisions¶
- Composite action, not a JavaScript action. No Node build
pipeline; the logic lives in
action.ymlshell steps plus a stdlib-onlycomment.py. - Hidden HTML marker lets subsequent runs update the existing PR comment in place. One comment per PR, not a running log.
- Step-summary write means even fork PRs (where posting a comment is blocked) still surface the diff in the GitHub Actions log view.
Phase 6, Demo + README¶
Decisions¶
- Demo fixtures are committed.
examples/demo/fixtures/{baseline, candidate}.agentlogare deterministic outputs ofgenerate_fixtures.py(also committed). A fresh clone runsjust demoin ≈1 s without touching a network. Regenerating the fixtures is reproducible. PositionalMockLLMis the demo backend.MockLLM(content-id lookup) wouldn't work because baseline/candidate differ in system-prompt wording, their request payloads have different ids.- README opens with "Why?" (per review), then a four-column competitive-landscape table (Langfuse / Braintrust / LangSmith / Shadow). Gives readers a reason to keep scrolling before they hit install instructions.
Phase 7, CI + release¶
Decisions¶
- Matrix: Ubuntu + macOS × Python 3.11 + 3.12. Windows deferred
(SDK is POSIX-path biased;
.venv/bin/pythonassumptions are Windows-unfriendly without more plumbing than the v0.1 budget allows). taiki-e/install-actionforcargo-llvm-covso CI doesn't pay the ~45 s compile-from-source cost on every run.- Rust cache via
Swatinem/rust-cacheshaves ~5 min off matrix legs. - Python tests gate at 85% coverage (
--cov-fail-under=85). Current: 88.07%. - Three jobs:
rust(fmt/clippy/test/coverage),python(ruff, mypy, pytest+coverage),demo(end-to-end).pythondepends onrust,demoonpython.
Conventions¶
- Each phase boundary lands a
### Phase N, <title>section. #### Decisionsfor design choices, with the alternative that was considered and why it was rejected.#### Dead endsfor approaches that were tried and backed out of, with the reason.#### Blockers surfacedfor things the user needs to act on.