Sandboxed deterministic replay¶
Shadow's diff and policy machinery answers "what changed?" The
sandboxed-replay engine answers "what would the candidate do, end-to-end,
without actually doing it?" The candidate's LLM picks tools, the engine
dispatches them, results feed back, the loop runs to completion. No real
network calls. No real database writes. No real charges. The output is an
ordinary .agentlog so every existing Shadow command (diff,
check-policy, mine, mcp-serve, bisect) reads it without changes.
This is the piece that turns "behavior diff CI" into "shadow deployment."
When to use it¶
- Confirming a bisect attribution.
shadow bisectsays "the model swap is 78% of the latency regression, estimated." The hedged language is honest because attribution is correlational. To prove it, swap one variable at a time withreplace_tool_result/replace_tool_args/branch_at_turnand re-diff. - Pre-merge sanity check on agent-loop changes. Your PR changes
the system prompt; the candidate might pick different tools.
shadow replay --agent-loopdrives the loop forward against the candidate's prompt and produces a candidate trace you can diff against baseline. - Reproducing production-only bugs in CI. A candidate trace where the user's real tools run under sandbox: same code paths, same failure modes, no real side effects.
Three backends¶
| Backend | What it does | When to pick |
|---|---|---|
ReplayToolBackend |
Indexes baseline tool_result records by (tool_name, canonical_args_hash). Returns recorded results. |
Default. Deterministic, free, no API keys. |
SandboxedToolBackend |
Wraps your real tool functions; blocks network / subprocess / fs writes. | When you care whether the candidate's new tool calls succeed. |
StubToolBackend |
Returns a deterministic placeholder for every call. | Tests, sanity checks. |
All three implement the same ToolBackend protocol:
class ToolBackend(Protocol):
async def execute(self, call: ToolCall) -> ToolResult: ...
@property
def id(self) -> str: ...
Novel-call policies¶
When the candidate calls a tool the baseline never recorded,
ReplayToolBackend consults a NovelCallPolicy. Four ship:
StrictPolicy— raisesShadowBackendError. The default for CI: a tool the baseline didn't record is a behavioral regression by definition.StubPolicy— returns a placeholder result. Lets the loop continue so you can see what the candidate would do given a stub.FuzzyMatchPolicy— finds the nearest same-tool baseline call by Jaccard distance over arg keys. Useful when arg values shifted but intent is the same.DelegatePolicy— calls a user-supplied async function. Use this to bridge to a sandboxed real backend or your own mocking server.
Implementing your own is a single async method.
CLI¶
shadow replay candidate.yaml \
--baseline baseline.agentlog \
--agent-loop \
--tool-backend replay \
--novel-tool-policy stub \
--max-turns 32 \
--output candidate.agentlog
The output is a normal .agentlog. Pipe it into shadow diff
baseline.agentlog candidate.agentlog like any other trace.
Programmatic API¶
The CLI flag --tool-backend sandbox deliberately errors with
guidance, because the sandbox needs your tool function
implementations and those don't fit on a CLI flag. From Python:
from shadow.replay_loop import run_agent_loop_replay
from shadow.tools.sandbox import SandboxedToolBackend
from shadow.llm import MockLLM
from shadow import _core
baseline = _core.parse_agentlog(open("baseline.agentlog", "rb").read())
async def search_kb(args: dict) -> str:
"""Your real tool function. Sandbox patches socket / subprocess /
fs writes during execution."""
return await my_kb_client.search(args["query"])
sandbox = SandboxedToolBackend(
tool_registry={"search_kb": search_kb},
block_network=True,
block_subprocess=True,
freeze_time=datetime(2026, 4, 25, tzinfo=UTC),
)
trace, summary = await run_agent_loop_replay(
baseline,
llm_backend=MockLLM.from_trace(baseline),
tool_backend=sandbox,
)
Counterfactual primitives¶
For surgical "what-if" questions, three helpers in
shadow.counterfactual_loop:
from shadow.counterfactual_loop import (
branch_at_turn,
replace_tool_args,
replace_tool_result,
)
# What would the agent do if this tool had returned X?
result = await replace_tool_result(
baseline,
tool_call_id="call_abc",
new_output="<no results found>",
)
# What if it had been called with these args instead?
result = await replace_tool_args(
baseline,
tool_call_id="call_abc",
new_arguments={"query": "python", "limit": 25},
)
# Pin turns 0..N from baseline, drive the loop forward from N+1.
result = await branch_at_turn(
baseline,
turn=5,
llm_backend=live_llm,
tool_backend=sandbox,
)
Each returns a CounterfactualLoopResult carrying the new trace,
agent-loop summary, and an override dict describing the
substitution so renderers can label the comparison "baseline vs.
counterfactual where ...".
What's in the sandbox¶
SandboxedToolBackend is best-effort isolation for replay
determinism, not a security boundary. Patches:
- Network:
socket.connect,socket.connect_ex(covers httpx / requests / aiohttp / http.client transitively). - Subprocess:
subprocess.Popen,subprocess.run,os.system,os.execvp. - Filesystem writes: opens with mode containing
w,a,+, orxare redirected into a tempdir. Read-mode opens pass through so config files keep working. - Time (opt-in):
time.time+datetime.utcnowpinned to a fixed instant.
A blocked operation surfaces as a tool_result record with
is_error=True and an actionable message. Patches restore after
each call so the surrounding test runner isn't affected.
Determinism guarantees¶
- Same baseline + same backends + same config → byte-identical output. Shadow's content-addressed envelopes don't carry wall-clock timestamps in the hash, so re-runs match.
- The replay engine never mutates its inputs.
- A
max_turnscap (default 32) bounds runaway agents — exceeded, the engine emits anerrorrecord withcode=loop_max_exceededand stops the session cleanly.
Limitations¶
- The agent-loop engine is Python-only. The Rust core handles the classic copy-through replay; the agent-driving variant lives in Python because user tool functions live in Python.
SandboxedToolBackendpatches the ambient runtime, not the tool function source. Pure-Python computation, in-memory state, and arbitrary recursion all still work — those are the agent's business.- Streaming responses aren't yet driven through the agent-loop
engine; the existing classic
run_replayhandles streaming correctly. v1.7 will unify.