Introducing Rewind: Chrome DevTools for AI Agents
I spent two months debugging a 40-step booking agent. Every time it failed, I’d change the prompt, re-run the whole thing, wait 3 minutes, and check the output. Different result each time because LLMs are non-deterministic. I couldn’t even tell if my fix worked.
That’s the state of AI agent debugging in 2026. We have tracing tools that show you what happened. We have eval frameworks that score outputs. Nobody built the thing in between: a debugger that lets you go back to the failure, change one thing, and re-run only that part.
So I built it.
What Rewind does
Rewind records every LLM call your agent makes. When something breaks, you fork the timeline at the failure point, fix your code, and replay. Steps before the fork are served from cache (0 tokens, 0 API calls, instant). Only the fixed step hits the LLM.
pip install rewind-agent
import rewind_agent
rewind_agent.init()
# your agent runs normally
# every LLM call is recorded automatically
That’s the setup. One line. After your agent runs, you have a full recording.
🔍 See what the model saw
The first thing you want when an agent fails: what was in the context window at each step?
rewind show latest
⏪ Rewind — Session Trace
Session: booking-agent Steps: 12 Tokens: 274,200
Agents: supervisor, researcher, writer
▼ ✓ 🤖 supervisor (agent) 1.2s
├ ✓ 🧠 gpt-4o "Route to researcher" 320ms 4,120↓ 840↑
▼ ✓ 🤖 researcher (agent) 2.1s
│ ├ ✓ 🧠 gpt-4o "Search for information" 890ms 8,340↓ 2,100↑
│ ├ ✓ 🔧 web_search("Tokyo population") 45ms
│ └ ✓ 🧠 gpt-4o "Synthesize results" 650ms 6,280↓ 1,950↑
├ ✓ 🔀 handoff: researcher → writer
▼ ✗ 🤖 writer (agent) 1.8s
│ ├ ✓ 🧠 gpt-4o "Draft article" 1200ms 9,450↓ 3,180↑
│ └ ✗ 🧠 gpt-4o "Polish final draft" 600ms 7,320↓ 2,120↑
│ ERROR: Hallucination — used stale data
└ ✓ 🧠 gpt-4o "Final review" 400ms 5,200↓ 1,450↑
That’s a span tree. Agent boundaries, tool calls, handoffs, token counts at each step. The writer agent hallucinated at step 8 because the researcher passed stale data. Without the tree, you’d see a flat list of 12 steps with no structure.

⏪ Fork at the failure, replay with the fix
This is the core idea. Fix your code, then:
rewind replay latest --from 7
Steps 1-6 are served from cache. Zero tokens, zero API calls. Only step 7+ re-runs live with your updated code. You get a new timeline (“fixed”) alongside the original (“main”).
rewind diff latest main fixed
⏪ Rewind — Timeline Diff (main vs fixed, diverge at step 7)
═ Step 1 identical
═ Step 2 identical
═ Step 3 identical
═ Step 4 identical
═ Step 5 identical
═ Step 6 identical
≠ Step 7 [stale data] → [fresh data]
≠ Step 8 [error] 7320tok → [success] 7540tok
You can see exactly where the timelines diverge and why. No guessing, no “let me re-run the whole thing and hope.”
✅ Prove the fix with LLM-as-judge
Changed the prompt? Score both timelines automatically:
rewind eval score latest -e correctness --compare-timelines
⏪ Rewind — Timeline Scores
Timeline correctness avg
──────────── ─────────── ──────
main 0.200 0.200
fixed 0.950 0.950
Delta (fixed vs main): +0.75 avg ↑
Original: 0.2 on correctness. Fixed: 0.95. Not you guessing. An LLM evaluator comparing the output against expected results.
Set up regression baselines for CI:
result = rewind_agent.evaluate(
dataset="booking-tests",
target_fn=my_agent,
evaluators=[
exact_match,
rewind_agent.llm_judge_evaluator(criteria="correctness"),
],
fail_below=0.9,
)
CI fails if the score drops below 90%. Ship with evidence.
📤 Share debug sessions
rewind share latest --include-content -o debug-session.html
Generates a self-contained HTML file. Open it in any browser, no install needed. The full trace, both timelines, the diff, the scores. Drop it in Slack. Your team sees exactly what broke and the proof that it’s fixed.

🔌 Works with your existing stack
Rewind isn’t a replacement for your observability tools. It works alongside them.
Already using Langfuse? Import a broken production trace with one command:
rewind import from-langfuse --trace <id>
Fork it, replay with the fix, export the fixed trace back:
rewind export otel latest --endpoint https://cloud.langfuse.com
Same for LangSmith, Datadog, Grafana Tempo, Jaeger, or anything that speaks OpenTelemetry.
Providers: OpenAI, Anthropic, AWS Bedrock, any OpenAI-compatible API (Ollama, vLLM, LiteLLM).
Frameworks: OpenAI Agents SDK (native), Pydantic AI (native), LangGraph, CrewAI, Autogen, smolagents, or custom code.
🏗️ How it’s built
Single Rust binary. No containers, no database servers, no config files. Everything (CLI, recording proxy, web dashboard, MCP server) is embedded in one executable. The Python SDK auto-downloads it on first use.
Recording works two ways:
- Direct mode (Python):
rewind_agent.init()monkey-patches OpenAI and Anthropic SDKs in-process. Zero latency overhead. - Proxy mode (any language):
rewind record --upstream https://api.openai.com. Point your agent’s base URL at the proxy. Streaming pass-through, no added latency.
The web dashboard runs at http://127.0.0.1:4800 with live updates via WebSocket. Sessions appear as they’re recorded.
29 CLI commands. 26 MCP tools for AI assistants (Claude Code, Cursor, Windsurf) to query recordings from your IDE. MIT licensed.
Try it
pip install rewind-agent
rewind demo && rewind inspect latest
No API keys needed. rewind demo seeds a sample multi-agent session with a hallucination at step 5. Try show, replay, diff, eval score, share on it.
Or add one line to your own agent:
import rewind_agent
rewind_agent.init()
# run your agent, then: rewind show latest
- GitHub: agentoptics/rewind
- PyPI: rewind-agent
- Docs: Getting started
- Website: agentoptics.dev
We’re building Rewind in the open. If you’re debugging AI agents and want to talk about what’s broken, open a discussion.