← Back to blog

Introducing Rewind: Chrome DevTools for AI Agents

AgentOptics ·

I spent two months debugging a 40-step booking agent. Every time it failed, I’d change the prompt, re-run the whole thing, wait 3 minutes, and check the output. Different result each time because LLMs are non-deterministic. I couldn’t even tell if my fix worked.

That’s the state of AI agent debugging in 2026. We have tracing tools that show you what happened. We have eval frameworks that score outputs. Nobody built the thing in between: a debugger that lets you go back to the failure, change one thing, and re-run only that part.

So I built it.

What Rewind does

Rewind records every LLM call your agent makes. When something breaks, you fork the timeline at the failure point, fix your code, and replay. Steps before the fork are served from cache (0 tokens, 0 API calls, instant). Only the fixed step hits the LLM.

pip install rewind-agent
import rewind_agent
rewind_agent.init()

# your agent runs normally
# every LLM call is recorded automatically

That’s the setup. One line. After your agent runs, you have a full recording.

🔍 See what the model saw

The first thing you want when an agent fails: what was in the context window at each step?

rewind show latest
⏪ Rewind — Session Trace

  Session: booking-agent   Steps: 12   Tokens: 274,200
  Agents: supervisor, researcher, writer

  ▼ ✓ 🤖 supervisor (agent)                          1.2s
    ├ ✓ 🧠  gpt-4o  "Route to researcher"           320ms  4,120↓ 840↑
    ▼ ✓ 🤖 researcher (agent)                        2.1s
    │ ├ ✓ 🧠  gpt-4o  "Search for information"      890ms  8,340↓ 2,100↑
    │ ├ ✓ 🔧  web_search("Tokyo population")          45ms
    │ └ ✓ 🧠  gpt-4o  "Synthesize results"          650ms  6,280↓ 1,950↑
    ├ ✓ 🔀 handoff: researcher → writer
    ▼ ✗ 🤖 writer (agent)                            1.8s
    │ ├ ✓ 🧠  gpt-4o  "Draft article"              1200ms  9,450↓ 3,180↑
    │ └ ✗ 🧠  gpt-4o  "Polish final draft"          600ms  7,320↓ 2,120↑
    │     ERROR: Hallucination — used stale data
    └ ✓ 🧠  gpt-4o  "Final review"                   400ms  5,200↓ 1,450↑

That’s a span tree. Agent boundaries, tool calls, handoffs, token counts at each step. The writer agent hallucinated at step 8 because the researcher passed stale data. Without the tree, you’d see a flat list of 12 steps with no structure.

Rewind web dashboard showing session trace with span tree

⏪ Fork at the failure, replay with the fix

This is the core idea. Fix your code, then:

rewind replay latest --from 7

Steps 1-6 are served from cache. Zero tokens, zero API calls. Only step 7+ re-runs live with your updated code. You get a new timeline (“fixed”) alongside the original (“main”).

rewind diff latest main fixed
⏪ Rewind — Timeline Diff (main vs fixed, diverge at step 7)

  ═ Step  1  identical
  ═ Step  2  identical
  ═ Step  3  identical
  ═ Step  4  identical
  ═ Step  5  identical
  ═ Step  6  identical
  ≠ Step  7  [stale data]  →  [fresh data]
  ≠ Step  8  [error] 7320tok   →  [success] 7540tok

You can see exactly where the timelines diverge and why. No guessing, no “let me re-run the whole thing and hope.”

✅ Prove the fix with LLM-as-judge

Changed the prompt? Score both timelines automatically:

rewind eval score latest -e correctness --compare-timelines
⏪ Rewind — Timeline Scores

  Timeline      correctness     avg
  ────────────  ───────────  ──────
  main                0.200   0.200
  fixed               0.950   0.950

  Delta (fixed vs main): +0.75 avg  ↑

Original: 0.2 on correctness. Fixed: 0.95. Not you guessing. An LLM evaluator comparing the output against expected results.

Set up regression baselines for CI:

result = rewind_agent.evaluate(
    dataset="booking-tests",
    target_fn=my_agent,
    evaluators=[
        exact_match,
        rewind_agent.llm_judge_evaluator(criteria="correctness"),
    ],
    fail_below=0.9,
)

CI fails if the score drops below 90%. Ship with evidence.

📤 Share debug sessions

rewind share latest --include-content -o debug-session.html

Generates a self-contained HTML file. Open it in any browser, no install needed. The full trace, both timelines, the diff, the scores. Drop it in Slack. Your team sees exactly what broke and the proof that it’s fixed.

Rewind HTML share viewer showing timelines and fork

🔌 Works with your existing stack

Rewind isn’t a replacement for your observability tools. It works alongside them.

Already using Langfuse? Import a broken production trace with one command:

rewind import from-langfuse --trace <id>

Fork it, replay with the fix, export the fixed trace back:

rewind export otel latest --endpoint https://cloud.langfuse.com

Same for LangSmith, Datadog, Grafana Tempo, Jaeger, or anything that speaks OpenTelemetry.

Providers: OpenAI, Anthropic, AWS Bedrock, any OpenAI-compatible API (Ollama, vLLM, LiteLLM).

Frameworks: OpenAI Agents SDK (native), Pydantic AI (native), LangGraph, CrewAI, Autogen, smolagents, or custom code.

🏗️ How it’s built

Single Rust binary. No containers, no database servers, no config files. Everything (CLI, recording proxy, web dashboard, MCP server) is embedded in one executable. The Python SDK auto-downloads it on first use.

Recording works two ways:

  1. Direct mode (Python): rewind_agent.init() monkey-patches OpenAI and Anthropic SDKs in-process. Zero latency overhead.
  2. Proxy mode (any language): rewind record --upstream https://api.openai.com. Point your agent’s base URL at the proxy. Streaming pass-through, no added latency.

The web dashboard runs at http://127.0.0.1:4800 with live updates via WebSocket. Sessions appear as they’re recorded.

29 CLI commands. 26 MCP tools for AI assistants (Claude Code, Cursor, Windsurf) to query recordings from your IDE. MIT licensed.

Try it

pip install rewind-agent
rewind demo && rewind inspect latest

No API keys needed. rewind demo seeds a sample multi-agent session with a hallucination at step 5. Try show, replay, diff, eval score, share on it.

Or add one line to your own agent:

import rewind_agent
rewind_agent.init()
# run your agent, then: rewind show latest

We’re building Rewind in the open. If you’re debugging AI agents and want to talk about what’s broken, open a discussion.