I use Langfuse for tracing. Here's why I added Rewind for debugging.
Last week my research agent failed at step 15 of a 30-step run. Langfuse showed me exactly where it broke. The writer sub-agent hallucinated, citing a stale 2019 population figure as current fact. Clean trace, obvious failure.
Now what?

I changed the system prompt. Re-ran the agent. $1.20 in tokens. 3 minutes of wall time. Different answer, still wrong, different hallucination. Re-ran again. $1.20 more. Another answer. By the fifth attempt I’d spent $6 and 15 minutes, and I still wasn’t sure the fix was right because every run gave a different output.
Langfuse is great at showing you what happened. It can’t let you change what happened and observe a different outcome.
So I built a tool that does.
🔍 The problem: observability ≠ debugging
Every LLM observability tool (Langfuse, LangSmith, Helicone) solves the same problem: “What did my agent do?” They capture traces, show you token counts, latencies, and the content of each step. That’s valuable.
But when something breaks at step 15 of a 30-step agent, you’re stuck:
- You can’t isolate the failure. To test a fix, you re-run all 30 steps. Steps 1-14 were fine. You’re paying for them again.
- You can’t reproduce it. LLMs are non-deterministic. Re-run the same agent and you get a different result. The bug might not even appear.
- You can’t prove your fix works. You changed the prompt. Did it actually fix the hallucination, or just shift the problem to a different step?
I needed something that lets me fork at step 14, replay only the broken part, and prove the fix works. So I built Rewind.
📥 Step 1: Import the failing trace from Langfuse
I see the broken trace in Langfuse’s UI. Copy the trace ID. One command:
rewind import from-langfuse --trace abc123
Rewind calls the Langfuse REST API, fetches the trace with all observations, converts them to its internal format, and gives me a browsable session.
rewind show latest
⏪ Rewind — Session Trace
Session: research-agent
Steps: 30 Tokens: 274,200
Agents: supervisor, researcher, writer
▼ ✗ 🤖 supervisor (agent) 12.4s
▼ ✓ 🤖 researcher (agent) 5.2s
├ ✓ 🧠 gpt-4o 1.2s 6,420↓ 1,840↑
├ ✓ 🧠 gpt-4o 2.1s 9,380↓ 2,610↑
▼ ✓ 🔧 web_search (tool) 45ms
▼ ✗ 🤖 writer (agent) 4.1s
└ ✗ 🧠 gpt-4o 3.4s 12,100↓ 4,250↑
│ ERROR: Hallucination — used stale 2019 data as current fact
Same data as Langfuse, but now it’s in a system that can act on it.

⏪ Step 2: Fork at the failure, replay with the fix
I fix the system prompt in my code (added an instruction to cross-reference dates in search results), then replay from just before the failure:
rewind replay latest --from 14
- Steps 1-13: served from cache. 0 tokens. 0 API calls. Instant.
- Step 14+: live LLM calls with my updated code.
The agent runs the writer sub-agent again with the fix. The first 13 steps don’t re-run at all. They’re replayed from the recorded data.
⏪ Replay Savings
Steps: 13/30 cached (served from fork cache)
Tokens saved: 118,750
Cost saved: $0.52
Time saved: 1m 50s
One replay instead of a full re-run. If the fix isn’t right, I replay again. Each time I only pay for the steps after the fork point.
✅ Step 3: Prove the fix works with LLM-as-judge
Instead of eyeballing the output, I score both timelines automatically:
rewind eval score latest -e correctness --compare-timelines
⏪ Rewind — Timeline Scores
Timeline correctness avg
──────────── ─────────── ──────
main 0.200 0.200
fixed 0.950 0.950
Delta (fixed vs main): +0.75 avg ↑
Original timeline scores 0.2 on correctness. Fixed timeline scores 0.95. Not me guessing. An evaluator comparing the output against the expected result.
I can set up regression baselines too. Next time I change the prompt, I run rewind assert check latest --against my-baseline and get a pass/fail across 37 assertions. No manual review needed.
📤 Step 4: Share the debug session
Once the fix is proven, I share the session with my team:
rewind share latest --include-content -o debug-session.html
This generates a self-contained HTML file. Open it in any browser, no install required. The full trace, both timelines, the fork point, the error, the fix. Think Jupyter notebook export but for debug sessions.

I drop the HTML file in Slack. The team can see exactly what broke, what I changed, and the proof that it works. No “trust me, I tested it.” The scores are right there.
💰 The cost math
Debugging before and after:
| Before (re-run) | After (Rewind) | |
|---|---|---|
| Attempts to find the fix | 5 full re-runs | 2 targeted replays |
| Tokens consumed | 1,371,000 | 311,000 |
| Cost | $6.00 | $1.36 |
| Time | 15 minutes | 3 minutes |
| Proof the fix works | ”Looks right to me” | Correctness score: 0.95 |
Costs based on GPT-4o API pricing ($2.50/1M input, $10/1M output). Each run consumes ~274K tokens (205K input + 69K output). Cached replay steps use 0 tokens.
This was a 30-step agent. For longer runs (50, 100 steps) the savings compound. You only pay for the steps after the fork point.
🔄 The workflow I’ve settled on
- Langfuse monitors production. Every trace is captured.
- Something breaks. Import the trace into Rewind:
rewind import from-langfuse --trace <id> - Fork at the failure, replay with the fix:
rewind replay latest --from 14 - Prove the fix works:
rewind eval score latest -e correctness --compare-timelines - Share the debug session:
rewind share latest - Optionally, export the fixed trace back to Langfuse:
rewind export otel latest --endpoint https://cloud.langfuse.com
Langfuse is my production backbone. Rewind is what I reach for when something breaks.
🚀 Try it
pip install rewind-agent
rewind demo && rewind show latest
No API keys needed for the demo. rewind demo seeds sample data (a research agent that hallucinates at step 5) so you can try show, diff, eval score, assert check, and share right away.
- GitHub: agentoptics/rewind (MIT licensed)
- PyPI: rewind-agent
- Langfuse integration docs: langfuse-import.md
- Website: agentoptics.dev
If you use Langfuse (or LangSmith, or any OTel-compatible backend), Rewind plugs in with one command.
Having trouble with a specific agent failure? Open a discussion and paste the trace. I’ll walk through debugging it with you.