Our repair agent patched the wrong file four times
A Claude Code repair agent patched the wrong file four times in 31 hours. Why agents anchor on tracebacks, and the prompt rewrite that fixed it.
Our repair agent shipped four patches to publish.py in 31 hours. None of them fixed anything, because the bug was never in publish.py. The damage: one Instagram property dark for 31 hours, 23 lines of defensive garbage merged to main, and $6.40 of API spend on an agent confidently diagnosing the wrong problem at the top of every hour.
There’s an old consulting joke where someone glances at a screen and says: “Ah, I see your problem right here. It has code.” It stops being a joke when your autonomous repair pipeline runs that exact diagnosis four times in a row, unsupervised, with commit access.
The failure modes, up front
Every item on this list is reachable in any setup where a Claude Code agent has Edit access and a traceback in its prompt:
- Anchoring on the traceback. The agent treats the line that raised as the line that is wrong.
- Symptom patching that recurses. Each defensive patch moves the failure one layer downstream, which produces a fresh traceback, which triggers a fresh invocation.
- Exit code 0 as ground truth. The upstream tool failed silently, so nothing in the evidence pointed upstream.
- Prompt-scoped diagnosis. The skill prompt told the agent where the bug was. The agent believed it.
Hold that list. The timeline makes each one concrete.
31 hours, four commits
The foundry platform runs 6 content properties on Claude Code skills plus APScheduler. Each property has a publish job; each job writes to an audit ledger; a watchdog tails the ledger and invokes a repair skill called /heal when a job fails twice consecutively. /heal is a headless Claude Code invocation with the traceback in its prompt and write access to the repo.
| Time (UTC) | Event |
|---|---|
| Sat 03:12 | Anthropic workspace hits its monthly spend cap. Caption generation starts returning empty stdout. |
| Sat 04:00 | publish.py:84 raises json.JSONDecodeError. Watchdog fires /heal after the second failure. |
| Sat 05:07 | Heal commit 1: wraps json.loads in try/except, defaults to {}. |
| Sat 06:00 | KeyError: 'caption' at line 91. /heal fires again. |
| Sat 07:02 | Heal commit 2: .get("caption", ""). |
| Sat 08:00 | Instagram API returns 400 on empty caption. /heal fires. |
| Sat 09:04 | Heal commit 3: retry loop around the IG call. |
| Sat 10:00 | Same 400, three times now. /heal fires. |
| Sat 10:58 | Heal commit 4: exponential backoff on the retry loop. |
| Sun 10:15 | I open the growth dashboard, see a flatline, and start reading commits I didn’t write. |
The commit messages are the unsettling part:
c9e21aa fix: handle malformed caption JSON in publish step
3b8f04d fix: tolerate missing caption key in metadata
9912c7e fix: retry transient Instagram API errors
5fa6b31 fix: add exponential backoff to publish retries
They read like a competent junior engineer’s commits. Each one is locally reasonable. Each one is defensible in code review if you only look at the diff and the traceback. All four are wrong, and they’re wrong in a specific direction: every patch accepts the bad state and moves the crash further from the cause.
By commit 2 the pipeline had stopped crashing on the real signal (empty model output) and started crashing on a fabricated one (an empty caption it had just invented). The agent spent commits 3 and 4 negotiating with Instagram about whether an empty string is a caption. Instagram, correctly, held its position.
Root cause
The caption step shells out to claude -p through a thin wrapper. At 03:12 the workspace spend cap kicked in. The CLI printed the failure to stderr and exited with empty stdout, and the wrapper’s only check was the return code.
I have written about the claude -p empty-stdout edge case before. The guard existed in three of the four call sites in this codebase. The caption step was the fourth. That detail matters, because the incident isn’t “we didn’t know about this failure mode.” It’s “we knew, the knowledge lived in three files, and nothing enforced it in the fourth.”
publish.py:84 is where the corpse was found:
raw = run_caption_model(post) # returned "" from 03:12 onward
meta = json.loads(raw) # line 84: JSONDecodeError
But the bug’s birthplace was a quota counter in Anthropic’s billing system, four layers and one network boundary upstream. No amount of editing publish.py was ever going to fix it.
Why did the agent never look upstream? Read the skill it was running. .claude/skills/heal/SKILL.md, as it existed that weekend:
You are the repair agent. A scheduled job has failed.
Traceback:
{traceback}
Find the bug in the file where the error occurred and fix it.
Run the job's test file to verify. Commit with a descriptive message.
“Find the bug in the file where the error occurred.” That sentence is the entire post-mortem. We wrote anchoring bias directly into the system prompt and handed it commit access.
Anchoring is the human failure mode this automates. A tired engineer at 2 a.m. stares at the traceback and patches line 84 because line 84 is what’s on the screen. The difference is that the human gets suspicious by the third failure. Suspicion was not in the prompt, so the agent didn’t have any. It executed the same shallow diagnosis at 05:07, 07:02, 09:04, and 10:58 with identical confidence, which is exactly what we asked for: a tireless engineer with our worst debugging habit and none of our doubt.
The fix
Three commits.
1. The deterministic guard. Commit a41f9c2, 3 lines, in the shared wrapper:
proc = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
if proc.returncode == 0 and not proc.stdout.strip():
raise EmptyCompletion(
f"claude -p exited 0 with empty stdout; stderr tail: {proc.stderr[-500:]}"
)
This converts the silent upstream failure into a loud, correctly located one. EmptyCompletion names the actual problem, carries the stderr that names the actual cause (spend limit reached), and fires at the layer where the bad value is created instead of three layers downstream.
2. The skill rewrite. Commit 7d03e1b. The new /heal prompt separates diagnosis from treatment and makes treatment conditional:
Before editing any file you MUST:
1. Reproduce the failure using the exact inputs recorded in the
audit ledger for the failed run.
2. Trace the failing value to its origin. State the file:line where
the bad value was CREATED, not where it crashed.
3. Check the ledger for failures in upstream jobs in the same window.
4. State the blast radius: every other job that reads the code path
you intend to change.
If the origin is outside the repo (API quota, credentials, env,
disk, DNS, upstream outage): DO NOT edit code. Write a diagnosis to
ops/incidents/{date}-{job}.md and send an ntfy page.
Two consecutive /heal invocations on the same job: stop. Page.
The agent gets one swing.
3. The revert. Commit e58d20f removed all 23 lines of defensive code from the four heal commits. Plus a spend alert at 80% of cap, which should have existed since day one and didn’t, because billing alerts are nobody’s feature.
What the rewrite caught
Six weeks of ledger data since 7d03e1b. /heal has fired 11 times:
| Diagnosis | Count | Code edited |
|---|---|---|
| Real code bug | 3 | ✓ |
| Expired OAuth token (IG, FB) | 2 | ✗ |
| Disk at 97% (image cache) | 1 | ✗ |
| Upstream API 5xx storm | 2 | ✗ |
| DNS resolution failure | 1 | ✗ |
| systemd env drift after server move | 1 | ✗ |
| Spend cap (again) | 1 | ✗ |
8 of 11 production failures were not code bugs. The old prompt would have shipped a patch for all eleven. The three real bugs got fixed and the fixes stayed. The second spend-cap event was caught by the 80% alert and resolved in 14 minutes instead of 31 hours, a 133× improvement that consisted entirely of knowing where to look.
The 8-of-11 ratio shouldn’t surprise anyone who has run production systems. Code is the most version-controlled, most tested, most reviewed artifact in the stack. Tokens expire, disks fill, quotas exhaust, and DNS does what DNS does, all without a single diff. The traceback just happens to surface in the one artifact your agent can edit.
Lessons
- A traceback is the location of death, not the cause of death. Autopsy accordingly.
- Your agent’s diagnostic bias is your prompt’s diagnostic bias, executed hourly, without fatigue and without doubt. Audit prompts the way you audit code, because they are code.
- Anchor-biased defensive patches don’t fail loudly. They convert outages into silent degradation. The worst version of this incident is the one where commit 4 works and the property publishes empty captions for a month while the dashboard stays green.
- Give repair agents a success path that doesn’t involve editing code. If Edit is the only tool that closes the loop, every diagnosis becomes a code bug.
- Deterministic guards beat judgment. The
EmptyCompletioncheck is 3 lines and fires every time. “Be careful about empty output” in a prompt is a suggestion.
Follow-ups still open:
- Route the last two direct
subprocesscall sites through the shared wrapper so the guard is structural, not conventional -
/healreads the last 50 ledger rows by default instead of on request - Weekly review of agent-authored commits as a standing diff, not ad hoc archaeology
The joke holds up. The agent saw the problem immediately: it had code. The fix was teaching it that almost nothing else in production does.
Try Claude Code yourself: https://claude.com/claude-code
Contains a referral link.
Keep Reading
claude-codeyour logs are lying to you
Five production failure modes in Claude Code platforms, the exact code that causes each, and the five-step debugging loop that isolates them.
claude-code31 seconds per 100 lines
38,412 lines of Claude-generated code, 11 incidents, one 11-day silent failure: why generation speed without verification slows your platform down.
claude-codeClaude Code deleted our deploy pipeline
Why semantic code understanding for Claude Code agents needs an entity index on top of git, not an LSP. Receipts from 90 days of production edits.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.