31 seconds per 100 lines
38,412 lines of Claude-generated code, 11 incidents, one 11-day silent failure: why generation speed without verification slows your platform down.
Between February 3 and April 28, I merged 38,412 lines of Claude-generated code into the Foundry monorepo. Posts shipped per week fell from 41 to 33. Production incidents went from 2 a month to 11. One property published nothing for 11 days while every job in its pipeline logged success.
More generated code did not make the platform faster. It moved the bottleneck somewhere I wasn’t watching, and the bill arrived in April.
The setup
Foundry runs 6 content properties on Claude Code: 3 blog sites, 2 social accounts, 1 newsletter. APScheduler fires roughly 240 jobs a day. Most of those jobs shell out to claude -p at some point. The codebase is about 31k lines of Python, and by March, 62% of new lines merged each week were generated.
The generation loop is seductive in the one way that matters: a feature that used to cost an evening cost 4 minutes of prompting plus 9 minutes of skimming. I tracked it. My median review pace on generated PRs in March was 31 seconds per 100 lines. My historical pace on human-written code is just over 4 minutes per 100 lines. I was approving code 8x faster than I was capable of reading it.
That is the entire post. Everything below is the receipts.
What actually broke
Four incidents in eight weeks, every one traceable to generated code that passed my review:
- A retry wrapper that retried a non-idempotent publish call. Property 2 posted the same Instagram carousel 3 times in 40 minutes.
- A “defensive”
try/exceptwrapped around a subprocess call, converting hard failures into empty strings. - A generated unit test that asserted against its own mock. 100% pass rate, 0% behavior coverage.
- A config migration applied across all 6 property configs with a transposed key in exactly one of them:
max_daily_posts: 0.
None of these bugs is exotic. A tired human writes all four. The difference is rate. A human writes one of these a month and you catch it because the rest of the diff is small enough to actually read. Claude writes one of these a week and buries it in 600 lines of idiomatic, confident, correct-looking code.
Anatomy of the 11-day silence
The worst of the four deserves the full post-mortem treatment.
Commit b41c9e7, merged April 2 at 14:11 UTC, was a 640-line refactor of the publish pipeline. I reviewed it in 9 minutes. It contained this wrapper, at foundry/skills/publish/runner.py:88:
def run_claude(prompt: str, timeout: int = 600) -> str:
result = subprocess.run(
["claude", "-p", prompt],
capture_output=True,
text=True,
timeout=timeout,
)
return result.stdout.strip()
It looks fine. It is idiomatic. It has two production-killing properties:
- No returncode check. When
claudehit a rate limit it exited 1 with the error on stderr. Nothing in the pipeline read stderr. - Empty stdout treated as a valid draft.
claude -pcan exit 0 and print nothing under specific conditions. The wrapper returned""and the pipeline kept going.
Downstream, at publisher.py:54, a dedupe gate hashes the draft body and skips anything it has seen before. The empty string hashes to the same value every time. So every failed generation after the first was classified as a duplicate, skipped, and logged as a routine success:
2026-04-09 03:20:14 INFO publish.runner property=4 job=daily_post status=ok
2026-04-09 03:20:14 INFO publish.dedupe property=4 hash=e3b0c442 action=skip reason=duplicate
e3b0c442 is the first eight hex characters of the SHA-256 of the empty string. One of the most recognizable hashes in computing sat in my logs for ten days, labeled INFO.
The timeline:
| date (UTC) | event |
|---|---|
| Apr 2, 14:11 | b41c9e7 merged. 640-line PR, reviewed in 9 minutes |
| Apr 5, 03:20 | first rate-limit hit. stdout empty, pipeline logs success |
| Apr 5-15 | 22 consecutive publish runs return "". every job green |
| Apr 16, 09:00 | weekly growth dashboard: property 4 flatlined at 0 posts |
| Apr 16, 22:40 | root cause at runner.py:88. fix merged in 7d2f0aa |
Direct cost: 11 days of zero publishing on a property doing ~60 sessions a day, six hours of incident response, $0 in direct revenue because the property was pre-monetization. Indirect cost: I no longer trusted anything merged in the previous eight weeks, so I re-read every generated PR from that window. 19 hours.
The root cause is not the wrapper
The wrapper is the proximate cause. The actual root cause is that I let generation throughput set merge throughput.
Generated code has a specific failure texture. It is plausible-shaped. The happy path is handled with confident, idiomatic code. The unhappy path is handled with whatever pattern is statistically most common in the training distribution, and the most common subprocess pattern on the internet is “capture output, return stdout.” The bug is not random. It is the median of every tutorial ever written.
That is why review pace matters more for generated code, not less. Human bugs are idiosyncratic; something looks off and your eye snags on it. Generated bugs are assembled from fragments that were correct in their original context. Nothing snags. Thirty-one seconds per hundred lines is not review. It is vibes confirmation.
The generated test from incident 3 is the purest specimen of the pattern:
def test_run_claude_returns_output(monkeypatch):
fake = SimpleNamespace(stdout="draft text", returncode=0)
monkeypatch.setattr(subprocess, "run", lambda *a, **k: fake)
assert run_claude("prompt") == "draft text"
This asserts that a mock returns the mock. Worse than useless: it pinned the bug in place. Any fix that checked returncode against real failure behavior had to fight a green test suite that had encoded the wrong behavior as correct.
The fix
Three layers, because fixing only the wrapper just schedules the next incident.
Layer 1, the wrapper itself (7d2f0aa):
def run_claude(prompt: str, timeout: int = 600) -> str:
result = subprocess.run(
["claude", "-p", prompt],
capture_output=True, text=True, timeout=timeout,
)
if result.returncode != 0:
raise ClaudeCallError(result.returncode, result.stderr[-2000:])
out = result.stdout.strip()
if not out:
raise EmptyOutputError("claude -p exited 0 with empty stdout")
return out
Layer 2, artifact-level monitoring (c19e3d2). The old heartbeat answered “did the job run.” The new sentinel answers “does the artifact exist”: one query per property, posts published in the trailing 48 hours, page me on zero. It would have caught this incident on day 2 instead of day 11. Job exit codes are claims. Rows in the published-posts table are evidence.
Layer 3, the one that actually changed the curve: a merge gate. Generated PRs now go through the same pipeline as anything I write by hand. Tests must assert against real subprocess behavior, so CI carries a fake claude binary on PATH with three modes: exit 1 with stderr, exit 0 with empty stdout, and hang until timeout. And a review-time floor: 4 minutes per 100 lines, measured. If I will not spend that, the PR waits until I will.
The math nobody runs
| stage | human code | generated, March me | generated, now |
|---|---|---|---|
| write | 3 h | 4 min | 4 min |
| review | 40 min | 9 min | 40 min |
| tests | 45 min | ”included” | 45 min, real ones |
| incident tax, amortized | ~10 min/PR | ~23 min/PR | ~5 min/PR |
The March column wins every row until the last one. Those eight weeks produced 4 incidents costing roughly 25 hours of response and audit, spread across 64 generated PRs: 23 minutes of incident tax per PR, which is more than the entire review time I “saved.” Generation did not remove the engineering. It deferred it, with interest.
The April column is the honest one. Generated code is still a massive win on the write step - 3 hours to 4 minutes is real and nothing claws it back. But every other step costs the same as it always did, because every other step exists to catch exactly the failure texture generation produces.
What survives contact
Five rules now enforced in the Foundry pipeline, in priority order:
- Check every subprocess returncode. Treat empty stdout from
claude -pas a failure, unconditionally. It is never a valid draft. - Never accept a generated test without reading the assertion. Mock-asserting-mock is the single most common generated test I see, and it is worse than no test.
- Monitor artifacts, not exits. A green job that produced nothing is a red job with better PR.
- Treat generated lines as inventory, not output. Inventory has carrying costs.
- Set a review-time floor and let it throttle merge rate. The floor is the safeguard; the discomfort is the point.
Five weeks since the gates went in: merge rate is down about 40% from the March peak, post throughput recovered to 43 a week - above the pre-generation baseline - and the only incident was a Meta API change, not generated code. Slower in, faster out.
Try Claude Code yourself: https://claude.com/claude-code - it still writes most of Foundry. It just doesn’t get to merge unread anymore.
Contains a referral link.
Keep Reading
claude-codeOur repair agent patched the wrong file four times
A Claude Code repair agent patched the wrong file four times in 31 hours. Why agents anchor on tracebacks, and the prompt rewrite that fixed it.
production-incidentsOur incident report was a vendor blog post
Anthropic's invisible-guardrails apology misses the point: production agents need output contracts, audit ledgers, and sentinel checks, not model-default trust.
claude-codeClaude Code deleted our deploy pipeline
Why semantic code understanding for Claude Code agents needs an entity index on top of git, not an LSP. Receipts from 90 days of production edits.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.