RC RANDOM CHAOS

31 seconds per 100 lines

38,412 lines of Claude-generated code, 11 incidents, one 11-day silent failure: why generation speed without verification slows your platform down.

· 7 min read

Between February 3 and April 28, I merged 38,412 lines of Claude-generated code into the Foundry monorepo. Posts shipped per week fell from 41 to 33. Production incidents went from 2 a month to 11. One property published nothing for 11 days while every job in its pipeline logged success.

More generated code did not make the platform faster. It moved the bottleneck somewhere I wasn’t watching, and the bill arrived in April.

The setup

Foundry runs 6 content properties on Claude Code: 3 blog sites, 2 social accounts, 1 newsletter. APScheduler fires roughly 240 jobs a day. Most of those jobs shell out to claude -p at some point. The codebase is about 31k lines of Python, and by March, 62% of new lines merged each week were generated.

The generation loop is seductive in the one way that matters: a feature that used to cost an evening cost 4 minutes of prompting plus 9 minutes of skimming. I tracked it. My median review pace on generated PRs in March was 31 seconds per 100 lines. My historical pace on human-written code is just over 4 minutes per 100 lines. I was approving code 8x faster than I was capable of reading it.

That is the entire post. Everything below is the receipts.

What actually broke

Four incidents in eight weeks, every one traceable to generated code that passed my review:

  • A retry wrapper that retried a non-idempotent publish call. Property 2 posted the same Instagram carousel 3 times in 40 minutes.
  • A “defensive” try/except wrapped around a subprocess call, converting hard failures into empty strings.
  • A generated unit test that asserted against its own mock. 100% pass rate, 0% behavior coverage.
  • A config migration applied across all 6 property configs with a transposed key in exactly one of them: max_daily_posts: 0.

None of these bugs is exotic. A tired human writes all four. The difference is rate. A human writes one of these a month and you catch it because the rest of the diff is small enough to actually read. Claude writes one of these a week and buries it in 600 lines of idiomatic, confident, correct-looking code.

Anatomy of the 11-day silence

The worst of the four deserves the full post-mortem treatment.

Commit b41c9e7, merged April 2 at 14:11 UTC, was a 640-line refactor of the publish pipeline. I reviewed it in 9 minutes. It contained this wrapper, at foundry/skills/publish/runner.py:88:

def run_claude(prompt: str, timeout: int = 600) -> str:
 result = subprocess.run(
 ["claude", "-p", prompt],
 capture_output=True,
 text=True,
 timeout=timeout,
 )
 return result.stdout.strip()

It looks fine. It is idiomatic. It has two production-killing properties:

  1. No returncode check. When claude hit a rate limit it exited 1 with the error on stderr. Nothing in the pipeline read stderr.
  2. Empty stdout treated as a valid draft. claude -p can exit 0 and print nothing under specific conditions. The wrapper returned "" and the pipeline kept going.

Downstream, at publisher.py:54, a dedupe gate hashes the draft body and skips anything it has seen before. The empty string hashes to the same value every time. So every failed generation after the first was classified as a duplicate, skipped, and logged as a routine success:

2026-04-09 03:20:14 INFO publish.runner property=4 job=daily_post status=ok
2026-04-09 03:20:14 INFO publish.dedupe property=4 hash=e3b0c442 action=skip reason=duplicate

e3b0c442 is the first eight hex characters of the SHA-256 of the empty string. One of the most recognizable hashes in computing sat in my logs for ten days, labeled INFO.

The timeline:

date (UTC)event
Apr 2, 14:11b41c9e7 merged. 640-line PR, reviewed in 9 minutes
Apr 5, 03:20first rate-limit hit. stdout empty, pipeline logs success
Apr 5-1522 consecutive publish runs return "". every job green
Apr 16, 09:00weekly growth dashboard: property 4 flatlined at 0 posts
Apr 16, 22:40root cause at runner.py:88. fix merged in 7d2f0aa

Direct cost: 11 days of zero publishing on a property doing ~60 sessions a day, six hours of incident response, $0 in direct revenue because the property was pre-monetization. Indirect cost: I no longer trusted anything merged in the previous eight weeks, so I re-read every generated PR from that window. 19 hours.

The root cause is not the wrapper

The wrapper is the proximate cause. The actual root cause is that I let generation throughput set merge throughput.

Generated code has a specific failure texture. It is plausible-shaped. The happy path is handled with confident, idiomatic code. The unhappy path is handled with whatever pattern is statistically most common in the training distribution, and the most common subprocess pattern on the internet is “capture output, return stdout.” The bug is not random. It is the median of every tutorial ever written.

That is why review pace matters more for generated code, not less. Human bugs are idiosyncratic; something looks off and your eye snags on it. Generated bugs are assembled from fragments that were correct in their original context. Nothing snags. Thirty-one seconds per hundred lines is not review. It is vibes confirmation.

The generated test from incident 3 is the purest specimen of the pattern:

def test_run_claude_returns_output(monkeypatch):
 fake = SimpleNamespace(stdout="draft text", returncode=0)
 monkeypatch.setattr(subprocess, "run", lambda *a, **k: fake)
 assert run_claude("prompt") == "draft text"

This asserts that a mock returns the mock. Worse than useless: it pinned the bug in place. Any fix that checked returncode against real failure behavior had to fight a green test suite that had encoded the wrong behavior as correct.

The fix

Three layers, because fixing only the wrapper just schedules the next incident.

Layer 1, the wrapper itself (7d2f0aa):

def run_claude(prompt: str, timeout: int = 600) -> str:
 result = subprocess.run(
 ["claude", "-p", prompt],
 capture_output=True, text=True, timeout=timeout,
 )
 if result.returncode != 0:
 raise ClaudeCallError(result.returncode, result.stderr[-2000:])
 out = result.stdout.strip()
 if not out:
 raise EmptyOutputError("claude -p exited 0 with empty stdout")
 return out

Layer 2, artifact-level monitoring (c19e3d2). The old heartbeat answered “did the job run.” The new sentinel answers “does the artifact exist”: one query per property, posts published in the trailing 48 hours, page me on zero. It would have caught this incident on day 2 instead of day 11. Job exit codes are claims. Rows in the published-posts table are evidence.

Layer 3, the one that actually changed the curve: a merge gate. Generated PRs now go through the same pipeline as anything I write by hand. Tests must assert against real subprocess behavior, so CI carries a fake claude binary on PATH with three modes: exit 1 with stderr, exit 0 with empty stdout, and hang until timeout. And a review-time floor: 4 minutes per 100 lines, measured. If I will not spend that, the PR waits until I will.

The math nobody runs

stagehuman codegenerated, March megenerated, now
write3 h4 min4 min
review40 min9 min40 min
tests45 min”included”45 min, real ones
incident tax, amortized~10 min/PR~23 min/PR~5 min/PR

The March column wins every row until the last one. Those eight weeks produced 4 incidents costing roughly 25 hours of response and audit, spread across 64 generated PRs: 23 minutes of incident tax per PR, which is more than the entire review time I “saved.” Generation did not remove the engineering. It deferred it, with interest.

The April column is the honest one. Generated code is still a massive win on the write step - 3 hours to 4 minutes is real and nothing claws it back. But every other step costs the same as it always did, because every other step exists to catch exactly the failure texture generation produces.

What survives contact

Five rules now enforced in the Foundry pipeline, in priority order:

  • Check every subprocess returncode. Treat empty stdout from claude -p as a failure, unconditionally. It is never a valid draft.
  • Never accept a generated test without reading the assertion. Mock-asserting-mock is the single most common generated test I see, and it is worse than no test.
  • Monitor artifacts, not exits. A green job that produced nothing is a red job with better PR.
  • Treat generated lines as inventory, not output. Inventory has carrying costs.
  • Set a review-time floor and let it throttle merge rate. The floor is the safeguard; the discomfort is the point.

Five weeks since the gates went in: merge rate is down about 40% from the March peak, post throughput recovered to 43 a week - above the pre-generation baseline - and the only incident was a Meta API change, not generated code. Slower in, faster out.

Try Claude Code yourself: https://claude.com/claude-code - it still writes most of Foundry. It just doesn’t get to merge unread anymore.


Contains a referral link.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.