RC RANDOM CHAOS

Eleven hours lost to one settings file

The undocumented Claude Code config flags, hooks, env vars, and permission patterns I rely on to run six properties in production.

· 5 min read

Claude Code’s Undocumented Config Surface

I run six properties on Claude Code in production. Six skills, four cron-driven generators, an APScheduler stack, and claude -p invoked roughly 1,400 times a day. Half of what keeps it stable is not in the docs.

This is the list of knobs I found by reading the binary, the SDK source, and three weeks of strace output. Every one of them changed a metric I cared about.

The settings.json hierarchy nobody documents in full

Claude Code reads five settings files. The docs mention three. The actual precedence, from lowest to highest:

/etc/claude-code/managed-settings.json # enterprise policy
~/.claude/settings.json # user global
<project>/.claude/settings.json # checked-in project
<project>/.claude/settings.local.json # gitignored override
CLAUDE_CODE_SETTINGS env var (JSON inline) # runtime override

The last one is undocumented and the most useful. When I run claude -p from APScheduler I inject a per-job settings blob:

CLAUDE_CODE_SETTINGS='{"permissions":{"allow":["Bash(git:*)","Write"]},"env":{"FOUNDRY_JOB_ID":"'$JOB_ID'"}}' \
 claude -p "$PROMPT" --output-format stream-json

This bypasses the merge-order problem where a project settings.local.json was silently overriding cron-job permissions. I lost 11 hours to that bug in March. Commit 8a3f1c2 on the foundry repo was a one-line fix: stop relying on file-based settings inside scheduled jobs.

Hooks fire on more events than the docs list

The documented hook events are PreToolUse, PostToolUse, Stop, UserPromptSubmit. There are four more I’ve confirmed by attaching a hook with event: "*" and tailing the matcher:

EventFires whenUseful for
SessionStartnew session, before first promptwarm cache, write audit row
SessionEndsession terminates, even on SIGTERMflush cost ledger
Notificationtool requests permissionroute to Slack instead of TTY
SubagentStopa spawned Agent returnsparent-agent cost rollup

The SessionEnd hook is the one that matters. Without it I was losing cost data on every cron job that got reaped by systemd’s TimeoutStopSec. After wiring it up I closed a $47/month accounting gap between what Anthropic billed and what my ledger reported.

{
 "hooks": {
 "SessionEnd": [{
 "matcher": "*",
 "hooks": [{"type": "command", "command": "/opt/foundry/bin/flush-ledger.sh"}]
 }]
 }
}

The env vars that actually change behaviour

Run strings $(which claude) | grep -E '^CLAUDE_' and you get 38 variables. The ones the docs cover: about 12. The undocumented ones I rely on:

  • CLAUDE_CODE_DISABLE_TELEMETRY=1 - kills the OTEL exporter. Cuts cold-start by ~180ms on a t4g.small.
  • CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192 - overrides the model default. The SDK respects this; the CLI silently caps it at the model max.
  • CLAUDE_CODE_API_KEY_HELPER=/path/to/script - runs an external script to fetch the API key. I pipe it through aws secretsmanager get-secret-value so no key ever lives on disk.
  • BASH_DEFAULT_TIMEOUT_MS=120000 and BASH_MAX_TIMEOUT_MS=600000 - separate caps for default vs explicit. Documented, but the interaction isn’t: the explicit timeout parameter on a Bash tool call is silently clamped to BASH_MAX_TIMEOUT_MS with no warning emitted.
  • MAX_THINKING_TOKENS=16000 - caps extended-thinking budget per turn. Saved me $14/day across one persona that kept burning thinking tokens on trivial Edit calls.
  • DISABLE_AUTOUPDATER=1 - the auto-updater will fire mid-cron-job and exit non-zero if it can’t write to the install dir. I learned this at 03:17 on a Tuesday when 47 scheduled posts failed in a row.
  • CLAUDE_CODE_SUBAGENT_MODEL=claude-haiku-4-5-20251001 - route subagents to a cheaper model than the parent. 60% cost reduction on a pipeline where the parent needed Opus but the Explore subagents did not.

Permission rules support globs the docs don’t show

The docs show Bash(npm:*). They don’t show:

{
 "permissions": {
 "allow": [
 "Bash(git diff:*)",
 "Bash(git log --oneline:*)",
 "Read(~/projects/**)",
 "Edit(./src/**/*.{ts,tsx})",
 "WebFetch(domain:docs.anthropic.com)",
 "mcp__github__*"
 ],
 "deny": [
 "Bash(rm -rf:*)",
 "Bash(*--force*)",
 "Write(/etc/**)",
 "Edit(**/.env*)"
 ],
 "additionalDirectories": ["/var/log/foundry"],
 "defaultMode": "acceptEdits"
 }
}

Three things matter here. First, the deny list is evaluated before allow, so a deny pattern wins even if a broader allow exists. Second, additionalDirectories extends the working tree without re-rooting the session - I use it to let agents tail logs without granting full filesystem read. Third, defaultMode: "acceptEdits" skips the per-Edit confirmation in headless mode but still prompts on Bash. There’s no "bypassPermissions" value that’s safe to use in production. I tried. Commit f1d8e0a reverts it.

The status line is a process you control

statusLine in settings.json takes a command. The command receives a JSON blob on stdin:

{
 "session_id": "abc123",
 "transcript_path": "/home/user/.claude/projects/.../abc123.jsonl",
 "model": {"id": "claude-opus-4-7", "display_name": "Opus 4.7"},
 "workspace": {"current_dir": "/app"},
 "cost": {"total_cost_usd": 0.847, "total_duration_ms": 124003}
}

I pipe this into a Python script that writes cost.total_cost_usd to a Prometheus textfile collector. My Grafana board shows per-session spend in near-real-time without touching the Anthropic billing API. The blob updates every ~300ms while a session is active.

Slash commands and skills have a frontmatter field nobody mentions

allowed-tools in a slash command’s frontmatter restricts that command to a subset of tools. If you set it, the command runs in a tighter sandbox than the parent session:

---
allowed-tools: Read, Grep, Glob
description: Read-only repo audit
---

Skills support the same field plus an undocumented model override:

---
name: cheap-summarizer
model: claude-haiku-4-5-20251001
allowed-tools: Read
---

The skill runs on Haiku regardless of what the parent session is using. For one of my pipelines that summarises 200 PRs a day this dropped cost from $0.034/run to $0.004/run. 88% reduction, no quality regression I could measure.

MCP server config has a quiet timeout knob

MCP servers in .mcp.json accept startupTimeoutMs and requestTimeoutMs. Defaults are 30000 and 60000. Both undocumented in the public config schema:

{
 "mcpServers": {
 "postgres": {
 "command": "npx",
 "args": ["-y", "@modelcontextprotocol/server-postgres", "$DATABASE_URL"],
 "startupTimeoutMs": 5000,
 "requestTimeoutMs": 15000
 }
 }
}

Dropping startup timeout from 30s to 5s caught a flaky MCP server in CI within minutes instead of half an hour. The session aborts with a clean error instead of hanging until the parent cron-job’s TimeoutStopSec kicks in and leaves no logs.

What I’d configure on day one

If I were setting up a fresh production environment tomorrow, in order:

  1. DISABLE_AUTOUPDATER=1 in the systemd unit. Non-negotiable.
  2. SessionEnd hook writing cost + token counts to a ledger.
  3. apiKeyHelper pointing at a secret manager, not a file.
  4. defaultMode: "acceptEdits", never bypassPermissions.
  5. deny rules for rm -rf, *--force*, .env*, and any path outside the project root.
  6. CLAUDE_CODE_SUBAGENT_MODEL set to Haiku unless the subagent genuinely needs Opus.
  7. A statusLine script piping cost to whatever your metrics backend is.
  8. startupTimeoutMs: 5000 on every MCP server.

None of this is in the quickstart. All of it has paid for itself inside a week on a workload that runs more than a hundred sessions a day.

Code for the foundry stack is at github.com/foundry-stack/foundry. Try Claude Code yourself at claude.com/claude-code.


Contains a referral link.

Share

Keep Reading

Stay in the loop

New writing delivered when it's ready. No schedule, no spam.