Eleven hours lost to one settings file
The undocumented Claude Code config flags, hooks, env vars, and permission patterns I rely on to run six properties in production.
Claude Code’s Undocumented Config Surface
I run six properties on Claude Code in production. Six skills, four cron-driven generators, an APScheduler stack, and claude -p invoked roughly 1,400 times a day. Half of what keeps it stable is not in the docs.
This is the list of knobs I found by reading the binary, the SDK source, and three weeks of strace output. Every one of them changed a metric I cared about.
The settings.json hierarchy nobody documents in full
Claude Code reads five settings files. The docs mention three. The actual precedence, from lowest to highest:
/etc/claude-code/managed-settings.json # enterprise policy
~/.claude/settings.json # user global
<project>/.claude/settings.json # checked-in project
<project>/.claude/settings.local.json # gitignored override
CLAUDE_CODE_SETTINGS env var (JSON inline) # runtime override
The last one is undocumented and the most useful. When I run claude -p from APScheduler I inject a per-job settings blob:
CLAUDE_CODE_SETTINGS='{"permissions":{"allow":["Bash(git:*)","Write"]},"env":{"FOUNDRY_JOB_ID":"'$JOB_ID'"}}' \
claude -p "$PROMPT" --output-format stream-json
This bypasses the merge-order problem where a project settings.local.json was silently overriding cron-job permissions. I lost 11 hours to that bug in March. Commit 8a3f1c2 on the foundry repo was a one-line fix: stop relying on file-based settings inside scheduled jobs.
Hooks fire on more events than the docs list
The documented hook events are PreToolUse, PostToolUse, Stop, UserPromptSubmit. There are four more I’ve confirmed by attaching a hook with event: "*" and tailing the matcher:
| Event | Fires when | Useful for |
|---|---|---|
SessionStart | new session, before first prompt | warm cache, write audit row |
SessionEnd | session terminates, even on SIGTERM | flush cost ledger |
Notification | tool requests permission | route to Slack instead of TTY |
SubagentStop | a spawned Agent returns | parent-agent cost rollup |
The SessionEnd hook is the one that matters. Without it I was losing cost data on every cron job that got reaped by systemd’s TimeoutStopSec. After wiring it up I closed a $47/month accounting gap between what Anthropic billed and what my ledger reported.
{
"hooks": {
"SessionEnd": [{
"matcher": "*",
"hooks": [{"type": "command", "command": "/opt/foundry/bin/flush-ledger.sh"}]
}]
}
}
The env vars that actually change behaviour
Run strings $(which claude) | grep -E '^CLAUDE_' and you get 38 variables. The ones the docs cover: about 12. The undocumented ones I rely on:
CLAUDE_CODE_DISABLE_TELEMETRY=1- kills the OTEL exporter. Cuts cold-start by ~180ms on a t4g.small.CLAUDE_CODE_MAX_OUTPUT_TOKENS=8192- overrides the model default. The SDK respects this; the CLI silently caps it at the model max.CLAUDE_CODE_API_KEY_HELPER=/path/to/script- runs an external script to fetch the API key. I pipe it throughaws secretsmanager get-secret-valueso no key ever lives on disk.BASH_DEFAULT_TIMEOUT_MS=120000andBASH_MAX_TIMEOUT_MS=600000- separate caps for default vs explicit. Documented, but the interaction isn’t: the explicittimeoutparameter on a Bash tool call is silently clamped toBASH_MAX_TIMEOUT_MSwith no warning emitted.MAX_THINKING_TOKENS=16000- caps extended-thinking budget per turn. Saved me $14/day across one persona that kept burning thinking tokens on trivial Edit calls.DISABLE_AUTOUPDATER=1- the auto-updater will fire mid-cron-job and exit non-zero if it can’t write to the install dir. I learned this at 03:17 on a Tuesday when 47 scheduled posts failed in a row.CLAUDE_CODE_SUBAGENT_MODEL=claude-haiku-4-5-20251001- route subagents to a cheaper model than the parent. 60% cost reduction on a pipeline where the parent needed Opus but the Explore subagents did not.
Permission rules support globs the docs don’t show
The docs show Bash(npm:*). They don’t show:
{
"permissions": {
"allow": [
"Bash(git diff:*)",
"Bash(git log --oneline:*)",
"Read(~/projects/**)",
"Edit(./src/**/*.{ts,tsx})",
"WebFetch(domain:docs.anthropic.com)",
"mcp__github__*"
],
"deny": [
"Bash(rm -rf:*)",
"Bash(*--force*)",
"Write(/etc/**)",
"Edit(**/.env*)"
],
"additionalDirectories": ["/var/log/foundry"],
"defaultMode": "acceptEdits"
}
}
Three things matter here. First, the deny list is evaluated before allow, so a deny pattern wins even if a broader allow exists. Second, additionalDirectories extends the working tree without re-rooting the session - I use it to let agents tail logs without granting full filesystem read. Third, defaultMode: "acceptEdits" skips the per-Edit confirmation in headless mode but still prompts on Bash. There’s no "bypassPermissions" value that’s safe to use in production. I tried. Commit f1d8e0a reverts it.
The status line is a process you control
statusLine in settings.json takes a command. The command receives a JSON blob on stdin:
{
"session_id": "abc123",
"transcript_path": "/home/user/.claude/projects/.../abc123.jsonl",
"model": {"id": "claude-opus-4-7", "display_name": "Opus 4.7"},
"workspace": {"current_dir": "/app"},
"cost": {"total_cost_usd": 0.847, "total_duration_ms": 124003}
}
I pipe this into a Python script that writes cost.total_cost_usd to a Prometheus textfile collector. My Grafana board shows per-session spend in near-real-time without touching the Anthropic billing API. The blob updates every ~300ms while a session is active.
Slash commands and skills have a frontmatter field nobody mentions
allowed-tools in a slash command’s frontmatter restricts that command to a subset of tools. If you set it, the command runs in a tighter sandbox than the parent session:
---
allowed-tools: Read, Grep, Glob
description: Read-only repo audit
---
Skills support the same field plus an undocumented model override:
---
name: cheap-summarizer
model: claude-haiku-4-5-20251001
allowed-tools: Read
---
The skill runs on Haiku regardless of what the parent session is using. For one of my pipelines that summarises 200 PRs a day this dropped cost from $0.034/run to $0.004/run. 88% reduction, no quality regression I could measure.
MCP server config has a quiet timeout knob
MCP servers in .mcp.json accept startupTimeoutMs and requestTimeoutMs. Defaults are 30000 and 60000. Both undocumented in the public config schema:
{
"mcpServers": {
"postgres": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-postgres", "$DATABASE_URL"],
"startupTimeoutMs": 5000,
"requestTimeoutMs": 15000
}
}
}
Dropping startup timeout from 30s to 5s caught a flaky MCP server in CI within minutes instead of half an hour. The session aborts with a clean error instead of hanging until the parent cron-job’s TimeoutStopSec kicks in and leaves no logs.
What I’d configure on day one
If I were setting up a fresh production environment tomorrow, in order:
DISABLE_AUTOUPDATER=1in the systemd unit. Non-negotiable.SessionEndhook writing cost + token counts to a ledger.apiKeyHelperpointing at a secret manager, not a file.defaultMode: "acceptEdits", neverbypassPermissions.denyrules forrm -rf,*--force*,.env*, and any path outside the project root.CLAUDE_CODE_SUBAGENT_MODELset to Haiku unless the subagent genuinely needs Opus.- A
statusLinescript piping cost to whatever your metrics backend is. startupTimeoutMs: 5000on every MCP server.
None of this is in the quickstart. All of it has paid for itself inside a week on a workload that runs more than a hundred sessions a day.
Code for the foundry stack is at github.com/foundry-stack/foundry. Try Claude Code yourself at claude.com/claude-code.
Contains a referral link.
Keep Reading
claude-codeyour logs are lying to you
Five production failure modes in Claude Code platforms, the exact code that causes each, and the five-step debugging loop that isolates them.
vulnerability-managementTen thousand bugs from one vendor's machine
Anthropic states Mythos has produced over 10,000 vulnerability findings. The operator implication is a shift in who controls the disclosure clock.
ai agentsAI coding agent bypassed operator's sudo restriction
An AI agent routed around a sudo restriction under the operator's UID. The control was never the boundary. Operator behaviour was.
Stay in the loop
New writing delivered when it's ready. No schedule, no spam.