Turning 1.2M Personal Messages Into a Life Vault With LLMs

A developer pulled 20 years of his own chat history out of ICQ, VK, Twitter, Facebook, Instagram, and Telegram using GDPR data-access requests, then built a pipeline to convert 1.2 million messages into a structured personal knowledge base. The parsing was a mess of platform quirks: Instagram double-encodes Cyrillic, Telegram reassigns message IDs between exports, Facebook scatters the same messages across three folders thanks to E2E encryption. After normalising to a uniform format, the bigger problem was noise — in his longest thread, 41% of messages were emoji, links, fillers, or media — and identity resolution, since the same person appears across platforms under different usernames and Slavic diminutives like ‘Sasha’ can map to many people of any gender.

Heuristics, NER, and fine-tuned classifiers topped out around 75% F1 on event detection, which at this scale would inject roughly 12,000 false events into the vault. He switched to LLMs for both name resolution and life-event classification, getting the false-positive rate under 1% on chunks below 6,000 messages. The model never writes directly to the vault: it emits a structured JSON manifest of daily-note bullets, entity facts, and unresolved ambiguities, which a deterministic script then injects with provenance markers pointing back to source messages in an SQLite store.

The cost is non-trivial — roughly 15–20 billion tokens, about $15k on Opus or 10–15 weeks of local Qwen3-30B inference on an M5 Pro. The incidental findings are arguably more interesting than the CRM itself: his vocabulary novelty rate has plateaued at 6% since his mid-20s, and the exercise surfaced patterns about emotional bandwidth and friendship half-lives that journaling never caught.