Engineering Articles
Real posts from engineering blogs — broken down for system design interviews, with a quiz to test what you learned.
How Google made Gemma 4 3x faster (without changing the model)
Google released Multi-Token Prediction (MTP) drafters for Gemma 4, achieving up to 3x inference speedup with zero quality degradation. The technique -- called speculative decoding -- pairs a small fast drafter model with the large target model. The drafter guesses several tokens ahead; the target verifies them all in one parallel pass. This post explains how it works, why it works, and when it does not.
Harness Engineering: How OpenAI Ships Without Writing Code
OpenAI ran a five-month experiment building an internal product with zero manually-written code: ~1,500 PRs, ~1 million lines of code, a team that grew from 3 to 7 engineers. The post describes the discipline behind it: how they made the codebase legible to agents, structured documentation for progressive disclosure, enforced architecture mechanically, redesigned merge philosophy for agent throughput, and managed entropy with background cleanup agents.
How Cloudflare Cut Cold Starts 10x: From TLS Pre-Warming to Consistent Hash Sharding
Cloudflare Workers started with 5ms cold starts that were hidden behind TLS handshakes. As Workers grew to support full applications (10 MB scripts, 400ms startup budgets), cold starts outgrew TLS - and the original trick stopped working. This post covers both generations of their solution: the TLS SNI pre-warming trick and the consistent hash ring sharding system that ultimately cut eviction rates 10x and pushed warm request rates to 99.99%.
Scaling PostgreSQL to power 800 million ChatGPT users
OpenAI runs ChatGPT for 800 million users on a single-primary PostgreSQL instance with ~50 read replicas - no sharding. Over the past year, database load grew 10x. This post covers every optimization they made to keep it running: connection pooling, cache stampede prevention, workload isolation, rate limiting, and safe schema management.
Inside OpenAI's In-House Data Agent: From Question to Insight in Minutes
OpenAI built a bespoke internal AI data agent that lets any employee - not just data engineers - go from natural language question to verified insight in minutes. The agent is powered by GPT-5.2, uses Codex to deeply understand table semantics from source code, retrieves context via RAG over 70k datasets (600 PB), and continuously self-improves through a layered memory system. The post breaks down its six-layer context architecture, conversational reasoning loop, eval-driven quality assurance, and key lessons in agent design.
Building AI-Powered Subtitles at Vimeo
Vimeo's AI subtitle system uses LLMs to translate video subtitles across 9+ languages. The core challenge: LLMs optimize for fluency and merge fragmented speech into clean sentences, breaking subtitle timing sync. Their fix is a three-phase "split-brain" pipeline that separates creative translation from structural line mapping, with a self-healing fallback chain that guarantees 100% of subtitle slots are filled.
How Cursor Built Fast Regex Search with N-Gram Indexing
Cursor's AI-powered code editor needs sub-second regex search across entire codebases to keep agents productive. Ripgrep alone takes 15+ seconds on large monorepos because it scans every file. This post covers how Cursor built a client-side n-gram inverted index that narrows candidates before ripgrep runs — covering trigrams, bloom filter masks, sparse n-grams with frequency-based weight functions, memory-mapped file formats, and why the index lives on the client rather than a server.
How OpenAI built Codex: inside the agent loop and harness
OpenAI's Codex powers a cross-platform coding agent (CLI, web, VS Code, macOS app) from a single shared harness. Two engineering posts reveal exactly how that harness works: the agent loop that orchestrates model inference and tool calls, the prompt structure and caching strategy that keeps it efficient, and the App Server JSON-RPC protocol that lets every client surface share the same core.
How OpenAI delivers low-latency voice AI to 900 million users
Voice AI feels natural only when it responds within ~300ms. This post unpacks how OpenAI rearchitected its WebRTC infrastructure to hit that bar at the scale of ChatGPT voice: a relay + transceiver split that keeps public UDP surface tiny, encodes routing metadata into a protocol-native field for first-packet routing, and geo-steers users to nearby ingress points worldwide.