Interview Prep

Unified AI systems delivery framework

16 min read

A practical framework for designing AI systems end to end: define the loop, choose the intelligence layer, wire orchestration, evaluate rigorously, and plan for production constraints.

Ready to ace your system design interview?

This article is just one piece. SWE Quiz gives you structured, interview-focused practice across every topic that comes up in senior engineering rounds.

1,000+ quiz questions across system design and ML/AI
Spaced repetition to lock in what you learn
Full case study walkthroughs of real interview topics
Track streaks, XP, and progress over time

Start for free See pricing

Breakdown

1.Modern AI systems are more than models

Modern AI systems are not just models. They are systems where multiple types of intelligence are composed together, connected through infrastructure, and continuously improved through feedback. Understanding how to design these systems end to end is one of the most important skills an engineer can develop right now.

At a high level, every AI system has two layers: an intelligence layer, which makes decisions or predictions, and an orchestration layer, which wires everything together and delivers it to users. Most engineers over-focus on the intelligence layer because that is what courses teach. The best engineers understand both, and know when each matters more.

2.Step 1: define the system and loop type

Before touching any models, understand what kind of system you are building, because the architecture, evaluation strategy, and failure modes differ dramatically depending on the answer.

Interaction systems are powered by LLMs and handle tasks that require natural language understanding, response generation, or transforming unstructured input into something useful. Examples include ChatGPT, Claude document analysis, Notion AI writing assistance, and GitHub Copilot inline code suggestions. These systems are evaluated on usefulness, coherence, and flexibility. They need to handle ambiguity gracefully.

Prediction systems are traditional ML systems used for ranking, classification, regression, or anomaly detection. Examples include TikTok content ranking, Spotify recommendations, Stripe fraud detection, and Google ad click-through prediction. These systems are evaluated on accuracy, stability, and calibration. They need to be reliable and fast at inference time.

Hybrid systems combine both. This is increasingly where the frontier is. Perplexity uses an LLM to reason about search results, but ranking and retrieval systems underneath surface candidates. LinkedIn can use ML to rank jobs and LLMs to personalize explanations. Cursor combines code indexing and retrieval with LLM generation. Most enterprise AI products being built now are hybrids whether or not their teams recognize it.

This distinction matters because the design philosophy shifts. Interaction systems are built around context and reasoning. Prediction systems are built around features and labels. Hybrids require coordination between both, and that coordination is often where things fall apart.

3.Step 2: clarify the user task and success criteria

Every system exists to solve a user problem. If that problem is vague, the system will be vague too, and you will not know what to optimize for.

Define what the user is trying to accomplish and what "working" actually means. This includes qualitative signals and quantitative metrics.

Anthropic defines success for Claude differently depending on the task. For a coding task, success might mean the code runs and handles edge cases. For a document summarization task, success might mean the key points are preserved and nothing is hallucinated. For an open-ended conversation, success might mean the user felt heard and got something useful. Each task implies different eval strategies and different production monitoring.

OpenAI's approach to ChatGPT evolved as the task became clearer. Early versions were evaluated largely on fluency. Once RLHF became central, success shifted toward helpfulness and harmlessness as human raters defined it. That shift in success criteria changed what the model optimized for.

Success criteria are not just metrics. They are design decisions. Optimizing for user retention, task completion, or safety will produce meaningfully different systems. Decide this explicitly and early, or the system will optimize for the wrong thing by default.

4.Step 3: design the intelligence layer

This is where you decide how intelligence is implemented in the system.

You have two primary tools: ML models and LLMs. The key decision is not choosing one over the other. It is deciding what each one is responsible for.

ML models map inputs to outputs. They are optimized offline on labeled data and expected to be stable, fast, and predictable at inference time. They excel when you have historical data and a clear target to predict. Spotify recommendation models ingest listening history, contextual signals, and item embeddings to predict what a user may like next. The system is fast, runs at enormous scale, and does not need to reason. It needs to predict well.

LLMs are runtime reasoning systems. They interpret unstructured input, make decisions, call tools, and generate structured outputs. They are valuable when the input space is unbounded, when tasks require multi-step reasoning, or when the system needs to handle novel situations. When you ask Claude to analyze a legal document, it is not looking up a cached answer. It is reasoning through the text in real time.

The interesting design territory is the boundary between them. GitHub Copilot uses retrieval over code context to find relevant repository information, then an LLM uses that context to generate completions. The retrieval layer handles search and ranking efficiently. The LLM handles generation and reasoning. Splitting work this way lets each component do what it does best.

Google's search generative experiences follow a similar pattern. Ranking systems surface candidate documents using traditional ML. An LLM synthesizes those results into a coherent answer. Ranking with an LLM would be too slow and expensive. Synthesis with a traditional ML model would lack the ability to reason across documents. You need both.

A common mistake is forcing LLMs to do things better handled by ML, such as predicting exact numerical values or classifying millions of items per second. The reverse mistake is trying to use a simple classifier for a task that requires language understanding and nuance.

5.Step 4: design the orchestration layer

This is the layer most engineers underinvest in, and it is where many real-world systems succeed or quietly fail.

Once you define the intelligence layer, design how everything is wired together. This includes context construction, tool access, data retrieval, state management, and control flow.

Context construction decides what information the model sees at runtime. This is critical for LLM systems because the model depends on what you put in the context window. When Claude analyzes a contract, Anthropic infrastructure decides what to include, how to chunk the document, and how to structure the prompt. When Cursor gives code suggestions, the orchestration layer assembles relevant files, recent edits, and cursor position into a coherent prompt before the LLM sees it. If this is wrong, the model reasons from bad or incomplete information no matter how capable it is.

Tool access defines what actions the model can take beyond generating text. OpenAI function calling and Anthropic tool use allow LLMs to invoke external systems, such as databases, APIs, and code execution environments. The tools you give the model, and how those tools are defined, determine what the system can actually accomplish. GPT-4 with access to a code interpreter can solve problems that GPT-4 without tools cannot, even though the base model is the same.

Retrieval extends the model's effective knowledge beyond its training data and context window. RAG systems use vector databases and other retrieval systems to pull relevant information at query time. Perplexity does this against the live web. Enterprise RAG systems do this against internal documents, codebases, and knowledge bases. Retrieval quality directly determines output quality. A powerful LLM with poor retrieval will hallucinate or give generic answers. A modest LLM with excellent retrieval can outperform it.

State management maintains continuity across steps or sessions. This matters enormously for agentic systems. When Claude operates as an agent, browsing the web, writing files, or running code, it needs to track what it has done, what it observed, and what it plans to do next. Longer-horizon tasks require explicit state management, often external to the model itself.

The orchestration layer is not glamorous, but it is decisive. Two teams using the same base LLM can produce dramatically different products based on how well they build this layer.

6.Step 5: choose a model strategy

Now decide how models are deployed and used across the system.

The shift is from thinking about "which model" to thinking about "what model strategy."

Model routing is increasingly common. Rather than sending every request to the most capable and expensive model, systems route based on task complexity. Anthropic model families such as Haiku, Sonnet, and Opus are designed with this in mind. A simple formatting task can go to a smaller model. A complex multi-step reasoning task can go to a stronger model. Routing correctly can cut costs while preserving quality where it matters. OpenAI GPT-4o mini serves a similar cost and latency role.

Fine-tuning adapts a base model to a specific domain or task using additional training data. OpenAI offers fine-tuning for GPT-4o. Companies in legal AI and medical AI use fine-tuned models because general-purpose models may not have deep enough domain behavior out of the box. Fine-tuning is not always the right choice. It requires data, infrastructure, and ongoing maintenance. But when the task is narrow, well-defined, and supported by enough training data, it can produce meaningful gains.

Ensemble and fallback strategies help production systems handle model failures or uncertainty. You might run two models and compare answers. You might fall back to a simpler model when latency is critical. You might route to a human reviewer when confidence is low. Stripe fraud detection does not rely on a single model. It layers multiple signals and models, each catching different patterns.

Good systems are not built on a single model. They are built on a model strategy that accounts for cost, latency, capability, reliability, and improvement over time.

7.Step 6: design evaluation before scaling

Evaluation keeps the system grounded in reality, and it needs to be designed before the system becomes too complex to reason about.

Most teams underinvest here and pay for it later. Without evals, you cannot tell if a change improved the system. You cannot detect regressions. You cannot make a credible internal case that the system is working.

For ML systems, offline evaluation is relatively mature. You hold out a test set and measure accuracy, precision, recall, AUC, or ranking metrics. The challenge is ensuring offline metrics predict online performance. Recommendation teams often find that optimizing for a proxy metric does not always improve user satisfaction. The eval may be measuring the wrong thing.

For LLM systems, evaluation is harder and less standardized. Task-based evals check whether the system completes a task correctly: does the code run, does the summary contain the key points, does extracted data match the source? Human preference evals ask raters to compare outputs and choose the better one. This is central to RLHF. Model-based evals use a separate LLM as a judge, which scales better than human rating but introduces its own biases. Behavioral tests check edge cases, adversarial inputs, and sensitive situations.

Anthropic Constitutional AI is a useful example of evaluation thinking baked into training. Instead of only relying on human raters, a model critiques outputs against a set of principles, and that critique signal helps improve the main model. The evaluation system becomes part of the training loop.

Treat evals as a product, not an afterthought. Define them early, version them, and treat a drop in eval scores the same way you would treat a drop in conversion rate: as a signal something is wrong.

8.Step 7: define the runtime workflow

The runtime workflow is the execution path of the system: what happens step by step when a user interacts with it.

For simple systems, this may be one inference call. The user sends a message and the model responds. Most interesting systems are more complex.

For agentic LLM systems, the common pattern is a loop: plan, act, observe, repeat. The model decides what to do, takes an action using a tool, observes the result, and decides what to do next. OpenAI deep research follows this style. It decomposes a research question into subqueries, searches the web iteratively, synthesizes findings, and produces a report. The workflow involves many model calls, tool invocations, and intermediate reasoning steps.

For hybrid systems, the workflow often includes handoffs between components. A user asks a question. A retrieval system finds relevant documents. An LLM synthesizes an answer. An ML ranking model scores and orders follow-up suggestions. Each handoff is a potential failure point. The workflow needs to be explicit, tested, and observable.

When you use Cursor to refactor a function, the workflow is roughly: parse intent, retrieve relevant code context using embeddings, construct a prompt with that context, call the LLM, post-process the output, apply it to the file, and check for syntax errors. Each step is a distinct component. If any step fails or produces bad output, the whole experience degrades.

Most real-world bugs in AI systems are workflow bugs: bad context construction, wrong tool calls, missing state, poor post-processing, or hidden retrieval failures. They are not always model capability bugs. Engineers who understand the full workflow can debug these. Engineers who only understand the model layer cannot.

9.Step 8: handle constraints and plan for iteration

Finally, design for production. The system will face conditions you did not anticipate, so build accordingly.

Safety and alignment are first-class engineering concerns, not just policy concerns. Anthropic's approach to Claude involves multiple layers: training-time alignment through Constitutional AI and RLHF, inference-time guardrails for certain request types, and monitoring for unexpected production behavior. OpenAI usage policies are enforced through a combination of model-level refusals and external classifiers. These are architectural decisions made early, not features bolted on at the end.

Cost and latency are often the binding constraints. An LLM that takes 30 seconds to respond is not a usable product, regardless of answer quality. Teams address this through caching, streaming, model routing, batching, context trimming, and fallback behavior. These are not optional optimizations. They are prerequisites for shipping.

Monitoring AI systems is harder than monitoring traditional software because failures are often soft. The system does not crash. It gives a subtly wrong answer or a slightly degraded experience, and you notice only when metrics slip. Good monitoring for LLM systems includes latency and error rates, output quality metrics over time, user behavior signals such as edits and retries, and drift detection for model behavior changes.

Iteration is how the system improves over time, and this is where the feedback loop matters. The best AI systems collect signals that improve them. OpenAI uses human feedback from ChatGPT users to improve future models. GitHub Copilot tracks accepted and rejected suggestions and uses that signal to inform training. The system that learns is the system that compounds. Building data flywheels into the product architecture early is one of the highest-leverage decisions you can make.

10.The core idea

AI systems are no longer just about models.

They are about combining trained intelligence from ML with runtime intelligence from LLMs, wiring them together through a well-designed orchestration layer, evaluating them rigorously, and improving them continuously through feedback.

The engineers who understand this end to end are increasingly rare and increasingly valuable. Most have depth in one layer. The leverage is in understanding all of them.

Finished reading?

Mark this article complete for your readiness checklist.