AI System Design
Learn how to architect for the future.
AI already powers search, copilots, chatbots, and internal developer tools.
At Staff+ levels, your job is to design systems around models, where reliability, safety, and cost still drive decisions.
In this lesson, we’ll cover what’s unique about AI System Design and its subvariants, and which ones you should learn.
Buckle up, we’re covering a lot (but at least we’re more thorough than any John docs).
Let’s get started.
Heads up: We’ll dive deeper into AI in the next module: AI Engineering.
Key differences: AI vs. traditional systems
To shift toward AI System Design, you’ll have to understand how AI systems differ from traditional System Design across four major dimensions:
Behavior: Deterministic execution vs. probabilistic outputs
Feedback: One-time deployment vs. continuous evaluation and retraining
Data: Static inputs vs. evolving, embedded knowledge
Control: Input validation vs. guardrails that shape and verify outputs
Let’s look at how each dimension reshapes System Design in practice.
1. Behavior: Deterministic vs. probabilistic
Traditional systems: Same input → same output. You design for correctness.
AI systems: Same input → different outputs depending on context, temperature, or fine-tuning. You design for consistency, not perfection.
Staff+ engineers anticipate and constrain this variability. They use temperature tuning, prompt templates, and response validators to make “unpredictable” models predictable enough for production.
Example: A customer support AI that must always output JSON, even when the model tries to “explain” itself.
2. Feedback: Deploy once vs. continuous evaluation
Traditional systems: Ship once → monitor uptime
AI systems: Ship → evaluate → retrain → redeploy
Every production AI system includes a feedback loop, users flag bad answers, evaluation jobs score outputs, and retraining pipelines improve models over time.
Monitoring now includes accuracy, bias, drift, hallucination rates, and even cost per query. It’s not “is it up?”, it’s “is it still right, fair, and affordable?”
Example: Your chatbot answers correctly in February but starts failing in May because the knowledge base hasn’t been refreshed. You catch that with automated eval jobs and content freshness metrics.
3. Data: Static inputs vs. embedded knowledge
Traditional systems: Data is a separate concern from logic—mostly static and external.
AI systems: Data is embedded into behavior via embeddings, vector stores, and fine-tuning.
In AI systems, data pipelines are as critical as APIs or databases. Your vector database, embeddings, and fine-tuning datasets are all first-class architecture components.
As a Staff+ engineer, you'll have to:
Design ETL pipelines for cleaning and deduplication.
Automate embedding updates as content changes.
Enforce data lineage and auditability for compliance.
Example: A retrieval pipeline that automatically re-embeds documents when an internal wiki page updates, keeping context fresh and answers accurate.
4. Control: Input validation vs. output guardrails
Traditional systems: Validate inputs to prevent bad behavior.
AI systems: Guardrails must also constrain outputs, which are less predictable and harder to test.
In traditional systems, we validate schemas and sanitize inputs. In AI systems, we must also validate outputs, because models can go off-script.
Guardrails prevent unsafe, off-topic, or brand-damaging responses. These can be implemented via:
Policy layers (e.g., NeMo Guardrails, Azure Content Safety)
Schema enforcement (structured outputs like JSON)
Post-processing filters (regex or domain classifiers)
Example: A support bot that refuses to answer medical questions or escalates to a human if confidence < 80%.
Example: Designing an AI-powered support assistant
Let’s look at how the four dimensions of AI System Design come together in a real-world product request—one you might be asked to lead at the Staff+ level.
Scenario: A PM says: “Can we build an AI assistant so users don’t flood support?”
Your job is to design a production-grade AI system, thinking in terms of system reliability, safety, and business impact.
Defining SLOs
First, you'd need to define success and failure:
95% of answers should come from internal docs.
Hallucination rate <5%.
These turn subjective “make it work” requests into measurable reliability goals, and give you something to evaluate post-launch.
Architectural design decisions
With these SLOs in place, your design choices reflect the unique needs of AI systems:
Retrieval-augmented generation (RAG): Ground answers in trusted documents to reduce hallucination and enable frequent updates; essential because data is your infrastructure now.
Tools: Pinecone/Weaviate/FAISS vs just raw LLMs (GPT)
Guardrails: LLMs aren’t sandboxed, so you must sanitize outputs, not just inputs.
Tools: NeMo, content filters, schema validators
Observability: You need metrics for answer quality, freshness, and user trust.
Tools: LangSmith, TruLens, custom eval harnesses
For those unfamiliar: We’ll cover RAG and observability in upcoming lessons.
Business tie-in
The design decisions now map to concrete business impact.
With RAG + guardrails, we deflect 40% of support tickets.
Without them, we risk frustrated users and support escalations.
Cost per query, latency, and escalation rates become the metrics that justify your choices to product and leadership.
Types of AI System Design
Different kinds of AI systems with their own respective constraints, failure modes, and architecture patterns.
At a high level, most production AI systems fall into three broad patterns:
1. Machine learning systems
These are your classic supervised learning pipelines—ranking, classification, forecasting. You train on labeled data, deploy a model, and measure performance with precision, recall, AUC, etc.
Machine learning System Design focuses on:
Data pipelines and labeling
Offline training + online serving
A/B testing and model drift detection
Dig deeper into data pipelines, training, and more in our Machine Learning System Design course.
2. Generative AI systems
These systems generate content: text, code, summaries, images. They rely on prompts, embeddings, and context windows (and they hallucinate—a lot).
Generative AI System Design focuses on:
Prompt engineering and context shaping
Guardrails to constrain output
Cost/performance trade-offs (token usage matters)
Think: Chatbots, copilots, and content engines
Our Grokking Generative AI System Design course gets you ahead with this most in-demand AI skill:
Learn real-world architectures (e.g., text, image, speech, and video generation).
Master scaling strategies for distributed training and inference.
Keep models accurate and efficient with data and pipeline design.
3. Agentic systems
Agentic systems can reason, plan, call tools, and take multi-step actions. They’re powerful, but unpredictable.
Agentic System Design involves:
Tool use orchestration (functions, APIs)
State management and memory
Safety layers to prevent runaway behavior
Think: Autonomous agents, workflow planners, and complex RAG apps
Curious to learn more? Check out our course on Agentic System Design.
Which AI System Design skill will you need?
As a Staff+ engineer, you don’t need to be an expert in every kind of AI system—but you do need fluency across all three.
Here’s how to think about it:
Start with breadth:
You’ll be asked to review designs, mentor teams, or make architectural calls across all types—so you must understand how each system works, where it fails, and how to reason about trade-offs.
Go deep depending on your org:
If your team owns a content engine or internal assistant, you’ll need deep generative AI chops.
If you're in a search, ranking, or recommendations org, ML system design will be core.
If you're supporting ambitious LLM-driven tooling, you’ll need to understand agent orchestration.
Leadership means pattern matching:
Even if you’re not building an agentic system today, recognizing when a design is one helps you avoid dangerous oversimplifications (It’s just a chatbot—until it starts making real decisions).