A six-stage learning sequence. From the model in isolation to agents, context design, and the pattern language that waits at the end of the road.
This document is a foundation course, not a survey. It is written for practitioners who are already working with AI tools and who want to understand what they are actually working with — precisely, from the ground up. Much of the vocabulary currently in circulation is imprecise. Imprecise vocabulary produces imprecise thinking. Imprecise thinking produces systems that do not do what was intended. The foundations described here are stable; some specifics — particularly around MCP and tooling protocols — have moved since this was written. Verify current documentation before building.
The sequence has six stages. Each stage has a clear boundary: what you know going in, and what you know coming out. The stages build on each other. Skipping stages is possible but not recommended. The gaps show up later in ways that are difficult to diagnose.
The final stage points toward a longer project: a pattern language for semantic AI systems. That is the practitioner's destination. The stages before it are the road.
Estimated time per stage: two to four hours of focused study and hands-on work. Total sequence: one to three weeks at practitioner pace.
The single most common barrier to clear thinking about AI systems is the conflation of terms. The following terms are frequently used interchangeably in practice. They are not interchangeable. Each names a distinct thing.
An LLM produces output one token at a time. Each token is selected probabilistically based on the input and all previously generated tokens. There is no reasoning happening in the way humans reason. There is sophisticated pattern completion happening at a scale that produces outputs that resemble reasoning. These are not the same thing, and the difference matters for system design.
The model has no persistent memory. Between calls, nothing is retained. Everything the model knows during a given interaction is present in the context window. This is both the fundamental constraint and the fundamental design surface. Managing what is in the context window — and what is not — is most of the work.
The model does not produce a single answer. It produces a probability distribution over possible next tokens and samples from it. Temperature controls how spread out that distribution is. High temperature: creative, variable, less reliable. Low temperature: conservative, consistent, occasionally rigid. This is a dial, not a switch.
Without external tools: it cannot retrieve current information, it cannot perform precise arithmetic reliably, it cannot take actions in the world, and it cannot remember previous conversations. Any system that seems to do these things is using tool use or retrieval — not the model alone.
Tool use is the capability that transforms a language model from a sophisticated text predictor into something that can interact with the world. Understanding how it works — mechanically — is essential before building any system that uses it.
The model is given a list of available tools, each described in natural language with a defined schema for inputs and outputs. When the model determines that a tool call is appropriate, it outputs a structured message — typically JSON — specifying which tool to call and with what arguments. The host system receives this output, executes the tool call, and returns the result to the model as a new context entry. The model then continues.
One cycle of tool use: the model receives input and tool definitions in context; outputs a tool call; the host system executes the call and captures the result; the result is appended to the context; the model receives the updated context and continues. An agent is this loop repeated until a stopping condition is met. Each iteration consumes context space. Long agentic tasks eventually run out of context — this is a hard constraint, not a bug to be fixed.
Tools should be designed the same way good API endpoints are designed. One tool, one concern. Clear input schemas. Predictable outputs. Idempotent where possible. Well-described in natural language — because the model reads the description to decide whether and how to use the tool.
MCP is a protocol specification. It standardizes the interface between an AI client and the servers that expose tools, resources, and prompts. It is not a framework, not a platform, and not a paradigm. It is plumbing — carefully designed plumbing, but plumbing. The protocol has continued to develop since this document was written; verify current specification details at modelcontextprotocol.io before building.
Before MCP, every AI application had to define its own bespoke interface for tool integration. MCP replaces that with a standard. The value of the standard is composability: an MCP server built for one client works with any compliant client.
MCP is not the same as tool use. Tool use is a model capability that predates MCP. MCP is one way to implement tool use at scale. A system can use tool use without using MCP. A system can use MCP without using agents. These are orthogonal concepts that happen to compose well.
Reading about MCP is not sufficient. The understanding that comes from building a server is qualitatively different from the understanding that comes from using one. This stage is hands-on throughout.
The first server should do exactly one thing. Not three things. Not a "useful" thing. One thing, implemented correctly, with proper error handling and a clear description that the model can read and understand. Good first-server candidates: return the current time in a specified timezone; fetch the contents of a URL and return clean text; read a file from a specified path and return its contents.
The description on each tool function is what the model reads to decide whether and how to use the tool. Write it for the model. Be explicit about what the tool does, what inputs it expects, and what it returns. Vague descriptions produce incorrect tool calls.
After the first server works, build a second one that connects to a real external system — a REST API, a database, a file system. This is where the design decisions become non-trivial. What do you expose and what do you hide? What errors does the external system produce and how do you surface them? What does the model need to know to use this tool correctly?
The context window is the design surface. What goes in it, when, and in what form determines what the model can do. This is not configuration — it is architecture. Context design is the discipline of managing that surface deliberately.
Three categories of content compete for context space: instructions (system prompts, tool descriptions, task definitions); state (conversation history, tool results, retrieved documents); and data (content the model must reason over to complete the task). In a long agentic task, state accumulates. Old tool results that are no longer relevant still occupy space. At some point, the context fills. Systems that do not manage this fail in ways that are difficult to diagnose because the model does not announce that it has forgotten something.
Retrieval-Augmented Generation is, at its core, a context management strategy. Instead of putting an entire knowledge base in context (impossible) or relying on training data (stale), you retrieve only the relevant fragments and place those in context. The retrieval mechanism is separate from the model and requires its own design — and the choice of mechanism is a substantive decision, not a configuration detail.
Three retrieval approaches are in common use, and they are not interchangeable. Vector search encodes documents and queries as dense numerical embeddings and returns semantically similar matches — effective when the vocabulary of the query differs from the vocabulary of the documents but the meaning aligns. It requires an embedding model, an index, and a choice of similarity metric. Keyword search (typically BM25 or a variant) matches on terms — faster, more interpretable, and more precise when query terms are likely to appear verbatim in the documents. Production systems often run both in parallel and merge the results; hybrid retrieval is more robust than either alone. Graph traversal follows explicit relationships between entities — the right approach when the relevant context is not a document but a chain of connections: who owns what, which component depends on which, what event preceded which decision. When relationships are first-class, graph traversal retrieves what vector and keyword search cannot.
The retrieval mechanism is one part of a retrieval system. The system also includes chunking strategy — how documents are divided before indexing — embedding model selection, index design, and the scoring and deduplication logic that determines what enters the context window when results compete for space. Each is a design decision. None is automatic. Errors in retrieval architecture show up as model errors, which makes them difficult to diagnose without understanding what the model was actually given to reason over.
Between sessions, the context window is empty. Nothing persists. If a system needs continuity across sessions — and most production systems do — that continuity must be engineered explicitly: written to external storage at the end of a session, retrieved and placed in context at the start of the next one. This is not automatic. It is a design decision that must be made and implemented.
An agent is a model running in a loop. Each iteration: observe the current state, decide the next action, execute the action via tool call, observe the result, repeat. The loop terminates when the goal is achieved or a stopping condition is met. Agents are powerful and they fail in specific, predictable ways — and nearly all of those failure modes are, at bottom, context failures: information that was absent, that accumulated until it crowded out what mattered, or that was retrieved at the wrong moment. Stage 4 is not background for this stage. It is the mechanism of it.
For most production systems, fully autonomous agents are not appropriate. The cost of an error is too high. The right pattern is a spectrum: from the model that drafts and the human that approves, through the model that acts autonomously on low-risk decisions and asks for confirmation on high-risk ones, to the fully autonomous agent that operates within a narrowly constrained domain where errors are cheap and reversible. The decision about where on this spectrum a system sits is a product decision, not a technical one.
Single-shot interactions do not need agents. Retrieval tasks where the retrieval logic is deterministic do not need agents. Agents add complexity, cost, and failure surface. Use them when the task genuinely requires sequential decision-making across multiple uncertain steps. Not before.
Christopher Alexander observed that good architecture is not the result of following rules. It is the result of applying patterns — solutions to recurring problems in specific contexts — at every scale simultaneously, from the region to the room to the doorknob. The patterns form a language. You compose them. Each pattern you apply creates the conditions in which other patterns can be applied.
The same structure applies to AI systems. There are recurring problems at every scale — organizational, session, prompt, tool, data. Each has proven solutions. Those solutions compose. The practitioner who can name them, apply them deliberately, and teach them to others is doing the work Alexander described.
These definitions are the vocabulary of this sequence. They are not exhaustive. They are precise. Use them as written until you have a reason to refine them.