"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." [cite_start]— Andrej Karpathy [cite: 79]
--{{0}}--
Welcome to the chapter on Context Engineering. As software engineers, we often start working with LLMs by tweaking prompts—changing a word here, adding "please" there. This is known as Prompt Engineering. However, for industrial-strength applications, this is not enough.
--{{1}}--
Think of the Large Language Model (LLM) as a CPU. It has processing power and logic, but it needs data to work on. The Context Window is the RAM.
Context Engineering is the process of programming this RAM. [cite_start]It is about the systematic orchestration of information to ensure the model has exactly what it needs to solve a task[cite: 7, 8].
--{{2}}--
If your RAG (Retrieval Augmented Generation) system is failing, it is rarely because the model is "dumb." It is likely a context retrieval issue. [cite_start]A good retrieval system can make a weak model perform well, but bad retrieval will fail even with state-of-the-art models[cite: 30, 31].
| Feature | Prompt Engineering | Context Engineering |
|---|---|---|
| Focus | "Magic words" and phrasing | Architecture of information flow |
| Scope | Single interaction | System-wide state, memory, and tools |
| Analogy | Writing a better search query | Managing the Heap and Stack in RAM |
| Goal | Better text generation | Reliable, structured task completion |
[cite_start][cite: 50, 51, 75]
--{{0}}--
To build robust agents, we need to treat context as a multi-dimensional design layer. There are six distinct types of context that an agent requires to function effectively.
--{{1}}--
Let's visualize this architecture.
mindmap
root((6 Types of Context))
Instructions
Role (Who)
Objective (Why)
Requirements (How - JSON/XML)
Examples
Behavior (Chain of Thought)
Responses (Format)
Knowledge
External (Domain Specifics)
Task (API Specs, Docs)
Memory
Short-term (Session history)
Long-term (User prefs, Facts)
Tools
Description
Parameters
Results (Feedback Loop)
Guardrails
Inputs (Cleaning)
Actions (Allowed list)
Outputs (Schema validation)
@mermaid_eval
-
Instructions: The "System Prompt." Defines the Persona (e.g., Coding Agent), the Objective, and strict constraints (e.g., "Always output JSON").
-
Examples: Few-shot prompting. Models learn patterns better than rules. Show them how to think and what a good response looks like.
-
Knowledge: The RAG layer. Bridging the gap between the model's training data and your proprietary data (Docs, API specs).
-
Memory:
- Short-term: The chat history of the current session.
Long-term: Stored facts about the user or project that persist across sessions.
-
Tools: The API layer. The agent needs to know what tools are available, how to call them (parameters), and crucially, it needs to see the Results of those calls.
-
Guardrails: Safety and schema enforcement. Ensuring inputs are clean and outputs match the expected format before they reach the user or downstream systems.
--{{0}}--
Context Engineering is not static; it is a pipeline. We can break down the lifecycle of context into four stages: Write, Read, Compress, and Isolate.
--{{1}}--
1. Write Context Saving information outside the context window.
- Long-term: Vector databases, Knowledge Graphs.
- Short-term: Session arrays, fast-access caches.
State: Runtime objects tracking the current task progress.
--{{2}}-- 2. Read Context Pulling relevant info into the "RAM" (Context Window) at runtime.
- Retrieved from Vector DBs.
- Fetched via Tool execution (e.g., SQL query results).
- Loaded from conversation history.
--{{3}}-- 3. Compress Context Optimization. The retrieved context might be verbose. We use summarization steps or smaller LLMs to distill the information into token-efficient formats before passing it to the main reasoning model.
--{{4}}-- 4. Isolate Context Security and Focus. Splitting context so specific agents only see what they need.
- Sandboxing: Code execution environments.
Sub-agents: A "Writer" agent doesn't need the raw SQL logs that the "Analyst" agent used.
graph LR
A[Raw Data] --> B(Write: Store in DB/State)
B --> C{Trigger}
C --> D(Read: Retrieve & Fetch)
D --> E(Compress: Summarize)
E --> F(Isolate: Route to Agent)
F --> G[LLM Processing]
--{{0}}--
Let's implement a "Context Engineering Workflow" for a Research Assistant. Tech Stack:
- Orchestration: CrewAI
- Ingestion: Tensorlake (for parsing complex docs)
- Memory: Zep (Knowledge Graph memory)
- Search: Firecrawl (Semantic web scraping)
- Vector DB: Milvus
--{{1}}--
The Workflow:
- Ingest: Parse docs into RAG-ready chunks.
- Retrieve: Hybrid search (Web + Docs + ArXiv).
- Filter: An evaluator agent removes irrelevant context.
- Synthesize: A final agent generates the answer.
Below is a conceptual implementation of the Retrieval & Filter logic.
# Pseudo-code based on the architecture
from crewai import Agent, Task, Crew
from firecrawl import FirecrawlApp
from pymilvus import Collection
class ContextManager:
def __init__(self):
self.milvus = Collection("research_docs")
self.firecrawl = FirecrawlApp()
def gather_context(self, query):
# 1. Internal Knowledge (Vector Search)
# We retrieve top-k chunks similar to the query
internal_results = self.milvus.search(
data=[encode(query)],
anns_field="embedding",
param={"metric_type": "L2", "params": {"nprobe": 10}},
limit=5
)
# 2. External Knowledge (Web Search)
# Using Firecrawl for LLM-ready markdown
web_results = self.firecrawl.search(query, params={"formats": ["markdown"]})
# 3. Aggregate
return {
"internal": [res.text for res in internal_results],
"external": web_results['data']
}
# Agent Definitions
researcher = Agent(
role='Context Evaluator',
goal='Filter irrelevant information from gathered context',
backstory='You are a strict editor. You only allow facts relevant to the query.',
verbose=True
)
synthesizer = Agent(
role='Research Writer',
goal='Synthesize a final answer from filtered context',
backstory='You write clear, citation-backed summaries.'
)
# Workflow Execution
def run_workflow(user_query):
ctx_manager = ContextManager()
raw_context = ctx_manager.gather_context(user_query)
# Task 1: Filter Context (Compress/Isolate)
task_filter = Task(
description=f"Analyze this context: {raw_context}. Remove noise relative to query: {user_query}",
agent=researcher
)
# Task 2: Generate Answer
task_synthesize = Task(
description="Write a final answer using the filtered context.",
context=[task_filter], # Passing the output of task 1
agent=synthesizer
)
crew = Crew(agents=[researcher, synthesizer], tasks=[task_filter, task_synthesize])
return crew.kickoff()--{{0}}--
As applications grow, you cannot dump everything into the context window. Anthropic's "Claude Skills" pattern offers a solution for Progressive Loading.
--{{1}}--
It uses a 3-layer system to keep the "RAM" clean:
- Main Context: Always loaded (Project config).
- Skill Metadata: Lightweight YAML definitions (Name, Description). Loaded to help the model decide which tool to use.
- Active Skill Context: Loaded only when the skill is triggered. Contains the heavy instructions and examples.
sequenceDiagram
participant User
participant System
participant LLM
Note over System: Layer 1: Main Context Loaded
User->>System: "Analyze the Q3 financial report"
System->>LLM: Sends Query + Skill Metadata (Layer 2)
LLM->>System: "I need the 'Financial_Analysis' skill"
System->>System: Loads 'Financial_Analysis.md' (Layer 3)
System->>LLM: Sends Query + Full Skill Instructions
LLM->>User: Generates Report
--{{0}}--
Scenario: A user asks, "What’s blocking the Chicago office project, and when’s our next meeting about it?"
--{{1}}--
The Problem:
- Blockers are in the issue tracker (e.g., Linear/Jira).
- Meetings are in the Calendar.
- Discussions are in Slack/Email.
- Permissions: Does the user have access to all this? --{{2}}-- The Solution: Agentic Context Engineering A manual RAG pipeline fails here. You need an Ingestion Layer that handles auth and updates (syncing), and a Retrieval Layer that can expand the query into multiple sub-queries (one for Linear, one for Calendar).
Key Takeaway: Context retrieval for Agents is an infrastructure problem, not just an embedding problem. You must build for continuous sync and hybrid search.
Objective: Design the Context Engineering flow for a Code Review Bot.
Requirements:
- The bot must know the coding standards (Knowledge).
- It must access the specific PR diff (Input).
- It must remember previous comments on this PR (Short-term Memory).
- It must verify if the user is a senior dev before allowing a "merge" action (Guardrails).
Task: Using the editor below, sketch the JSON structure you would inject into the context window for the "Instructions" and "Guardrails" layers.
{
"role": "Code Reviewer",
"guardrails": {
"allowed_actions": ["comment", "request_changes"],
"restricted_actions": ["merge"],
"validation": "Check if user_role == 'senior_dev' for merge"
},
"context_sources": [
"docs/coding_standards.md",
"api/github/pr_diff"
]
}
--{{0}}--
Test your understanding of Context Engineering.
- According to the "CPU vs RAM" analogy, what represents the RAM?
- [( )] The Vector Database
- [(X)] The Context Window
- [( )] The Prompt
- Which stage of the Context Lifecycle involves splitting information so specific agents only see what they need?
- [( )] Write Context
- [( )] Compress Context
- [(X)] Isolate Context
- Why is "Skill Metadata" loaded separately from "Active Skill Context" in the Progressive Loading pattern?
- [( )] To make the model faster
- [(X)] To save tokens and keep the context lightweight until needed
- [( )] Because YAML is easier to read than Markdown
- In the "Chicago Office" problem, why does simple RAG fail?
- [(X)] Data is spread across multiple dynamic sources (Calendar, Slack, Jira) with different permissions
- [( )] The model cannot understand the word "Chicago"
- [( )] Vector databases cannot store calendar events
- What are the 6 types of context?
- [[X]] Instructions
- [[X]] Examples
- [[ ]] Emotions
- [[X]] Knowledge
- [[X]] Memory
- [[X]] Tools
- [[X]] Guardrails
- [[ ]] Hardware Specs