barakplasma/Context Engineering: The Architecture of AI Agents.md

## Context Engineering: The Architecture of AI Agents.md

      
    Raw
  

              Context Engineering: The Architecture of AI Agents.md
            
          
    Context Engineering: The Architecture of AI Agents


"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." [cite_start]— Andrej Karpathy [cite: 79]

Introduction: Beyond Prompt Engineering

--{{0}}--

Welcome to the chapter on Context Engineering. As software engineers, we often start working with LLMs by tweaking prompts—changing a word here, adding "please" there. This is known as Prompt Engineering. However, for industrial-strength applications, this is not enough.
--{{1}}--

Think of the Large Language Model (LLM) as a CPU. It has processing power and logic, but it needs data to work on. The Context Window is the RAM.
Context Engineering is the process of programming this RAM. [cite_start]It is about the systematic orchestration of information to ensure the model has exactly what it needs to solve a task[cite: 7, 8].
--{{2}}--

If your RAG (Retrieval Augmented Generation) system is failing, it is rarely because the model is "dumb." It is likely a context retrieval issue. [cite_start]A good retrieval system can make a weak model perform well, but bad retrieval will fail even with state-of-the-art models[cite: 30, 31].

The Evolution of Context


Feature
Prompt Engineering
Context Engineering


Focus
"Magic words" and phrasing
Architecture of information flow


Scope
Single interaction
System-wide state, memory, and tools


Analogy
Writing a better search query
Managing the Heap and Stack in RAM


Goal
Better text generation
Reliable, structured task completion


[cite_start][cite: 50, 51, 75]
The 6 Types of Context

--{{0}}--

To build robust agents, we need to treat context as a multi-dimensional design layer. There are six distinct types of context that an agent requires to function effectively.
--{{1}}--

Let's visualize this architecture.

  
      mindmap
  root((6 Types of Context))
    Instructions
      Role (Who)
      Objective (Why)
      Requirements (How - JSON/XML)
    Examples
      Behavior (Chain of Thought)
      Responses (Format)
    Knowledge
      External (Domain Specifics)
      Task (API Specs, Docs)
    Memory
      Short-term (Session history)
      Long-term (User prefs, Facts)
    Tools
      Description
      Parameters
      Results (Feedback Loop)
    Guardrails
      Inputs (Cleaning)
      Actions (Allowed list)
      Outputs (Schema validation)

    
      Loading

  
@mermaid_eval
deep dive


Instructions: The "System Prompt." Defines the Persona (e.g., Coding Agent), the Objective, and strict constraints (e.g., "Always output JSON").


Examples: Few-shot prompting. Models learn patterns better than rules. Show them how to think and what a good response looks like.


Knowledge: The RAG layer. Bridging the gap between the model's training data and your proprietary data (Docs, API specs).


Memory:


Short-term: The chat history of the current session.


Long-term: Stored facts about the user or project that persist across sessions.


Tools: The API layer. The agent needs to know what tools are available, how to call them (parameters), and crucially, it needs to see the Results of those calls.


Guardrails: Safety and schema enforcement. Ensuring inputs are clean and outputs match the expected format before they reach the user or downstream systems.


The Context Lifecycle

--{{0}}--


Context Engineering is not static; it is a pipeline. We can break down the lifecycle of context into four stages: Write, Read, Compress, and Isolate.
--{{1}}--


1. Write Context
Saving information outside the context window.

Long-term: Vector databases, Knowledge Graphs.
Short-term: Session arrays, fast-access caches.


State: Runtime objects tracking the current task progress.
--{{2}}--
2. Read Context
Pulling relevant info into the "RAM" (Context Window) at runtime.

Retrieved from Vector DBs.
Fetched via Tool execution (e.g., SQL query results).
Loaded from conversation history.

--{{3}}--
3. Compress Context
Optimization. The retrieved context might be verbose. We use summarization steps or smaller LLMs to distill the information into token-efficient formats before passing it to the main reasoning model.
--{{4}}--
4. Isolate Context
Security and Focus. Splitting context so specific agents only see what they need.

Sandboxing: Code execution environments.


Sub-agents: A "Writer" agent doesn't need the raw SQL logs that the "Analyst" agent used.

  
      graph LR
    A[Raw Data] --> B(Write: Store in DB/State)
    B --> C{Trigger}
    C --> D(Read: Retrieve & Fetch)
    D --> E(Compress: Summarize)
    E --> F(Isolate: Route to Agent)
    F --> G[LLM Processing]


      Loading

  
Architecture: Building a Research Agent

--{{0}}--


Let's implement a "Context Engineering Workflow" for a Research Assistant.
Tech Stack:

Orchestration: CrewAI
Ingestion: Tensorlake (for parsing complex docs)
Memory: Zep (Knowledge Graph memory)
Search: Firecrawl (Semantic web scraping)
Vector DB: Milvus

--{{1}}--


The Workflow:

Ingest: Parse docs into RAG-ready chunks.
Retrieve: Hybrid search (Web + Docs + ArXiv).
Filter: An evaluator agent removes irrelevant context.
Synthesize: A final agent generates the answer.

Python Implementation

Below is a conceptual implementation of the Retrieval & Filter logic.
# Pseudo-code based on the architecture
from crewai import Agent, Task, Crew
from firecrawl import FirecrawlApp
from pymilvus import Collection

class ContextManager:
    def __init__(self):
        self.milvus = Collection("research_docs")
        self.firecrawl = FirecrawlApp()
        
    def gather_context(self, query):
        # 1. Internal Knowledge (Vector Search)
        # We retrieve top-k chunks similar to the query
        internal_results = self.milvus.search(
            data=[encode(query)], 
            anns_field="embedding", 
            param={"metric_type": "L2", "params": {"nprobe": 10}}, 
            limit=5
        )
        
        # 2. External Knowledge (Web Search)
        # Using Firecrawl for LLM-ready markdown
        web_results = self.firecrawl.search(query, params={"formats": ["markdown"]})
        
        # 3. Aggregate
        return {
            "internal": [res.text for res in internal_results],
            "external": web_results['data']
        }

# Agent Definitions
researcher = Agent(
    role='Context Evaluator',
    goal='Filter irrelevant information from gathered context',
    backstory='You are a strict editor. You only allow facts relevant to the query.',
    verbose=True
)

synthesizer = Agent(
    role='Research Writer',
    goal='Synthesize a final answer from filtered context',
    backstory='You write clear, citation-backed summaries.'
)

# Workflow Execution
def run_workflow(user_query):
    ctx_manager = ContextManager()
    raw_context = ctx_manager.gather_context(user_query)
    
    # Task 1: Filter Context (Compress/Isolate)
    task_filter = Task(
        description=f"Analyze this context: {raw_context}. Remove noise relative to query: {user_query}",
        agent=researcher
    )
    
    # Task 2: Generate Answer
    task_synthesize = Task(
        description="Write a final answer using the filtered context.",
        context=[task_filter], # Passing the output of task 1
        agent=synthesizer
    )
    
    crew = Crew(agents=[researcher, synthesizer], tasks=[task_filter, task_synthesize])
    return crew.kickoff()
Advanced Pattern: Progressive Context Loading

--{{0}}--


As applications grow, you cannot dump everything into the context window. Anthropic's "Claude Skills" pattern offers a solution for Progressive Loading.
--{{1}}--


It uses a 3-layer system to keep the "RAM" clean:

Main Context: Always loaded (Project config).
Skill Metadata: Lightweight YAML definitions (Name, Description). Loaded to help the model decide which tool to use.
Active Skill Context: Loaded only when the skill is triggered. Contains the heavy instructions and examples.


      sequenceDiagram
    participant User
    participant System
    participant LLM

    Note over System: Layer 1: Main Context Loaded
    User->>System: "Analyze the Q3 financial report"
    System->>LLM: Sends Query + Skill Metadata (Layer 2)
    LLM->>System: "I need the 'Financial_Analysis' skill"
    System->>System: Loads 'Financial_Analysis.md' (Layer 3)
    System->>LLM: Sends Query + Full Skill Instructions
    LLM->>User: Generates Report


      Loading

  
Challenge: The "Chicago Office" Problem

--{{0}}--


Scenario: A user asks, "What’s blocking the Chicago office project, and when’s our next meeting about it?"
--{{1}}--


The Problem:

Blockers are in the issue tracker (e.g., Linear/Jira).
Meetings are in the Calendar.
Discussions are in Slack/Email.
Permissions: Does the user have access to all this?
--{{2}}--
The Solution: Agentic Context Engineering
A manual RAG pipeline fails here. You need an Ingestion Layer that handles auth and updates (syncing), and a Retrieval Layer that can expand the query into multiple sub-queries (one for Linear, one for Calendar).

Key Takeaway: Context retrieval for Agents is an infrastructure problem, not just an embedding problem. You must build for continuous sync and hybrid search.
Exercise: Design a Context Pipeline

Objective: Design the Context Engineering flow for a Code Review Bot.
Requirements:

The bot must know the coding standards (Knowledge).
It must access the specific PR diff (Input).
It must remember previous comments on this PR (Short-term Memory).
It must verify if the user is a senior dev before allowing a "merge" action (Guardrails).

Task:
Using the editor below, sketch the JSON structure you would inject into the context window for the "Instructions" and "Guardrails" layers.
{
  "role": "Code Reviewer",
  "guardrails": {
    "allowed_actions": ["comment", "request_changes"],
    "restricted_actions": ["merge"],
    "validation": "Check if user_role == 'senior_dev' for merge"
  },
  "context_sources": [
    "docs/coding_standards.md",
    "api/github/pr_diff"
  ]
}

<script>
@input
</script>
Chapter Quiz

--{{0}}--


Test your understanding of Context Engineering.

According to the "CPU vs RAM" analogy, what represents the RAM?


[( )] The Vector Database
[(X)] The Context Window
[( )] The Prompt


Which stage of the Context Lifecycle involves splitting information so specific agents only see what they need?


[( )] Write Context
[( )] Compress Context
[(X)] Isolate Context


Why is "Skill Metadata" loaded separately from "Active Skill Context" in the Progressive Loading pattern?


[( )] To make the model faster
[(X)] To save tokens and keep the context lightweight until needed
[( )] Because YAML is easier to read than Markdown


In the "Chicago Office" problem, why does simple RAG fail?


[(X)] Data is spread across multiple dynamic sources (Calendar, Slack, Jira) with different permissions
[( )] The model cannot understand the word "Chicago"
[( )] Vector databases cannot store calendar events


What are the 6 types of context?


[[X]] Instructions
[[X]] Examples
[[ ]] Emotions
[[X]] Knowledge
[[X]] Memory
[[X]] Tools
[[X]] Guardrails
[[ ]] Hardware Specs
Feature	Prompt Engineering	Context Engineering
Focus	"Magic words" and phrasing	Architecture of information flow
Scope	Single interaction	System-wide state, memory, and tools
Analogy	Writing a better search query	Managing the Heap and Stack in RAM
Goal	Better text generation	Reliable, structured task completion
No results found