sureshg/agent-log-analysis.md

## agent-log-analysis.md

      
    Raw
  

              agent-log-analysis.md
            
          
    Large log file analysis with LLM agents

Exported on 1/6/2026 at 14:21:38 PST from Cursor (2.3.23)

User
I have a problem that I would like to solve using an LLM or agents. The problem is I have a system that has a lot of logs. Log files. I can access those log files through SSH. That is not a problem. What I need to do is analyze the log files—basically analyze the crash logs, exceptions, or things like that. I would like to leverage an LLM to do that analysis and come up with information like what the outcome should be, how many errors there are, and other insights.
So what is the best way to do that? Is using an LLM and creating an agent a good option? If so, how do we actually create an agent? I do have access to LLMs—say I have internal access to the OpenAI API. Could you devise a simple solution for how this kind of problem can be efficiently solved, so that this agent can analyze the logs with the help of the LLM?
Also, I would like to know how you are planning to handle this analysis if the log file size is greater than the context size. What is the best approach for that? How are agents usually solving this kind of problem? Tools like Cursor do this kind of thing, and I think they're also using some kind of LLM agent. So what is the best way to solve this kind of problem?
I would like to see some sample code (clean, simple, concise)—show me the important parts in Python, maybe using LangChain or LangGraph.
The solution should be clean , concise and also work for many large long files.

Cursor
Log Analysis with LLM Agents: Architecture & Implementation

Great question! Using an LLM agent is indeed a solid approach for log analysis. Let me break down the architecture and provide clean, practical code.
Why Agents Work Well for Log Analysis


Pattern Recognition: LLMs excel at understanding error patterns, stack traces, and correlating issues
Natural Language Output: Get summaries instead of raw grep results
Flexible Querying: Ask questions like "What caused the 3AM outage?" instead of writing complex regex

Handling Large Log Files (The Key Challenge)

When logs exceed context size, there are several proven strategies:
Strategy 1: Map-Reduce Pattern

Process chunks independently, then aggregate findings. This is what most production systems use.
Strategy 2: Smart Chunking + Filtering

Pre-filter logs (grep for errors/exceptions) before sending to LLM. Don't send boring "INFO" lines.
Strategy 3: RAG (Retrieval-Augmented Generation)

Embed log chunks, then retrieve relevant ones based on the query.
Strategy 4: Hierarchical Summarization

Summarize chunks → summarize summaries → final report.

Clean Implementation with LangChain/LangGraph

Here's a practical, production-ready approach:
1. Core Log Analyzer (Map-Reduce Pattern)

import os
from typing import List, Dict, Any
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter


class LogAnalyzer:
    """Analyzes large log files using map-reduce pattern."""
    
    def __init__(self, model: str = "gpt-4o", chunk_size: int = 8000):
        self.llm = ChatOpenAI(model=model, temperature=0)
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=200,
            separators=["\n\n", "\n", " "]
        )
        
        # Prompt for analyzing each chunk
        self.map_prompt = ChatPromptTemplate.from_template("""
Analyze this log chunk for errors, exceptions, and issues.

Log chunk:
{chunk}

Extract:
1. Error count and types
2. Exception stack traces (summarized)
3. Critical issues requiring attention
4. Timestamps of issues

Output as structured JSON.
""")
        
        # Prompt for combining chunk analyses
        self.reduce_prompt = ChatPromptTemplate.from_template("""
Combine these log analysis results into a final report:

{chunk_analyses}

Provide:
1. Total error count by type
2. Root cause analysis for major issues
3. Timeline of incidents
4. Recommended actions
5. Overall health assessment
""")

    def prefilter_logs(self, content: str) -> str:
        """Keep only interesting lines to reduce noise."""
        interesting_patterns = ['ERROR', 'WARN', 'Exception', 'FATAL', 
                               'Traceback', 'failed', 'timeout', 'refused']
        lines = content.split('\n')
        filtered = []
        context_lines = 0
        
        for line in lines:
            if any(p in line for p in interesting_patterns):
                filtered.append(line)
                context_lines = 3  # Include next 3 lines for stack traces
            elif context_lines > 0:
                filtered.append(line)
                context_lines -= 1
                
        return '\n'.join(filtered)

    def analyze(self, log_content: str, prefilter: bool = True) -> str:
        """Analyze logs using map-reduce."""
        
        # Step 1: Prefilter if requested (huge reduction in tokens)
        if prefilter:
            log_content = self.prefilter_logs(log_content)
        
        # Step 2: Split into chunks
        chunks = self.splitter.split_text(log_content)
        print(f"Processing {len(chunks)} chunks...")
        
        # Step 3: MAP - Analyze each chunk
        chunk_analyses = []
        for i, chunk in enumerate(chunks):
            messages = self.map_prompt.format_messages(chunk=chunk)
            result = self.llm.invoke(messages)
            chunk_analyses.append(f"Chunk {i+1}:\n{result.content}")
        
        # Step 4: REDUCE - Combine analyses
        combined = "\n\n".join(chunk_analyses)
        messages = self.reduce_prompt.format_messages(chunk_analyses=combined)
        final_report = self.llm.invoke(messages)
        
        return final_report.content


# Usage
analyzer = LogAnalyzer()
with open("/path/to/large.log") as f:
    report = analyzer.analyze(f.read())
print(report)
2. Agent-Based Approach with Tools (LangGraph)

This gives you an interactive agent that can query logs intelligently:
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
import subprocess


# Define tools the agent can use
@tool
def grep_logs(pattern: str, log_path: str, context_lines: int = 5) -> str:
    """Search logs for a pattern. Returns matching lines with context."""
    try:
        result = subprocess.run(
            ["grep", "-n", "-C", str(context_lines), pattern, log_path],
            capture_output=True, text=True, timeout=30
        )
        return result.stdout[:10000] or "No matches found"  # Limit output
    except Exception as e:
        return f"Error: {e}"


@tool  
def get_error_summary(log_path: str) -> str:
    """Get a quick count of error types in the log file."""
    result = subprocess.run(
        ["grep", "-oE", "(ERROR|WARN|FATAL|Exception)[^:]*", log_path],
        capture_output=True, text=True
    )
    lines = result.stdout.strip().split('\n')
    from collections import Counter
    counts = Counter(lines)
    return "\n".join(f"{k}: {v}" for k, v in counts.most_common(20))


@tool
def get_log_tail(log_path: str, lines: int = 100) -> str:
    """Get the last N lines of a log file."""
    result = subprocess.run(
        ["tail", "-n", str(lines), log_path],
        capture_output=True, text=True
    )
    return result.stdout


@tool
def ssh_fetch_logs(host: str, remote_path: str, pattern: str = "ERROR") -> str:
    """Fetch and grep logs from a remote server via SSH."""
    cmd = f"ssh {host} 'grep -n {pattern} {remote_path} | head -500'"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout[:15000] or "No matches or connection failed"


# Agent state
class AgentState(TypedDict):
    messages: list
    

def create_log_agent():
    """Create a LangGraph agent for log analysis."""
    
    tools = [grep_logs, get_error_summary, get_log_tail, ssh_fetch_logs]
    llm = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
    
    def agent_node(state: AgentState):
        """The agent decides what to do next."""
        system_msg = """You are a log analysis expert. Use the available tools to:
        1. Search for errors and exceptions
        2. Identify patterns and root causes  
        3. Provide actionable insights
        
        Be thorough - use multiple tool calls to investigate issues."""
        
        messages = [{"role": "system", "content": system_msg}] + state["messages"]
        response = llm.invoke(messages)
        return {"messages": [response]}
    
    def should_continue(state: AgentState):
        """Check if agent wants to use more tools."""
        last_msg = state["messages"][-1]
        if hasattr(last_msg, "tool_calls") and last_msg.tool_calls:
            return "tools"
        return END
    
    # Build the graph
    graph = StateGraph(AgentState)
    graph.add_node("agent", agent_node)
    graph.add_node("tools", ToolNode(tools))
    
    graph.set_entry_point("agent")
    graph.add_conditional_edges("agent", should_continue)
    graph.add_edge("tools", "agent")
    
    return graph.compile()


# Usage
agent = create_log_agent()
result = agent.invoke({
    "messages": [{"role": "user", "content": 
        "Analyze /var/log/app.log - find all errors from the last hour, "
        "identify patterns, and suggest fixes"}]
})
print(result["messages"][-1].content)
3. Streaming Analysis for Very Large Files

For files too large to even load into memory:
import itertools
from typing import Generator


class StreamingLogAnalyzer:
    """Process logs in a streaming fashion - never load full file."""
    
    def __init__(self, llm, batch_size: int = 1000):
        self.llm = llm
        self.batch_size = batch_size
        
    def stream_relevant_lines(self, log_path: str) -> Generator[str, None, None]:
        """Stream only relevant lines from a huge log file."""
        error_patterns = {'ERROR', 'WARN', 'Exception', 'Traceback', 'FATAL'}
        
        with open(log_path, 'r') as f:
            buffer = []
            for line in f:
                if any(p in line for p in error_patterns):
                    buffer.append(line)
                    # Yield in batches
                    if len(buffer) >= self.batch_size:
                        yield ''.join(buffer)
                        buffer = []
            if buffer:
                yield ''.join(buffer)
    
    def analyze_streaming(self, log_path: str) -> str:
        """Analyze huge log files with constant memory usage."""
        summaries = []
        
        for batch in self.stream_relevant_lines(log_path):
            # Analyze each batch
            response = self.llm.invoke(
                f"Summarize errors in this log batch:\n{batch}"
            )
            summaries.append(response.content)
        
        # Final aggregation
        return self.llm.invoke(
            f"Combine these summaries into a final report:\n" + 
            "\n---\n".join(summaries)
        ).content
4. RAG Approach for Interactive Querying

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


class LogRAG:
    """Embed log chunks for semantic search."""
    
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = None
        
    def index_logs(self, log_content: str, chunk_size: int = 1000):
        """Index log file for semantic search."""
        splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
        chunks = splitter.split_text(log_content)
        
        self.vectorstore = FAISS.from_texts(chunks, self.embeddings)
        print(f"Indexed {len(chunks)} chunks")
        
    def query(self, question: str, k: int = 5) -> str:
        """Find relevant log sections and answer question."""
        docs = self.vectorstore.similarity_search(question, k=k)
        context = "\n\n".join(d.page_content for d in docs)
        
        llm = ChatOpenAI(model="gpt-4o")
        return llm.invoke(
            f"Based on these log excerpts:\n{context}\n\nAnswer: {question}"
        ).content


# Usage
rag = LogRAG()
rag.index_logs(open("huge.log").read())
print(rag.query("What caused the database connection failures?"))

How Cursor/Similar Tools Handle This

Tools like Cursor use a combination of:

Smart Chunking: Split by logical boundaries (functions, classes)
Semantic Search: Embed code/logs and retrieve relevant chunks
Tool-based Agents: Let the LLM decide what to grep/read
Iterative Refinement: Agent can do multiple searches to narrow down


My Recommendation for Your Use Case

Given you have SSH access to logs, I'd recommend:
┌─────────────────────────────────────────────────────────────┐
│                    Log Analysis Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│  1. SSH + grep/awk   →   Pre-filter to errors/exceptions    │
│  2. Smart chunking   →   Split by timestamp/session         │
│  3. Map-Reduce       →   Analyze chunks in parallel         │
│  4. LLM Aggregation  →   Combine into actionable report     │
└─────────────────────────────────────────────────────────────┘

Key insight: Don't send raw logs to the LLM. Pre-filter aggressively using grep/awk over SSH first, then send only the interesting parts to the LLM. This cuts token usage by 90%+ and improves quality.
Would you like me to expand on any of these approaches or create a more complete implementation for your specific use case?
import os
from typing import List, Dict, Any
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter


class LogAnalyzer:
    """Analyzes large log files using map-reduce pattern."""
    
    def __init__(self, model: str = "gpt-4o", chunk_size: int = 8000):
        self.llm = ChatOpenAI(model=model, temperature=0)
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=200,
            separators=["\n\n", "\n", " "]
        )
        
        # Prompt for analyzing each chunk
        self.map_prompt = ChatPromptTemplate.from_template("""
Analyze this log chunk for errors, exceptions, and issues.

Log chunk:
{chunk}

Extract:
1. Error count and types
2. Exception stack traces (summarized)
3. Critical issues requiring attention
4. Timestamps of issues

Output as structured JSON.
""")
        
        # Prompt for combining chunk analyses
        self.reduce_prompt = ChatPromptTemplate.from_template("""
Combine these log analysis results into a final report:

{chunk_analyses}

Provide:
1. Total error count by type
2. Root cause analysis for major issues
3. Timeline of incidents
4. Recommended actions
5. Overall health assessment
""")

    def prefilter_logs(self, content: str) -> str:
        """Keep only interesting lines to reduce noise."""
        interesting_patterns = ['ERROR', 'WARN', 'Exception', 'FATAL', 
                               'Traceback', 'failed', 'timeout', 'refused']
        lines = content.split('\n')
        filtered = []
        context_lines = 0
        
        for line in lines:
            if any(p in line for p in interesting_patterns):
                filtered.append(line)
                context_lines = 3  # Include next 3 lines for stack traces
            elif context_lines > 0:
                filtered.append(line)
                context_lines -= 1
                
        return '\n'.join(filtered)

    def analyze(self, log_content: str, prefilter: bool = True) -> str:
        """Analyze logs using map-reduce."""
        
        # Step 1: Prefilter if requested (huge reduction in tokens)
        if prefilter:
            log_content = self.prefilter_logs(log_content)
        
        # Step 2: Split into chunks
        chunks = self.splitter.split_text(log_content)
        print(f"Processing {len(chunks)} chunks...")
        
        # Step 3: MAP - Analyze each chunk
        chunk_analyses = []
        for i, chunk in enumerate(chunks):
            messages = self.map_prompt.format_messages(chunk=chunk)
            result = self.llm.invoke(messages)
            chunk_analyses.append(f"Chunk {i+1}:\n{result.content}")
        
        # Step 4: REDUCE - Combine analyses
        combined = "\n\n".join(chunk_analyses)
        messages = self.reduce_prompt.format_messages(chunk_analyses=combined)
        final_report = self.llm.invoke(messages)
        
        return final_report.content


# Usage
analyzer = LogAnalyzer()
with open("/path/to/large.log") as f:
    report = analyzer.analyze(f.read())
print(report)
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
import subprocess


# Define tools the agent can use
@tool
def grep_logs(pattern: str, log_path: str, context_lines: int = 5) -> str:
    """Search logs for a pattern. Returns matching lines with context."""
    try:
        result = subprocess.run(
            ["grep", "-n", "-C", str(context_lines), pattern, log_path],
            capture_output=True, text=True, timeout=30
        )
        return result.stdout[:10000] or "No matches found"  # Limit output
    except Exception as e:
        return f"Error: {e}"


@tool  
def get_error_summary(log_path: str) -> str:
    """Get a quick count of error types in the log file."""
    result = subprocess.run(
        ["grep", "-oE", "(ERROR|WARN|FATAL|Exception)[^:]*", log_path],
        capture_output=True, text=True
    )
    lines = result.stdout.strip().split('\n')
    from collections import Counter
    counts = Counter(lines)
    return "\n".join(f"{k}: {v}" for k, v in counts.most_common(20))


@tool
def get_log_tail(log_path: str, lines: int = 100) -> str:
    """Get the last N lines of a log file."""
    result = subprocess.run(
        ["tail", "-n", str(lines), log_path],
        capture_output=True, text=True
    )
    return result.stdout


@tool
def ssh_fetch_logs(host: str, remote_path: str, pattern: str = "ERROR") -> str:
    """Fetch and grep logs from a remote server via SSH."""
    cmd = f"ssh {host} 'grep -n {pattern} {remote_path} | head -500'"
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    return result.stdout[:15000] or "No matches or connection failed"


# Agent state
class AgentState(TypedDict):
    messages: list
    

def create_log_agent():
    """Create a LangGraph agent for log analysis."""
    
    tools = [grep_logs, get_error_summary, get_log_tail, ssh_fetch_logs]
    llm = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
    
    def agent_node(state: AgentState):
        """The agent decides what to do next."""
        system_msg = """You are a log analysis expert. Use the available tools to:
        1. Search for errors and exceptions
        2. Identify patterns and root causes  
        3. Provide actionable insights
        
        Be thorough - use multiple tool calls to investigate issues."""
        
        messages = [{"role": "system", "content": system_msg}] + state["messages"]
        response = llm.invoke(messages)
        return {"messages": [response]}
    
    def should_continue(state: AgentState):
        """Check if agent wants to use more tools."""
        last_msg = state["messages"][-1]
        if hasattr(last_msg, "tool_calls") and last_msg.tool_calls:
            return "tools"
        return END
    
    # Build the graph
    graph = StateGraph(AgentState)
    graph.add_node("agent", agent_node)
    graph.add_node("tools", ToolNode(tools))
    
    graph.set_entry_point("agent")
    graph.add_conditional_edges("agent", should_continue)
    graph.add_edge("tools", "agent")
    
    return graph.compile()


# Usage
agent = create_log_agent()
result = agent.invoke({
    "messages": [{"role": "user", "content": 
        "Analyze /var/log/app.log - find all errors from the last hour, "
        "identify patterns, and suggest fixes"}]
})
print(result["messages"][-1].content)
import itertools
from typing import Generator


class StreamingLogAnalyzer:
    """Process logs in a streaming fashion - never load full file."""
    
    def __init__(self, llm, batch_size: int = 1000):
        self.llm = llm
        self.batch_size = batch_size
        
    def stream_relevant_lines(self, log_path: str) -> Generator[str, None, None]:
        """Stream only relevant lines from a huge log file."""
        error_patterns = {'ERROR', 'WARN', 'Exception', 'Traceback', 'FATAL'}
        
        with open(log_path, 'r') as f:
            buffer = []
            for line in f:
                if any(p in line for p in error_patterns):
                    buffer.append(line)
                    # Yield in batches
                    if len(buffer) >= self.batch_size:
                        yield ''.join(buffer)
                        buffer = []
            if buffer:
                yield ''.join(buffer)
    
    def analyze_streaming(self, log_path: str) -> str:
        """Analyze huge log files with constant memory usage."""
        summaries = []
        
        for batch in self.stream_relevant_lines(log_path):
            # Analyze each batch
            response = self.llm.invoke(
                f"Summarize errors in this log batch:\n{batch}"
            )
            summaries.append(response.content)
        
        # Final aggregation
        return self.llm.invoke(
            f"Combine these summaries into a final report:\n" + 
            "\n---\n".join(summaries)
        ).content
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS


class LogRAG:
    """Embed log chunks for semantic search."""
    
    def __init__(self):
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = None
        
    def index_logs(self, log_content: str, chunk_size: int = 1000):
        """Index log file for semantic search."""
        splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
        chunks = splitter.split_text(log_content)
        
        self.vectorstore = FAISS.from_texts(chunks, self.embeddings)
        print(f"Indexed {len(chunks)} chunks")
        
    def query(self, question: str, k: int = 5) -> str:
        """Find relevant log sections and answer question."""
        docs = self.vectorstore.similarity_search(question, k=k)
        context = "\n\n".join(d.page_content for d in docs)
        
        llm = ChatOpenAI(model="gpt-4o")
        return llm.invoke(
            f"Based on these log excerpts:\n{context}\n\nAnswer: {question}"
        ).content


# Usage
rag = LogRAG()
rag.index_logs(open("huge.log").read())
print(rag.query("What caused the database connection failures?"))
┌─────────────────────────────────────────────────────────────┐
│                    Log Analysis Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│  1. SSH + grep/awk   →   Pre-filter to errors/exceptions    │
│  2. Smart chunking   →   Split by timestamp/session         │
│  3. Map-Reduce       →   Analyze chunks in parallel         │
│  4. LLM Aggregation  →   Combine into actionable report     │
└─────────────────────────────────────────────────────────────┘
No results found