Exported on 1/6/2026 at 14:21:38 PST from Cursor (2.3.23)
User
I have a problem that I would like to solve using an LLM or agents. The problem is I have a system that has a lot of logs. Log files. I can access those log files through SSH. That is not a problem. What I need to do is analyze the log files—basically analyze the crash logs, exceptions, or things like that. I would like to leverage an LLM to do that analysis and come up with information like what the outcome should be, how many errors there are, and other insights. So what is the best way to do that? Is using an LLM and creating an agent a good option? If so, how do we actually create an agent? I do have access to LLMs—say I have internal access to the OpenAI API. Could you devise a simple solution for how this kind of problem can be efficiently solved, so that this agent can analyze the logs with the help of the LLM? Also, I would like to know how you are planning to handle this analysis if the log file size is greater than the context size. What is the best approach for that? How are agents usually solving this kind of problem? Tools like Cursor do this kind of thing, and I think they're also using some kind of LLM agent. So what is the best way to solve this kind of problem? I would like to see some sample code (clean, simple, concise)—show me the important parts in Python, maybe using LangChain or LangGraph.
The solution should be clean , concise and also work for many large long files.
Cursor
Great question! Using an LLM agent is indeed a solid approach for log analysis. Let me break down the architecture and provide clean, practical code.
- Pattern Recognition: LLMs excel at understanding error patterns, stack traces, and correlating issues
- Natural Language Output: Get summaries instead of raw grep results
- Flexible Querying: Ask questions like "What caused the 3AM outage?" instead of writing complex regex
When logs exceed context size, there are several proven strategies:
Process chunks independently, then aggregate findings. This is what most production systems use.
Pre-filter logs (grep for errors/exceptions) before sending to LLM. Don't send boring "INFO" lines.
Embed log chunks, then retrieve relevant ones based on the query.
Summarize chunks → summarize summaries → final report.
Here's a practical, production-ready approach:
import os
from typing import List, Dict, Any
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
class LogAnalyzer:
"""Analyzes large log files using map-reduce pattern."""
def __init__(self, model: str = "gpt-4o", chunk_size: int = 8000):
self.llm = ChatOpenAI(model=model, temperature=0)
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=200,
separators=["\n\n", "\n", " "]
)
# Prompt for analyzing each chunk
self.map_prompt = ChatPromptTemplate.from_template("""
Analyze this log chunk for errors, exceptions, and issues.
Log chunk:
{chunk}
Extract:
1. Error count and types
2. Exception stack traces (summarized)
3. Critical issues requiring attention
4. Timestamps of issues
Output as structured JSON.
""")
# Prompt for combining chunk analyses
self.reduce_prompt = ChatPromptTemplate.from_template("""
Combine these log analysis results into a final report:
{chunk_analyses}
Provide:
1. Total error count by type
2. Root cause analysis for major issues
3. Timeline of incidents
4. Recommended actions
5. Overall health assessment
""")
def prefilter_logs(self, content: str) -> str:
"""Keep only interesting lines to reduce noise."""
interesting_patterns = ['ERROR', 'WARN', 'Exception', 'FATAL',
'Traceback', 'failed', 'timeout', 'refused']
lines = content.split('\n')
filtered = []
context_lines = 0
for line in lines:
if any(p in line for p in interesting_patterns):
filtered.append(line)
context_lines = 3 # Include next 3 lines for stack traces
elif context_lines > 0:
filtered.append(line)
context_lines -= 1
return '\n'.join(filtered)
def analyze(self, log_content: str, prefilter: bool = True) -> str:
"""Analyze logs using map-reduce."""
# Step 1: Prefilter if requested (huge reduction in tokens)
if prefilter:
log_content = self.prefilter_logs(log_content)
# Step 2: Split into chunks
chunks = self.splitter.split_text(log_content)
print(f"Processing {len(chunks)} chunks...")
# Step 3: MAP - Analyze each chunk
chunk_analyses = []
for i, chunk in enumerate(chunks):
messages = self.map_prompt.format_messages(chunk=chunk)
result = self.llm.invoke(messages)
chunk_analyses.append(f"Chunk {i+1}:\n{result.content}")
# Step 4: REDUCE - Combine analyses
combined = "\n\n".join(chunk_analyses)
messages = self.reduce_prompt.format_messages(chunk_analyses=combined)
final_report = self.llm.invoke(messages)
return final_report.content
# Usage
analyzer = LogAnalyzer()
with open("/path/to/large.log") as f:
report = analyzer.analyze(f.read())
print(report)This gives you an interactive agent that can query logs intelligently:
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
import subprocess
# Define tools the agent can use
@tool
def grep_logs(pattern: str, log_path: str, context_lines: int = 5) -> str:
"""Search logs for a pattern. Returns matching lines with context."""
try:
result = subprocess.run(
["grep", "-n", "-C", str(context_lines), pattern, log_path],
capture_output=True, text=True, timeout=30
)
return result.stdout[:10000] or "No matches found" # Limit output
except Exception as e:
return f"Error: {e}"
@tool
def get_error_summary(log_path: str) -> str:
"""Get a quick count of error types in the log file."""
result = subprocess.run(
["grep", "-oE", "(ERROR|WARN|FATAL|Exception)[^:]*", log_path],
capture_output=True, text=True
)
lines = result.stdout.strip().split('\n')
from collections import Counter
counts = Counter(lines)
return "\n".join(f"{k}: {v}" for k, v in counts.most_common(20))
@tool
def get_log_tail(log_path: str, lines: int = 100) -> str:
"""Get the last N lines of a log file."""
result = subprocess.run(
["tail", "-n", str(lines), log_path],
capture_output=True, text=True
)
return result.stdout
@tool
def ssh_fetch_logs(host: str, remote_path: str, pattern: str = "ERROR") -> str:
"""Fetch and grep logs from a remote server via SSH."""
cmd = f"ssh {host} 'grep -n {pattern} {remote_path} | head -500'"
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return result.stdout[:15000] or "No matches or connection failed"
# Agent state
class AgentState(TypedDict):
messages: list
def create_log_agent():
"""Create a LangGraph agent for log analysis."""
tools = [grep_logs, get_error_summary, get_log_tail, ssh_fetch_logs]
llm = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
def agent_node(state: AgentState):
"""The agent decides what to do next."""
system_msg = """You are a log analysis expert. Use the available tools to:
1. Search for errors and exceptions
2. Identify patterns and root causes
3. Provide actionable insights
Be thorough - use multiple tool calls to investigate issues."""
messages = [{"role": "system", "content": system_msg}] + state["messages"]
response = llm.invoke(messages)
return {"messages": [response]}
def should_continue(state: AgentState):
"""Check if agent wants to use more tools."""
last_msg = state["messages"][-1]
if hasattr(last_msg, "tool_calls") and last_msg.tool_calls:
return "tools"
return END
# Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue)
graph.add_edge("tools", "agent")
return graph.compile()
# Usage
agent = create_log_agent()
result = agent.invoke({
"messages": [{"role": "user", "content":
"Analyze /var/log/app.log - find all errors from the last hour, "
"identify patterns, and suggest fixes"}]
})
print(result["messages"][-1].content)For files too large to even load into memory:
import itertools
from typing import Generator
class StreamingLogAnalyzer:
"""Process logs in a streaming fashion - never load full file."""
def __init__(self, llm, batch_size: int = 1000):
self.llm = llm
self.batch_size = batch_size
def stream_relevant_lines(self, log_path: str) -> Generator[str, None, None]:
"""Stream only relevant lines from a huge log file."""
error_patterns = {'ERROR', 'WARN', 'Exception', 'Traceback', 'FATAL'}
with open(log_path, 'r') as f:
buffer = []
for line in f:
if any(p in line for p in error_patterns):
buffer.append(line)
# Yield in batches
if len(buffer) >= self.batch_size:
yield ''.join(buffer)
buffer = []
if buffer:
yield ''.join(buffer)
def analyze_streaming(self, log_path: str) -> str:
"""Analyze huge log files with constant memory usage."""
summaries = []
for batch in self.stream_relevant_lines(log_path):
# Analyze each batch
response = self.llm.invoke(
f"Summarize errors in this log batch:\n{batch}"
)
summaries.append(response.content)
# Final aggregation
return self.llm.invoke(
f"Combine these summaries into a final report:\n" +
"\n---\n".join(summaries)
).contentfrom langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
class LogRAG:
"""Embed log chunks for semantic search."""
def __init__(self):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = None
def index_logs(self, log_content: str, chunk_size: int = 1000):
"""Index log file for semantic search."""
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
chunks = splitter.split_text(log_content)
self.vectorstore = FAISS.from_texts(chunks, self.embeddings)
print(f"Indexed {len(chunks)} chunks")
def query(self, question: str, k: int = 5) -> str:
"""Find relevant log sections and answer question."""
docs = self.vectorstore.similarity_search(question, k=k)
context = "\n\n".join(d.page_content for d in docs)
llm = ChatOpenAI(model="gpt-4o")
return llm.invoke(
f"Based on these log excerpts:\n{context}\n\nAnswer: {question}"
).content
# Usage
rag = LogRAG()
rag.index_logs(open("huge.log").read())
print(rag.query("What caused the database connection failures?"))Tools like Cursor use a combination of:
- Smart Chunking: Split by logical boundaries (functions, classes)
- Semantic Search: Embed code/logs and retrieve relevant chunks
- Tool-based Agents: Let the LLM decide what to grep/read
- Iterative Refinement: Agent can do multiple searches to narrow down
Given you have SSH access to logs, I'd recommend:
┌─────────────────────────────────────────────────────────────┐
│ Log Analysis Pipeline │
├─────────────────────────────────────────────────────────────┤
│ 1. SSH + grep/awk → Pre-filter to errors/exceptions │
│ 2. Smart chunking → Split by timestamp/session │
│ 3. Map-Reduce → Analyze chunks in parallel │
│ 4. LLM Aggregation → Combine into actionable report │
└─────────────────────────────────────────────────────────────┘
Key insight: Don't send raw logs to the LLM. Pre-filter aggressively using grep/awk over SSH first, then send only the interesting parts to the LLM. This cuts token usage by 90%+ and improves quality.
Would you like me to expand on any of these approaches or create a more complete implementation for your specific use case?
import os
from typing import List, Dict, Any
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
class LogAnalyzer:
"""Analyzes large log files using map-reduce pattern."""
def __init__(self, model: str = "gpt-4o", chunk_size: int = 8000):
self.llm = ChatOpenAI(model=model, temperature=0)
self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=200,
separators=["\n\n", "\n", " "]
)
# Prompt for analyzing each chunk
self.map_prompt = ChatPromptTemplate.from_template("""
Analyze this log chunk for errors, exceptions, and issues.
Log chunk:
{chunk}
Extract:
1. Error count and types
2. Exception stack traces (summarized)
3. Critical issues requiring attention
4. Timestamps of issues
Output as structured JSON.
""")
# Prompt for combining chunk analyses
self.reduce_prompt = ChatPromptTemplate.from_template("""
Combine these log analysis results into a final report:
{chunk_analyses}
Provide:
1. Total error count by type
2. Root cause analysis for major issues
3. Timeline of incidents
4. Recommended actions
5. Overall health assessment
""")
def prefilter_logs(self, content: str) -> str:
"""Keep only interesting lines to reduce noise."""
interesting_patterns = ['ERROR', 'WARN', 'Exception', 'FATAL',
'Traceback', 'failed', 'timeout', 'refused']
lines = content.split('\n')
filtered = []
context_lines = 0
for line in lines:
if any(p in line for p in interesting_patterns):
filtered.append(line)
context_lines = 3 # Include next 3 lines for stack traces
elif context_lines > 0:
filtered.append(line)
context_lines -= 1
return '\n'.join(filtered)
def analyze(self, log_content: str, prefilter: bool = True) -> str:
"""Analyze logs using map-reduce."""
# Step 1: Prefilter if requested (huge reduction in tokens)
if prefilter:
log_content = self.prefilter_logs(log_content)
# Step 2: Split into chunks
chunks = self.splitter.split_text(log_content)
print(f"Processing {len(chunks)} chunks...")
# Step 3: MAP - Analyze each chunk
chunk_analyses = []
for i, chunk in enumerate(chunks):
messages = self.map_prompt.format_messages(chunk=chunk)
result = self.llm.invoke(messages)
chunk_analyses.append(f"Chunk {i+1}:\n{result.content}")
# Step 4: REDUCE - Combine analyses
combined = "\n\n".join(chunk_analyses)
messages = self.reduce_prompt.format_messages(chunk_analyses=combined)
final_report = self.llm.invoke(messages)
return final_report.content
# Usage
analyzer = LogAnalyzer()
with open("/path/to/large.log") as f:
report = analyzer.analyze(f.read())
print(report)from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolNode
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
import subprocess
# Define tools the agent can use
@tool
def grep_logs(pattern: str, log_path: str, context_lines: int = 5) -> str:
"""Search logs for a pattern. Returns matching lines with context."""
try:
result = subprocess.run(
["grep", "-n", "-C", str(context_lines), pattern, log_path],
capture_output=True, text=True, timeout=30
)
return result.stdout[:10000] or "No matches found" # Limit output
except Exception as e:
return f"Error: {e}"
@tool
def get_error_summary(log_path: str) -> str:
"""Get a quick count of error types in the log file."""
result = subprocess.run(
["grep", "-oE", "(ERROR|WARN|FATAL|Exception)[^:]*", log_path],
capture_output=True, text=True
)
lines = result.stdout.strip().split('\n')
from collections import Counter
counts = Counter(lines)
return "\n".join(f"{k}: {v}" for k, v in counts.most_common(20))
@tool
def get_log_tail(log_path: str, lines: int = 100) -> str:
"""Get the last N lines of a log file."""
result = subprocess.run(
["tail", "-n", str(lines), log_path],
capture_output=True, text=True
)
return result.stdout
@tool
def ssh_fetch_logs(host: str, remote_path: str, pattern: str = "ERROR") -> str:
"""Fetch and grep logs from a remote server via SSH."""
cmd = f"ssh {host} 'grep -n {pattern} {remote_path} | head -500'"
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
return result.stdout[:15000] or "No matches or connection failed"
# Agent state
class AgentState(TypedDict):
messages: list
def create_log_agent():
"""Create a LangGraph agent for log analysis."""
tools = [grep_logs, get_error_summary, get_log_tail, ssh_fetch_logs]
llm = ChatOpenAI(model="gpt-4o", temperature=0).bind_tools(tools)
def agent_node(state: AgentState):
"""The agent decides what to do next."""
system_msg = """You are a log analysis expert. Use the available tools to:
1. Search for errors and exceptions
2. Identify patterns and root causes
3. Provide actionable insights
Be thorough - use multiple tool calls to investigate issues."""
messages = [{"role": "system", "content": system_msg}] + state["messages"]
response = llm.invoke(messages)
return {"messages": [response]}
def should_continue(state: AgentState):
"""Check if agent wants to use more tools."""
last_msg = state["messages"][-1]
if hasattr(last_msg, "tool_calls") and last_msg.tool_calls:
return "tools"
return END
# Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.add_node("tools", ToolNode(tools))
graph.set_entry_point("agent")
graph.add_conditional_edges("agent", should_continue)
graph.add_edge("tools", "agent")
return graph.compile()
# Usage
agent = create_log_agent()
result = agent.invoke({
"messages": [{"role": "user", "content":
"Analyze /var/log/app.log - find all errors from the last hour, "
"identify patterns, and suggest fixes"}]
})
print(result["messages"][-1].content)import itertools
from typing import Generator
class StreamingLogAnalyzer:
"""Process logs in a streaming fashion - never load full file."""
def __init__(self, llm, batch_size: int = 1000):
self.llm = llm
self.batch_size = batch_size
def stream_relevant_lines(self, log_path: str) -> Generator[str, None, None]:
"""Stream only relevant lines from a huge log file."""
error_patterns = {'ERROR', 'WARN', 'Exception', 'Traceback', 'FATAL'}
with open(log_path, 'r') as f:
buffer = []
for line in f:
if any(p in line for p in error_patterns):
buffer.append(line)
# Yield in batches
if len(buffer) >= self.batch_size:
yield ''.join(buffer)
buffer = []
if buffer:
yield ''.join(buffer)
def analyze_streaming(self, log_path: str) -> str:
"""Analyze huge log files with constant memory usage."""
summaries = []
for batch in self.stream_relevant_lines(log_path):
# Analyze each batch
response = self.llm.invoke(
f"Summarize errors in this log batch:\n{batch}"
)
summaries.append(response.content)
# Final aggregation
return self.llm.invoke(
f"Combine these summaries into a final report:\n" +
"\n---\n".join(summaries)
).contentfrom langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
class LogRAG:
"""Embed log chunks for semantic search."""
def __init__(self):
self.embeddings = OpenAIEmbeddings()
self.vectorstore = None
def index_logs(self, log_content: str, chunk_size: int = 1000):
"""Index log file for semantic search."""
splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
chunks = splitter.split_text(log_content)
self.vectorstore = FAISS.from_texts(chunks, self.embeddings)
print(f"Indexed {len(chunks)} chunks")
def query(self, question: str, k: int = 5) -> str:
"""Find relevant log sections and answer question."""
docs = self.vectorstore.similarity_search(question, k=k)
context = "\n\n".join(d.page_content for d in docs)
llm = ChatOpenAI(model="gpt-4o")
return llm.invoke(
f"Based on these log excerpts:\n{context}\n\nAnswer: {question}"
).content
# Usage
rag = LogRAG()
rag.index_logs(open("huge.log").read())
print(rag.query("What caused the database connection failures?"))┌─────────────────────────────────────────────────────────────┐
│ Log Analysis Pipeline │
├─────────────────────────────────────────────────────────────┤
│ 1. SSH + grep/awk → Pre-filter to errors/exceptions │
│ 2. Smart chunking → Split by timestamp/session │
│ 3. Map-Reduce → Analyze chunks in parallel │
│ 4. LLM Aggregation → Combine into actionable report │
└─────────────────────────────────────────────────────────────┘