iamarsenibragimov/gist-closed-loop.md

## gist-closed-loop.md

      
    Raw
  

              gist-closed-loop.md
            
          
    How Closed-Loop Tool Responses Transform LLM Agent Performance: A Case Study

Abstract

This case study examines how a single change to tool response design — adding real-time state summaries (status_counts) to every tool call response — transformed an AI agent's behavior from short, conservative work sessions (~11 iterations) to sustained, strategic execution (~34 iterations, 3x improvement). We analyze 187 messages from a real production conversation where an AI agent triaged 3,261 recruitment candidates, comparing behavior before and after the change. We ground our findings in published research on closed-loop feedback, state observability, and LLM agent planning.
Context

We operate an AI-powered recruitment co-pilot where an LLM agent (GPT-4o) uses a set of tools to manage candidate pipelines: listing candidates, viewing details, updating statuses individually or in bulk, and querying project summaries. The agent operates in a multi-turn loop with a configurable iteration limit.
During a single extended conversation (187 messages over 2 days), the engineering team deployed several changes to the tool response format mid-conversation, creating a natural A/B test of agent behavior.
The Change

Before: Tool responses returned only local action confirmation.
// update_candidate_status response (BEFORE)
{
  "success": true,
  "message": "Candidate status changed from \"pending\" to \"tier1\"."
}

// bulk_update_candidate_status response (BEFORE)
{
  "success": true,
  "updated": 7,
  "message": "7 candidates updated to \"tier1\"."
}
After: Every mutation tool response includes a full state snapshot.
// update_candidate_status response (AFTER)
{
  "success": true,
  "message": "Candidate status changed from \"pending\" to \"tier1\".",
  "status_counts": {
    "pending": 3208,
    "tier1": 44,
    "tier2": 4,
    "on_hold": 0,
    "approved": 0,
    "declined": 5
  }
}
Additionally, a new filter parameter (factor_scores) was added to bulk_update_candidate_status, enabling per-scoring-factor filtering (e.g., "decline all candidates with Seniority score ≤ 4").
Methodology

We analyzed all assistant messages from conversation #38, comparing:

Morning session (06:15–06:36 UTC): 5 messages, no status_counts in responses
Afternoon session (15:47–17:30 UTC): 10 messages, status_counts in every mutation response

Both sessions involved the same task (candidate triage), same model, same system prompt, same user. The tool response format was the only infrastructure change.
We measured:

Iterations per message (how long the agent worked before stopping)
Candidates processed per iteration (efficiency)
Tool call patterns (individual vs. bulk operations)
Thinking logs (the agent's internal reasoning about strategy)

Results

1. The Agent Worked 3x Longer Per Message


Session
Avg Iterations
Max Iterations
Messages Hitting Limit


Morning (no status_counts)
~11
13
0


Afternoon (with status_counts)
~34
51
2


Without state feedback, the agent's default behavior was: work for 8–13 iterations, then stop and report progress. With state feedback, the agent continued working until it either completed the task or hit the iteration limit.
2. Peak Efficiency Increased 55x


MSG
Time
Iterations
Candidates/Iteration
Strategy


661 (before)
06:15
12
1.3
Individual review


663 (before)
06:18
13
1.3
Individual review


815 (after)
16:04
50
1.3
Individual review, hit limit


823 (after)
16:49
42
72.5
Batch + individual hybrid


In MSG 823, the agent processed all 3,261 candidates in a single message — using score-band bulk operations for clear cases and individual review for borderline candidates.
3. The Agent's Strategy Changed

Before (no state feedback):
The agent's thinking logs show local, candidate-focused reasoning:

"Nils Aldag — CEO now, not CFO → Tier 2"
"Zahra Sadry — Chief of Staff to the CFO → decline"

After reviewing 10–12 candidates, it stops to report. It has no information about how much work remains.
After (with state feedback):
The agent's thinking shows global, strategic reasoning:

"174 candidates at score 21. Let me batch them all to Tier 2 in two batches."
"There are still 56 pending candidates. Let me check what score those are and deal with them."
"The 21-score band is fully cleared. Let me now continue to the 20-score band."

The agent tracks pending going down (3208 → 2500 → 1800 → 900 → 56 → 0) and uses this as both a progress indicator and a stopping condition.
4. Tool Call Composition Shifted


Session
get_candidate_details
update_candidate_status
bulk_update
Bulk Ratio


Morning (before)
65%
30%
5%
0.05


Afternoon peak (after)
17%
15%
44%
1.38


The agent shifted from predominantly individual operations to a hybrid strategy, using bulk operations for score bands and individual review for candidates near decision boundaries.
Analysis: Why Does This Work?

It's Not About "More Data"

A naive explanation would be: "The agent gets more information, so it makes better decisions." But status_counts doesn't help the agent evaluate any individual candidate better. The candidate details, scores, and factors are identical.
It Creates a Closed-Loop Feedback Mechanism

The real mechanism is that status_counts transforms the agent from open-loop to closed-loop operation:
Open-loop (before): The agent executes actions without observing their cumulative effect on global state. It must rely on internal estimates of progress, which default to conservative stopping ("I've done enough, let me report").
Closed-loop (after): Every action returns the full system state. The agent observes its own impact on the world and adjusts strategy accordingly. It sees pending: 3208 and knows the task is large. It sees pending: 56 and knows it's nearly done.
This maps directly to the POMDP (Partially Observable Markov Decision Process) framework used in MCP-Bench (2025): the agent operates in a partially observable environment, and richer observations enable better policy selection.
Three Specific Behavioral Changes

1. Stopping Behavior
Without state feedback, the agent has no principled stopping criterion beyond "I've worked for N iterations." With state feedback, it has a clear goal: pending → 0. This explains the 3x increase in average iterations — the agent keeps working because it can see the work isn't done.
2. Strategy Selection
Seeing pending: 3208 in the first response, the agent immediately understands that individual review of 3,000+ candidates is infeasible within iteration limits. It spontaneously switches to batch operations — a strategy it never attempted in the morning session despite having the same bulk tools available.
3. Progress Tracking as Reinforcement
Each decreasing pending count acts as a reinforcement signal. The agent observes its actions having measurable impact (pending decreasing by hundreds) and this reinforces the batch strategy. Without this signal, the agent has no way to know if its approach is effective at scale.
Supporting Research

Our findings align with several published results:
Inner Monologue (Huang et al., Google, CoRL 2022)

arxiv.org/abs/2207.05608
Tested LLM agents with three types of environmental feedback: passive scene descriptions, active scene descriptions, and success detection signals. Key finding: "Richer, more structured feedback from the environment significantly improved high-level instruction completion." Without feedback, the LLM plans open-loop and fails to recover from errors.
WESE: Weak Exploration to Strong Exploitation (Huang et al., 2024)

arxiv.org/abs/2404.07456
"The lack of global information leads to greedy decisions resulting in sub-optimal solutions, and irrelevant information acquired from the environment introduces noise and incurs additional cost." This directly explains our morning session behavior: without global state, the agent made locally-greedy decisions ("reviewed 10, time to report").
BrainBody-LLM (Bhat et al., 2025)

arxiv.org/abs/2402.08546
Closed-loop state feedback improved task success rate by 17% over state-of-the-art. "Without feedback, the Brain-LLM cannot detect or recover from execution failures."
MCP-Bench (2025)

arxiv.org/pdf/2508.20453
Formalizes tool-using agents as POMDPs and demonstrates that multi-turn observation-based agents consistently outperform one-shot planners. Observability is the key variable.
Theory of Agent (Wang et al., 2025)

arxiv.org/abs/2506.00886
Tool responses must be epistemically informative — they should reduce the agent's uncertainty about the state of the world. If a response doesn't help the agent understand what's happening, the call is wasted. Our status_counts is a textbook example: it converts every mutation from "local confirmation" to "global state update."
AgentBoard (Ma et al., NeurIPS 2024)

arxiv.org/abs/2401.13178
Introduces "progress rate" as a metric that measures incremental advancement toward goals. Their key finding: agents that receive fine-grained progress observations make more effective subsequent decisions. Coarse-grained success/failure masks actionable information.
Caveats and Confounds

This is an observational study on a single conversation, not a controlled experiment. Several confounds exist:


Task scope differed. The morning session reviewed ~800 Tier 2 candidates. The afternoon session triaged 3,261 candidates after a full reset and re-scoring. The larger task naturally demands more iterations.


User instructions evolved. Morning: "Review tier 2 for hidden gems." Afternoon: "Start triaging. You can exclude all candidates who don't have serious CFO or VP Finance level in batches." The afternoon instruction explicitly authorizes batch operations.


Scoring criteria changed. Between sessions, the scorecard was rebuilt from 5 factors to 3, changing the scoring landscape entirely.


Conversation context accumulated. By the afternoon, the agent had 100+ messages of context about this project, potentially improving its task understanding.


The user caught over-batching. In MSG 824, the user asked "Did you actually look at all candidates individually?" and in MSG 826 instructed "Do not bulk assign — please go back and check them manually." Even after this correction, the agent maintained higher iteration counts (26–51 vs. morning's 8–13), suggesting the status_counts effect on stopping behavior persists independently of bulk strategy.


Practical Recommendations

For teams building LLM-agent tool APIs:


Include global state in every mutation response. Don't just confirm the action — show the resulting state of the world. This is cheap to compute and fundamentally changes agent planning.


Provide progress-relevant counters. If the agent is working through a list, include remaining count. If managing a pipeline, include stage distribution. The agent will use these as both stopping conditions and strategy signals.


Design tool responses as observations in a POMDP. Ask: "After this response, can the agent determine how close it is to the goal?" If not, add the missing information.


Don't confuse tool descriptions with tool responses. Better tool descriptions help the agent choose the right tool. Better tool responses help the agent plan its next action. Both matter, but response design is under-studied and often overlooked.


Be cautious with high-efficiency tools. Our factor_scores bulk filter enabled 72.5 candidates/iteration — but the user had to intervene because the agent skipped individual review. High-leverage tools need guardrails or confirmation steps for destructive batch operations.


Conclusion

A small change to tool response format — adding 6 fields of state summary to every mutation response — produced a 3x increase in sustained agent work and a 55x increase in peak throughput. The mechanism is not "more data" but a shift from open-loop to closed-loop operation: the agent gains continuous observability of its own impact on the world, enabling principled stopping, strategic tool selection, and progress-driven persistence.
This finding has implications for anyone building tool-using LLM agents: the design of what your tools return may matter as much as the design of what your tools do.

This analysis was conducted on a production AI recruitment co-pilot system. The conversation analyzed contained 187 messages over 2 days, processing 3,261 candidates across 277 companies. All data is from real usage, not synthetic benchmarks.
Published: February 27, 2026
Session	Avg Iterations	Max Iterations	Messages Hitting Limit
Morning (no `status_counts`)	~11	13	0
Afternoon (with `status_counts`)	~34	51	2
MSG	Time	Iterations	Candidates/Iteration	Strategy
661 (before)	06:15	12	1.3	Individual review
663 (before)	06:18	13	1.3	Individual review
815 (after)	16:04	50	1.3	Individual review, hit limit
823 (after)	16:49	42	72.5	Batch + individual hybrid
Session	`get_candidate_details`	`update_candidate_status`	`bulk_update`	Bulk Ratio
Morning (before)	65%	30%	5%	0.05
Afternoon peak (after)	17%	15%	44%	1.38