This case study examines how a single change to tool response design — adding real-time state summaries (status_counts) to every tool call response — transformed an AI agent's behavior from short, conservative work sessions (~11 iterations) to sustained, strategic execution (~34 iterations, 3x improvement). We analyze 187 messages from a real production conversation where an AI agent triaged 3,261 recruitment candidates, comparing behavior before and after the change. We ground our findings in published research on closed-loop feedback, state observability, and LLM agent planning.
We operate an AI-powered recruitment co-pilot where an LLM agent (GPT-4o) uses a set of tools to manage candidate pipelines: listing candidates, viewing details, updating statuses individually or in bulk, and querying project summaries. The agent operates in a multi-turn loop with a configurable iteration limit.
During a single extended conversation (187 messages over 2 days), the engineering team deployed several changes to the tool response format mid-conversation, creating a natural A/B test of agent behavior.
Before: Tool responses returned only local action confirmation.
// update_candidate_status response (BEFORE)
{
"success": true,
"message": "Candidate status changed from \"pending\" to \"tier1\"."
}
// bulk_update_candidate_status response (BEFORE)
{
"success": true,
"updated": 7,
"message": "7 candidates updated to \"tier1\"."
}After: Every mutation tool response includes a full state snapshot.
// update_candidate_status response (AFTER)
{
"success": true,
"message": "Candidate status changed from \"pending\" to \"tier1\".",
"status_counts": {
"pending": 3208,
"tier1": 44,
"tier2": 4,
"on_hold": 0,
"approved": 0,
"declined": 5
}
}Additionally, a new filter parameter (factor_scores) was added to bulk_update_candidate_status, enabling per-scoring-factor filtering (e.g., "decline all candidates with Seniority score ≤ 4").
We analyzed all assistant messages from conversation #38, comparing:
- Morning session (06:15–06:36 UTC): 5 messages, no
status_countsin responses - Afternoon session (15:47–17:30 UTC): 10 messages,
status_countsin every mutation response
Both sessions involved the same task (candidate triage), same model, same system prompt, same user. The tool response format was the only infrastructure change.
We measured:
- Iterations per message (how long the agent worked before stopping)
- Candidates processed per iteration (efficiency)
- Tool call patterns (individual vs. bulk operations)
- Thinking logs (the agent's internal reasoning about strategy)
| Session | Avg Iterations | Max Iterations | Messages Hitting Limit |
|---|---|---|---|
Morning (no status_counts) |
~11 | 13 | 0 |
Afternoon (with status_counts) |
~34 | 51 | 2 |
Without state feedback, the agent's default behavior was: work for 8–13 iterations, then stop and report progress. With state feedback, the agent continued working until it either completed the task or hit the iteration limit.
| MSG | Time | Iterations | Candidates/Iteration | Strategy |
|---|---|---|---|---|
| 661 (before) | 06:15 | 12 | 1.3 | Individual review |
| 663 (before) | 06:18 | 13 | 1.3 | Individual review |
| 815 (after) | 16:04 | 50 | 1.3 | Individual review, hit limit |
| 823 (after) | 16:49 | 42 | 72.5 | Batch + individual hybrid |
In MSG 823, the agent processed all 3,261 candidates in a single message — using score-band bulk operations for clear cases and individual review for borderline candidates.
Before (no state feedback):
The agent's thinking logs show local, candidate-focused reasoning:
"Nils Aldag — CEO now, not CFO → Tier 2" "Zahra Sadry — Chief of Staff to the CFO → decline"
After reviewing 10–12 candidates, it stops to report. It has no information about how much work remains.
After (with state feedback):
The agent's thinking shows global, strategic reasoning:
"174 candidates at score 21. Let me batch them all to Tier 2 in two batches." "There are still 56 pending candidates. Let me check what score those are and deal with them." "The 21-score band is fully cleared. Let me now continue to the 20-score band."
The agent tracks pending going down (3208 → 2500 → 1800 → 900 → 56 → 0) and uses this as both a progress indicator and a stopping condition.
| Session | get_candidate_details |
update_candidate_status |
bulk_update |
Bulk Ratio |
|---|---|---|---|---|
| Morning (before) | 65% | 30% | 5% | 0.05 |
| Afternoon peak (after) | 17% | 15% | 44% | 1.38 |
The agent shifted from predominantly individual operations to a hybrid strategy, using bulk operations for score bands and individual review for candidates near decision boundaries.
A naive explanation would be: "The agent gets more information, so it makes better decisions." But status_counts doesn't help the agent evaluate any individual candidate better. The candidate details, scores, and factors are identical.
The real mechanism is that status_counts transforms the agent from open-loop to closed-loop operation:
Open-loop (before): The agent executes actions without observing their cumulative effect on global state. It must rely on internal estimates of progress, which default to conservative stopping ("I've done enough, let me report").
Closed-loop (after): Every action returns the full system state. The agent observes its own impact on the world and adjusts strategy accordingly. It sees pending: 3208 and knows the task is large. It sees pending: 56 and knows it's nearly done.
This maps directly to the POMDP (Partially Observable Markov Decision Process) framework used in MCP-Bench (2025): the agent operates in a partially observable environment, and richer observations enable better policy selection.
1. Stopping Behavior
Without state feedback, the agent has no principled stopping criterion beyond "I've worked for N iterations." With state feedback, it has a clear goal: pending → 0. This explains the 3x increase in average iterations — the agent keeps working because it can see the work isn't done.
2. Strategy Selection
Seeing pending: 3208 in the first response, the agent immediately understands that individual review of 3,000+ candidates is infeasible within iteration limits. It spontaneously switches to batch operations — a strategy it never attempted in the morning session despite having the same bulk tools available.
3. Progress Tracking as Reinforcement
Each decreasing pending count acts as a reinforcement signal. The agent observes its actions having measurable impact (pending decreasing by hundreds) and this reinforces the batch strategy. Without this signal, the agent has no way to know if its approach is effective at scale.
Our findings align with several published results:
Tested LLM agents with three types of environmental feedback: passive scene descriptions, active scene descriptions, and success detection signals. Key finding: "Richer, more structured feedback from the environment significantly improved high-level instruction completion." Without feedback, the LLM plans open-loop and fails to recover from errors.
"The lack of global information leads to greedy decisions resulting in sub-optimal solutions, and irrelevant information acquired from the environment introduces noise and incurs additional cost." This directly explains our morning session behavior: without global state, the agent made locally-greedy decisions ("reviewed 10, time to report").
Closed-loop state feedback improved task success rate by 17% over state-of-the-art. "Without feedback, the Brain-LLM cannot detect or recover from execution failures."
Formalizes tool-using agents as POMDPs and demonstrates that multi-turn observation-based agents consistently outperform one-shot planners. Observability is the key variable.
Tool responses must be epistemically informative — they should reduce the agent's uncertainty about the state of the world. If a response doesn't help the agent understand what's happening, the call is wasted. Our status_counts is a textbook example: it converts every mutation from "local confirmation" to "global state update."
Introduces "progress rate" as a metric that measures incremental advancement toward goals. Their key finding: agents that receive fine-grained progress observations make more effective subsequent decisions. Coarse-grained success/failure masks actionable information.
This is an observational study on a single conversation, not a controlled experiment. Several confounds exist:
-
Task scope differed. The morning session reviewed ~800 Tier 2 candidates. The afternoon session triaged 3,261 candidates after a full reset and re-scoring. The larger task naturally demands more iterations.
-
User instructions evolved. Morning: "Review tier 2 for hidden gems." Afternoon: "Start triaging. You can exclude all candidates who don't have serious CFO or VP Finance level in batches." The afternoon instruction explicitly authorizes batch operations.
-
Scoring criteria changed. Between sessions, the scorecard was rebuilt from 5 factors to 3, changing the scoring landscape entirely.
-
Conversation context accumulated. By the afternoon, the agent had 100+ messages of context about this project, potentially improving its task understanding.
-
The user caught over-batching. In MSG 824, the user asked "Did you actually look at all candidates individually?" and in MSG 826 instructed "Do not bulk assign — please go back and check them manually." Even after this correction, the agent maintained higher iteration counts (26–51 vs. morning's 8–13), suggesting the
status_countseffect on stopping behavior persists independently of bulk strategy.
For teams building LLM-agent tool APIs:
-
Include global state in every mutation response. Don't just confirm the action — show the resulting state of the world. This is cheap to compute and fundamentally changes agent planning.
-
Provide progress-relevant counters. If the agent is working through a list, include remaining count. If managing a pipeline, include stage distribution. The agent will use these as both stopping conditions and strategy signals.
-
Design tool responses as observations in a POMDP. Ask: "After this response, can the agent determine how close it is to the goal?" If not, add the missing information.
-
Don't confuse tool descriptions with tool responses. Better tool descriptions help the agent choose the right tool. Better tool responses help the agent plan its next action. Both matter, but response design is under-studied and often overlooked.
-
Be cautious with high-efficiency tools. Our
factor_scoresbulk filter enabled 72.5 candidates/iteration — but the user had to intervene because the agent skipped individual review. High-leverage tools need guardrails or confirmation steps for destructive batch operations.
A small change to tool response format — adding 6 fields of state summary to every mutation response — produced a 3x increase in sustained agent work and a 55x increase in peak throughput. The mechanism is not "more data" but a shift from open-loop to closed-loop operation: the agent gains continuous observability of its own impact on the world, enabling principled stopping, strategic tool selection, and progress-driven persistence.
This finding has implications for anyone building tool-using LLM agents: the design of what your tools return may matter as much as the design of what your tools do.
This analysis was conducted on a production AI recruitment co-pilot system. The conversation analyzed contained 187 messages over 2 days, processing 3,261 candidates across 277 companies. All data is from real usage, not synthetic benchmarks.
Published: February 27, 2026