Skip to content

Instantly share code, notes, and snippets.

@terrhorn
Created March 11, 2026 15:06
Show Gist options
  • Select an option

  • Save terrhorn/9572cfc49787b98d3f68461f7c8c13d3 to your computer and use it in GitHub Desktop.

Select an option

Save terrhorn/9572cfc49787b98d3f68461f7c8c13d3 to your computer and use it in GitHub Desktop.
Comparative critique: MCP Tool Call Flow research reports

Comparative Critique: MCP Tool Call Flow Research Reports

Two research reports trace how a tool call flows through the Model Context Protocol. This critique compares their depth, quality, and gaps.

Dimension Report A (mcp-tool-call-flow.md) Report B (how-does-a-tool-call-flow...)
Length ~450 lines ~700 lines (~32KB)
Structure Phase-based (4 phases) Layer-based (4 layers, 9 numbered steps)
Session 27ed82ad f349ea56

Depth Comparison

Report A: Breadth-Oriented

Report A covers the protocol lifecycle in four clean phases (Negotiation → Discovery → Invocation → Updates) and then branches into topics Report B omits entirely:

  • Sampling/Agentic context — a full mermaid sequence diagram showing how tools participate in sampling/createMessage multi-turn loops, with the server executing tools locally. This is a significant protocol feature that Report B ignores completely.
  • Security considerations — a structured breakdown of server responsibilities (input validation, rate limiting, output sanitization), client responsibilities (user confirmation, timeouts, audit logging), and the trust model around annotations.
  • Python SDK patterns — traces the decorator-based registration pattern in the fetch reference server using Pydantic, showing that the tool call pattern is cross-language.
  • Client-side middleware — documents the composable fetch middleware pipeline (withOAuth, withLogging, custom middleware), which is the mechanism real-world HTTP clients use.
  • Content types — enumerates all five content block types (TextContent, ImageContent, AudioContent, ResourceLink, EmbeddedResource), while Report B only shows text.

Report B: Depth-Oriented

Report B traces the exact same core flow but at a significantly deeper level of implementation detail:

  • 9 numbered steps with actual source code at each layer — not just the API surface, but the internal dispatch mechanisms (_requestWithSchema, _onrequest, _responseHandlers map keying by message ID).
  • Precise citations — 23 footnotes with file:line references (e.g., packages/core/src/shared/protocol.ts:761-870). Report A cites files but rarely lines.
  • Complete end-to-end ASCII trace — a ~60-line diagram tracing a single echo call from client.callTool() through every internal method, across the wire, through server dispatch, validation, execution, and back. This is the most valuable artifact in either report.
  • 5 validation checkpoints — explicitly enumerates every point where validation occurs (capability check, request schema, input args, result schema, output schema). Report A mentions validation exists but doesn't map the full pipeline.
  • Confidence assessment — explicitly states what's high-confidence vs. medium-confidence and documents assumptions. Report A presents everything as equally certain.
  • Task-based execution guard — documents the isToolTaskRequired() check that rejects tools requiring the experimental tasks API. Report A mentions execution as a Tool field but doesn't trace the runtime behavior.

Quality Assessment

Report A Strengths

  • Better architectural overview — the layered ASCII diagram showing McpServer → Server → Protocol → Transport is cleaner and easier to scan than Report B's more detailed but denser version.
  • Better breadth — covers cross-cutting concerns (security, sampling, middleware, Python) that a reader needs for full protocol understanding.
  • Cleaner mermaid diagrams — the invocation sequence diagram is immediately legible.
  • More complete Tool schema table — includes title, execution, and all annotation fields with their purposes.
  • Annotation defaults — Report B lists defaults (readOnlyHint=false, destructiveHint=true, etc.) but Report A does not. This is actually a Report B strength.

Report B Strengths

  • Significantly more rigorous — every claim traces to source code with line numbers. If the SDK changes, you know exactly which claims to re-verify.
  • Better for implementers — someone building an MCP client or server could follow Report B step-by-step and understand the actual code they'll interact with.
  • The end-to-end trace is exceptional — no other artifact in either report provides as much clarity on the actual runtime flow.
  • Honest about uncertainty — the confidence assessment is a mark of research quality that Report A lacks.
  • Schema interface definitions — includes the actual TypeScript interface definitions from schema.ts, not just field tables.

Report A Weaknesses

  • No source citations below file level — you can't verify claims without re-reading entire files.
  • No confidence assessment — presents all information with equal certainty, which is misleading for draft-version features.
  • Sampling diagram is uncited — the agentic flow diagram is valuable but has no specification reference beyond a file name.

Report B Weaknesses

  • Missing sampling/agentic context — this is the biggest gap. The multi-turn tool use loop via sampling/createMessage is a core protocol capability.
  • Missing security analysis — no discussion of the trust model, rate limiting, or client-side safeguards.
  • TypeScript-only — doesn't acknowledge the Python SDK or show that the pattern is cross-language.
  • No middleware coverage — omits the HTTP middleware pipeline that real-world clients use.
  • Verbose — at 32KB, it's harder to use as a quick reference. The depth is earned but could benefit from a summary section.

Identified Gaps (Neither Report Covers)

  1. Cancellation flow — both reports mention AbortController/AbortSignal in passing but neither traces what happens when a client sends notifications/cancelled for an in-flight tools/call. The spec supports this and the SDK implements it.

  2. Progress notificationstools/call supports progress tokens via _meta.progressToken. Neither report traces how a long-running tool sends notifications/progress back to the client during execution.

  3. Pagination mechanics — Report A mentions cursor-based pagination for tools/list; neither report shows the actual cursor flow or how a client iterates through a large tool set.

  4. Transport-specific behaviors — neither report addresses how the tool call flow differs across transports. For example, Streamable HTTP requires session management headers, and SSE has different request/response semantics than stdio.

  5. Error recovery patterns — the two-level error model is documented, but neither report discusses what happens after an error: retry strategies, how LLMs use isError responses to self-correct, or client-side error handling patterns.

  6. Tool output schema validation asymmetry — Report B documents that the client validates structuredContent against outputSchema, but neither report discusses the design decision: why is output validation done on both server AND client? What happens when they disagree?

  7. Dynamic tool registration at runtime — both mention notifications/tools/list_changed, but neither traces the full flow of a server adding a tool after initialization and the client re-discovering it.

  8. Authorization/authentication — Report A mentions security broadly, but neither traces how auth flows through tool calls, particularly in HTTP transports where authInfo is extracted from the request and passed through the context object.


Verdict

Report B is the stronger research artifact for its intended purpose (understanding the implementation). The footnoted citations, validation checkpoint map, and end-to-end trace demonstrate genuine source-level investigation rather than summarization.

Report A is the better reference document for someone who needs to understand the protocol holistically — sampling, security, cross-language patterns, and middleware are all things a practitioner needs.

The ideal document would combine both: Report B's depth and rigor for the core flow, with Report A's breadth sections on sampling, security, Python patterns, and middleware appended as additional chapters. The confidence assessment from Report B should be standard practice.

Recommended Synthesis Priority

If merging into a single document:

  1. Use Report B's layer-based structure and end-to-end trace as the spine
  2. Add Report A's sampling/agentic context section (with citations added)
  3. Add Report A's security considerations section
  4. Add Report A's Python SDK patterns section
  5. Add Report A's middleware section
  6. Fill the 8 gaps identified above
  7. Add a quick-reference summary at the top for scanning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment