Skip to content

Instantly share code, notes, and snippets.

@stefanstranger
Created March 8, 2026 13:08
Show Gist options
  • Select an option

  • Save stefanstranger/6a5a564158faf510408b9b48f52a9713 to your computer and use it in GitHub Desktop.

Select an option

Save stefanstranger/6a5a564158faf510408b9b48f52a9713 to your computer and use it in GitHub Desktop.
LLM Council — Final Ruling: GitHub Copilot CLI Model Assessment

🏛️ LLM Council — Final Ruling: GitHub Copilot CLI Model Assessment

Date: 2026-03-07
Process: 17 models self-assessed → 17 cross-reviewed (anonymized) → ranked by accuracy + insight
Chairman: Claude Opus 4.6


Stage 1: Individual Self-Assessments

All 17 available models in the GitHub Copilot CLI were asked to honestly describe their strengths, weaknesses, ideal use cases, and hand-off logic. Responses ranged from comprehensive multi-page analyses to single-paragraph summaries. One model (GPT-5.2-Codex) refused to answer entirely.

Stage 2: Anonymized Cross-Review Rankings

Each model reviewed all 16 other responses (identities hidden as Model A–Q) and ranked the top 5 by combined accuracy (honest about real capabilities/limits) and insight (useful, specific, actionable for model selection).

Consensus Rankings: Best Self-Assessments

Rank Model Points Appeared in Top-5 of 1st Place Votes Key Praise
🥇 Claude Opus 4.6 55 15/17 reviewers 4 "20/80 delegation rule" — most actionable routing heuristic
🥈 Claude Opus 4.6 (1M) 50 14/17 3 Clearest differentiator + best "when NOT to use me" guidance
🥉 GPT-5.1 43 11/17 4 Most detailed hand-off logic mapping bottleneck type → model tier
4th Claude Haiku 4.5 41 11/17 4 Sharpest scope-limiting: "I am NOT for architecture"
5th Claude Sonnet 4.5 31 15/17 0 Consistent decision-tree framework, but never ranked #1
6th Claude Sonnet 4.6 12 6/17 1 Credible "daily driver" positioning
7th GPT-5.2 9 5/17 1 Rare honesty about hallucination risk
8th GPT-5.3-Codex 7 2/17 0 Concise "execution engine" framing
9th Gemini 3 Pro Preview 4 3/17 0 Preview instability disclosure built trust
10th GPT-5.1-Codex 2 2/17 0 Specific hand-off examples
10th Claude Sonnet 4 2 1/17 0 Less differentiated
Claude Opus 4.5, GPT-5.1-Codex-Max, GPT-5.1-Codex-Mini, GPT-5-mini, GPT-4.1 0 0/17 0 Too brief or generic
GPT-5.2-Codex 0 0/17 0 Refused to answer — universally panned

Flagged Concerns Across Reviewers

  • GPT-5.2-Codex refused to self-assess — every reviewer flagged this as providing zero utility
  • "Senior engineer" was claimed by 5+ models — overcrowded, undifferentiated
  • Numeric claims ("2-5x faster", "70% of tasks", "best quality-per-dollar") are directionally useful but unverifiable
  • Multiple models claiming "default" status is contradictory — only one can truly be the default

Stage 3: The Chairman's Synthesized Model Map

🏗️ Tier 1: Premium Reasoning (The Architects)

Model Model ID When to Use When NOT to Use
Claude Opus 4.6 claude-opus-4.6 Hard bugs, architecture decisions, ambiguous requirements, cross-cutting analysis, high-stakes one-shot code Simple edits, fast iteration, cost-sensitive work
Claude Opus 4.5 claude-opus-4.5 Same niche as Opus 4.6 but older; use if 4.6 is unavailable Same as above; 4.6 is generally preferred
Claude Opus 4.6 (1M) claude-opus-4.6-1m When context exceeds ~100K tokens: whole-codebase analysis, massive logs, large document review, cross-file migration If context fits in <100K tokens — you're paying for an unused context window and getting slower responses

Council consensus: Use Opus for the 20% of tasks that are genuinely hard. The 80/20 rule was the single most praised heuristic. The 1M variant is only justified when context size is the bottleneck.


⚙️ Tier 2: Standard Workhorses (The Tech Leads)

Model Model ID When to Use When NOT to Use
Claude Sonnet 4.6 claude-sonnet-4.6 Default for agentic CLI work: multi-file refactoring, debugging, code review, tool orchestration. The "daily driver." Formal proofs, extremely deep reasoning chains, ultra-cheap bulk work
Claude Sonnet 4.5 claude-sonnet-4.5 Reliable alternative to 4.6; strong quality-per-dollar on moderate-complexity tasks Cutting-edge knowledge (4.6 has fresher training data)
Claude Sonnet 4 claude-sonnet-4 General coding + reasoning at good speed Less differentiated; prefer 4.5 or 4.6 for most tasks
GPT-5.1 gpt-5.1 Best all-rounder from OpenAI family. Mixed tasks: coding + reasoning + writing + planning Pure heavy coding (use Codex variants), cheapest throughput
GPT-5.2 gpt-5.2 Day-to-day engineering builder/debugger. Uniquely honest about hallucination risk — strongest when it can verify via tests Deep theoretical reasoning, creative writing
Gemini 3 Pro Preview gemini-3-pro-preview Complex reasoning with long context. Fresh perspective from a different model family Production-critical work (Preview stability concerns), simple tasks

Council consensus: This tier handles ~70% of daily work. Claude Sonnet 4.6 emerged as the most credible "default" choice for the CLI, with GPT-5.1 as the strongest OpenAI alternative.


🔧 Tier 3: Code-Specialized (The Implementers)

Model Model ID When to Use When NOT to Use
GPT-5.3-Codex gpt-5.3-codex Surgical bug fixes, CI debugging, turning specs into code. "Execution engine." Creative work, abstract reasoning, multilingual content
GPT-5.2-Codex gpt-5.2-codex (Refused to self-assess — use cautiously, benchmark against alternatives)
GPT-5.1-Codex-Max gpt-5.1-codex-max High-precision multi-file refactors, API/SDK integration, complex SQL/KQL Speed-sensitive work, creative writing
GPT-5.1-Codex gpt-5.1-codex Large-scope code work with strong reasoning. "Precision engineer." Fast ideation, purely creative tasks

Council consensus: The Codex variants are fine-tuned for code and excel at mechanical, well-specified implementation tasks. Use them when the spec is clear and you need reliable code output.


⚡ Tier 4: Fast/Cheap (The Executors)

Model Model ID When to Use When NOT to Use
Claude Haiku 4.5 claude-haiku-4.5 Quick edits, file searches, CLI commands, test running, boilerplate. 2-5x faster than Sonnet. Architecture, complex debugging, ambiguous problems
GPT-5.1-Codex-Mini gpt-5.1-codex-mini Quick bug patches, shell tweaks, helper scripts. CLI-native. Deep research, large architectural changes
GPT-5-mini gpt-5-mini Code completions, scaffolding, summaries, rapid iteration Deep reasoning, formal proofs, creative nuance
GPT-4.1 gpt-4.1 Balanced default for simple-to-moderate tasks. Reliable and safe. Complex multi-step analysis, specialized domains

Council consensus: Use these for 80% of routine work. Haiku was unanimously praised for its honest self-scoping — it explicitly says what it can't do, which builds trust.


The Decision Flowchart

                    Is the task simple & well-defined?
                   /                                  \
                 YES                                   NO
                  |                                     |
         Is speed/cost critical?              Does it require deep reasoning
         /            \                       or architectural thinking?
       YES             NO                    /                        \
        |               |                  YES                         NO
   ┌────┴────┐    ┌─────┴─────┐     ┌──────┴───────┐          ┌───────┴───────┐
   │ Haiku   │    │ Sonnet or │     │ Does context  │          │ Standard task │
   │ GPT-mini│    │ GPT-5.1   │     │ exceed 100K?  │          │ with coding   │
   │ GPT-4.1 │    │           │     │  /        \   │          │               │
   └─────────┘    └───────────┘     │YES        NO  │          │ Sonnet 4.6    │
                                    │ |          |  │          │ GPT-5.1       │
                                    │Opus 1M  Opus  │          │ Codex variant │
                                    │         4.6   │          └───────────────┘
                                    └───────────────┘

The Optimal Session Pattern

The fastest, most cost-effective path through a development session:

  1. Haiku / GPT-mini → Explore the codebase, run searches, quick edits
  2. Sonnet 4.6 / GPT-5.1 → Implement features, debug, refactor, review
  3. Opus 4.6 → Escalate for hard problems, architecture decisions, subtle bugs
  4. Opus 4.6 (1M) → Only when context size is the bottleneck (>100K tokens)
  5. Codex variants → Mechanical code generation from clear specs

Meta-Observations from the Council Process

  1. Self-awareness correlates with quality. The models that produced the best self-assessments (Opus 4.6, Haiku 4.5, GPT-5.1) were also the ones that provided the most specific, falsifiable claims and honest limitations.

  2. Boundary-setting > capability-listing. Reviewers across all 3 model families (Claude, GPT, Gemini) agreed: knowing when NOT to use a model is more valuable than knowing what it can do.

  3. The "senior engineer" trap. Multiple models claimed the same title without differentiation. In practice, the fleet works best as a team with clear roles, not as competing generalists.

  4. Cross-vendor agreement was high. GPT models ranked Claude responses highly, and vice versa. The anonymization worked — there was no detectable vendor bias in the rankings.

  5. GPT-5.2-Codex's refusal was the worst possible answer. Every reviewer flagged it. Even an imperfect self-assessment provides more user value than refusing to engage.


Appendix: Individual Model Self-Assessments (Condensed)

Claude Opus 4.6

"I am the senior engineer in the fleet. Use faster/cheaper models for the 80% of tasks that are straightforward. Escalate to me for the 20% that require deep reasoning, careful judgment, or where getting it wrong is expensive. Use me to plan, then delegate execution to faster models via sub-agents."

Claude Opus 4.5

"I'm expensive and slow but highly capable. Don't use me to grep files or run tests—use me to figure out WHY the tests are failing and HOW to fix the architecture."

Claude Opus 4.6 (1M)

"My 1M context window is not about being 'better' — it's about being necessary when the problem is too large for any other model to hold in its head at once. If the problem fits in a smaller window, use a smaller, faster model."

Claude Sonnet 4.6

"In the GitHub Copilot CLI specifically, I'm the default for good reason: I'm fast enough to feel interactive, capable enough to handle real codebases, and honest enough to tell you when something is outside my confidence."

Claude Sonnet 4.5

"I'm the senior engineer who ships solid work efficiently — not the genius architect, not the junior doing grunt work, but the dependable tech lead who gets complex features done right."

Claude Sonnet 4

"I'm the 'senior engineer' — handle complex analysis, architecture decisions, and nuanced problem-solving. Use faster models for execution of simple tasks I identify, and defer to more powerful models for research or extreme complexity."

Claude Haiku 4.5

"I am the fast, practical executor. I thrive in the CLI environment handling real, defined tasks at speed. I'm the 'get-it-done' model—not the 'think-deeply' model."

Gemini 3 Pro Preview

"I am the 'Senior Engineer' or 'Architect' in your workflow. Use lighter models for the grunt work; call me in for the difficult, high-stakes engineering tasks."

GPT-5.3-Codex

"Choose me when you need a model that can ship code changes accurately and drive tasks to completion in a terminal workflow."

GPT-5.2-Codex

Refused to answer, citing inability to claim capabilities without official model specs.

GPT-5.2

"Choose me for most day-to-day engineering work where you want high-quality coding + practical debugging and can validate with tooling."

GPT-5.1-Codex-Max

"Choose me for precise, reliable coding and structured fixes; avoid me for cheapest throughput or open-ended creative/multilingual flair."

GPT-5.1-Codex

"I'm the precision engineer: let lighter or more creative models handle ideation or summaries, then hand off to me for the exact implementation and verification."

GPT-5.1

"Choose me when you want a strong all-rounder: high-quality reasoning, explanation, and code—especially for mixed tasks that aren't purely 'max reasoning' or 'max throughput.'"

GPT-5.1-Codex-Mini

"Opt for me when you need sprinty CLI-focused coding help, avoid me for sprawling plans/creativity."

GPT-5-mini

"Choose me when you want a fast, reliable assistant for pragmatic developer workflows, concise reasoning, and code-centric tasks."

GPT-4.1

"I fit best as a versatile, default option in a multi-model workflow—handling most tasks well, and deferring to specialists when the task demands it."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment