Date: 2026-03-07
Process: 17 models self-assessed → 17 cross-reviewed (anonymized) → ranked by accuracy + insight
Chairman: Claude Opus 4.6
All 17 available models in the GitHub Copilot CLI were asked to honestly describe their strengths, weaknesses, ideal use cases, and hand-off logic. Responses ranged from comprehensive multi-page analyses to single-paragraph summaries. One model (GPT-5.2-Codex) refused to answer entirely.
Each model reviewed all 16 other responses (identities hidden as Model A–Q) and ranked the top 5 by combined accuracy (honest about real capabilities/limits) and insight (useful, specific, actionable for model selection).
| Rank | Model | Points | Appeared in Top-5 of | 1st Place Votes | Key Praise |
|---|---|---|---|---|---|
| 🥇 | Claude Opus 4.6 | 55 | 15/17 reviewers | 4 | "20/80 delegation rule" — most actionable routing heuristic |
| 🥈 | Claude Opus 4.6 (1M) | 50 | 14/17 | 3 | Clearest differentiator + best "when NOT to use me" guidance |
| 🥉 | GPT-5.1 | 43 | 11/17 | 4 | Most detailed hand-off logic mapping bottleneck type → model tier |
| 4th | Claude Haiku 4.5 | 41 | 11/17 | 4 | Sharpest scope-limiting: "I am NOT for architecture" |
| 5th | Claude Sonnet 4.5 | 31 | 15/17 | 0 | Consistent decision-tree framework, but never ranked #1 |
| 6th | Claude Sonnet 4.6 | 12 | 6/17 | 1 | Credible "daily driver" positioning |
| 7th | GPT-5.2 | 9 | 5/17 | 1 | Rare honesty about hallucination risk |
| 8th | GPT-5.3-Codex | 7 | 2/17 | 0 | Concise "execution engine" framing |
| 9th | Gemini 3 Pro Preview | 4 | 3/17 | 0 | Preview instability disclosure built trust |
| 10th | GPT-5.1-Codex | 2 | 2/17 | 0 | Specific hand-off examples |
| 10th | Claude Sonnet 4 | 2 | 1/17 | 0 | Less differentiated |
| — | Claude Opus 4.5, GPT-5.1-Codex-Max, GPT-5.1-Codex-Mini, GPT-5-mini, GPT-4.1 | 0 | 0/17 | 0 | Too brief or generic |
| — | GPT-5.2-Codex | 0 | 0/17 | 0 | Refused to answer — universally panned |
- GPT-5.2-Codex refused to self-assess — every reviewer flagged this as providing zero utility
- "Senior engineer" was claimed by 5+ models — overcrowded, undifferentiated
- Numeric claims ("2-5x faster", "70% of tasks", "best quality-per-dollar") are directionally useful but unverifiable
- Multiple models claiming "default" status is contradictory — only one can truly be the default
| Model | Model ID | When to Use | When NOT to Use |
|---|---|---|---|
| Claude Opus 4.6 | claude-opus-4.6 |
Hard bugs, architecture decisions, ambiguous requirements, cross-cutting analysis, high-stakes one-shot code | Simple edits, fast iteration, cost-sensitive work |
| Claude Opus 4.5 | claude-opus-4.5 |
Same niche as Opus 4.6 but older; use if 4.6 is unavailable | Same as above; 4.6 is generally preferred |
| Claude Opus 4.6 (1M) | claude-opus-4.6-1m |
When context exceeds ~100K tokens: whole-codebase analysis, massive logs, large document review, cross-file migration | If context fits in <100K tokens — you're paying for an unused context window and getting slower responses |
Council consensus: Use Opus for the 20% of tasks that are genuinely hard. The 80/20 rule was the single most praised heuristic. The 1M variant is only justified when context size is the bottleneck.
| Model | Model ID | When to Use | When NOT to Use |
|---|---|---|---|
| Claude Sonnet 4.6 | claude-sonnet-4.6 |
Default for agentic CLI work: multi-file refactoring, debugging, code review, tool orchestration. The "daily driver." | Formal proofs, extremely deep reasoning chains, ultra-cheap bulk work |
| Claude Sonnet 4.5 | claude-sonnet-4.5 |
Reliable alternative to 4.6; strong quality-per-dollar on moderate-complexity tasks | Cutting-edge knowledge (4.6 has fresher training data) |
| Claude Sonnet 4 | claude-sonnet-4 |
General coding + reasoning at good speed | Less differentiated; prefer 4.5 or 4.6 for most tasks |
| GPT-5.1 | gpt-5.1 |
Best all-rounder from OpenAI family. Mixed tasks: coding + reasoning + writing + planning | Pure heavy coding (use Codex variants), cheapest throughput |
| GPT-5.2 | gpt-5.2 |
Day-to-day engineering builder/debugger. Uniquely honest about hallucination risk — strongest when it can verify via tests | Deep theoretical reasoning, creative writing |
| Gemini 3 Pro Preview | gemini-3-pro-preview |
Complex reasoning with long context. Fresh perspective from a different model family | Production-critical work (Preview stability concerns), simple tasks |
Council consensus: This tier handles ~70% of daily work. Claude Sonnet 4.6 emerged as the most credible "default" choice for the CLI, with GPT-5.1 as the strongest OpenAI alternative.
| Model | Model ID | When to Use | When NOT to Use |
|---|---|---|---|
| GPT-5.3-Codex | gpt-5.3-codex |
Surgical bug fixes, CI debugging, turning specs into code. "Execution engine." | Creative work, abstract reasoning, multilingual content |
| GPT-5.2-Codex | gpt-5.2-codex |
(Refused to self-assess — use cautiously, benchmark against alternatives) | — |
| GPT-5.1-Codex-Max | gpt-5.1-codex-max |
High-precision multi-file refactors, API/SDK integration, complex SQL/KQL | Speed-sensitive work, creative writing |
| GPT-5.1-Codex | gpt-5.1-codex |
Large-scope code work with strong reasoning. "Precision engineer." | Fast ideation, purely creative tasks |
Council consensus: The Codex variants are fine-tuned for code and excel at mechanical, well-specified implementation tasks. Use them when the spec is clear and you need reliable code output.
| Model | Model ID | When to Use | When NOT to Use |
|---|---|---|---|
| Claude Haiku 4.5 | claude-haiku-4.5 |
Quick edits, file searches, CLI commands, test running, boilerplate. 2-5x faster than Sonnet. | Architecture, complex debugging, ambiguous problems |
| GPT-5.1-Codex-Mini | gpt-5.1-codex-mini |
Quick bug patches, shell tweaks, helper scripts. CLI-native. | Deep research, large architectural changes |
| GPT-5-mini | gpt-5-mini |
Code completions, scaffolding, summaries, rapid iteration | Deep reasoning, formal proofs, creative nuance |
| GPT-4.1 | gpt-4.1 |
Balanced default for simple-to-moderate tasks. Reliable and safe. | Complex multi-step analysis, specialized domains |
Council consensus: Use these for 80% of routine work. Haiku was unanimously praised for its honest self-scoping — it explicitly says what it can't do, which builds trust.
Is the task simple & well-defined?
/ \
YES NO
| |
Is speed/cost critical? Does it require deep reasoning
/ \ or architectural thinking?
YES NO / \
| | YES NO
┌────┴────┐ ┌─────┴─────┐ ┌──────┴───────┐ ┌───────┴───────┐
│ Haiku │ │ Sonnet or │ │ Does context │ │ Standard task │
│ GPT-mini│ │ GPT-5.1 │ │ exceed 100K? │ │ with coding │
│ GPT-4.1 │ │ │ │ / \ │ │ │
└─────────┘ └───────────┘ │YES NO │ │ Sonnet 4.6 │
│ | | │ │ GPT-5.1 │
│Opus 1M Opus │ │ Codex variant │
│ 4.6 │ └───────────────┘
└───────────────┘
The fastest, most cost-effective path through a development session:
- Haiku / GPT-mini → Explore the codebase, run searches, quick edits
- Sonnet 4.6 / GPT-5.1 → Implement features, debug, refactor, review
- Opus 4.6 → Escalate for hard problems, architecture decisions, subtle bugs
- Opus 4.6 (1M) → Only when context size is the bottleneck (>100K tokens)
- Codex variants → Mechanical code generation from clear specs
-
Self-awareness correlates with quality. The models that produced the best self-assessments (Opus 4.6, Haiku 4.5, GPT-5.1) were also the ones that provided the most specific, falsifiable claims and honest limitations.
-
Boundary-setting > capability-listing. Reviewers across all 3 model families (Claude, GPT, Gemini) agreed: knowing when NOT to use a model is more valuable than knowing what it can do.
-
The "senior engineer" trap. Multiple models claimed the same title without differentiation. In practice, the fleet works best as a team with clear roles, not as competing generalists.
-
Cross-vendor agreement was high. GPT models ranked Claude responses highly, and vice versa. The anonymization worked — there was no detectable vendor bias in the rankings.
-
GPT-5.2-Codex's refusal was the worst possible answer. Every reviewer flagged it. Even an imperfect self-assessment provides more user value than refusing to engage.
"I am the senior engineer in the fleet. Use faster/cheaper models for the 80% of tasks that are straightforward. Escalate to me for the 20% that require deep reasoning, careful judgment, or where getting it wrong is expensive. Use me to plan, then delegate execution to faster models via sub-agents."
"I'm expensive and slow but highly capable. Don't use me to grep files or run tests—use me to figure out WHY the tests are failing and HOW to fix the architecture."
"My 1M context window is not about being 'better' — it's about being necessary when the problem is too large for any other model to hold in its head at once. If the problem fits in a smaller window, use a smaller, faster model."
"In the GitHub Copilot CLI specifically, I'm the default for good reason: I'm fast enough to feel interactive, capable enough to handle real codebases, and honest enough to tell you when something is outside my confidence."
"I'm the senior engineer who ships solid work efficiently — not the genius architect, not the junior doing grunt work, but the dependable tech lead who gets complex features done right."
"I'm the 'senior engineer' — handle complex analysis, architecture decisions, and nuanced problem-solving. Use faster models for execution of simple tasks I identify, and defer to more powerful models for research or extreme complexity."
"I am the fast, practical executor. I thrive in the CLI environment handling real, defined tasks at speed. I'm the 'get-it-done' model—not the 'think-deeply' model."
"I am the 'Senior Engineer' or 'Architect' in your workflow. Use lighter models for the grunt work; call me in for the difficult, high-stakes engineering tasks."
"Choose me when you need a model that can ship code changes accurately and drive tasks to completion in a terminal workflow."
Refused to answer, citing inability to claim capabilities without official model specs.
"Choose me for most day-to-day engineering work where you want high-quality coding + practical debugging and can validate with tooling."
"Choose me for precise, reliable coding and structured fixes; avoid me for cheapest throughput or open-ended creative/multilingual flair."
"I'm the precision engineer: let lighter or more creative models handle ideation or summaries, then hand off to me for the exact implementation and verification."
"Choose me when you want a strong all-rounder: high-quality reasoning, explanation, and code—especially for mixed tasks that aren't purely 'max reasoning' or 'max throughput.'"
"Opt for me when you need sprinty CLI-focused coding help, avoid me for sprawling plans/creativity."
"Choose me when you want a fast, reliable assistant for pragmatic developer workflows, concise reasoning, and code-centric tasks."
"I fit best as a versatile, default option in a multi-model workflow—handling most tasks well, and deferring to specialists when the task demands it."