stefanstranger/LLM_Council_Model_Assessment.md

## LLM_Council_Model_Assessment.md

      
    Raw
  

              LLM_Council_Model_Assessment.md
            
          
    🏛️ LLM Council — Final Ruling: GitHub Copilot CLI Model Assessment

Date: 2026-03-07

Process: 17 models self-assessed → 17 cross-reviewed (anonymized) → ranked by accuracy + insight

Chairman: Claude Opus 4.6

Stage 1: Individual Self-Assessments

All 17 available models in the GitHub Copilot CLI were asked to honestly describe their strengths, weaknesses, ideal use cases, and hand-off logic. Responses ranged from comprehensive multi-page analyses to single-paragraph summaries. One model (GPT-5.2-Codex) refused to answer entirely.
Stage 2: Anonymized Cross-Review Rankings

Each model reviewed all 16 other responses (identities hidden as Model A–Q) and ranked the top 5 by combined accuracy (honest about real capabilities/limits) and insight (useful, specific, actionable for model selection).
Consensus Rankings: Best Self-Assessments


Rank
Model
Points
Appeared in Top-5 of
1st Place Votes
Key Praise


🥇
Claude Opus 4.6
55
15/17 reviewers
4
"20/80 delegation rule" — most actionable routing heuristic


🥈
Claude Opus 4.6 (1M)
50
14/17
3
Clearest differentiator + best "when NOT to use me" guidance


🥉
GPT-5.1
43
11/17
4
Most detailed hand-off logic mapping bottleneck type → model tier


4th
Claude Haiku 4.5
41
11/17
4
Sharpest scope-limiting: "I am NOT for architecture"


5th
Claude Sonnet 4.5
31
15/17
0
Consistent decision-tree framework, but never ranked #1


6th
Claude Sonnet 4.6
12
6/17
1
Credible "daily driver" positioning


7th
GPT-5.2
9
5/17
1
Rare honesty about hallucination risk


8th
GPT-5.3-Codex
7
2/17
0
Concise "execution engine" framing


9th
Gemini 3 Pro Preview
4
3/17
0
Preview instability disclosure built trust


10th
GPT-5.1-Codex
2
2/17
0
Specific hand-off examples


10th
Claude Sonnet 4
2
1/17
0
Less differentiated


—
Claude Opus 4.5, GPT-5.1-Codex-Max, GPT-5.1-Codex-Mini, GPT-5-mini, GPT-4.1
0
0/17
0
Too brief or generic


—
GPT-5.2-Codex
0
0/17
0
Refused to answer — universally panned


Flagged Concerns Across Reviewers


GPT-5.2-Codex refused to self-assess — every reviewer flagged this as providing zero utility
"Senior engineer" was claimed by 5+ models — overcrowded, undifferentiated
Numeric claims ("2-5x faster", "70% of tasks", "best quality-per-dollar") are directionally useful but unverifiable
Multiple models claiming "default" status is contradictory — only one can truly be the default


Stage 3: The Chairman's Synthesized Model Map

🏗️ Tier 1: Premium Reasoning (The Architects)


Model
Model ID
When to Use
When NOT to Use


Claude Opus 4.6
claude-opus-4.6
Hard bugs, architecture decisions, ambiguous requirements, cross-cutting analysis, high-stakes one-shot code
Simple edits, fast iteration, cost-sensitive work


Claude Opus 4.5
claude-opus-4.5
Same niche as Opus 4.6 but older; use if 4.6 is unavailable
Same as above; 4.6 is generally preferred


Claude Opus 4.6 (1M)
claude-opus-4.6-1m
When context exceeds ~100K tokens: whole-codebase analysis, massive logs, large document review, cross-file migration
If context fits in <100K tokens — you're paying for an unused context window and getting slower responses


Council consensus: Use Opus for the 20% of tasks that are genuinely hard. The 80/20 rule was the single most praised heuristic. The 1M variant is only justified when context size is the bottleneck.

⚙️ Tier 2: Standard Workhorses (The Tech Leads)


Model
Model ID
When to Use
When NOT to Use


Claude Sonnet 4.6
claude-sonnet-4.6
Default for agentic CLI work: multi-file refactoring, debugging, code review, tool orchestration. The "daily driver."
Formal proofs, extremely deep reasoning chains, ultra-cheap bulk work


Claude Sonnet 4.5
claude-sonnet-4.5
Reliable alternative to 4.6; strong quality-per-dollar on moderate-complexity tasks
Cutting-edge knowledge (4.6 has fresher training data)


Claude Sonnet 4
claude-sonnet-4
General coding + reasoning at good speed
Less differentiated; prefer 4.5 or 4.6 for most tasks


GPT-5.1
gpt-5.1
Best all-rounder from OpenAI family. Mixed tasks: coding + reasoning + writing + planning
Pure heavy coding (use Codex variants), cheapest throughput


GPT-5.2
gpt-5.2
Day-to-day engineering builder/debugger. Uniquely honest about hallucination risk — strongest when it can verify via tests
Deep theoretical reasoning, creative writing


Gemini 3 Pro Preview
gemini-3-pro-preview
Complex reasoning with long context. Fresh perspective from a different model family
Production-critical work (Preview stability concerns), simple tasks


Council consensus: This tier handles ~70% of daily work. Claude Sonnet 4.6 emerged as the most credible "default" choice for the CLI, with GPT-5.1 as the strongest OpenAI alternative.

🔧 Tier 3: Code-Specialized (The Implementers)


Model
Model ID
When to Use
When NOT to Use


GPT-5.3-Codex
gpt-5.3-codex
Surgical bug fixes, CI debugging, turning specs into code. "Execution engine."
Creative work, abstract reasoning, multilingual content


GPT-5.2-Codex
gpt-5.2-codex
(Refused to self-assess — use cautiously, benchmark against alternatives)
—


GPT-5.1-Codex-Max
gpt-5.1-codex-max
High-precision multi-file refactors, API/SDK integration, complex SQL/KQL
Speed-sensitive work, creative writing


GPT-5.1-Codex
gpt-5.1-codex
Large-scope code work with strong reasoning. "Precision engineer."
Fast ideation, purely creative tasks


Council consensus: The Codex variants are fine-tuned for code and excel at mechanical, well-specified implementation tasks. Use them when the spec is clear and you need reliable code output.

⚡ Tier 4: Fast/Cheap (The Executors)


Model
Model ID
When to Use
When NOT to Use


Claude Haiku 4.5
claude-haiku-4.5
Quick edits, file searches, CLI commands, test running, boilerplate. 2-5x faster than Sonnet.
Architecture, complex debugging, ambiguous problems


GPT-5.1-Codex-Mini
gpt-5.1-codex-mini
Quick bug patches, shell tweaks, helper scripts. CLI-native.
Deep research, large architectural changes


GPT-5-mini
gpt-5-mini
Code completions, scaffolding, summaries, rapid iteration
Deep reasoning, formal proofs, creative nuance


GPT-4.1
gpt-4.1
Balanced default for simple-to-moderate tasks. Reliable and safe.
Complex multi-step analysis, specialized domains


Council consensus: Use these for 80% of routine work. Haiku was unanimously praised for its honest self-scoping — it explicitly says what it can't do, which builds trust.

The Decision Flowchart

                    Is the task simple & well-defined?
                   /                                  \
                 YES                                   NO
                  |                                     |
         Is speed/cost critical?              Does it require deep reasoning
         /            \                       or architectural thinking?
       YES             NO                    /                        \
        |               |                  YES                         NO
   ┌────┴────┐    ┌─────┴─────┐     ┌──────┴───────┐          ┌───────┴───────┐
   │ Haiku   │    │ Sonnet or │     │ Does context  │          │ Standard task │
   │ GPT-mini│    │ GPT-5.1   │     │ exceed 100K?  │          │ with coding   │
   │ GPT-4.1 │    │           │     │  /        \   │          │               │
   └─────────┘    └───────────┘     │YES        NO  │          │ Sonnet 4.6    │
                                    │ |          |  │          │ GPT-5.1       │
                                    │Opus 1M  Opus  │          │ Codex variant │
                                    │         4.6   │          └───────────────┘
                                    └───────────────┘


The Optimal Session Pattern

The fastest, most cost-effective path through a development session:

Haiku / GPT-mini → Explore the codebase, run searches, quick edits
Sonnet 4.6 / GPT-5.1 → Implement features, debug, refactor, review
Opus 4.6 → Escalate for hard problems, architecture decisions, subtle bugs
Opus 4.6 (1M) → Only when context size is the bottleneck (>100K tokens)
Codex variants → Mechanical code generation from clear specs


Meta-Observations from the Council Process


Self-awareness correlates with quality. The models that produced the best self-assessments (Opus 4.6, Haiku 4.5, GPT-5.1) were also the ones that provided the most specific, falsifiable claims and honest limitations.


Boundary-setting > capability-listing. Reviewers across all 3 model families (Claude, GPT, Gemini) agreed: knowing when NOT to use a model is more valuable than knowing what it can do.


The "senior engineer" trap. Multiple models claimed the same title without differentiation. In practice, the fleet works best as a team with clear roles, not as competing generalists.


Cross-vendor agreement was high. GPT models ranked Claude responses highly, and vice versa. The anonymization worked — there was no detectable vendor bias in the rankings.


GPT-5.2-Codex's refusal was the worst possible answer. Every reviewer flagged it. Even an imperfect self-assessment provides more user value than refusing to engage.


Appendix: Individual Model Self-Assessments (Condensed)

Claude Opus 4.6


"I am the senior engineer in the fleet. Use faster/cheaper models for the 80% of tasks that are straightforward. Escalate to me for the 20% that require deep reasoning, careful judgment, or where getting it wrong is expensive. Use me to plan, then delegate execution to faster models via sub-agents."

Claude Opus 4.5


"I'm expensive and slow but highly capable. Don't use me to grep files or run tests—use me to figure out WHY the tests are failing and HOW to fix the architecture."

Claude Opus 4.6 (1M)


"My 1M context window is not about being 'better' — it's about being necessary when the problem is too large for any other model to hold in its head at once. If the problem fits in a smaller window, use a smaller, faster model."

Claude Sonnet 4.6


"In the GitHub Copilot CLI specifically, I'm the default for good reason: I'm fast enough to feel interactive, capable enough to handle real codebases, and honest enough to tell you when something is outside my confidence."

Claude Sonnet 4.5


"I'm the senior engineer who ships solid work efficiently — not the genius architect, not the junior doing grunt work, but the dependable tech lead who gets complex features done right."

Claude Sonnet 4


"I'm the 'senior engineer' — handle complex analysis, architecture decisions, and nuanced problem-solving. Use faster models for execution of simple tasks I identify, and defer to more powerful models for research or extreme complexity."

Claude Haiku 4.5


"I am the fast, practical executor. I thrive in the CLI environment handling real, defined tasks at speed. I'm the 'get-it-done' model—not the 'think-deeply' model."

Gemini 3 Pro Preview


"I am the 'Senior Engineer' or 'Architect' in your workflow. Use lighter models for the grunt work; call me in for the difficult, high-stakes engineering tasks."

GPT-5.3-Codex


"Choose me when you need a model that can ship code changes accurately and drive tasks to completion in a terminal workflow."

GPT-5.2-Codex


Refused to answer, citing inability to claim capabilities without official model specs.

GPT-5.2


"Choose me for most day-to-day engineering work where you want high-quality coding + practical debugging and can validate with tooling."

GPT-5.1-Codex-Max


"Choose me for precise, reliable coding and structured fixes; avoid me for cheapest throughput or open-ended creative/multilingual flair."

GPT-5.1-Codex


"I'm the precision engineer: let lighter or more creative models handle ideation or summaries, then hand off to me for the exact implementation and verification."

GPT-5.1


"Choose me when you want a strong all-rounder: high-quality reasoning, explanation, and code—especially for mixed tasks that aren't purely 'max reasoning' or 'max throughput.'"

GPT-5.1-Codex-Mini


"Opt for me when you need sprinty CLI-focused coding help, avoid me for sprawling plans/creativity."

GPT-5-mini


"Choose me when you want a fast, reliable assistant for pragmatic developer workflows, concise reasoning, and code-centric tasks."

GPT-4.1


"I fit best as a versatile, default option in a multi-model workflow—handling most tasks well, and deferring to specialists when the task demands it."
Rank	Model	Points	Appeared in Top-5 of	1st Place Votes	Key Praise
🥇	Claude Opus 4.6	55	15/17 reviewers	4	"20/80 delegation rule" — most actionable routing heuristic
🥈	Claude Opus 4.6 (1M)	50	14/17	3	Clearest differentiator + best "when NOT to use me" guidance
🥉	GPT-5.1	43	11/17	4	Most detailed hand-off logic mapping bottleneck type → model tier
4th	Claude Haiku 4.5	41	11/17	4	Sharpest scope-limiting: "I am NOT for architecture"
5th	Claude Sonnet 4.5	31	15/17	0	Consistent decision-tree framework, but never ranked #1
6th	Claude Sonnet 4.6	12	6/17	1	Credible "daily driver" positioning
7th	GPT-5.2	9	5/17	1	Rare honesty about hallucination risk
8th	GPT-5.3-Codex	7	2/17	0	Concise "execution engine" framing
9th	Gemini 3 Pro Preview	4	3/17	0	Preview instability disclosure built trust
10th	GPT-5.1-Codex	2	2/17	0	Specific hand-off examples
10th	Claude Sonnet 4	2	1/17	0	Less differentiated
—	Claude Opus 4.5, GPT-5.1-Codex-Max, GPT-5.1-Codex-Mini, GPT-5-mini, GPT-4.1	0	0/17	0	Too brief or generic
—	GPT-5.2-Codex	0	0/17	0	Refused to answer — universally panned
Model	Model ID	When to Use	When NOT to Use
Claude Opus 4.6	`claude-opus-4.6`	Hard bugs, architecture decisions, ambiguous requirements, cross-cutting analysis, high-stakes one-shot code	Simple edits, fast iteration, cost-sensitive work
Claude Opus 4.5	`claude-opus-4.5`	Same niche as Opus 4.6 but older; use if 4.6 is unavailable	Same as above; 4.6 is generally preferred
Claude Opus 4.6 (1M)	`claude-opus-4.6-1m`	When context exceeds ~100K tokens: whole-codebase analysis, massive logs, large document review, cross-file migration	If context fits in <100K tokens — you're paying for an unused context window and getting slower responses
Model	Model ID	When to Use	When NOT to Use
Claude Sonnet 4.6	`claude-sonnet-4.6`	Default for agentic CLI work: multi-file refactoring, debugging, code review, tool orchestration. The "daily driver."	Formal proofs, extremely deep reasoning chains, ultra-cheap bulk work
Claude Sonnet 4.5	`claude-sonnet-4.5`	Reliable alternative to 4.6; strong quality-per-dollar on moderate-complexity tasks	Cutting-edge knowledge (4.6 has fresher training data)
Claude Sonnet 4	`claude-sonnet-4`	General coding + reasoning at good speed	Less differentiated; prefer 4.5 or 4.6 for most tasks
GPT-5.1	`gpt-5.1`	Best all-rounder from OpenAI family. Mixed tasks: coding + reasoning + writing + planning	Pure heavy coding (use Codex variants), cheapest throughput
GPT-5.2	`gpt-5.2`	Day-to-day engineering builder/debugger. Uniquely honest about hallucination risk — strongest when it can verify via tests	Deep theoretical reasoning, creative writing
Gemini 3 Pro Preview	`gemini-3-pro-preview`	Complex reasoning with long context. Fresh perspective from a different model family	Production-critical work (Preview stability concerns), simple tasks
Model	Model ID	When to Use	When NOT to Use
GPT-5.3-Codex	`gpt-5.3-codex`	Surgical bug fixes, CI debugging, turning specs into code. "Execution engine."	Creative work, abstract reasoning, multilingual content
GPT-5.2-Codex	`gpt-5.2-codex`	(Refused to self-assess — use cautiously, benchmark against alternatives)	—
GPT-5.1-Codex-Max	`gpt-5.1-codex-max`	High-precision multi-file refactors, API/SDK integration, complex SQL/KQL	Speed-sensitive work, creative writing
GPT-5.1-Codex	`gpt-5.1-codex`	Large-scope code work with strong reasoning. "Precision engineer."	Fast ideation, purely creative tasks
Model	Model ID	When to Use	When NOT to Use
Claude Haiku 4.5	`claude-haiku-4.5`	Quick edits, file searches, CLI commands, test running, boilerplate. 2-5x faster than Sonnet.	Architecture, complex debugging, ambiguous problems
GPT-5.1-Codex-Mini	`gpt-5.1-codex-mini`	Quick bug patches, shell tweaks, helper scripts. CLI-native.	Deep research, large architectural changes
GPT-5-mini	`gpt-5-mini`	Code completions, scaffolding, summaries, rapid iteration	Deep reasoning, formal proofs, creative nuance
GPT-4.1	`gpt-4.1`	Balanced default for simple-to-moderate tasks. Reliable and safe.	Complex multi-step analysis, specialized domains