Analysis of 1,007 Copilot review comments on mlflow/mlflow. Roughly 1 in 5 comments (21%) fall into a bad pattern.
| Pattern | % | Count |
|---|---|---|
| Generic test coverage nag | 6.0% | 59 |
| Duplicate/repeated comments | 4.8% | 47 |
| Pedantic style/grammar nitpicks | 2.4% | 24 |
| Accessibility boilerplate | 2.0% | 20 |
| Stating the obvious | 1.9% | 19 |
| Speculative edge cases | 1.3% | 13 |
| Factually wrong | 0.7% | 7 |
| Unrelated scope policing | 0.7% | 7 |
| Other (best-practice bikeshedding, etc.) | 1.0% | 10 |
Acknowledges the code is correct, then generically asks for tests without identifying a specific bug risk.
-
mlflow/gateway/providers/anthropic.py (PR #20957)
"The changes to extract cached tokens from Anthropic streaming responses are correct. However, there are no tests covering this new functionality." — Admits the code is correct, then asks for tests as boilerplate. Same comment repeated across 4 files.
-
MlflowSidebarExperimentItems.tsx (PR #20860)
"The new functionality for highlighting sidebar navigation on run detail pages lacks test coverage." — Lists 4 test scenarios, none addressing an actual bug risk. Pure busywork.
-
mlflow/entities/labeling.py (PR #20838)
"Adding similar tests for the new labeling entities would help catch breaking proto/field changes early." — Generic "add round-trip tests" without identifying any field mapping issue.
-
mlflow/strands/autolog.py (PR #20809)
"No tests are included for the shared tracer provider scenario that this PR is meant to fix." — Asks for the obvious (test the thing you changed) without identifying what would break.
Same feedback posted nearly verbatim on multiple files in one PR, creating notification spam.
-
PR #20834 — "Unrelated to labeling sessions" x6
"The documentation changes in this file appear to be unrelated to the labeling sessions feature... should ideally be in a separate PR." — Posted on 6 separate doc files. One PR-level comment would suffice. (+5 duplicates)
-
PR #20956 — "Missing test coverage for cached token extraction" x4
"Missing test coverage for cached token extraction in [passthrough/streaming] responses..." — Four near-identical comments on four functions. One comment covering all four would be enough. (+3 duplicates)
-
PR #20058 — "gpt-5 does not exist" x2
"The model name 'gpt-5' does not exist." — Same factually wrong comment posted twice on the same file.
Formatting, naming, wording, and import-style suggestions that do not affect correctness.
-
.github/ui-preview/app.py (PR #20987)
"Consider using pathlib.Path instead of os.path functions for file path operations." — Style preference in a standalone helper script. Not a bug.
-
"The tooltip text uses 'model training' (lowercase) while the workflow option label uses 'Model training' (capitalized)." — Capitalization nitpick in a tooltip string.
-
"Entity messages are named with a Proto suffix... Consider dropping the Proto suffix." — Naming convention preference. The suffix is intentional to disambiguate from Python entity classes.
Generic ARIA/keyboard/semantic HTML suggestions repeated across every sidebar PR.
-
MlflowSidebarWorkflowSwitch.tsx (PR #20699)
"Pill is implemented as a clickable div without role, tabIndex, or keyboard handlers." — Same feedback appears on PRs #20699, #20698, #20831, and #20758.
-
"nestedItems is rendered directly inside the ul, but the provided implementations include div elements." — Technically correct about HTML spec, but low-priority for an internal ML tool.
-
"This switch's label is set to 'On'/'Off', which is not a descriptive accessible name." — Generic ARIA labeling suggestion for an internal settings toggle.
Comments that restate what the code clearly shows, adding no insight.
-
"The PR uses actions/github-script@v7.0.1 while most other workflows use v8.0.0." — Observes a version difference without identifying any incompatibility.
-
"The PR title is marked '[WIP]', but the description checklist indicates the work is complete." — Stating what the author obviously already knows.
-
sqlalchemy_store.py (PR #20547)
"The slicing operation runs_with_inputs_outputs[:max_results] will work correctly even when max_results is None." — Describes how Python slicing works, then confirms the code is correct. Zero value.
"While unlikely..." or "could potentially..." scenarios without concrete evidence.
-
"The test uses probabilistic assertions with random.random() which could lead to flaky tests. While the range [30, 70] for 100 calls provides reasonable tolerance, there's still a non-zero probability..." — The failure probability is below 0.0001%. Theoretical hand-wringing.
-
async_artifacts_logging_queue.py (PR #20577)
"The _at_exit_callback method doesn't check if the queue is active... While the callback is only registered during activation, there could be edge cases..." — Acknowledges the precondition makes this impossible, then speculates anyway.
-
anthropic/src/index.ts (PR #20384)
"The traced flag is checked and set synchronously... but there's a potential race condition." — JavaScript is single-threaded. Synchronous flag checks cannot race.
Claims something doesn't exist or is incorrect when it's actually fine.
-
"The model name 'gpt-5' does not exist. OpenAI's latest model series is GPT-4." — Using fictional model names in test fixtures is standard practice.
-
"The workflow uses ubuntu-latest... which is inconsistent with the codebase convention." —
ubuntu-latestis intentional here due toubuntu-slim's 15-minute timeout constraint. -
"The PR description references CVE-2026-2391, which has a year in the future (2026). This appears to be a typo." — The current year IS 2026. The bot's knowledge cutoff makes it flag a real CVE as a typo.
-
"The sidebar configuration references a documentation file... but this file does not exist in the repository." — The file is being added in the same PR. The bot doesn't check the full diff.
Suggesting changes outside PR scope or telling the author to split their PR.
-
"The documentation changes in this file appear to be unrelated to the labeling sessions feature... should ideally be in a separate PR." — Repeated on 6 files. The author bundled related doc changes intentionally.
-
mlflow_artifacts_repo.py (PR #20352)
"The change to remove tracking_uri and registry_uri parameters appears unrelated to the multipart download feature." — Cleanup of related code is normal in feature PRs.
-
"This PR includes unmentioned changes to existing metric registry paths." — Path changes are part of the migration being done in the PR.
Many of the "factually wrong" comments stem from a fundamental limitation: the model's training data has a knowledge cutoff. It confidently makes claims based on outdated information, and has no mechanism to verify against the current state of the world or even the current PR diff.
1. Flagging valid names/versions as non-existent
The model's training data doesn't include recent releases, so it flags new model names, CVEs, and tools as errors.
-
"The model name 'gpt-5' does not exist. OpenAI's latest model series is GPT-4." — Test fixtures don't need real model names, and GPT-5 exists post-cutoff.
-
"The PR description references CVE-2026-2391, which has a year in the future (2026). This appears to be a typo." — The current year IS 2026. The model thinks it's still 2024/2025.
2. Claiming functions/files don't exist (when they're in the PR or codebase)
The model checks its training snapshot, not the actual repo state or the PR diff.
-
test_gateway_api.py (PR #20559)
"The test patches
mlflow.server.gateway_api._make_traced_streaming_response, but this function does not exist in the actual implementation." — The function exists in the codebase. Same wrong claim repeated 4 times in this PR. -
"The sidebar references 'eval-monitor/online-evaluations/index' but this file does not exist in the repository." — The file is being added in the same PR or a coordinated PR.
-
"The README documents a command 'mlflow migrate-filestore' that does not exist in the codebase." — The command is being introduced by the PR itself. The bot reviews docs without checking the accompanying code changes.
3. Stale knowledge about conventions and defaults
The model applies outdated patterns to current code.
- polish.yml (PR #20481)
"The workflow uses ubuntu-latest... which is inconsistent with the codebase convention. Most workflows use ubuntu-slim." —
ubuntu-latestis intentional here due to runtime constraints the model doesn't understand.
Unlike other bad patterns (duplicates, test nags) that can be addressed with behavioral instructions, knowledge cutoff errors are structural:
- The model cannot look things up — it can't
grepthe repo, check PyPI, or verify a CVE database - Confidence is miscalibrated — it states "does not exist" with certainty rather than "I'm not sure if this exists"
- The same outdated fact propagates — once the model "knows" GPT-4 is the latest, it will keep flagging GPT-5 references
The best instruction-level mitigation is:
Do not claim that a name, function, file, model, CVE, or version "does not exist" unless you can see proof of its absence in the PR diff. If you are unsure, do not comment.
But this only partially helps. A more robust solution requires either:
- Retrieval-augmented review — give the model access to the full repo tree and current package versions
- Post-hoc filtering — use a second LLM pass that has access to the repo to verify "does not exist" claims before posting
The majority of Copilot comments are actually useful. The best ones identify:
- Real logic bugs (wrong variable, off-by-one, broken control flow)
- Security issues (injection, credential exposure, path traversal)
- Performance problems (N+1 queries, unbounded loops)
- Inconsistent API contracts (wrong status codes, missing fields)
- Resource leaks (unclosed connections, missing cleanup)