Skip to content

Instantly share code, notes, and snippets.

@harupy
Last active February 20, 2026 17:48
Show Gist options
  • Select an option

  • Save harupy/8205ee018b031d5a4991b34d2f42bed9 to your computer and use it in GitHub Desktop.

Select an option

Save harupy/8205ee018b031d5a4991b34d2f42bed9 to your computer and use it in GitHub Desktop.
Copilot code review instructions and bad pattern analysis for mlflow/mlflow

Copilot Code Review: Common Bad Patterns

Analysis of 1,007 Copilot review comments on mlflow/mlflow. Roughly 1 in 5 comments (21%) fall into a bad pattern.

Pattern % Count
Generic test coverage nag 6.0% 59
Duplicate/repeated comments 4.8% 47
Pedantic style/grammar nitpicks 2.4% 24
Accessibility boilerplate 2.0% 20
Stating the obvious 1.9% 19
Speculative edge cases 1.3% 13
Factually wrong 0.7% 7
Unrelated scope policing 0.7% 7
Other (best-practice bikeshedding, etc.) 1.0% 10

1. Generic Test Coverage Nag (6.0%)

Acknowledges the code is correct, then generically asks for tests without identifying a specific bug risk.

  • mlflow/gateway/providers/anthropic.py (PR #20957)

    "The changes to extract cached tokens from Anthropic streaming responses are correct. However, there are no tests covering this new functionality." — Admits the code is correct, then asks for tests as boilerplate. Same comment repeated across 4 files.

  • MlflowSidebarExperimentItems.tsx (PR #20860)

    "The new functionality for highlighting sidebar navigation on run detail pages lacks test coverage." — Lists 4 test scenarios, none addressing an actual bug risk. Pure busywork.

  • mlflow/entities/labeling.py (PR #20838)

    "Adding similar tests for the new labeling entities would help catch breaking proto/field changes early." — Generic "add round-trip tests" without identifying any field mapping issue.

  • mlflow/strands/autolog.py (PR #20809)

    "No tests are included for the shared tracer provider scenario that this PR is meant to fix." — Asks for the obvious (test the thing you changed) without identifying what would break.


2. Duplicate/Repeated Comments (4.8%)

Same feedback posted nearly verbatim on multiple files in one PR, creating notification spam.


3. Pedantic Style/Grammar Nitpicks (2.4%)

Formatting, naming, wording, and import-style suggestions that do not affect correctness.

  • .github/ui-preview/app.py (PR #20987)

    "Consider using pathlib.Path instead of os.path functions for file path operations." — Style preference in a standalone helper script. Not a bug.

  • en.json (PR #20830)

    "The tooltip text uses 'model training' (lowercase) while the workflow option label uses 'Model training' (capitalized)." — Capitalization nitpick in a tooltip string.

  • service.proto (PR #20848)

    "Entity messages are named with a Proto suffix... Consider dropping the Proto suffix." — Naming convention preference. The suffix is intentional to disambiguate from Python entity classes.


4. Accessibility Boilerplate (2.0%)

Generic ARIA/keyboard/semantic HTML suggestions repeated across every sidebar PR.

  • MlflowSidebarWorkflowSwitch.tsx (PR #20699)

    "Pill is implemented as a clickable div without role, tabIndex, or keyboard handlers." — Same feedback appears on PRs #20699, #20698, #20831, and #20758.

  • MlflowSidebar.tsx (PR #20697)

    "nestedItems is rendered directly inside the ul, but the provided implementations include div elements." — Technically correct about HTML spec, but low-priority for an internal ML tool.

  • SettingsPage.tsx (PR #20695)

    "This switch's label is set to 'On'/'Off', which is not a descriptive accessible name." — Generic ARIA labeling suggestion for an internal settings toggle.


5. Stating the Obvious (1.9%)

Comments that restate what the code clearly shows, adding no insight.

  • title.js (PR #20911)

    "The PR uses actions/github-script@v7.0.1 while most other workflows use v8.0.0." — Observes a version difference without identifying any incompatibility.

  • PR #20692 (PR-level)

    "The PR title is marked '[WIP]', but the description checklist indicates the work is complete." — Stating what the author obviously already knows.

  • sqlalchemy_store.py (PR #20547)

    "The slicing operation runs_with_inputs_outputs[:max_results] will work correctly even when max_results is None." — Describes how Python slicing works, then confirms the code is correct. Zero value.


6. Speculative Edge Cases (1.3%)

"While unlikely..." or "could potentially..." scenarios without concrete evidence.

  • test_fluent.py (PR #19784)

    "The test uses probabilistic assertions with random.random() which could lead to flaky tests. While the range [30, 70] for 100 calls provides reasonable tolerance, there's still a non-zero probability..." — The failure probability is below 0.0001%. Theoretical hand-wringing.

  • async_artifacts_logging_queue.py (PR #20577)

    "The _at_exit_callback method doesn't check if the queue is active... While the callback is only registered during activation, there could be edge cases..." — Acknowledges the precondition makes this impossible, then speculates anyway.

  • anthropic/src/index.ts (PR #20384)

    "The traced flag is checked and set synchronously... but there's a potential race condition." — JavaScript is single-threaded. Synchronous flag checks cannot race.


7. Factually Wrong (0.7%)

Claims something doesn't exist or is incorrect when it's actually fine.

  • openai.test.ts (PR #20058)

    "The model name 'gpt-5' does not exist. OpenAI's latest model series is GPT-4." — Using fictional model names in test fixtures is standard practice.

  • polish.yml (PR #20481)

    "The workflow uses ubuntu-latest... which is inconsistent with the codebase convention." — ubuntu-latest is intentional here due to ubuntu-slim's 15-minute timeout constraint.

  • package.json (PR #20968)

    "The PR description references CVE-2026-2391, which has a year in the future (2026). This appears to be a typo." — The current year IS 2026. The bot's knowledge cutoff makes it flag a real CVE as a typo.

  • sidebarsGenAI.ts (PR #20103)

    "The sidebar configuration references a documentation file... but this file does not exist in the repository." — The file is being added in the same PR. The bot doesn't check the full diff.


8. Unrelated Scope Policing (0.7%)

Suggesting changes outside PR scope or telling the author to split their PR.

  • PR #20834 (6 files)

    "The documentation changes in this file appear to be unrelated to the labeling sessions feature... should ideally be in a separate PR." — Repeated on 6 files. The author bundled related doc changes intentionally.

  • mlflow_artifacts_repo.py (PR #20352)

    "The change to remove tracking_uri and registry_uri parameters appears unrelated to the multipart download feature." — Cleanup of related code is normal in feature PRs.

  • ragas/registry.py (PR #20104)

    "This PR includes unmentioned changes to existing metric registry paths." — Path changes are part of the migration being done in the PR.


Knowledge Cutoff: A Cross-Cutting Problem

Many of the "factually wrong" comments stem from a fundamental limitation: the model's training data has a knowledge cutoff. It confidently makes claims based on outdated information, and has no mechanism to verify against the current state of the world or even the current PR diff.

How It Manifests

1. Flagging valid names/versions as non-existent

The model's training data doesn't include recent releases, so it flags new model names, CVEs, and tools as errors.

  • openai.test.ts (PR #20058)

    "The model name 'gpt-5' does not exist. OpenAI's latest model series is GPT-4." — Test fixtures don't need real model names, and GPT-5 exists post-cutoff.

  • package.json (PR #20968)

    "The PR description references CVE-2026-2391, which has a year in the future (2026). This appears to be a typo." — The current year IS 2026. The model thinks it's still 2024/2025.

2. Claiming functions/files don't exist (when they're in the PR or codebase)

The model checks its training snapshot, not the actual repo state or the PR diff.

  • test_gateway_api.py (PR #20559)

    "The test patches mlflow.server.gateway_api._make_traced_streaming_response, but this function does not exist in the actual implementation." — The function exists in the codebase. Same wrong claim repeated 4 times in this PR.

  • sidebarsGenAI.ts (PR #20103)

    "The sidebar references 'eval-monitor/online-evaluations/index' but this file does not exist in the repository." — The file is being added in the same PR or a coordinated PR.

  • fs2db/README.md (PR #20608)

    "The README documents a command 'mlflow migrate-filestore' that does not exist in the codebase." — The command is being introduced by the PR itself. The bot reviews docs without checking the accompanying code changes.

3. Stale knowledge about conventions and defaults

The model applies outdated patterns to current code.

  • polish.yml (PR #20481)

    "The workflow uses ubuntu-latest... which is inconsistent with the codebase convention. Most workflows use ubuntu-slim." — ubuntu-latest is intentional here due to runtime constraints the model doesn't understand.

Why This Is Hard to Fix with Instructions Alone

Unlike other bad patterns (duplicates, test nags) that can be addressed with behavioral instructions, knowledge cutoff errors are structural:

  • The model cannot look things up — it can't grep the repo, check PyPI, or verify a CVE database
  • Confidence is miscalibrated — it states "does not exist" with certainty rather than "I'm not sure if this exists"
  • The same outdated fact propagates — once the model "knows" GPT-4 is the latest, it will keep flagging GPT-5 references

Mitigation

The best instruction-level mitigation is:

Do not claim that a name, function, file, model, CVE, or version "does not exist" unless you can see proof of its absence in the PR diff. If you are unsure, do not comment.

But this only partially helps. A more robust solution requires either:

  1. Retrieval-augmented review — give the model access to the full repo tree and current package versions
  2. Post-hoc filtering — use a second LLM pass that has access to the repo to verify "does not exist" claims before posting

What Good Looks Like (~79%)

The majority of Copilot comments are actually useful. The best ones identify:

  • Real logic bugs (wrong variable, off-by-one, broken control flow)
  • Security issues (injection, credential exposure, path traversal)
  • Performance problems (N+1 queries, unbounded loops)
  • Inconsistent API contracts (wrong status codes, missing fields)
  • Resource leaks (unclosed connections, missing cleanup)

You Live in the Past

Your knowledge is outdated. The codebase, its dependencies, and the outside world are ahead of you. Treat anything you don't recognize as new, not wrong.

  • NEVER claim a function, file, model name, CVE, or version "does not exist"
  • NEVER flag dates or version numbers as "in the future"
  • If a PR adds docs for a new feature, assume the implementation exists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment