Skill: audit-tests
Version: 5.0.0
Author: Jeremy Longshore jeremy@intentsolutions.io
License: MIT
Marketplace Grade: A (99/100)
Compatible With: Claude Code, Codex, OpenClaw
Audit Date: 2026-03-11
A universal test orchestration skill that auto-discovers, runs, audits, and auto-remediates tests across 13+ languages and 30+ frameworks. Given any codebase, /audit-tests runs an 8-step pipeline — from framework discovery through mutation testing and bias detection to automated gap-filling — producing structured deliverables (TEST_AUDIT.md, quality scorecards, remediation commits) that transform "we think our tests are good" into "we can prove it."
Target users: Engineering teams, solo developers, vibe coders shipping AI-generated tests, tech leads auditing inherited codebases.
audit-tests/
├── SKILL.md # Skill definition (v5.0.0, 226 lines)
├── evals/
│ └── evals.json # 7 eval scenarios (happy path, edge, negative)
├── scripts/
│ └── bias-count.sh # Quick bias pattern counter for test dirs
└── references/ # 8 specialized reference docs
├── discovery-preflight.md # Auto-discovery, language/PM detection, monorepo
├── frameworks.md # Unit/integration commands for all languages
├── e2e-testing.md # Playwright, Cypress, WebdriverIO, Selenium
├── specialized-testing.md # API, perf, security, a11y, visual, mutation
├── github-audit.md # Full audit engine, report template, remediation
├── scaffold-productivity.md # Scaffold when no tests, productivity audit
├── test-quality-deep-audit.md # Bias detection, mutation, OWASP, AI-test risk
└── auto-remediation.md # Test generation, hardening, verify loop, commit
User Prompt
|
+---------v----------+
| 1. Intent Router |
+----+----+----+-----+
| | |
+-------------+ | +-------------+
| | |
+------v------+ +------v------+ +------v------+
| 2. Discovery | | 5. GH Audit | | 8. Auto- |
+--------------+ +-------------+ | Remediate |
| | +------+------+
+------v------+ +------v------+ |
| 3. Pre-Flight| | 6. Prod. | +------v------+
+--------------+ | Audit | | Verify Loop |
| +-------------+ +------+------+
+------v------+ | |
| 4. Test Run | +------v------+ +------v------+
+--------------+ | 7. Quality | | Auto-Commit |
| | Deep Audit | +-------------+
v +-------------+
TestRunReport |
v
TestQualityScorecard
| User Says | Steps Executed |
|---|---|
| "run tests" / "test this" | 2 → 3 → 4 (discover, pre-flight, run) |
| "audit tests" / "find gaps" | 5 → 7 → 8 (full pipeline: audit, quality, fix) |
| "test quality" / "bias" / "harden" | 7 (quality deep audit) |
| "fix tests" / "fill gaps" | 8 (auto-remediation) |
| "tests are slow" | 6 (productivity audit) |
| Specific framework named | 3 → 4 (pre-flight, run) |
- User says a trigger phrase (e.g., "audit tests", "run tests", "test quality")
- SKILL.md auto-activates, granting scoped tools (30+ Bash scopes, Read, Write, Edit, Glob, Grep)
- Step 1 routes intent to the correct pipeline steps
- Steps 2-4 discover framework, check environment, run tests
- Steps 5-8 audit gaps, assess quality, auto-fix, commit
- Output structured reports (console +
TEST_AUDIT.mddeliverable)
name: audit-tests
description: |
Test orchestration, quality audit, and auto-remediation across any language.
Make sure to use this skill whenever running tests, auditing test quality,
detecting bias patterns, running mutation testing, or filling test gaps.
Use when you need to discover, run, audit, harden, or auto-fix tests.
Trigger with "run tests", "audit tests", "test quality", "harden tests",
"fix tests", "mutation testing", "fill gaps".
allowed-tools: "Read, Write, Edit, Glob, Grep, Bash(find:*), Bash(pnpm:*), Bash(npm:*),
Bash(npx:*), Bash(yarn:*), Bash(bun:*), Bash(python:*), Bash(pytest:*), Bash(go:*),
Bash(cargo:*), Bash(bundle:*), Bash(mix:*), Bash(dotnet:*), Bash(docker:*), Bash(gh:*),
Bash(git:*), Bash(make:*), Bash(ls:*), Bash(cat:*), Bash(grep:*), Bash(lsof:*),
Bash(curl:*), Bash(k6:*), Bash(artillery:*), Bash(semgrep:*), Bash(gitleaks:*),
Bash(trivy:*), Bash(ab:*), Task, AskUserQuestion"
model: inherit
version: 5.0.0
author: Jeremy Longshore <jeremy@intentsolutions.io>
license: MIT
compatible-with: claude-code, codex, openclaw
tags: [testing, quality-audit, mutation-testing, auto-remediation, bias-detection, owasp]| File | Lines | Purpose |
|---|---|---|
discovery-preflight.md |
164 | Config file scanning, language detection (13 languages), package manager detection (11 PMs), environment readiness, monorepo handling (pnpm, turbo, nx, lerna, rush) |
frameworks.md |
159 | Complete command reference for Vitest, Jest, Pytest, Go test, Cargo, RSpec, JUnit/Gradle/Maven, PHPUnit, ExUnit, .NET/xUnit |
e2e-testing.md |
~200 | Playwright, Cypress, WebdriverIO, Selenium, Cucumber setup and commands |
specialized-testing.md |
~300 | API contract (dredd, Pact), performance (k6, Artillery, Locust), security (SAST, secrets, DAST), accessibility (axe-core, pa11y), visual regression, mutation testing |
github-audit.md |
598 | Full audit engine: repo scan, gap mapping, CI/CD audit, coverage audit, security audit, TEST_AUDIT.md template, remediation plan generator |
test-quality-deep-audit.md |
670 | Bias detection (7 patterns), mutation testing (4 tools), assertion quality scoring, OWASP Top 10 coverage, AI-test detection, Test Quality Scorecard |
auto-remediation.md |
370 | Gap-to-test pipeline, test generation strategy, verification loop, auto-commit protocol, remediation report |
scaffold-productivity.md |
~200 | Scaffold when no tests exist, duration analysis, flaky detection, CI speed audit |
| Language | Unit Framework | E2E Framework | Mutation Tool |
|---|---|---|---|
| JavaScript/TypeScript | Vitest, Jest, Mocha | Playwright, Cypress, WebdriverIO | Stryker |
| Python | pytest, unittest | Playwright (Python) | mutmut v3 |
| Go | go test | — | — |
| Rust | cargo test | — | cargo-mutants |
| Ruby | RSpec, Minitest | Capybara | — |
| Java/Kotlin | JUnit, Gradle, Maven | Selenium | PITest |
| PHP | PHPUnit | Laravel Dusk | Infection |
| Elixir | ExUnit | Wallaby | — |
| C#/.NET | xUnit, NUnit | Playwright (.NET) | Stryker.NET |
| Swift | XCTest | XCUITest | — |
| Haskell | HSpec | — | — |
| C/C++ | GoogleTest, Catch2 | — | — |
| Dart/Flutter | flutter test | integration_test | — |
Parses the user's prompt against a trigger table to determine which steps to run. "audit tests" runs the full 5→7→8 pipeline. "run tests" runs 2→3→4. Specific framework names skip discovery.
Scans for config files (vitest.config.*, pyproject.toml, Cargo.toml, go.mod, etc.), detects language, identifies package manager from lockfiles, checks for monorepo configuration. Never assumes — always discovers.
Validates environment readiness: dependencies installed, correct Python venv active, .env.test present, Docker running (if needed), no port conflicts for E2E servers.
Routes to the correct framework command. Executes with coverage when available. Produces structured output (Pass / Fail / Warning) with failure root-cause analysis and specific fix suggestions.
Maps every source file to its test file. Calculates coverage ratio. Checks CI/CD for test steps, coverage gates, security scanning. Runs dependency audit and secret detection. Generates gap analysis sorted by risk (P0-P3). Writes TEST_AUDIT.md to repo root.
Duration analysis per test file. Flaky test detection (rerun 3x, classify inconsistent results). CI pipeline speed audit. Identifies parallelization opportunities and bloated test fixtures.
The core differentiator. Runs 6 quality checks:
- Bias detection — scans for 7 pattern types (see Section 8)
- Assertion density — assertions per test function (target ≥2.0)
- Negative test ratio — error path coverage (target ≥15%)
- OWASP Top 10 coverage — security test grading per category
- Mutation testing — kill rate via mutmut/Stryker/PITest/cargo-mutants
- AI-test risk scoring — commit-based + pattern-based detection
Produces a Test Quality Scorecard with quality-adjusted coverage: effective_coverage = line_coverage × kill_rate.
For each gap found in Steps 5/7: generates tests following project conventions, hardens weak assertions (smoke → exact value), fixes bias patterns, adds negative/boundary/security tests. Runs a verification loop (tests pass → lint clean → mutation test → iterate if kill rate < 70%). Auto-commits test files only. Never pushes — lets user review.
| Language | Tool | Config Location | Target Command |
|---|---|---|---|
| Python | mutmut v3 | [tool.mutmut] in pyproject.toml |
mutmut run |
| JS/TS | Stryker | stryker.conf.json | npx stryker run |
| Java | PITest | build.gradle plugin | ./gradlew pitest |
| Rust | cargo-mutants | (no config needed) | cargo mutants |
| Kill Rate | Grade | Interpretation |
|---|---|---|
| 90-100% | S | Exceptional — catches nearly every possible bug |
| 80-89% | A | Strong — suitable for production-critical code |
| 70-79% | B | Good — acceptable for most codebases |
| 60-69% | C | Adequate — room for improvement |
| 50-59% | D | Weak — tests miss many real bugs |
| 40-49% | E | Poor — little confidence |
| <40% | F | Failing — tests are decoration |
effective_coverage = line_coverage × kill_rate
Example: 90% line coverage × 65% kill rate = 58.5% effective coverage (D grade). High coverage with weak assertions creates false confidence — this formula exposes it.
- v3 CLI differs from v2 (
mutmut run, notmutmut run --paths-to-mutate) - Results stored in
.mutmut-cache/(SQLite), not flat file also_copyin config is critical — tests needing fixtures/configs fail without it (false kills)
7 bias patterns that let mutations survive because assertions don't verify correctness:
| # | Bias Type | What It Looks Like | Why It's Bad |
|---|---|---|---|
| 1 | Tautological | assert sorted(x) == sorted(x) |
Asserts value equals itself — always passes |
| 2 | Self-Referential | expected = module.calculate(input); assert result == expected |
Computes expected from same code under test |
| 3 | Smoke-Only | assert result is not None |
Checks existence, not correctness |
| 4 | Identity Misuse | assert not result |
Truthiness check when specific value needed |
| 5 | Symmetric Input | func(0, 0), func("test", "test") |
Degenerate inputs don't exercise real logic |
| 6 | Range-Only | assert 0 <= x <= 2 |
Range when exact value is knowable |
| 7 | Mutation-Insensitive | assert "error" in message |
Substring match survives most mutations |
Each pattern has grep-based detection. Scored per 100 tests: 0-5 Low, 6-15 Moderate, 16-30 High, 30+ Critical.
Checks test directory for security test patterns per OWASP category:
| Category | What to Test | Detection Pattern |
|---|---|---|
| A01 Access Control | Unauth 401, wrong-role 403, horizontal/vertical escalation | unauthorized|forbidden|403|401|permission |
| A02 Crypto | Password hashing, token expiry, no plaintext secrets | encrypt|hash|bcrypt|jwt|token.*valid |
| A03 Injection | SQL injection, XSS, command injection, path traversal | injection|xss|sanitize|escape|parameterized |
| A04 Insecure Design | Rate limiting, brute force lockout, business logic abuse | rate.limit|throttle|brute.force|lockout |
| A05 Misconfig | Security headers, debug disabled, no stack trace leaks | csp|hsts|x.frame|debug.*false |
| A06 Vulnerable Deps | Dependency audit in CI, CVE blocking | audit|vulnerability|cve|advisory |
| A07 Auth Failures | Login/logout, session expiry, password reset, MFA | login|logout|session|password|mfa |
| A08 Integrity | CSRF tokens, webhook signatures, upload checks | csrf|checksum|integrity|webhook.*valid |
| A09 Logging | Auth failures logged, no sensitive data in logs | audit.*log|security.*log|log.*fail |
| A10 SSRF | URL allowlist, private IP blocking, metadata blocking | ssrf|allowlist|blocklist|metadata.*block |
Grading: A (≥80% items covered) through F (<20%).
TEST RUN COMPLETE
Suite: [name]
Runner: [framework]
Passed: [N]
Failed: 0
Skipped: [N]
Duration: [X]s
Coverage: [X]% lines [X]% branches [X]% functions
Written to repo root. Contains:
- What Is Working Well — existing strengths (coverage, CI, isolation, security, conventions)
- What Could Be Better — gaps sorted P0-P3, each with: what exists, what's missing, why it matters, what to add (with code scaffolds)
- Test Quality Assessment — assertion density, bias analysis, mutation results, OWASP grades, AI-test risk
- Summary Scorecard — 19-row status table across all quality dimensions
- Remediation Plan — prioritized checklist (P0 Critical → P3 Hardening)
TEST QUALITY SCORECARD — [REPO NAME]
ASSERTION QUALITY
Density: [X] per test [Grade]
Smoke-only ratio: [X]% [Grade]
Negative test ratio: [X]% [Grade]
MUTATION TESTING
Kill rate: [X]% [Grade]
Effective coverage: [X]% (line × kill)
SECURITY (OWASP Top 10)
Overall: [Grade]
TEST BIAS
Patterns found: [N]
Per-100-tests rate: [X]
Severity: [Low/Moderate/High/Critical]
AI-TEST RISK
Score: [N] points [Level]
AUTO-REMEDIATION REPORT — [REPO NAME]
TESTS WRITTEN: [N] new files, [N] new functions
TESTS HARDENED: [N] assertions strengthened, [N] bias patterns fixed
MUTATION DELTA: [before]% → [after]% kill rate
QUALITY DELTA: [before] → [after] assertion density
COMMIT: [SHA] — review with: git diff HEAD~1
| Decision | Rationale |
|---|---|
| Progressive disclosure (SKILL.md + 8 reference docs) | SKILL.md stays under 230 lines. Heavy content loaded on-demand per step. Token-efficient. |
| Never assume framework | A project can have Vitest + Playwright + Jest. Discovery prevents wrong-framework execution. |
| Kill rate over line coverage | 90% lines with 65% kill rate = 58.5% effective. Kill rate is the quality signal. |
| Auto-commit tests, never push | Let the developer review generated tests. Commit provides undo point. Push is their decision. |
| 7-pattern bias taxonomy | Derived from mutation testing survivor analysis across 50+ real projects. These 7 patterns account for >90% of false-confidence assertions. |
| P0-P3 risk tiers | P0 = exposed secrets and untested auth. P3 = visual regression. Not all gaps are equal. |
| Verification loop (max 3 iterations) | Generate → run → mutate → iterate. Capped at 3 to prevent infinite loops on equivalent mutants. |
| Scoped Bash tools (30+ scopes) | No unscoped Bash — every tool invocation is auditable. Enterprise-tier requirement. |
| ID | Trigger | Expected Behavior |
|---|---|---|
happy-path-run-tests |
"run tests" | Discovers framework, pre-flights, runs, structured output |
full-audit-pipeline |
"audit tests" | Steps 5→7→8, produces TEST_AUDIT.md |
mutation-testing-quality |
"test quality" | Bias scan, mutation test, OWASP check, scorecard |
no-framework-found |
"run tests" (empty repo) | Detects absence, loads scaffold, suggests setup |
monorepo-detection |
"run tests" (monorepo) | Detects workspaces, asks scope, runs scoped command |
auto-remediation-commit |
"fill gaps" | Generates tests, verifies, commits test files only |
slow-tests-productivity |
"tests are slow" | Duration analysis, flaky detection, CI speed audit |
Grade: A (99/100)
Progressive Disclosure: 27/30
Ease of Use: 25/25
Utility: 19/20
Spec Compliance: 15/15
Writing Style: 10/10
Modifiers: +3 (grep-friendly, exemplary examples)
Validated against the Intent Solutions 100-point rubric (validate-skill.py --grade). Zero errors, zero warnings.