Skip to content

Instantly share code, notes, and snippets.

@jeremylongshore
Created March 12, 2026 01:46
Show Gist options
  • Select an option

  • Save jeremylongshore/1380a7abde36037b90603a05bea6a80b to your computer and use it in GitHub Desktop.

Select an option

Save jeremylongshore/1380a7abde36037b90603a05bea6a80b to your computer and use it in GitHub Desktop.
Architecture audit: audit-tests Claude Code skill — universal test orchestration, quality audit, mutation testing, and auto-remediation

Quick Architecture Reference Card: audit-tests

Skill: audit-tests Version: 5.0.0 Author: Jeremy Longshore jeremy@intentsolutions.io License: MIT Marketplace Grade: A (99/100) Compatible With: Claude Code, Codex, OpenClaw Audit Date: 2026-03-11


1. Overview

A universal test orchestration skill that auto-discovers, runs, audits, and auto-remediates tests across 13+ languages and 30+ frameworks. Given any codebase, /audit-tests runs an 8-step pipeline — from framework discovery through mutation testing and bias detection to automated gap-filling — producing structured deliverables (TEST_AUDIT.md, quality scorecards, remediation commits) that transform "we think our tests are good" into "we can prove it."

Target users: Engineering teams, solo developers, vibe coders shipping AI-generated tests, tech leads auditing inherited codebases.


2. File Tree

audit-tests/
├── SKILL.md                                 # Skill definition (v5.0.0, 226 lines)
├── evals/
│   └── evals.json                           # 7 eval scenarios (happy path, edge, negative)
├── scripts/
│   └── bias-count.sh                        # Quick bias pattern counter for test dirs
└── references/                              # 8 specialized reference docs
    ├── discovery-preflight.md               # Auto-discovery, language/PM detection, monorepo
    ├── frameworks.md                        # Unit/integration commands for all languages
    ├── e2e-testing.md                       # Playwright, Cypress, WebdriverIO, Selenium
    ├── specialized-testing.md               # API, perf, security, a11y, visual, mutation
    ├── github-audit.md                      # Full audit engine, report template, remediation
    ├── scaffold-productivity.md             # Scaffold when no tests, productivity audit
    ├── test-quality-deep-audit.md           # Bias detection, mutation, OWASP, AI-test risk
    └── auto-remediation.md                  # Test generation, hardening, verify loop, commit

3. Architecture

8-Step Pipeline

                         User Prompt
                              |
                    +---------v----------+
                    |  1. Intent Router   |
                    +----+----+----+-----+
                         |    |    |
           +-------------+    |    +-------------+
           |                  |                  |
    +------v------+    +------v------+    +------v------+
    | 2. Discovery |    | 5. GH Audit |    | 8. Auto-    |
    +--------------+    +-------------+    |  Remediate  |
           |                  |            +------+------+
    +------v------+    +------v------+           |
    | 3. Pre-Flight|    | 6. Prod.    |    +------v------+
    +--------------+    |    Audit    |    | Verify Loop |
           |            +-------------+    +------+------+
    +------v------+           |                  |
    | 4. Test Run  |    +------v------+    +------v------+
    +--------------+    | 7. Quality  |    | Auto-Commit |
           |            |  Deep Audit |    +-------------+
           v            +-------------+
     TestRunReport            |
                              v
                    TestQualityScorecard

Intent Routing

User Says Steps Executed
"run tests" / "test this" 2 → 3 → 4 (discover, pre-flight, run)
"audit tests" / "find gaps" 5 → 7 → 8 (full pipeline: audit, quality, fix)
"test quality" / "bias" / "harden" 7 (quality deep audit)
"fix tests" / "fill gaps" 8 (auto-remediation)
"tests are slow" 6 (productivity audit)
Specific framework named 3 → 4 (pre-flight, run)

Activation Path

  1. User says a trigger phrase (e.g., "audit tests", "run tests", "test quality")
  2. SKILL.md auto-activates, granting scoped tools (30+ Bash scopes, Read, Write, Edit, Glob, Grep)
  3. Step 1 routes intent to the correct pipeline steps
  4. Steps 2-4 discover framework, check environment, run tests
  5. Steps 5-8 audit gaps, assess quality, auto-fix, commit
  6. Output structured reports (console + TEST_AUDIT.md deliverable)

4. Key Components

4.1 SKILL.md Frontmatter

name: audit-tests
description: |
  Test orchestration, quality audit, and auto-remediation across any language.
  Make sure to use this skill whenever running tests, auditing test quality,
  detecting bias patterns, running mutation testing, or filling test gaps.
  Use when you need to discover, run, audit, harden, or auto-fix tests.
  Trigger with "run tests", "audit tests", "test quality", "harden tests",
  "fix tests", "mutation testing", "fill gaps".
allowed-tools: "Read, Write, Edit, Glob, Grep, Bash(find:*), Bash(pnpm:*), Bash(npm:*),
  Bash(npx:*), Bash(yarn:*), Bash(bun:*), Bash(python:*), Bash(pytest:*), Bash(go:*),
  Bash(cargo:*), Bash(bundle:*), Bash(mix:*), Bash(dotnet:*), Bash(docker:*), Bash(gh:*),
  Bash(git:*), Bash(make:*), Bash(ls:*), Bash(cat:*), Bash(grep:*), Bash(lsof:*),
  Bash(curl:*), Bash(k6:*), Bash(artillery:*), Bash(semgrep:*), Bash(gitleaks:*),
  Bash(trivy:*), Bash(ab:*), Task, AskUserQuestion"
model: inherit
version: 5.0.0
author: Jeremy Longshore <jeremy@intentsolutions.io>
license: MIT
compatible-with: claude-code, codex, openclaw
tags: [testing, quality-audit, mutation-testing, auto-remediation, bias-detection, owasp]

4.2 References

File Lines Purpose
discovery-preflight.md 164 Config file scanning, language detection (13 languages), package manager detection (11 PMs), environment readiness, monorepo handling (pnpm, turbo, nx, lerna, rush)
frameworks.md 159 Complete command reference for Vitest, Jest, Pytest, Go test, Cargo, RSpec, JUnit/Gradle/Maven, PHPUnit, ExUnit, .NET/xUnit
e2e-testing.md ~200 Playwright, Cypress, WebdriverIO, Selenium, Cucumber setup and commands
specialized-testing.md ~300 API contract (dredd, Pact), performance (k6, Artillery, Locust), security (SAST, secrets, DAST), accessibility (axe-core, pa11y), visual regression, mutation testing
github-audit.md 598 Full audit engine: repo scan, gap mapping, CI/CD audit, coverage audit, security audit, TEST_AUDIT.md template, remediation plan generator
test-quality-deep-audit.md 670 Bias detection (7 patterns), mutation testing (4 tools), assertion quality scoring, OWASP Top 10 coverage, AI-test detection, Test Quality Scorecard
auto-remediation.md 370 Gap-to-test pipeline, test generation strategy, verification loop, auto-commit protocol, remediation report
scaffold-productivity.md ~200 Scaffold when no tests exist, duration analysis, flaky detection, CI speed audit

5. Supported Frameworks

Language Unit Framework E2E Framework Mutation Tool
JavaScript/TypeScript Vitest, Jest, Mocha Playwright, Cypress, WebdriverIO Stryker
Python pytest, unittest Playwright (Python) mutmut v3
Go go test
Rust cargo test cargo-mutants
Ruby RSpec, Minitest Capybara
Java/Kotlin JUnit, Gradle, Maven Selenium PITest
PHP PHPUnit Laravel Dusk Infection
Elixir ExUnit Wallaby
C#/.NET xUnit, NUnit Playwright (.NET) Stryker.NET
Swift XCTest XCUITest
Haskell HSpec
C/C++ GoogleTest, Catch2
Dart/Flutter flutter test integration_test

6. The 8-Step Pipeline

Step 1: Intent Router

Parses the user's prompt against a trigger table to determine which steps to run. "audit tests" runs the full 5→7→8 pipeline. "run tests" runs 2→3→4. Specific framework names skip discovery.

Step 2: Discovery

Scans for config files (vitest.config.*, pyproject.toml, Cargo.toml, go.mod, etc.), detects language, identifies package manager from lockfiles, checks for monorepo configuration. Never assumes — always discovers.

Step 3: Pre-Flight

Validates environment readiness: dependencies installed, correct Python venv active, .env.test present, Docker running (if needed), no port conflicts for E2E servers.

Step 4: Test Runner

Routes to the correct framework command. Executes with coverage when available. Produces structured output (Pass / Fail / Warning) with failure root-cause analysis and specific fix suggestions.

Step 5: GitHub Audit

Maps every source file to its test file. Calculates coverage ratio. Checks CI/CD for test steps, coverage gates, security scanning. Runs dependency audit and secret detection. Generates gap analysis sorted by risk (P0-P3). Writes TEST_AUDIT.md to repo root.

Step 6: Productivity Audit

Duration analysis per test file. Flaky test detection (rerun 3x, classify inconsistent results). CI pipeline speed audit. Identifies parallelization opportunities and bloated test fixtures.

Step 7: Test Quality Deep Audit

The core differentiator. Runs 6 quality checks:

  1. Bias detection — scans for 7 pattern types (see Section 8)
  2. Assertion density — assertions per test function (target ≥2.0)
  3. Negative test ratio — error path coverage (target ≥15%)
  4. OWASP Top 10 coverage — security test grading per category
  5. Mutation testing — kill rate via mutmut/Stryker/PITest/cargo-mutants
  6. AI-test risk scoring — commit-based + pattern-based detection

Produces a Test Quality Scorecard with quality-adjusted coverage: effective_coverage = line_coverage × kill_rate.

Step 8: Auto-Remediation

For each gap found in Steps 5/7: generates tests following project conventions, hardens weak assertions (smoke → exact value), fixes bias patterns, adds negative/boundary/security tests. Runs a verification loop (tests pass → lint clean → mutation test → iterate if kill rate < 70%). Auto-commits test files only. Never pushes — lets user review.


7. Mutation Testing

Tool Matrix

Language Tool Config Location Target Command
Python mutmut v3 [tool.mutmut] in pyproject.toml mutmut run
JS/TS Stryker stryker.conf.json npx stryker run
Java PITest build.gradle plugin ./gradlew pitest
Rust cargo-mutants (no config needed) cargo mutants

Kill Rate Grading

Kill Rate Grade Interpretation
90-100% S Exceptional — catches nearly every possible bug
80-89% A Strong — suitable for production-critical code
70-79% B Good — acceptable for most codebases
60-69% C Adequate — room for improvement
50-59% D Weak — tests miss many real bugs
40-49% E Poor — little confidence
<40% F Failing — tests are decoration

Quality-Adjusted Coverage

effective_coverage = line_coverage × kill_rate

Example: 90% line coverage × 65% kill rate = 58.5% effective coverage (D grade). High coverage with weak assertions creates false confidence — this formula exposes it.

mutmut v3 Gotchas

  • v3 CLI differs from v2 (mutmut run, not mutmut run --paths-to-mutate)
  • Results stored in .mutmut-cache/ (SQLite), not flat file
  • also_copy in config is critical — tests needing fixtures/configs fail without it (false kills)

8. Bias Detection Engine

7 bias patterns that let mutations survive because assertions don't verify correctness:

# Bias Type What It Looks Like Why It's Bad
1 Tautological assert sorted(x) == sorted(x) Asserts value equals itself — always passes
2 Self-Referential expected = module.calculate(input); assert result == expected Computes expected from same code under test
3 Smoke-Only assert result is not None Checks existence, not correctness
4 Identity Misuse assert not result Truthiness check when specific value needed
5 Symmetric Input func(0, 0), func("test", "test") Degenerate inputs don't exercise real logic
6 Range-Only assert 0 <= x <= 2 Range when exact value is knowable
7 Mutation-Insensitive assert "error" in message Substring match survives most mutations

Each pattern has grep-based detection. Scored per 100 tests: 0-5 Low, 6-15 Moderate, 16-30 High, 30+ Critical.


9. OWASP Top 10 Security Test Coverage

Checks test directory for security test patterns per OWASP category:

Category What to Test Detection Pattern
A01 Access Control Unauth 401, wrong-role 403, horizontal/vertical escalation unauthorized|forbidden|403|401|permission
A02 Crypto Password hashing, token expiry, no plaintext secrets encrypt|hash|bcrypt|jwt|token.*valid
A03 Injection SQL injection, XSS, command injection, path traversal injection|xss|sanitize|escape|parameterized
A04 Insecure Design Rate limiting, brute force lockout, business logic abuse rate.limit|throttle|brute.force|lockout
A05 Misconfig Security headers, debug disabled, no stack trace leaks csp|hsts|x.frame|debug.*false
A06 Vulnerable Deps Dependency audit in CI, CVE blocking audit|vulnerability|cve|advisory
A07 Auth Failures Login/logout, session expiry, password reset, MFA login|logout|session|password|mfa
A08 Integrity CSRF tokens, webhook signatures, upload checks csrf|checksum|integrity|webhook.*valid
A09 Logging Auth failures logged, no sensitive data in logs audit.*log|security.*log|log.*fail
A10 SSRF URL allowlist, private IP blocking, metadata blocking ssrf|allowlist|blocklist|metadata.*block

Grading: A (≥80% items covered) through F (<20%).


10. Deliverables

Console Output (every run)

TEST RUN COMPLETE
Suite:    [name]
Runner:   [framework]
Passed:   [N]
Failed:   0
Skipped:  [N]
Duration: [X]s
Coverage: [X]% lines  [X]% branches  [X]% functions

TEST_AUDIT.md (audit runs)

Written to repo root. Contains:

  • What Is Working Well — existing strengths (coverage, CI, isolation, security, conventions)
  • What Could Be Better — gaps sorted P0-P3, each with: what exists, what's missing, why it matters, what to add (with code scaffolds)
  • Test Quality Assessment — assertion density, bias analysis, mutation results, OWASP grades, AI-test risk
  • Summary Scorecard — 19-row status table across all quality dimensions
  • Remediation Plan — prioritized checklist (P0 Critical → P3 Hardening)

Test Quality Scorecard (quality audit runs)

TEST QUALITY SCORECARD — [REPO NAME]

ASSERTION QUALITY
  Density:              [X] per test      [Grade]
  Smoke-only ratio:     [X]%              [Grade]
  Negative test ratio:  [X]%              [Grade]

MUTATION TESTING
  Kill rate:            [X]%              [Grade]
  Effective coverage:   [X]%  (line × kill)

SECURITY (OWASP Top 10)
  Overall:              [Grade]

TEST BIAS
  Patterns found:       [N]
  Per-100-tests rate:   [X]
  Severity:             [Low/Moderate/High/Critical]

AI-TEST RISK
  Score:                [N] points        [Level]

Remediation Report (auto-fix runs)

AUTO-REMEDIATION REPORT — [REPO NAME]

TESTS WRITTEN:          [N] new files, [N] new functions
TESTS HARDENED:         [N] assertions strengthened, [N] bias patterns fixed
MUTATION DELTA:         [before]% → [after]% kill rate
QUALITY DELTA:          [before] → [after] assertion density
COMMIT:                 [SHA] — review with: git diff HEAD~1

11. Design Decisions

Decision Rationale
Progressive disclosure (SKILL.md + 8 reference docs) SKILL.md stays under 230 lines. Heavy content loaded on-demand per step. Token-efficient.
Never assume framework A project can have Vitest + Playwright + Jest. Discovery prevents wrong-framework execution.
Kill rate over line coverage 90% lines with 65% kill rate = 58.5% effective. Kill rate is the quality signal.
Auto-commit tests, never push Let the developer review generated tests. Commit provides undo point. Push is their decision.
7-pattern bias taxonomy Derived from mutation testing survivor analysis across 50+ real projects. These 7 patterns account for >90% of false-confidence assertions.
P0-P3 risk tiers P0 = exposed secrets and untested auth. P3 = visual regression. Not all gaps are equal.
Verification loop (max 3 iterations) Generate → run → mutate → iterate. Capped at 3 to prevent infinite loops on equivalent mutants.
Scoped Bash tools (30+ scopes) No unscoped Bash — every tool invocation is auditable. Enterprise-tier requirement.

12. Eval Scenarios

ID Trigger Expected Behavior
happy-path-run-tests "run tests" Discovers framework, pre-flights, runs, structured output
full-audit-pipeline "audit tests" Steps 5→7→8, produces TEST_AUDIT.md
mutation-testing-quality "test quality" Bias scan, mutation test, OWASP check, scorecard
no-framework-found "run tests" (empty repo) Detects absence, loads scaffold, suggests setup
monorepo-detection "run tests" (monorepo) Detects workspaces, asks scope, runs scoped command
auto-remediation-commit "fill gaps" Generates tests, verifies, commits test files only
slow-tests-productivity "tests are slow" Duration analysis, flaky detection, CI speed audit

13. Marketplace Validation

Grade: A (99/100)

Progressive Disclosure:  27/30
Ease of Use:             25/25
Utility:                 19/20
Spec Compliance:         15/15
Writing Style:           10/10
Modifiers:               +3 (grep-friendly, exemplary examples)

Validated against the Intent Solutions 100-point rubric (validate-skill.py --grade). Zero errors, zero warnings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment