jeremylongshore/audit-tests-gist.md

## audit-tests-gist.md

      
    Raw
  

              audit-tests-gist.md
            
          
    Quick Architecture Reference Card: audit-tests

Skill: audit-tests
Version: 5.0.0
Author: Jeremy Longshore jeremy@intentsolutions.io
License: MIT
Marketplace Grade: A (99/100)
Compatible With: Claude Code, Codex, OpenClaw
Audit Date: 2026-03-11

1. Overview

A universal test orchestration skill that auto-discovers, runs, audits, and auto-remediates tests across 13+ languages and 30+ frameworks. Given any codebase, /audit-tests runs an 8-step pipeline — from framework discovery through mutation testing and bias detection to automated gap-filling — producing structured deliverables (TEST_AUDIT.md, quality scorecards, remediation commits) that transform "we think our tests are good" into "we can prove it."
Target users: Engineering teams, solo developers, vibe coders shipping AI-generated tests, tech leads auditing inherited codebases.

2. File Tree

audit-tests/
├── SKILL.md                                 # Skill definition (v5.0.0, 226 lines)
├── evals/
│   └── evals.json                           # 7 eval scenarios (happy path, edge, negative)
├── scripts/
│   └── bias-count.sh                        # Quick bias pattern counter for test dirs
└── references/                              # 8 specialized reference docs
    ├── discovery-preflight.md               # Auto-discovery, language/PM detection, monorepo
    ├── frameworks.md                        # Unit/integration commands for all languages
    ├── e2e-testing.md                       # Playwright, Cypress, WebdriverIO, Selenium
    ├── specialized-testing.md               # API, perf, security, a11y, visual, mutation
    ├── github-audit.md                      # Full audit engine, report template, remediation
    ├── scaffold-productivity.md             # Scaffold when no tests, productivity audit
    ├── test-quality-deep-audit.md           # Bias detection, mutation, OWASP, AI-test risk
    └── auto-remediation.md                  # Test generation, hardening, verify loop, commit


3. Architecture

8-Step Pipeline

                         User Prompt
                              |
                    +---------v----------+
                    |  1. Intent Router   |
                    +----+----+----+-----+
                         |    |    |
           +-------------+    |    +-------------+
           |                  |                  |
    +------v------+    +------v------+    +------v------+
    | 2. Discovery |    | 5. GH Audit |    | 8. Auto-    |
    +--------------+    +-------------+    |  Remediate  |
           |                  |            +------+------+
    +------v------+    +------v------+           |
    | 3. Pre-Flight|    | 6. Prod.    |    +------v------+
    +--------------+    |    Audit    |    | Verify Loop |
           |            +-------------+    +------+------+
    +------v------+           |                  |
    | 4. Test Run  |    +------v------+    +------v------+
    +--------------+    | 7. Quality  |    | Auto-Commit |
           |            |  Deep Audit |    +-------------+
           v            +-------------+
     TestRunReport            |
                              v
                    TestQualityScorecard

Intent Routing


User Says
Steps Executed


"run tests" / "test this"
2 → 3 → 4 (discover, pre-flight, run)


"audit tests" / "find gaps"
5 → 7 → 8 (full pipeline: audit, quality, fix)


"test quality" / "bias" / "harden"
7 (quality deep audit)


"fix tests" / "fill gaps"
8 (auto-remediation)


"tests are slow"
6 (productivity audit)


Specific framework named
3 → 4 (pre-flight, run)


Activation Path


User says a trigger phrase (e.g., "audit tests", "run tests", "test quality")
SKILL.md auto-activates, granting scoped tools (30+ Bash scopes, Read, Write, Edit, Glob, Grep)
Step 1 routes intent to the correct pipeline steps
Steps 2-4 discover framework, check environment, run tests
Steps 5-8 audit gaps, assess quality, auto-fix, commit
Output structured reports (console + TEST_AUDIT.md deliverable)


4. Key Components

4.1 SKILL.md Frontmatter

name: audit-tests
description: |
  Test orchestration, quality audit, and auto-remediation across any language.
  Make sure to use this skill whenever running tests, auditing test quality,
  detecting bias patterns, running mutation testing, or filling test gaps.
  Use when you need to discover, run, audit, harden, or auto-fix tests.
  Trigger with "run tests", "audit tests", "test quality", "harden tests",
  "fix tests", "mutation testing", "fill gaps".
allowed-tools: "Read, Write, Edit, Glob, Grep, Bash(find:*), Bash(pnpm:*), Bash(npm:*),
  Bash(npx:*), Bash(yarn:*), Bash(bun:*), Bash(python:*), Bash(pytest:*), Bash(go:*),
  Bash(cargo:*), Bash(bundle:*), Bash(mix:*), Bash(dotnet:*), Bash(docker:*), Bash(gh:*),
  Bash(git:*), Bash(make:*), Bash(ls:*), Bash(cat:*), Bash(grep:*), Bash(lsof:*),
  Bash(curl:*), Bash(k6:*), Bash(artillery:*), Bash(semgrep:*), Bash(gitleaks:*),
  Bash(trivy:*), Bash(ab:*), Task, AskUserQuestion"
model: inherit
version: 5.0.0
author: Jeremy Longshore <jeremy@intentsolutions.io>
license: MIT
compatible-with: claude-code, codex, openclaw
tags: [testing, quality-audit, mutation-testing, auto-remediation, bias-detection, owasp]
4.2 References


File
Lines
Purpose


discovery-preflight.md
164
Config file scanning, language detection (13 languages), package manager detection (11 PMs), environment readiness, monorepo handling (pnpm, turbo, nx, lerna, rush)


frameworks.md
159
Complete command reference for Vitest, Jest, Pytest, Go test, Cargo, RSpec, JUnit/Gradle/Maven, PHPUnit, ExUnit, .NET/xUnit


e2e-testing.md
~200
Playwright, Cypress, WebdriverIO, Selenium, Cucumber setup and commands


specialized-testing.md
~300
API contract (dredd, Pact), performance (k6, Artillery, Locust), security (SAST, secrets, DAST), accessibility (axe-core, pa11y), visual regression, mutation testing


github-audit.md
598
Full audit engine: repo scan, gap mapping, CI/CD audit, coverage audit, security audit, TEST_AUDIT.md template, remediation plan generator


test-quality-deep-audit.md
670
Bias detection (7 patterns), mutation testing (4 tools), assertion quality scoring, OWASP Top 10 coverage, AI-test detection, Test Quality Scorecard


auto-remediation.md
370
Gap-to-test pipeline, test generation strategy, verification loop, auto-commit protocol, remediation report


scaffold-productivity.md
~200
Scaffold when no tests exist, duration analysis, flaky detection, CI speed audit


5. Supported Frameworks


Language
Unit Framework
E2E Framework
Mutation Tool


JavaScript/TypeScript
Vitest, Jest, Mocha
Playwright, Cypress, WebdriverIO
Stryker


Python
pytest, unittest
Playwright (Python)
mutmut v3


Go
go test
—
—


Rust
cargo test
—
cargo-mutants


Ruby
RSpec, Minitest
Capybara
—


Java/Kotlin
JUnit, Gradle, Maven
Selenium
PITest


PHP
PHPUnit
Laravel Dusk
Infection


Elixir
ExUnit
Wallaby
—


C#/.NET
xUnit, NUnit
Playwright (.NET)
Stryker.NET


Swift
XCTest
XCUITest
—


Haskell
HSpec
—
—


C/C++
GoogleTest, Catch2
—
—


Dart/Flutter
flutter test
integration_test
—


6. The 8-Step Pipeline

Step 1: Intent Router

Parses the user's prompt against a trigger table to determine which steps to run. "audit tests" runs the full 5→7→8 pipeline. "run tests" runs 2→3→4. Specific framework names skip discovery.
Step 2: Discovery

Scans for config files (vitest.config.*, pyproject.toml, Cargo.toml, go.mod, etc.), detects language, identifies package manager from lockfiles, checks for monorepo configuration. Never assumes — always discovers.
Step 3: Pre-Flight

Validates environment readiness: dependencies installed, correct Python venv active, .env.test present, Docker running (if needed), no port conflicts for E2E servers.
Step 4: Test Runner

Routes to the correct framework command. Executes with coverage when available. Produces structured output (Pass / Fail / Warning) with failure root-cause analysis and specific fix suggestions.
Step 5: GitHub Audit

Maps every source file to its test file. Calculates coverage ratio. Checks CI/CD for test steps, coverage gates, security scanning. Runs dependency audit and secret detection. Generates gap analysis sorted by risk (P0-P3). Writes TEST_AUDIT.md to repo root.
Step 6: Productivity Audit

Duration analysis per test file. Flaky test detection (rerun 3x, classify inconsistent results). CI pipeline speed audit. Identifies parallelization opportunities and bloated test fixtures.
Step 7: Test Quality Deep Audit

The core differentiator. Runs 6 quality checks:

Bias detection — scans for 7 pattern types (see Section 8)
Assertion density — assertions per test function (target ≥2.0)
Negative test ratio — error path coverage (target ≥15%)
OWASP Top 10 coverage — security test grading per category
Mutation testing — kill rate via mutmut/Stryker/PITest/cargo-mutants
AI-test risk scoring — commit-based + pattern-based detection

Produces a Test Quality Scorecard with quality-adjusted coverage: effective_coverage = line_coverage × kill_rate.
Step 8: Auto-Remediation

For each gap found in Steps 5/7: generates tests following project conventions, hardens weak assertions (smoke → exact value), fixes bias patterns, adds negative/boundary/security tests. Runs a verification loop (tests pass → lint clean → mutation test → iterate if kill rate < 70%). Auto-commits test files only. Never pushes — lets user review.

7. Mutation Testing

Tool Matrix


Language
Tool
Config Location
Target Command


Python
mutmut v3
[tool.mutmut] in pyproject.toml
mutmut run


JS/TS
Stryker
stryker.conf.json
npx stryker run


Java
PITest
build.gradle plugin
./gradlew pitest


Rust
cargo-mutants
(no config needed)
cargo mutants


Kill Rate Grading


Kill Rate
Grade
Interpretation


90-100%
S
Exceptional — catches nearly every possible bug


80-89%
A
Strong — suitable for production-critical code


70-79%
B
Good — acceptable for most codebases


60-69%
C
Adequate — room for improvement


50-59%
D
Weak — tests miss many real bugs


40-49%
E
Poor — little confidence


<40%
F
Failing — tests are decoration


Quality-Adjusted Coverage

effective_coverage = line_coverage × kill_rate

Example: 90% line coverage × 65% kill rate = 58.5% effective coverage (D grade). High coverage with weak assertions creates false confidence — this formula exposes it.
mutmut v3 Gotchas


v3 CLI differs from v2 (mutmut run, not mutmut run --paths-to-mutate)
Results stored in .mutmut-cache/ (SQLite), not flat file
also_copy in config is critical — tests needing fixtures/configs fail without it (false kills)


8. Bias Detection Engine

7 bias patterns that let mutations survive because assertions don't verify correctness:


#
Bias Type
What It Looks Like
Why It's Bad


1
Tautological
assert sorted(x) == sorted(x)
Asserts value equals itself — always passes


2
Self-Referential
expected = module.calculate(input); assert result == expected
Computes expected from same code under test


3
Smoke-Only
assert result is not None
Checks existence, not correctness


4
Identity Misuse
assert not result
Truthiness check when specific value needed


5
Symmetric Input
func(0, 0), func("test", "test")
Degenerate inputs don't exercise real logic


6
Range-Only
assert 0 <= x <= 2
Range when exact value is knowable


7
Mutation-Insensitive
assert "error" in message
Substring match survives most mutations


Each pattern has grep-based detection. Scored per 100 tests: 0-5 Low, 6-15 Moderate, 16-30 High, 30+ Critical.

9. OWASP Top 10 Security Test Coverage

Checks test directory for security test patterns per OWASP category:


Category
What to Test
Detection Pattern


A01 Access Control
Unauth 401, wrong-role 403, horizontal/vertical escalation
unauthorized|forbidden|403|401|permission


A02 Crypto
Password hashing, token expiry, no plaintext secrets
encrypt|hash|bcrypt|jwt|token.*valid


A03 Injection
SQL injection, XSS, command injection, path traversal
injection|xss|sanitize|escape|parameterized


A04 Insecure Design
Rate limiting, brute force lockout, business logic abuse
rate.limit|throttle|brute.force|lockout


A05 Misconfig
Security headers, debug disabled, no stack trace leaks
csp|hsts|x.frame|debug.*false


A06 Vulnerable Deps
Dependency audit in CI, CVE blocking
audit|vulnerability|cve|advisory


A07 Auth Failures
Login/logout, session expiry, password reset, MFA
login|logout|session|password|mfa


A08 Integrity
CSRF tokens, webhook signatures, upload checks
csrf|checksum|integrity|webhook.*valid


A09 Logging
Auth failures logged, no sensitive data in logs
audit.*log|security.*log|log.*fail


A10 SSRF
URL allowlist, private IP blocking, metadata blocking
ssrf|allowlist|blocklist|metadata.*block


Grading: A (≥80% items covered) through F (<20%).

10. Deliverables

Console Output (every run)

TEST RUN COMPLETE
Suite:    [name]
Runner:   [framework]
Passed:   [N]
Failed:   0
Skipped:  [N]
Duration: [X]s
Coverage: [X]% lines  [X]% branches  [X]% functions

TEST_AUDIT.md (audit runs)

Written to repo root. Contains:

What Is Working Well — existing strengths (coverage, CI, isolation, security, conventions)
What Could Be Better — gaps sorted P0-P3, each with: what exists, what's missing, why it matters, what to add (with code scaffolds)
Test Quality Assessment — assertion density, bias analysis, mutation results, OWASP grades, AI-test risk
Summary Scorecard — 19-row status table across all quality dimensions
Remediation Plan — prioritized checklist (P0 Critical → P3 Hardening)

Test Quality Scorecard (quality audit runs)

TEST QUALITY SCORECARD — [REPO NAME]

ASSERTION QUALITY
  Density:              [X] per test      [Grade]
  Smoke-only ratio:     [X]%              [Grade]
  Negative test ratio:  [X]%              [Grade]

MUTATION TESTING
  Kill rate:            [X]%              [Grade]
  Effective coverage:   [X]%  (line × kill)

SECURITY (OWASP Top 10)
  Overall:              [Grade]

TEST BIAS
  Patterns found:       [N]
  Per-100-tests rate:   [X]
  Severity:             [Low/Moderate/High/Critical]

AI-TEST RISK
  Score:                [N] points        [Level]

Remediation Report (auto-fix runs)

AUTO-REMEDIATION REPORT — [REPO NAME]

TESTS WRITTEN:          [N] new files, [N] new functions
TESTS HARDENED:         [N] assertions strengthened, [N] bias patterns fixed
MUTATION DELTA:         [before]% → [after]% kill rate
QUALITY DELTA:          [before] → [after] assertion density
COMMIT:                 [SHA] — review with: git diff HEAD~1


11. Design Decisions


Decision
Rationale


Progressive disclosure (SKILL.md + 8 reference docs)
SKILL.md stays under 230 lines. Heavy content loaded on-demand per step. Token-efficient.


Never assume framework
A project can have Vitest + Playwright + Jest. Discovery prevents wrong-framework execution.


Kill rate over line coverage
90% lines with 65% kill rate = 58.5% effective. Kill rate is the quality signal.


Auto-commit tests, never push
Let the developer review generated tests. Commit provides undo point. Push is their decision.


7-pattern bias taxonomy
Derived from mutation testing survivor analysis across 50+ real projects. These 7 patterns account for >90% of false-confidence assertions.


P0-P3 risk tiers
P0 = exposed secrets and untested auth. P3 = visual regression. Not all gaps are equal.


Verification loop (max 3 iterations)
Generate → run → mutate → iterate. Capped at 3 to prevent infinite loops on equivalent mutants.


Scoped Bash tools (30+ scopes)
No unscoped Bash — every tool invocation is auditable. Enterprise-tier requirement.


12. Eval Scenarios


ID
Trigger
Expected Behavior


happy-path-run-tests
"run tests"
Discovers framework, pre-flights, runs, structured output


full-audit-pipeline
"audit tests"
Steps 5→7→8, produces TEST_AUDIT.md


mutation-testing-quality
"test quality"
Bias scan, mutation test, OWASP check, scorecard


no-framework-found
"run tests" (empty repo)
Detects absence, loads scaffold, suggests setup


monorepo-detection
"run tests" (monorepo)
Detects workspaces, asks scope, runs scoped command


auto-remediation-commit
"fill gaps"
Generates tests, verifies, commits test files only


slow-tests-productivity
"tests are slow"
Duration analysis, flaky detection, CI speed audit


13. Marketplace Validation

Grade: A (99/100)

Progressive Disclosure:  27/30
Ease of Use:             25/25
Utility:                 19/20
Spec Compliance:         15/15
Writing Style:           10/10
Modifiers:               +3 (grep-friendly, exemplary examples)

Validated against the Intent Solutions 100-point rubric (validate-skill.py --grade). Zero errors, zero warnings.
User Says	Steps Executed
"run tests" / "test this"	2 → 3 → 4 (discover, pre-flight, run)
"audit tests" / "find gaps"	5 → 7 → 8 (full pipeline: audit, quality, fix)
"test quality" / "bias" / "harden"	7 (quality deep audit)
"fix tests" / "fill gaps"	8 (auto-remediation)
"tests are slow"	6 (productivity audit)
Specific framework named	3 → 4 (pre-flight, run)
File	Lines	Purpose
`discovery-preflight.md`	164	Config file scanning, language detection (13 languages), package manager detection (11 PMs), environment readiness, monorepo handling (pnpm, turbo, nx, lerna, rush)
`frameworks.md`	159	Complete command reference for Vitest, Jest, Pytest, Go test, Cargo, RSpec, JUnit/Gradle/Maven, PHPUnit, ExUnit, .NET/xUnit
`e2e-testing.md`	~200	Playwright, Cypress, WebdriverIO, Selenium, Cucumber setup and commands
`specialized-testing.md`	~300	API contract (dredd, Pact), performance (k6, Artillery, Locust), security (SAST, secrets, DAST), accessibility (axe-core, pa11y), visual regression, mutation testing
`github-audit.md`	598	Full audit engine: repo scan, gap mapping, CI/CD audit, coverage audit, security audit, TEST_AUDIT.md template, remediation plan generator
`test-quality-deep-audit.md`	670	Bias detection (7 patterns), mutation testing (4 tools), assertion quality scoring, OWASP Top 10 coverage, AI-test detection, Test Quality Scorecard
`auto-remediation.md`	370	Gap-to-test pipeline, test generation strategy, verification loop, auto-commit protocol, remediation report
`scaffold-productivity.md`	~200	Scaffold when no tests exist, duration analysis, flaky detection, CI speed audit
Language	Unit Framework	E2E Framework	Mutation Tool
JavaScript/TypeScript	Vitest, Jest, Mocha	Playwright, Cypress, WebdriverIO	Stryker
Python	pytest, unittest	Playwright (Python)	mutmut v3
Go	go test	—	—
Rust	cargo test	—	cargo-mutants
Ruby	RSpec, Minitest	Capybara	—
Java/Kotlin	JUnit, Gradle, Maven	Selenium	PITest
PHP	PHPUnit	Laravel Dusk	Infection
Elixir	ExUnit	Wallaby	—
C#/.NET	xUnit, NUnit	Playwright (.NET)	Stryker.NET
Swift	XCTest	XCUITest	—
Haskell	HSpec	—	—
C/C++	GoogleTest, Catch2	—	—
Dart/Flutter	flutter test	integration_test	—
Language	Tool	Config Location	Target Command
Python	mutmut v3	`[tool.mutmut]` in pyproject.toml	`mutmut run`
JS/TS	Stryker	stryker.conf.json	`npx stryker run`
Java	PITest	build.gradle plugin	`./gradlew pitest`
Rust	cargo-mutants	(no config needed)	`cargo mutants`
Kill Rate	Grade	Interpretation
90-100%	S	Exceptional — catches nearly every possible bug
80-89%	A	Strong — suitable for production-critical code
70-79%	B	Good — acceptable for most codebases
60-69%	C	Adequate — room for improvement
50-59%	D	Weak — tests miss many real bugs
40-49%	E	Poor — little confidence
<40%	F	Failing — tests are decoration
#	Bias Type	What It Looks Like	Why It's Bad
1	Tautological	`assert sorted(x) == sorted(x)`	Asserts value equals itself — always passes
2	Self-Referential	`expected = module.calculate(input); assert result == expected`	Computes expected from same code under test
3	Smoke-Only	`assert result is not None`	Checks existence, not correctness
4	Identity Misuse	`assert not result`	Truthiness check when specific value needed
5	Symmetric Input	`func(0, 0)`, `func("test", "test")`	Degenerate inputs don't exercise real logic
6	Range-Only	`assert 0 <= x <= 2`	Range when exact value is knowable
7	Mutation-Insensitive	`assert "error" in message`	Substring match survives most mutations
Category	What to Test	Detection Pattern
A01 Access Control	Unauth 401, wrong-role 403, horizontal/vertical escalation	`unauthorized\|forbidden\|403\|401\|permission`
A02 Crypto	Password hashing, token expiry, no plaintext secrets	`encrypt\|hash\|bcrypt\|jwt\|token.*valid`
A03 Injection	SQL injection, XSS, command injection, path traversal	`injection\|xss\|sanitize\|escape\|parameterized`
A04 Insecure Design	Rate limiting, brute force lockout, business logic abuse	`rate.limit\|throttle\|brute.force\|lockout`
A05 Misconfig	Security headers, debug disabled, no stack trace leaks	`csp\|hsts\|x.frame\|debug.*false`
A06 Vulnerable Deps	Dependency audit in CI, CVE blocking	`audit\|vulnerability\|cve\|advisory`
A07 Auth Failures	Login/logout, session expiry, password reset, MFA	`login\|logout\|session\|password\|mfa`
A08 Integrity	CSRF tokens, webhook signatures, upload checks	`csrf\|checksum\|integrity\|webhook.*valid`
A09 Logging	Auth failures logged, no sensitive data in logs	`audit.log\|security.log\|log.*fail`
A10 SSRF	URL allowlist, private IP blocking, metadata blocking	`ssrf\|allowlist\|blocklist\|metadata.*block`
Decision	Rationale
Progressive disclosure (SKILL.md + 8 reference docs)	SKILL.md stays under 230 lines. Heavy content loaded on-demand per step. Token-efficient.
Never assume framework	A project can have Vitest + Playwright + Jest. Discovery prevents wrong-framework execution.
Kill rate over line coverage	90% lines with 65% kill rate = 58.5% effective. Kill rate is the quality signal.
Auto-commit tests, never push	Let the developer review generated tests. Commit provides undo point. Push is their decision.
7-pattern bias taxonomy	Derived from mutation testing survivor analysis across 50+ real projects. These 7 patterns account for >90% of false-confidence assertions.
P0-P3 risk tiers	P0 = exposed secrets and untested auth. P3 = visual regression. Not all gaps are equal.
Verification loop (max 3 iterations)	Generate → run → mutate → iterate. Capped at 3 to prevent infinite loops on equivalent mutants.
Scoped Bash tools (30+ scopes)	No unscoped `Bash` — every tool invocation is auditable. Enterprise-tier requirement.
ID	Trigger	Expected Behavior
`happy-path-run-tests`	"run tests"	Discovers framework, pre-flights, runs, structured output
`full-audit-pipeline`	"audit tests"	Steps 5→7→8, produces TEST_AUDIT.md
`mutation-testing-quality`	"test quality"	Bias scan, mutation test, OWASP check, scorecard
`no-framework-found`	"run tests" (empty repo)	Detects absence, loads scaffold, suggests setup
`monorepo-detection`	"run tests" (monorepo)	Detects workspaces, asks scope, runs scoped command
`auto-remediation-commit`	"fill gaps"	Generates tests, verifies, commits test files only
`slow-tests-productivity`	"tests are slow"	Duration analysis, flaky detection, CI speed audit