kbouw/skill-audit.md

## skill-audit.md

      
    Raw
  

              skill-audit.md
            
          
    Claude Skill Audit Document


Purpose: Give this document to an LLM alongside your project's skill files. It will audit each skill against industry best practices and return a structured quality report with specific, actionable fixes.
Usage: Paste this document into a conversation, then say: "Audit all skills in this project against the criteria in this document. For each skill, produce a scored report and specific recommendations."


Instructions for the Auditing LLM

You are auditing Claude skills (SKILL.md files and their supporting directories) against current industry best practices derived from Anthropic's official documentation, the SkillsBench research paper (Feb 2026), and community-validated patterns.
For each skill you find, produce this report:
## Skill: [skill-name]
- **Path:** [file path]
- **Overall Grade:** [Elite / Good / Average / Needs Work] ([score]%)
- **Summary:** [1-2 sentence verdict]

### Dimension Scores
| Dimension | Score | Finding |
|-----------|-------|---------|
| ... | /10 | ... |

### Critical Issues (fix first)
1. ...

### Recommendations (improve next)
1. ...

### What's Working Well
1. ...


Audit Criteria

Score each dimension 0–10. The overall grade is the average across all 15 dimensions mapped to: Elite (≥ 85%), Good (70–84%), Average (40–69%), Needs Work (< 40%).
1. Name Quality (0–10)


Score
Criteria


0–3
Vague, generic (helper, utils, tools, documents), or contains reserved words (anthropic, claude)


4–6
Descriptive but inconsistent form, or slightly too broad


7–8
Clear, specific, uses gerund form (processing-pdfs, analyzing-spreadsheets)


9–10
Precise, immediately communicates scope, consistent with other skills in the project


Rules: Must be lowercase letters, numbers, and hyphens only. Max 64 characters. No XML tags.
2. Description Quality (0–10)


Score
Criteria


0–3
Vague (Helps with documents), missing trigger conditions, or uses first/second person


4–6
Says what it does but missing trigger conditions OR key vocabulary for discovery


7–8
Includes what it does + when to use it + key terms. Third person.


9–10
Comprehensive: what + when + scope boundaries + key vocabulary users would actually say. Third person throughout.


Critical rules to check:

MUST be third person. ✅ "Processes Excel files" ❌ "I can help you" ❌ "You can use this to"
MUST include trigger conditions (ideally starts with or contains "Use when...")
MUST include key terms/vocabulary that users would use
Max 1024 characters. Non-empty.
Should tell Claude enough to select this skill from 100+ installed skills

3. Token Efficiency (0–10)


Score
Criteria


0–3
Explains concepts Claude already knows (what a PDF is, how libraries work, etc.), verbose padding


4–6
Some unnecessary explanation but mostly relevant content


7–8
Concise, only includes context Claude doesn't have


9–10
Every paragraph justifies its token cost; zero waste


Check for these anti-patterns:

Explaining fundamental programming concepts, file formats, or well-known libraries
Verbose introductions before getting to instructions
Redundant restatements of the same information
Could the same instruction be expressed in 1/3 the words without loss?

4. SKILL.md Length (0–10)


Score
Criteria


0–3
Over 800 lines or under 5 lines (bloated or empty)


4–6
500–800 lines (should offload to references)


7–8
200–500 lines with appropriate reference linking


9–10
Under 200 lines with deep content properly pushed to references/


5. Progressive Disclosure (0–10)


Score
Criteria


0–3
Everything crammed into SKILL.md. No reference files. Or deeply nested references (A → B → C → actual info).


4–6
Some content in separate files but SKILL.md still bloated, or references not clearly linked


7–8
Good separation. SKILL.md serves as overview/TOC. References loaded on demand.


9–10
Exemplary: SKILL.md is lean overview, all deep content in well-named references one level deep. Uses {baseDir} for paths.


Check for:

Are reference links one level deep from SKILL.md? (Deeply nested = bad)
Are references organized by topic/workflow stage?
Does SKILL.md use Read tool references or {baseDir} paths (not hardcoded absolute paths)?
Any use of @ syntax that force-loads files immediately? (anti-pattern)

6. Conditional Routing (0–10)


Score
Criteria


0–3
No branching logic. Single linear path regardless of input type.


4–6
Some implicit branching or basic if/else guidance


7–8
Clear decision tree or conditional workflow at the top of the body


9–10
Comprehensive routing: determines approach first, then directs Claude to the right workflow path


What to look for: Does the skill start with a determination step? (e.g., "What type of input? → Text-based: follow X. Scanned: follow Y. Form: see Z.")
7. Workflow Structure (0–10)


Score
Criteria


0–3
No workflow. Just information or unstructured instructions.


4–6
Basic numbered steps but no checklists, no exit criteria


7–8
Structured steps with clear sequencing and some progress tracking


9–10
Complete workflow with copy-paste checklists (- [ ] items), explicit exit criteria, and clear step boundaries


8. Validation / Feedback Loops (0–10)


Score
Criteria


0–3
No validation. No error checking. "Hope it works" approach.


4–6
Basic "check your work" instruction without a concrete mechanism


7–8
Explicit validation step (e.g., run a script, check against criteria)


9–10
Full feedback loop: validate → fix errors → validate again → only proceed when passing. Plan-validate-execute pattern for destructive operations.


9. Error Handling (0–10)


Score
Criteria


0–3
No error handling mentioned


4–6
Generic "handle errors appropriately" or partial coverage


7–8
Specific error cases addressed with solutions


9–10
Structured error table mapping symptom → cause → fix. Covers common failure modes. This encodes senior-engineer procedural knowledge.


10. Examples (0–10)


Score
Criteria


0–3
No examples


4–6
Vague or incomplete examples


7–8
Concrete examples showing expected behavior


9–10
Multiple input/output pairs covering different scenarios (like multishot prompting). Shows both typical and edge cases.


11. Freedom Calibration (0–10)


Score
Criteria


0–3
Over-prescriptive for judgment tasks OR under-specified for fragile operations


4–6
Mostly appropriate but doesn't vary by task fragility


7–8
Distinguishes between guidance (high freedom) and exact commands (low freedom)


9–10
Explicitly calibrated: high freedom for context-dependent decisions, medium for parameterized patterns, low/exact for fragile/destructive operations


Heuristic: Fragile operations (DB migrations, deployments, destructive changes) should have exact commands with no modification allowed. Open-ended tasks (code review, analysis) should give directional guidance.
12. Tool Scoping (0–10)


Score
Criteria


0–3
No allowed-tools specified (unlimited access) or lists every possible tool


4–6
Tools specified but broader than necessary


7–8
Tools scoped to what the skill needs


9–10
Minimal tool set with wildcard scoping where appropriate (e.g., Bash(git:*) instead of all Bash)


13. Consistent Terminology (0–10)


Score
Criteria


0–3
Mixed synonyms throughout (e.g., "endpoint" / "URL" / "route" / "path" used interchangeably)


4–6
Mostly consistent with occasional slips


7–8
One term per concept throughout


9–10
Perfectly consistent terminology with clear definitions where domain terms could be ambiguous


14. Language Style (0–10)


Score
Criteria


0–3
Passive voice, hedging ("You might want to consider..."), or second person throughout


4–6
Mix of imperative and passive/hedging


7–8
Mostly imperative ("Run validation", "Extract text")


9–10
Consistently imperative. Direct commands. No unnecessary hedging. No time-sensitive content (e.g., "before August 2025, use the old API").


15. Testability (0–10)


Score
Criteria


0–3
No evidence of testing. No way to verify the skill works.


4–6
Some implicit testability but no explicit test plan or success criteria


7–8
Clear success criteria or verification steps included


9–10
Evidence of eval-driven development: the skill was built to address documented failures. Includes verification commands or success criteria. Scripts have error handling with specific messages.


Anti-Pattern Checklist

Flag any of these if found. Each is a specific, fixable issue:

 Kitchen Sink — SKILL.md tries to cover every possible scenario inline
 Encyclopedia — Explains fundamental concepts Claude already knows
 Force-Load Trap — Uses @ syntax to immediately load large files into context
 Inconsistent Terminology — Same concept referred to by multiple names
 Deeply Nested References — SKILL.md → A.md → B.md → actual info (keep it one level deep)
 Voodoo Constants — Configuration values with no justification or comment
 Time-Sensitive Content — Instructions that reference specific dates for behavior changes
 Hardcoded Paths — Absolute paths instead of {baseDir} relative paths
 Over-Permissioned Tools — allowed-tools includes tools the skill doesn't actually use
 Missing Description Triggers — Description says what the skill does but not when to invoke it
 First/Second Person Description — Description uses "I can" or "You can" instead of third person
 No Validation Step — Complex multi-step workflow with no verification at any point
 Silent Error Handling — Scripts that fail silently or swallow errors without informative messages


Directory Structure Check

For each skill, verify the directory structure follows conventions:
Expected:
my-skill/
├── SKILL.md              ← Required. Entry point.
├── references/           ← Text loaded INTO context (costs tokens)
├── scripts/              ← Code executed via Bash (only output costs tokens)
├── assets/               ← Files referenced by path (zero token cost until used)
└── LICENSE.txt           ← Optional

Verify:
- [ ] SKILL.md exists at root
- [ ] SKILL.md has valid YAML frontmatter with --- delimiters
- [ ] name field exists and is valid (lowercase, hyphens, numbers, ≤64 chars)
- [ ] description field exists and is non-empty (≤1024 chars)
- [ ] Supporting files are organized logically (references vs scripts vs assets)
- [ ] No deeply nested directory structures


Output Format

After auditing all skills, provide:

Individual skill reports (format shown above)
Project-level summary:

## Project Skill Audit Summary

| Skill | Grade | Score | Top Issue |
|-------|-------|-------|-----------|
| ... | ... | ...% | ... |

### Project-Wide Patterns
- [Common issues across multiple skills]
- [Systemic recommendations]

### Priority Fix Order
1. [Highest-impact fix across all skills]
2. [Next highest]
3. ...


Based on: Anthropic Skill Authoring Best Practices (platform.claude.com), SkillsBench paper (arxiv.org, Feb 2026), Claude Code Skills documentation (code.claude.com), HumanLayer CLAUDE.md guide (Nov 2025), Obra writing-skills TDD methodology (Dec 2025), Tessl Registry evaluation framework.
Score	Criteria
0–3	Vague, generic (`helper`, `utils`, `tools`, `documents`), or contains reserved words (`anthropic`, `claude`)
4–6	Descriptive but inconsistent form, or slightly too broad
7–8	Clear, specific, uses gerund form (`processing-pdfs`, `analyzing-spreadsheets`)
9–10	Precise, immediately communicates scope, consistent with other skills in the project
Score	Criteria
0–3	Vague (`Helps with documents`), missing trigger conditions, or uses first/second person
4–6	Says what it does but missing trigger conditions OR key vocabulary for discovery
7–8	Includes what it does + when to use it + key terms. Third person.
9–10	Comprehensive: what + when + scope boundaries + key vocabulary users would actually say. Third person throughout.
Score	Criteria
0–3	Explains concepts Claude already knows (what a PDF is, how libraries work, etc.), verbose padding
4–6	Some unnecessary explanation but mostly relevant content
7–8	Concise, only includes context Claude doesn't have
9–10	Every paragraph justifies its token cost; zero waste
Score	Criteria
0–3	Over 800 lines or under 5 lines (bloated or empty)
4–6	500–800 lines (should offload to references)
7–8	200–500 lines with appropriate reference linking
9–10	Under 200 lines with deep content properly pushed to `references/`
Score	Criteria
0–3	Everything crammed into SKILL.md. No reference files. Or deeply nested references (A → B → C → actual info).
4–6	Some content in separate files but SKILL.md still bloated, or references not clearly linked
7–8	Good separation. SKILL.md serves as overview/TOC. References loaded on demand.
9–10	Exemplary: SKILL.md is lean overview, all deep content in well-named references one level deep. Uses `{baseDir}` for paths.
Score	Criteria
0–3	No branching logic. Single linear path regardless of input type.
4–6	Some implicit branching or basic if/else guidance
7–8	Clear decision tree or conditional workflow at the top of the body
9–10	Comprehensive routing: determines approach first, then directs Claude to the right workflow path
Score	Criteria
0–3	No workflow. Just information or unstructured instructions.
4–6	Basic numbered steps but no checklists, no exit criteria
7–8	Structured steps with clear sequencing and some progress tracking
9–10	Complete workflow with copy-paste checklists (- [ ] items), explicit exit criteria, and clear step boundaries
Score	Criteria
0–3	No validation. No error checking. "Hope it works" approach.
4–6	Basic "check your work" instruction without a concrete mechanism
7–8	Explicit validation step (e.g., run a script, check against criteria)
9–10	Full feedback loop: validate → fix errors → validate again → only proceed when passing. Plan-validate-execute pattern for destructive operations.
Score	Criteria
0–3	No error handling mentioned
4–6	Generic "handle errors appropriately" or partial coverage
7–8	Specific error cases addressed with solutions
9–10	Structured error table mapping symptom → cause → fix. Covers common failure modes. This encodes senior-engineer procedural knowledge.
Score	Criteria
0–3	No examples
4–6	Vague or incomplete examples
7–8	Concrete examples showing expected behavior
9–10	Multiple input/output pairs covering different scenarios (like multishot prompting). Shows both typical and edge cases.