Skip to content

Instantly share code, notes, and snippets.

@kbouw
Created February 21, 2026 00:19
Show Gist options
  • Select an option

  • Save kbouw/1ef20d6a6369cc97a144d621838c8b7c to your computer and use it in GitHub Desktop.

Select an option

Save kbouw/1ef20d6a6369cc97a144d621838c8b7c to your computer and use it in GitHub Desktop.

Claude Skill Audit Document

Purpose: Give this document to an LLM alongside your project's skill files. It will audit each skill against industry best practices and return a structured quality report with specific, actionable fixes.

Usage: Paste this document into a conversation, then say: "Audit all skills in this project against the criteria in this document. For each skill, produce a scored report and specific recommendations."


Instructions for the Auditing LLM

You are auditing Claude skills (SKILL.md files and their supporting directories) against current industry best practices derived from Anthropic's official documentation, the SkillsBench research paper (Feb 2026), and community-validated patterns.

For each skill you find, produce this report:

## Skill: [skill-name]
- **Path:** [file path]
- **Overall Grade:** [Elite / Good / Average / Needs Work] ([score]%)
- **Summary:** [1-2 sentence verdict]

### Dimension Scores
| Dimension | Score | Finding |
|-----------|-------|---------|
| ... | /10 | ... |

### Critical Issues (fix first)
1. ...

### Recommendations (improve next)
1. ...

### What's Working Well
1. ...

Audit Criteria

Score each dimension 0–10. The overall grade is the average across all 15 dimensions mapped to: Elite (≥ 85%), Good (70–84%), Average (40–69%), Needs Work (< 40%).

1. Name Quality (0–10)

Score Criteria
0–3 Vague, generic (helper, utils, tools, documents), or contains reserved words (anthropic, claude)
4–6 Descriptive but inconsistent form, or slightly too broad
7–8 Clear, specific, uses gerund form (processing-pdfs, analyzing-spreadsheets)
9–10 Precise, immediately communicates scope, consistent with other skills in the project

Rules: Must be lowercase letters, numbers, and hyphens only. Max 64 characters. No XML tags.

2. Description Quality (0–10)

Score Criteria
0–3 Vague (Helps with documents), missing trigger conditions, or uses first/second person
4–6 Says what it does but missing trigger conditions OR key vocabulary for discovery
7–8 Includes what it does + when to use it + key terms. Third person.
9–10 Comprehensive: what + when + scope boundaries + key vocabulary users would actually say. Third person throughout.

Critical rules to check:

  • MUST be third person. ✅ "Processes Excel files" ❌ "I can help you" ❌ "You can use this to"
  • MUST include trigger conditions (ideally starts with or contains "Use when...")
  • MUST include key terms/vocabulary that users would use
  • Max 1024 characters. Non-empty.
  • Should tell Claude enough to select this skill from 100+ installed skills

3. Token Efficiency (0–10)

Score Criteria
0–3 Explains concepts Claude already knows (what a PDF is, how libraries work, etc.), verbose padding
4–6 Some unnecessary explanation but mostly relevant content
7–8 Concise, only includes context Claude doesn't have
9–10 Every paragraph justifies its token cost; zero waste

Check for these anti-patterns:

  • Explaining fundamental programming concepts, file formats, or well-known libraries
  • Verbose introductions before getting to instructions
  • Redundant restatements of the same information
  • Could the same instruction be expressed in 1/3 the words without loss?

4. SKILL.md Length (0–10)

Score Criteria
0–3 Over 800 lines or under 5 lines (bloated or empty)
4–6 500–800 lines (should offload to references)
7–8 200–500 lines with appropriate reference linking
9–10 Under 200 lines with deep content properly pushed to references/

5. Progressive Disclosure (0–10)

Score Criteria
0–3 Everything crammed into SKILL.md. No reference files. Or deeply nested references (A → B → C → actual info).
4–6 Some content in separate files but SKILL.md still bloated, or references not clearly linked
7–8 Good separation. SKILL.md serves as overview/TOC. References loaded on demand.
9–10 Exemplary: SKILL.md is lean overview, all deep content in well-named references one level deep. Uses {baseDir} for paths.

Check for:

  • Are reference links one level deep from SKILL.md? (Deeply nested = bad)
  • Are references organized by topic/workflow stage?
  • Does SKILL.md use Read tool references or {baseDir} paths (not hardcoded absolute paths)?
  • Any use of @ syntax that force-loads files immediately? (anti-pattern)

6. Conditional Routing (0–10)

Score Criteria
0–3 No branching logic. Single linear path regardless of input type.
4–6 Some implicit branching or basic if/else guidance
7–8 Clear decision tree or conditional workflow at the top of the body
9–10 Comprehensive routing: determines approach first, then directs Claude to the right workflow path

What to look for: Does the skill start with a determination step? (e.g., "What type of input? → Text-based: follow X. Scanned: follow Y. Form: see Z.")

7. Workflow Structure (0–10)

Score Criteria
0–3 No workflow. Just information or unstructured instructions.
4–6 Basic numbered steps but no checklists, no exit criteria
7–8 Structured steps with clear sequencing and some progress tracking
9–10 Complete workflow with copy-paste checklists (- [ ] items), explicit exit criteria, and clear step boundaries

8. Validation / Feedback Loops (0–10)

Score Criteria
0–3 No validation. No error checking. "Hope it works" approach.
4–6 Basic "check your work" instruction without a concrete mechanism
7–8 Explicit validation step (e.g., run a script, check against criteria)
9–10 Full feedback loop: validate → fix errors → validate again → only proceed when passing. Plan-validate-execute pattern for destructive operations.

9. Error Handling (0–10)

Score Criteria
0–3 No error handling mentioned
4–6 Generic "handle errors appropriately" or partial coverage
7–8 Specific error cases addressed with solutions
9–10 Structured error table mapping symptom → cause → fix. Covers common failure modes. This encodes senior-engineer procedural knowledge.

10. Examples (0–10)

Score Criteria
0–3 No examples
4–6 Vague or incomplete examples
7–8 Concrete examples showing expected behavior
9–10 Multiple input/output pairs covering different scenarios (like multishot prompting). Shows both typical and edge cases.

11. Freedom Calibration (0–10)

Score Criteria
0–3 Over-prescriptive for judgment tasks OR under-specified for fragile operations
4–6 Mostly appropriate but doesn't vary by task fragility
7–8 Distinguishes between guidance (high freedom) and exact commands (low freedom)
9–10 Explicitly calibrated: high freedom for context-dependent decisions, medium for parameterized patterns, low/exact for fragile/destructive operations

Heuristic: Fragile operations (DB migrations, deployments, destructive changes) should have exact commands with no modification allowed. Open-ended tasks (code review, analysis) should give directional guidance.

12. Tool Scoping (0–10)

Score Criteria
0–3 No allowed-tools specified (unlimited access) or lists every possible tool
4–6 Tools specified but broader than necessary
7–8 Tools scoped to what the skill needs
9–10 Minimal tool set with wildcard scoping where appropriate (e.g., Bash(git:*) instead of all Bash)

13. Consistent Terminology (0–10)

Score Criteria
0–3 Mixed synonyms throughout (e.g., "endpoint" / "URL" / "route" / "path" used interchangeably)
4–6 Mostly consistent with occasional slips
7–8 One term per concept throughout
9–10 Perfectly consistent terminology with clear definitions where domain terms could be ambiguous

14. Language Style (0–10)

Score Criteria
0–3 Passive voice, hedging ("You might want to consider..."), or second person throughout
4–6 Mix of imperative and passive/hedging
7–8 Mostly imperative ("Run validation", "Extract text")
9–10 Consistently imperative. Direct commands. No unnecessary hedging. No time-sensitive content (e.g., "before August 2025, use the old API").

15. Testability (0–10)

Score Criteria
0–3 No evidence of testing. No way to verify the skill works.
4–6 Some implicit testability but no explicit test plan or success criteria
7–8 Clear success criteria or verification steps included
9–10 Evidence of eval-driven development: the skill was built to address documented failures. Includes verification commands or success criteria. Scripts have error handling with specific messages.

Anti-Pattern Checklist

Flag any of these if found. Each is a specific, fixable issue:

  • Kitchen Sink — SKILL.md tries to cover every possible scenario inline
  • Encyclopedia — Explains fundamental concepts Claude already knows
  • Force-Load Trap — Uses @ syntax to immediately load large files into context
  • Inconsistent Terminology — Same concept referred to by multiple names
  • Deeply Nested References — SKILL.md → A.md → B.md → actual info (keep it one level deep)
  • Voodoo Constants — Configuration values with no justification or comment
  • Time-Sensitive Content — Instructions that reference specific dates for behavior changes
  • Hardcoded Paths — Absolute paths instead of {baseDir} relative paths
  • Over-Permissioned Toolsallowed-tools includes tools the skill doesn't actually use
  • Missing Description Triggers — Description says what the skill does but not when to invoke it
  • First/Second Person Description — Description uses "I can" or "You can" instead of third person
  • No Validation Step — Complex multi-step workflow with no verification at any point
  • Silent Error Handling — Scripts that fail silently or swallow errors without informative messages

Directory Structure Check

For each skill, verify the directory structure follows conventions:

Expected:
my-skill/
├── SKILL.md              ← Required. Entry point.
├── references/           ← Text loaded INTO context (costs tokens)
├── scripts/              ← Code executed via Bash (only output costs tokens)
├── assets/               ← Files referenced by path (zero token cost until used)
└── LICENSE.txt           ← Optional

Verify:
- [ ] SKILL.md exists at root
- [ ] SKILL.md has valid YAML frontmatter with --- delimiters
- [ ] name field exists and is valid (lowercase, hyphens, numbers, ≤64 chars)
- [ ] description field exists and is non-empty (≤1024 chars)
- [ ] Supporting files are organized logically (references vs scripts vs assets)
- [ ] No deeply nested directory structures

Output Format

After auditing all skills, provide:

  1. Individual skill reports (format shown above)
  2. Project-level summary:
## Project Skill Audit Summary

| Skill | Grade | Score | Top Issue |
|-------|-------|-------|-----------|
| ... | ... | ...% | ... |

### Project-Wide Patterns
- [Common issues across multiple skills]
- [Systemic recommendations]

### Priority Fix Order
1. [Highest-impact fix across all skills]
2. [Next highest]
3. ...

Based on: Anthropic Skill Authoring Best Practices (platform.claude.com), SkillsBench paper (arxiv.org, Feb 2026), Claude Code Skills documentation (code.claude.com), HumanLayer CLAUDE.md guide (Nov 2025), Obra writing-skills TDD methodology (Dec 2025), Tessl Registry evaluation framework.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment