Skill Creator v5.0 — Complete source: agent skill creation, validation, and empirical eval infrastructure for Claude Code (teaching reference)
Production-grade agent skill creation & validation for AI coding agents By jeremylongshore.com
Now with Anthropic's empirical eval infrastructure: subagent testing, interactive HTML viewer, automated description optimization, benchmark aggregation, blind A/B comparison, and .skill packaging.
skill-creator/
├── SKILL.md
├── agents/
│ ├── analyzer.md
│ ├── comparator.md
│ └── grader.md
├── assets/
│ └── eval_review.html
├── eval-viewer/
│ ├── generate_review.py
│ └── viewer.html
├── references/
│ ├── advanced-eval-workflow.md
│ ├── frontmatter-spec.md
│ ├── output-patterns.md
│ ├── schemas.md
│ ├── validation-rules.md
│ └── workflows.md
├── scripts/
│ ├── __init__.py
│ ├── aggregate_benchmark.py
│ ├── generate_report.py
│ ├── improve_description.py
│ ├── package_skill.py
│ ├── quick_validate.py
│ ├── run_eval.py
│ ├── run_loop.py
│ ├── utils.py
│ └── validate-skill.py
└── templates/
└── skill-template.md
---
name: skill-creator
description: |
Create production-grade agent skills aligned with the 2026 AgentSkills.io spec and Anthropic
best practices. Also validates existing skills against the Intent Solutions 100-point rubric.
Use when building, testing, validating, or optimizing Claude Code skills.
Trigger with "/skill-creator", "create a skill", "validate my skill", or "check skill quality".
Make sure to use this skill whenever creating a new skill, slash command, or agent capability.
allowed-tools: "Read,Write,Edit,Glob,Grep,Bash(mkdir:*),Bash(chmod:*),Bash(python:*),Bash(claude:*),Task,AskUserQuestion"
version: 5.0.0
author: Jeremy Longshore <jeremy@intentsolutions.io>
license: MIT
compatible-with: claude-code, codex, openclaw
tags: [skill-creation, validation, meta-tooling]
model: inherit
---
# Skill Creator
Creates complete, spec-compliant skill packages following AgentSkills.io and Anthropic standards.
Supports both creation and validation workflows with 100-point marketplace grading.
## Overview
Skill Creator solves the gap between writing ad-hoc agent skills and producing marketplace-ready
packages that score well on the Intent Solutions 100-point rubric. It enforces the 2026 spec
(top-level identity fields, `${CLAUDE_SKILL_DIR}` paths, scored sections) and catches
contradictions that would cost marketplace points. Supports two modes: create new skills from
scratch with full validation, or grade/audit existing skills with actionable fix suggestions.
## Table of Contents
- [Instructions](#instructions) — Create mode (Steps 1-10) + Mode Detection
- [Validation Workflow](#validation-workflow) — Grade/audit existing skills (Steps V1-V5)
- [Examples](#examples) — Create, full package, and validate examples
- [Edge Cases](#edge-cases) — Name conflicts, long content, legacy metadata
- [Error Handling](#error-handling) — Common errors and solutions
- [Resources](#resources) — Reference files and scripts
- [Running and Evaluating Test Cases](#running-and-evaluating-test-cases) — Subagent-based eval with viewer
- [Improving the Skill](#improving-the-skill) — Iteration loop based on feedback
- [Description Optimization (Automated)](#description-optimization-automated) — Automated trigger accuracy tuning
- [Advanced: Blind Comparison](#advanced-blind-comparison) — A/B testing between skill versions
- [Packaging](#packaging) — Create distributable .skill files
- [Platform-Specific Notes](#platform-specific-notes) — Claude.ai and Cowork adaptations
## Instructions
### Mode Detection
Determine user intent from their prompt:
- **Create mode**: "create a skill", "build a skill", "new skill" -> proceed to Step 1
- **Validate mode**: "validate", "check", "grade", "score", "audit" -> jump to Validation Workflow
### Step 1: Understand Requirements
Ask the user with AskUserQuestion:
**Skill Identity:**
- Name (kebab-case, gerund preferred: `processing-pdfs`, `analyzing-data`)
- Purpose (1-2 sentences: what it does + when to use it)
**Execution Model:**
- User-invocable via `/name`? Or background knowledge only?
- Accepts arguments? (`$ARGUMENTS` substitution)
- Needs isolated context? (`context: fork` for subagent execution)
- Explicit-only invocation? (`disable-model-invocation: true` — prevents auto-activation, requires `/name`)
**Required Tools:**
- Read, Write, Edit, Glob, Grep, WebFetch, WebSearch, Task, AskUserQuestion, Skill
- Bash must be scoped: `Bash(git:*)`, `Bash(npm:*)`, etc.
- MCP tools: `ServerName:tool_name`
**Complexity:**
- Simple (SKILL.md only)
- With scripts (automation code in `scripts/`)
- With references (documentation in `references/`)
- With templates (boilerplate in `templates/`)
- Full package (all directories)
**Location:**
- Global: `~/.claude/skills/<skill-name>/`
- Project: `.claude/skills/<skill-name>/`
### Step 2: Plan the Skill
Before writing, determine:
**Degrees of Freedom:**
| Level | When to Use |
|-------|-------------|
| High | Creative/open-ended tasks (analysis, writing) |
| Medium | Defined workflow, flexible content (most skills) |
| Low | Strict output format (compliance, API calls, configs) |
**Workflow Pattern** (see `${CLAUDE_SKILL_DIR}/references/workflows.md`):
- Sequential: fixed steps in order
- Conditional: branch based on input
- Wizard: interactive multi-step gathering
- Plan-Validate-Execute: verifiable intermediates
- Feedback Loop: iterate until quality met
- Search-Analyze-Report: explore and summarize
**Output Pattern** (see `${CLAUDE_SKILL_DIR}/references/output-patterns.md`):
- Strict template (exact format)
- Flexible template (structure with creative content)
- Examples-driven (input/output pairs)
- Visual (HTML generation)
- Structured data (JSON/YAML)
### Step 3: Initialize Structure
Create the skill directory and files:
```bash
mkdir -p {location}/{skill-name}
mkdir -p {location}/{skill-name}/scripts # if needed
mkdir -p {location}/{skill-name}/references # if needed
mkdir -p {location}/{skill-name}/templates # if needed
mkdir -p {location}/{skill-name}/assets # if needed
mkdir -p {location}/{skill-name}/evals # for eval-driven development
```
### Step 4: Write SKILL.md
Generate the SKILL.md using the template from `${CLAUDE_SKILL_DIR}/templates/skill-template.md`.
**Frontmatter rules** (see `${CLAUDE_SKILL_DIR}/references/frontmatter-spec.md`):
Required fields:
```yaml
name: {skill-name} # Must match directory name
description: | # Third person, what + when + keywords
{What it does}. Use when {scenario}.
Trigger with "/{skill-name}" or "{natural phrase}".
```
Identity fields (top-level — marketplace validator scores these here):
```yaml
version: 1.0.0
author: {name} <{email}>
license: MIT
```
**IMPORTANT**: `version`, `author`, `license`, `tags`, and `compatible-with` are TOP-LEVEL fields.
Do NOT nest them under `metadata:`. The marketplace 100-point validator checks them at top-level.
Recommended fields:
```yaml
allowed-tools: "{scoped tools}"
model: inherit
```
Optional Claude Code extensions:
```yaml
argument-hint: "[arg]" # If accepts $ARGUMENTS
context: fork # If needs isolated execution
agent: general-purpose # Subagent type (with context: fork)
disable-model-invocation: true # If explicit /name only (no auto-activation)
user-invocable: false # If background knowledge only
compatibility: "Python 3.10+" # If environment-specific
compatible-with: claude-code, codex # Platforms this works on
tags: [devops, ci] # Discovery tags
```
**Description writing — maximize discoverability scoring:**
Descriptions determine activation AND marketplace grade. Include these patterns for maximum points:
```yaml
# Good - scores +6 pts on marketplace grading
description: |
Analyze Python code for security vulnerabilities. Use when reviewing code
before deployment. Trigger with "/security-scan" or "scan for vulnerabilities".
# Bad - loses 6 discoverability points
description: |
Analyzes code for security issues.
```
Pattern: "Use when [scenario]" (+3 pts) + "Trigger with [phrases]" (+3 pts) + "Make sure to use whenever..." for aggressive claiming.
**Token budget awareness:** All installed skill descriptions load at startup (~100 tokens each). The total skill list is capped at ~15,000 characters (`SLASH_COMMAND_TOOL_CHAR_BUDGET`). Keep descriptions impactful but efficient.
**Body content guidelines — marketplace-scored sections:**
Include these sections for maximum marketplace points:
```
## Overview (>50 chars content: +4 pts)
## Prerequisites (+2 pts)
## Instructions (numbered steps: +3 pts)
## Output (+2 pts)
## Error Handling (+2 pts)
## Examples (+2 pts)
## Resources (+1 pt)
5+ sections total: +2 pts bonus
```
Additional guidelines:
- Keep under 500 lines (offload to `references/` if longer)
- Concise — Claude is smart, don't over-explain
- Concrete examples over abstract descriptions
- Use `${CLAUDE_SKILL_DIR}/` for internal file references in the skills you create
- Include edge cases that actually matter
- No time-sensitive information
- Consistent terminology throughout
**String substitutions available:**
- `$ARGUMENTS` / `$0`, `$1` - user-provided arguments
- `${CLAUDE_SESSION_ID}` - current session ID
- `` !`command` `` - dynamic context injection
### Step 5: Create Supporting Files
**Scripts** (`scripts/`):
- Scripts should solve problems, not punt to Claude
- Explicit error handling
- No voodoo constants (document all magic values)
- List required packages
- Make executable: `chmod +x scripts/*.py`
**References** (`references/`):
- Heavy documentation that doesn't need to load at activation
- TOC for files >100 lines
- One-level-deep references only (no `references/sub/dir/`)
**Templates** (`templates/`):
- Boilerplate files used for generation
- Use clear variable syntax (`{{VARIABLE_NAME}}`)
**Assets** (`assets/`):
- Static resources (images, configs, data files)
### Step 6: Validate
Run validation (see `${CLAUDE_SKILL_DIR}/references/validation-rules.md`):
```bash
python3 ${CLAUDE_SKILL_DIR}/scripts/validate-skill.py {skill-dir}/SKILL.md
python3 ${CLAUDE_SKILL_DIR}/scripts/validate-skill.py --grade {skill-dir}/SKILL.md
```
Enterprise tier is default. Use `--standard` for AgentSkills.io minimum only.
**Validation checks:**
- Frontmatter: required fields, types, constraints
- Description: third person, what + when, keywords, length
- Body: under 500 lines, no absolute paths, has instructions + examples
- Tools: valid names, scoped Bash
- Resources: all `${CLAUDE_SKILL_DIR}/` references exist
- Anti-patterns: Windows paths, nested refs, hardcoded model IDs
- Progressive disclosure: appropriate use of references/
**If validation fails:** fix issues and re-run. Common fixes:
- Scope Bash tools: `Bash(git:*)` not `Bash`
- Remove absolute paths, use `${CLAUDE_SKILL_DIR}/`
- Split long SKILL.md into references
- Add missing sections (Overview, Prerequisites, Output)
- Move author/version to top-level if nested in metadata
### Step 7: Test & Evaluate
Create `evals/evals.json` with minimum 3 scenarios: happy path, edge case, negative test.
```json
[
{"name": "basic_usage", "prompt": "Trigger prompt", "assertions": ["Expected behavior"]},
{"name": "edge_case", "prompt": "Edge case prompt", "assertions": ["Expected handling"]},
{"name": "negative_test", "prompt": "Should NOT trigger", "assertions": ["Skill inactive"]}
]
```
Run parallel evaluation: Claude A with skill installed vs Claude B without. Compare outputs against assertions — the skill should produce meaningfully better results for its target use cases.
### Step 8: Iterate
1. Review which assertions passed/failed
2. Modify SKILL.md instructions, examples, or constraints
3. Re-validate with `validate-skill.py --grade`
4. Re-test evals until all assertions pass
Common fixes: undertriggering -> pushier description, wrong format -> explicit output examples, over-constraining -> increase degrees of freedom.
### Step 9: Optimize Description
Create 20 trigger evaluation queries (10 should-trigger, 10 should-not-trigger). Split into train (14) and test (6) sets. Iterate description until >90% accuracy on both sets.
Tips: front-load distinctive keywords, include specific file types/tools/domains, add "Use when...", "Trigger with...", "Make sure to use whenever..." patterns. Avoid generic terms that overlap with other skills.
### Step 10: Report
Show the user:
```
SKILL CREATED
====================================
Location: {full path}
Files:
SKILL.md ({lines} lines)
scripts/{files}
references/{files}
templates/{files}
evals/evals.json
Validation: Enterprise tier
Errors: {count}
Warnings: {count}
Disclosure Score: {score}/6
Grade: {letter} ({points}/100)
Eval Results:
Scenarios: {count}
Passed: {count}/{count}
Description Accuracy: {percentage}%
Usage:
/{skill-name} {argument-hint}
or: "{natural language trigger}"
====================================
```
## Validation Workflow
When the user wants to validate, grade, or audit an existing skill:
### Step V1: Locate the Skill
Ask for the SKILL.md path or detect from context. Common locations:
- `~/.claude/skills/<skill-name>/SKILL.md` (global)
- `.claude/skills/<skill-name>/SKILL.md` (project)
### Step V2: Run Validator
```bash
python3 ${CLAUDE_SKILL_DIR}/scripts/validate-skill.py --grade {path}/SKILL.md
```
### Step V3: Review Grade
100-point rubric across 5 pillars:
| Pillar | Max | What It Measures |
|--------|-----|------------------|
| Progressive Disclosure | 30 | Token economy, layered structure, navigation |
| Ease of Use | 25 | Metadata, discoverability, workflow clarity |
| Utility | 20 | Problem solving, examples, feedback loops |
| Spec Compliance | 15 | Frontmatter, naming, description quality |
| Writing Style | 10 | Voice, objectivity, conciseness |
| Modifiers | +/-5 | Bonuses/penalties for patterns |
Grade scale: A (90+), B (80-89), C (70-79), D (60-69), F (<60)
See `${CLAUDE_SKILL_DIR}/references/validation-rules.md` for detailed sub-criteria.
### Step V4: Report Results
Present the grade report with specific fix recommendations. Prioritize fixes by point value (highest first).
### Step V5: Auto-Fix (if requested)
If the user says "fix it" or "auto-fix", apply the suggested improvements:
1. Add missing sections (Overview, Prerequisites, Output)
2. Add "Use when" / "Trigger with" to description
3. Move author/version from metadata to top-level
4. Fix path variables to `${CLAUDE_SKILL_DIR}/`
5. Re-run grading to confirm improvement
## Examples
### Simple Skill (Create Mode)
```
User: Create a skill called "code-review" that reviews code quality
Creates:
~/.claude/skills/code-review/
├── SKILL.md
└── evals/
└── evals.json
Frontmatter:
---
name: code-review
description: |
Make sure to use this skill whenever reviewing code for quality, security
vulnerabilities, and best practices. Use when doing code reviews, PR analysis,
or checking code quality. Trigger with "/code-review" or "review this code".
allowed-tools: "Read,Glob,Grep"
version: 1.0.0
author: Jeremy Longshore <jeremy@intentsolutions.io>
license: MIT
model: inherit
---
```
### Full Package with Arguments (Create Mode)
```
User: Create a skill that generates release notes from git history
Creates:
~/.claude/skills/generating-release-notes/
├── SKILL.md (argument-hint: "[version-tag]")
├── scripts/
│ └── parse-commits.py
├── references/
│ └── commit-conventions.md
├── templates/
│ └── release-template.md
└── evals/
└── evals.json
Uses $ARGUMENTS[0] for version tag.
Uses context: fork for isolated execution.
```
### Validate Mode
```
User: Grade my skill at ~/.claude/skills/code-review/SKILL.md
Runs: python3 ${CLAUDE_SKILL_DIR}/scripts/validate-skill.py --grade ~/.claude/skills/code-review/SKILL.md
Output:
Grade: B (84/100)
Improvements:
- Add "Trigger with" to description (+3 pts)
- Add ## Output section (+2 pts)
- Add ## Prerequisites section (+2 pts)
```
## Edge Cases
- **Name conflicts**: Check if skill directory already exists before creating
- **Empty arguments**: If skill uses `$ARGUMENTS`, handle the empty case
- **Long content**: If SKILL.md exceeds 300 lines during writing, stop and split to references
- **Bash scoping**: If user requests raw `Bash`, always scope it
- **Model selection**: Default to `inherit`, only override with good reason
- **Undertriggering**: If skill isn't activating, make description more aggressive/pushy
- **Legacy metadata nesting**: If found, move author/version/license to top-level
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| Name exists | Directory already present | Choose different name or confirm overwrite |
| Invalid name | Not kebab-case or >64 chars | Fix to lowercase-with-hyphens |
| Validation fails | Missing fields or anti-patterns | Run validator, fix reported issues |
| Resource missing | `${CLAUDE_SKILL_DIR}/` ref points to nonexistent file | Create the file or fix the reference |
| Undertriggering | Description too passive | Add "Make sure to use whenever..." phrasing |
| Eval failures | Skill not producing expected output | Iterate on instructions and re-test |
| Low grade | Missing scored sections or fields | Add Overview, Prerequisites, Output sections |
## Resources
**References:** `${CLAUDE_SKILL_DIR}/references/`
- `source-of-truth.md` — Canonical spec | `frontmatter-spec.md` — Field reference | `validation-rules.md` — 100-point rubric
- `workflows.md` — Workflow patterns | `output-patterns.md` — Output formats | `schemas.md` — JSON schemas (evals, grading, benchmarks)
- `anthropic-comparison.md` — Gap analysis | `advanced-eval-workflow.md` — Eval, iteration, optimization, platform notes
**Agents** (read when spawning subagents): `${CLAUDE_SKILL_DIR}/agents/`
- `grader.md` — Assertion evaluation | `comparator.md` — Blind A/B comparison | `analyzer.md` — Benchmark analysis
**Scripts:** `${CLAUDE_SKILL_DIR}/scripts/`
- `validate-skill.py` — 100-point rubric grading | `quick_validate.py` — Lightweight validation
- `aggregate_benchmark.py` — Benchmark stats | `run_eval.py` — Trigger accuracy testing
- `run_loop.py` — Description optimization loop | `improve_description.py` — LLM-powered rewriting
- `generate_report.py` — HTML reports | `package_skill.py` — .skill packaging | `utils.py` — Shared utilities
**Eval Viewer:** `${CLAUDE_SKILL_DIR}/eval-viewer/` — `generate_review.py` + `viewer.html` (interactive output comparison)
**Assets:** `${CLAUDE_SKILL_DIR}/assets/eval_review.html` (trigger eval set editor)
**Templates:** `${CLAUDE_SKILL_DIR}/templates/skill-template.md` (SKILL.md skeleton)
---
## Running and Evaluating Test Cases
For detailed empirical eval workflow (Steps E1-E5), read `${CLAUDE_SKILL_DIR}/references/advanced-eval-workflow.md`.
**Quick summary:** Spawn with-skill and baseline subagents in parallel -> draft assertions while running -> capture timing data from task notifications -> grade with `${CLAUDE_SKILL_DIR}/agents/grader.md` -> aggregate with `scripts/aggregate_benchmark.py` -> launch `eval-viewer/generate_review.py` for interactive human review -> read `feedback.json`.
---
## Improving the Skill
For iteration loop details, read `${CLAUDE_SKILL_DIR}/references/advanced-eval-workflow.md` (section "Improving the Skill").
**Key principles:** Generalize from feedback (don't overfit), keep prompts lean, explain the *why* behind rules (not just prescriptions), and bundle repeated helper scripts.
---
## Description Optimization (Automated)
For the full pipeline (Steps D1-D4), read `${CLAUDE_SKILL_DIR}/references/advanced-eval-workflow.md` (section "Description Optimization"). Quick summary: generate 20 realistic trigger eval queries -> review with user via `${CLAUDE_SKILL_DIR}/assets/eval_review.html` -> run `python -m scripts.run_loop` (60/40 train/test, 3 runs/query, up to 5 iterations) -> apply `best_description`.
## Advanced: Blind Comparison
For A/B testing between skill versions, read `${CLAUDE_SKILL_DIR}/agents/comparator.md` and `${CLAUDE_SKILL_DIR}/agents/analyzer.md`. Optional; most users won't need it.
## Packaging
`python -m scripts.package_skill <path/to/skill-folder> [output-directory]` — Creates distributable `.skill` zip after validation.
## Platform-Specific Notes
See `${CLAUDE_SKILL_DIR}/references/advanced-eval-workflow.md` (section "Platform-Specific Notes").
- **Claude.ai**: No subagents — run tests yourself, skip benchmarking/description optimization.
- **Cowork**: Full subagent workflow. Use `--static` for eval viewer. Generate viewer BEFORE self-evaluation.# Advanced Eval Workflow
Detailed instructions for empirical skill evaluation, iteration, description optimization,
blind comparison, packaging, and platform-specific adaptations.
Read this reference when using the eval infrastructure (Steps E1-E5), improving skills
through iteration, running automated description optimization, or working on Claude.ai/Cowork.
---
## Running and Evaluating Test Cases
This extends Steps 7-8 with concrete tooling for empirical evaluation. The core idea: run
the skill on real prompts, compare against a baseline, grade the results, and show them to
the user in an interactive viewer.
Put results in `<skill-name>-workspace/` as a sibling to the skill directory. Within the
workspace, organize by iteration (`iteration-1/`, `iteration-2/`, etc.) and within that,
each test case gets a directory (`eval-0/`, `eval-1/`, etc.). Create directories as you go.
### Step E1: Spawn all runs (with-skill AND baseline) in the same turn
For each test case, spawn two subagents in the same turn — one with the skill, one without.
Launch everything at once so runs finish around the same time.
**With-skill run:**
```
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the user cares about>
```
**Baseline run** (same prompt, no skill):
- **Creating a new skill**: no skill at all. Save to `without_skill/outputs/`.
- **Improving an existing skill**: snapshot the old version first (`cp -r`), point baseline
at the snapshot. Save to `old_skill/outputs/`.
Write an `eval_metadata.json` for each test case:
```json
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
```
### Step E2: While runs are in progress, draft assertions
Don't wait idle. Draft quantitative assertions for each test case and explain them to the
user. Good assertions are objectively verifiable and have descriptive names. Subjective
skills (writing style, design) are better evaluated qualitatively — don't force assertions
onto things that need human judgment.
Update `eval_metadata.json` files and `evals/evals.json` with the assertions. See
`${CLAUDE_SKILL_DIR}/references/schemas.md` for the full schema.
### Step E3: As runs complete, capture timing data
When each subagent task completes, the notification contains `total_tokens` and
`duration_ms`. Save immediately to `timing.json` — this is the only opportunity to
capture this data:
```json
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
```
### Step E4: Grade, aggregate, and launch the viewer
Once all runs are done:
1. **Grade each run** — spawn a grader subagent that reads
`${CLAUDE_SKILL_DIR}/agents/grader.md` and evaluates each assertion against the outputs.
Save results to `grading.json` in each run directory. The grading.json expectations array
must use fields `text`, `passed`, and `evidence` — the viewer depends on these exact field
names. For programmatically checkable assertions, write and run a script rather than
eyeballing it.
2. **Aggregate into benchmark**:
```bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
```
This produces `benchmark.json` and `benchmark.md` with pass_rate, time, and tokens for
each configuration, with mean +/- stddev and the delta. If generating benchmark.json
manually, see `${CLAUDE_SKILL_DIR}/references/schemas.md` for the exact schema the
viewer expects.
3. **Analyst pass** — read the benchmark data and surface patterns. See
`${CLAUDE_SKILL_DIR}/agents/analyzer.md` (the "Analyzing Benchmark Results" section) for
what to look for — non-discriminating assertions, high-variance evals, time/token tradeoffs.
4. **Launch the viewer**:
```bash
nohup python ${CLAUDE_SKILL_DIR}/eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
> /dev/null 2>&1 &
VIEWER_PID=$!
```
For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
**Headless/Cowork:** Use `--static <output_path>` to write standalone HTML instead of
starting a server.
5. **Tell the user** the viewer is open. They'll see an "Outputs" tab (per-test-case
feedback) and a "Benchmark" tab (quantitative comparison).
### What the user sees in the viewer
The "Outputs" tab shows one test case at a time:
- **Prompt**: the task that was given
- **Output**: the files the skill produced, rendered inline where possible
- **Previous Output** (iteration 2+): collapsed section showing last iteration's output
- **Formal Grades** (if grading was run): collapsed section showing assertion pass/fail
- **Feedback**: a textbox that auto-saves as they type
- **Previous Feedback** (iteration 2+): their comments from last time
The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each
configuration, with per-eval breakdowns and analyst observations.
Navigation is via prev/next buttons or arrow keys. When done, "Submit All Reviews" saves
all feedback to `feedback.json`.
### Step E5: Read the feedback
When the user is done, read `feedback.json`:
```json
{
"reviews": [
{"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
{"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
],
"status": "complete"
}
```
Empty feedback means the user thought it was fine. Focus improvements on test cases with
specific complaints. Kill the viewer server when done: `kill $VIEWER_PID 2>/dev/null`.
---
## Improving the Skill
After running test cases and collecting feedback, improve the skill based on what you learned.
### Key principles
1. **Generalize from feedback.** Skills may be invoked millions of times across many prompts.
Don't overfit to the few test examples — generalize from failures to broader categories of
user intent. Rather than fiddly overfitty changes or oppressive MUSTs, try branching out
with different metaphors or patterns. It's cheap to experiment.
2. **Keep the prompt lean.** Remove what isn't pulling its weight. Read transcripts, not just
outputs — if the skill makes the model waste time on unproductive steps, remove those
instructions.
3. **Explain the why.** Try to explain *why* things matter rather than just prescribing rules.
Today's LLMs are smart — they have good theory of mind and when given a good harness can
go beyond rote instructions. If you find yourself writing ALWAYS or NEVER in all caps,
reframe and explain the reasoning. That's more humane, powerful, and effective.
4. **Look for repeated work.** If all test runs independently wrote similar helper scripts,
that's a signal the skill should bundle that script in `scripts/`. Write it once and tell
the skill to use it.
### The iteration loop
After improving the skill:
1. Apply improvements to the skill
2. Rerun all test cases into a new `iteration-<N+1>/` directory, including baselines
3. Launch the reviewer with `--previous-workspace` pointing at the previous iteration
4. Wait for the user to review and tell you they're done
5. Read new feedback, improve again, repeat
Keep going until:
- The user says they're happy
- The feedback is all empty (everything looks good)
- You're not making meaningful progress
---
## Description Optimization (Automated)
The description field is the primary mechanism determining whether Claude invokes a skill.
After creating or improving a skill, offer to optimize the description for better triggering
accuracy.
### Step D1: Generate trigger eval queries
Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
```json
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]
```
Queries must be realistic — concrete, specific, with details like file paths, personal
context, column names. Include casual speech, typos, abbreviations. Focus on edge cases
rather than clear-cut examples.
For **should-trigger** (8-10): different phrasings of the same intent, cases where the user
doesn't name the skill but clearly needs it, uncommon use cases, cases where this skill
competes with another but should win.
For **should-not-trigger** (8-10): near-misses — queries sharing keywords but needing
something different. Adjacent domains, ambiguous phrasing where naive keyword match would
trigger but shouldn't. Don't make negatives obviously irrelevant.
Bad: `"Format this data"`, `"Extract text from PDF"`, `"Create a chart"`
Good: `"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin"`
### Step D2: Review with user
Present the eval set using the HTML template:
1. Read `${CLAUDE_SKILL_DIR}/assets/eval_review.html`
2. Replace placeholders: `__EVAL_DATA_PLACEHOLDER__` (JSON array, no quotes — it's a JS
variable assignment), `__SKILL_NAME_PLACEHOLDER__`, `__SKILL_DESCRIPTION_PLACEHOLDER__`
3. Write to `/tmp/eval_review_<skill-name>.html` and open it
4. User edits queries, toggles triggers, clicks "Export Eval Set"
5. Read the exported file from `~/Downloads/eval_set.json` (check for `eval_set (1).json` etc.)
### Step D3: Run the optimization loop
Tell the user this will take some time. Save the eval set to the workspace, then run:
```bash
python -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verbose
```
Use the model ID from your system prompt so the triggering test matches what the user
actually experiences.
This splits 60% train / 40% test, evaluates the current description (3 runs per query for
reliability), proposes improvements based on failures, and iterates up to 5 times. Returns
JSON with `best_description` selected by test score to avoid overfitting.
### How skill triggering works
Skills appear in Claude's `available_skills` list with their name + description. Claude
decides whether to consult a skill based on that description. Important: Claude only consults
skills for tasks it can't easily handle alone — simple one-step queries may not trigger even
if the description matches. Complex, multi-step, or specialized queries reliably trigger when
the description matches.
Eval queries should be substantive enough that Claude would benefit from consulting a skill.
### Step D4: Apply the result
Take `best_description` from the JSON output and update the skill's SKILL.md frontmatter.
Show the user before/after and report the scores.
---
## Advanced: Blind Comparison
For rigorous comparison between two skill versions, use the blind comparison system. Read
`${CLAUDE_SKILL_DIR}/agents/comparator.md` and `${CLAUDE_SKILL_DIR}/agents/analyzer.md` for
details.
The basic idea: give two outputs to an independent agent without telling it which is which,
and let it judge quality. Then analyze why the winner won.
This is optional, requires subagents, and most users won't need it. The human review loop
is usually sufficient.
---
## Packaging
Package a skill into a distributable `.skill` file:
```bash
python -m scripts.package_skill <path/to/skill-folder> [output-directory]
```
This validates the skill first, then creates a zip archive excluding `__pycache__`,
`node_modules`, `evals/`, and `.DS_Store`.
Only package if `present_files` tool is available or the user requests it. After packaging,
direct the user to the resulting `.skill` file path so they can install it.
---
## Platform-Specific Notes
### Claude.ai
The core workflow (draft -> test -> review -> improve) is the same, but without subagents:
- **Test cases**: Run them yourself one at a time. Skip baseline runs.
- **Review**: Present results directly in conversation. Save output files and tell the user
where they are.
- **Benchmarking**: Skip quantitative benchmarking — it relies on baselines.
- **Description optimization**: Requires `claude` CLI — skip on Claude.ai.
- **Packaging**: `package_skill.py` works anywhere with Python.
- **Updating existing skills**: Preserve the original name. Copy to `/tmp/` before editing
(installed path may be read-only). If packaging manually, stage in `/tmp/` first.
### Cowork
- You have subagents — the full workflow works, though you may need to run test prompts in
series if timeouts occur.
- No browser: use `--static <output_path>` for the eval viewer and share the HTML file link.
- Always generate the eval viewer BEFORE evaluating inputs yourself — get results in front
of the human ASAP.
- Feedback: "Submit All Reviews" downloads `feedback.json` as a file.
- Description optimization (`run_loop.py`) works since it uses `claude -p` via subprocess.
- Save description optimization until the skill is fully finished and the user agrees it's
in good shape.# JSON Schemas
This document defines the JSON schemas used by skill-creator.
---
## evals.json
Defines the evals for a skill. Located at `evals/evals.json` within the skill directory.
```json
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's example prompt",
"expected_output": "Description of expected result",
"files": ["evals/files/sample1.pdf"],
"expectations": [
"The output includes X",
"The skill used script Y"
]
}
]
}
```
**Fields:**
- `skill_name`: Name matching the skill's frontmatter
- `evals[].id`: Unique integer identifier
- `evals[].prompt`: The task to execute
- `evals[].expected_output`: Human-readable description of success
- `evals[].files`: Optional list of input file paths (relative to skill root)
- `evals[].expectations`: List of verifiable statements
---
## history.json
Tracks version progression in Improve mode. Located at workspace root.
```json
{
"started_at": "2026-01-15T10:30:00Z",
"skill_name": "pdf",
"current_best": "v2",
"iterations": [
{
"version": "v0",
"parent": null,
"expectation_pass_rate": 0.65,
"grading_result": "baseline",
"is_current_best": false
},
{
"version": "v1",
"parent": "v0",
"expectation_pass_rate": 0.75,
"grading_result": "won",
"is_current_best": false
},
{
"version": "v2",
"parent": "v1",
"expectation_pass_rate": 0.85,
"grading_result": "won",
"is_current_best": true
}
]
}
```
**Fields:**
- `started_at`: ISO timestamp of when improvement started
- `skill_name`: Name of the skill being improved
- `current_best`: Version identifier of the best performer
- `iterations[].version`: Version identifier (v0, v1, ...)
- `iterations[].parent`: Parent version this was derived from
- `iterations[].expectation_pass_rate`: Pass rate from grading
- `iterations[].grading_result`: "baseline", "won", "lost", or "tie"
- `iterations[].is_current_best`: Whether this is the current best version
---
## grading.json
Output from the grader agent. Located at `<run-dir>/grading.json`.
```json
{
"expectations": [
{
"text": "The output includes the name 'John Smith'",
"passed": true,
"evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
},
{
"text": "The spreadsheet has a SUM formula in cell B10",
"passed": false,
"evidence": "No spreadsheet was created. The output was a text file."
}
],
"summary": {
"passed": 2,
"failed": 1,
"total": 3,
"pass_rate": 0.67
},
"execution_metrics": {
"tool_calls": {
"Read": 5,
"Write": 2,
"Bash": 8
},
"total_tool_calls": 15,
"total_steps": 6,
"errors_encountered": 0,
"output_chars": 12450,
"transcript_chars": 3200
},
"timing": {
"executor_duration_seconds": 165.0,
"grader_duration_seconds": 26.0,
"total_duration_seconds": 191.0
},
"claims": [
{
"claim": "The form has 12 fillable fields",
"type": "factual",
"verified": true,
"evidence": "Counted 12 fields in field_info.json"
}
],
"user_notes_summary": {
"uncertainties": ["Used 2023 data, may be stale"],
"needs_review": [],
"workarounds": ["Fell back to text overlay for non-fillable fields"]
},
"eval_feedback": {
"suggestions": [
{
"assertion": "The output includes the name 'John Smith'",
"reason": "A hallucinated document that mentions the name would also pass"
}
],
"overall": "Assertions check presence but not correctness."
}
}
```
**Fields:**
- `expectations[]`: Graded expectations with evidence
- `summary`: Aggregate pass/fail counts
- `execution_metrics`: Tool usage and output size (from executor's metrics.json)
- `timing`: Wall clock timing (from timing.json)
- `claims`: Extracted and verified claims from the output
- `user_notes_summary`: Issues flagged by the executor
- `eval_feedback`: (optional) Improvement suggestions for the evals, only present when the grader identifies issues worth raising
---
## metrics.json
Output from the executor agent. Located at `<run-dir>/outputs/metrics.json`.
```json
{
"tool_calls": {
"Read": 5,
"Write": 2,
"Bash": 8,
"Edit": 1,
"Glob": 2,
"Grep": 0
},
"total_tool_calls": 18,
"total_steps": 6,
"files_created": ["filled_form.pdf", "field_values.json"],
"errors_encountered": 0,
"output_chars": 12450,
"transcript_chars": 3200
}
```
**Fields:**
- `tool_calls`: Count per tool type
- `total_tool_calls`: Sum of all tool calls
- `total_steps`: Number of major execution steps
- `files_created`: List of output files created
- `errors_encountered`: Number of errors during execution
- `output_chars`: Total character count of output files
- `transcript_chars`: Character count of transcript
---
## timing.json
Wall clock timing for a run. Located at `<run-dir>/timing.json`.
**How to capture:** When a subagent task completes, the task notification includes `total_tokens` and `duration_ms`. Save these immediately — they are not persisted anywhere else and cannot be recovered after the fact.
```json
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3,
"executor_start": "2026-01-15T10:30:00Z",
"executor_end": "2026-01-15T10:32:45Z",
"executor_duration_seconds": 165.0,
"grader_start": "2026-01-15T10:32:46Z",
"grader_end": "2026-01-15T10:33:12Z",
"grader_duration_seconds": 26.0
}
```
---
## benchmark.json
Output from Benchmark mode. Located at `benchmarks/<timestamp>/benchmark.json`.
```json
{
"metadata": {
"skill_name": "pdf",
"skill_path": "/path/to/pdf",
"executor_model": "claude-sonnet-4-20250514",
"analyzer_model": "most-capable-model",
"timestamp": "2026-01-15T10:30:00Z",
"evals_run": [1, 2, 3],
"runs_per_configuration": 3
},
"runs": [
{
"eval_id": 1,
"eval_name": "Ocean",
"configuration": "with_skill",
"run_number": 1,
"result": {
"pass_rate": 0.85,
"passed": 6,
"failed": 1,
"total": 7,
"time_seconds": 42.5,
"tokens": 3800,
"tool_calls": 18,
"errors": 0
},
"expectations": [
{"text": "...", "passed": true, "evidence": "..."}
],
"notes": [
"Used 2023 data, may be stale",
"Fell back to text overlay for non-fillable fields"
]
}
],
"run_summary": {
"with_skill": {
"pass_rate": {"mean": 0.85, "stddev": 0.05, "min": 0.80, "max": 0.90},
"time_seconds": {"mean": 45.0, "stddev": 12.0, "min": 32.0, "max": 58.0},
"tokens": {"mean": 3800, "stddev": 400, "min": 3200, "max": 4100}
},
"without_skill": {
"pass_rate": {"mean": 0.35, "stddev": 0.08, "min": 0.28, "max": 0.45},
"time_seconds": {"mean": 32.0, "stddev": 8.0, "min": 24.0, "max": 42.0},
"tokens": {"mean": 2100, "stddev": 300, "min": 1800, "max": 2500}
},
"delta": {
"pass_rate": "+0.50",
"time_seconds": "+13.0",
"tokens": "+1700"
}
},
"notes": [
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
"Eval 3 shows high variance (50% +/- 40%) - may be flaky or model-dependent",
"Without-skill runs consistently fail on table extraction expectations",
"Skill adds 13s average execution time but improves pass rate by 50%"
]
}
```
**Fields:**
- `metadata`: Information about the benchmark run
- `skill_name`: Name of the skill
- `timestamp`: When the benchmark was run
- `evals_run`: List of eval names or IDs
- `runs_per_configuration`: Number of runs per config (e.g. 3)
- `runs[]`: Individual run results
- `eval_id`: Numeric eval identifier
- `eval_name`: Human-readable eval name (used as section header in the viewer)
- `configuration`: Must be `"with_skill"` or `"without_skill"` (the viewer uses this exact string for grouping and color coding)
- `run_number`: Integer run number (1, 2, 3...)
- `result`: Nested object with `pass_rate`, `passed`, `total`, `time_seconds`, `tokens`, `errors`
- `run_summary`: Statistical aggregates per configuration
- `with_skill` / `without_skill`: Each contains `pass_rate`, `time_seconds`, `tokens` objects with `mean` and `stddev` fields
- `delta`: Difference strings like `"+0.50"`, `"+13.0"`, `"+1700"`
- `notes`: Freeform observations from the analyzer
**Important:** The viewer reads these field names exactly. Using `config` instead of `configuration`, or putting `pass_rate` at the top level of a run instead of nested under `result`, will cause the viewer to show empty/zero values. Always reference this schema when generating benchmark.json manually.
---
## comparison.json
Output from blind comparator. Located at `<grading-dir>/comparison-N.json`.
```json
{
"winner": "A",
"reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
"rubric": {
"A": {
"content": {
"correctness": 5,
"completeness": 5,
"accuracy": 4
},
"structure": {
"organization": 4,
"formatting": 5,
"usability": 4
},
"content_score": 4.7,
"structure_score": 4.3,
"overall_score": 9.0
},
"B": {
"content": {
"correctness": 3,
"completeness": 2,
"accuracy": 3
},
"structure": {
"organization": 3,
"formatting": 2,
"usability": 3
},
"content_score": 2.7,
"structure_score": 2.7,
"overall_score": 5.4
}
},
"output_quality": {
"A": {
"score": 9,
"strengths": ["Complete solution", "Well-formatted", "All fields present"],
"weaknesses": ["Minor style inconsistency in header"]
},
"B": {
"score": 5,
"strengths": ["Readable output", "Correct basic structure"],
"weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
}
},
"expectation_results": {
"A": {
"passed": 4,
"total": 5,
"pass_rate": 0.80,
"details": [
{"text": "Output includes name", "passed": true}
]
},
"B": {
"passed": 3,
"total": 5,
"pass_rate": 0.60,
"details": [
{"text": "Output includes name", "passed": true}
]
}
}
}
```
---
## analysis.json
Output from post-hoc analyzer. Located at `<grading-dir>/analysis.json`.
```json
{
"comparison_summary": {
"winner": "A",
"winner_skill": "path/to/winner/skill",
"loser_skill": "path/to/loser/skill",
"comparator_reasoning": "Brief summary of why comparator chose winner"
},
"winner_strengths": [
"Clear step-by-step instructions for handling multi-page documents",
"Included validation script that caught formatting errors"
],
"loser_weaknesses": [
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
"No script for validation, agent had to improvise"
],
"instruction_following": {
"winner": {
"score": 9,
"issues": ["Minor: skipped optional logging step"]
},
"loser": {
"score": 6,
"issues": [
"Did not use the skill's formatting template",
"Invented own approach instead of following step 3"
]
}
},
"improvement_suggestions": [
{
"priority": "high",
"category": "instructions",
"suggestion": "Replace 'process the document appropriately' with explicit steps",
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
}
],
"transcript_insights": {
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script",
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods"
}
}
```# Workflow Patterns
Reference guide for choosing the right workflow pattern when creating skills.
---
## Sequential
Fixed steps executed in order. Best for deterministic tasks with predictable flow.
**When to use:** Data processing pipelines, file transformations, deployment scripts.
```
Step 1 -> Step 2 -> Step 3 -> Done
```
**Example:** Generate release notes — parse commits, group by type, format output, write file.
---
## Conditional
Branch based on input or detected state. Best when different scenarios need different handling.
**When to use:** Error diagnosis, multi-format support, environment-specific logic.
```
Detect input type
+-- Type A -> Path A steps
+-- Type B -> Path B steps
+-- Unknown -> Error handling
```
**Example:** File converter — detect format (JSON/YAML/TOML), apply appropriate parser, output in target format.
---
## Wizard
Interactive multi-step gathering with user input between phases. Best for configuration or personalized output.
**When to use:** Project scaffolding, skill creation, configuration generation.
```
Ask question 1 -> Process answer -> Ask question 2 -> ... -> Generate output
```
**Example:** Skill Creator Step 1 — ask name, purpose, tools, complexity, then generate.
---
## Plan-Validate-Execute
Generate a plan, validate feasibility, then execute. Best when operations are hard to reverse.
**When to use:** Database migrations, infrastructure changes, bulk refactoring.
```
Analyze current state -> Generate plan -> Validate plan -> Execute -> Verify
```
**Example:** Database migration — analyze schema diff, generate migration SQL, dry-run validate, execute, verify integrity.
---
## Feedback Loop
Iterate until quality threshold is met. Best for optimization and creative tasks.
**When to use:** Description tuning, code quality improvement, test coverage gaps.
```
Generate -> Evaluate -> Score -> (below threshold?) -> Modify -> Re-evaluate
```
**Example:** Description optimization — generate description, test trigger accuracy, if <90% modify and retry.
---
## Search-Analyze-Report
Explore a codebase or dataset, analyze findings, produce structured report. Best for audit and discovery tasks.
**When to use:** Security audits, dependency analysis, code review, documentation gaps.
```
Search (Glob/Grep) -> Collect findings -> Analyze patterns -> Generate report
```
**Example:** Security scanner — search for SQL injection patterns, analyze severity, generate prioritized report.
---
## Choosing a Pattern
| Task Type | Recommended Pattern |
|-----------|-------------------|
| Build/generate something | Sequential |
| Fix/diagnose something | Conditional |
| Configure/set up | Wizard |
| Migrate/transform | Plan-Validate-Execute |
| Optimize/improve | Feedback Loop |
| Audit/review | Search-Analyze-Report |
Most skills combine patterns. A skill creator uses Wizard (gather info) + Sequential (write files) + Feedback Loop (validate and iterate).# Output Patterns
Reference guide for structuring skill output. Choose based on how strictly the output format matters.
---
## Strict Template
Exact output format — every field, every line defined. Zero creative freedom.
**When to use:** API responses, config files, compliance documents, machine-parseable output.
**Implementation:** Provide the exact template in SKILL.md with all fields labeled.
```markdown
## Output
Generate output in EXACTLY this format:
\```
REPORT: {title}
Date: {YYYY-MM-DD}
Status: {PASS|FAIL|WARN}
Findings: {count}
{findings table}
\```
```
---
## Flexible Template
Defined structure with creative content. Sections are fixed, content within is free-form.
**When to use:** Documentation, READMEs, blog posts, analysis reports.
**Implementation:** Specify required sections and constraints, let Claude fill content.
```markdown
## Output
Include these sections in order:
1. **Summary** (2-3 sentences)
2. **Key Findings** (bulleted list)
3. **Recommendations** (numbered, prioritized)
4. **Next Steps** (actionable items)
```
---
## Examples-Driven
Define behavior through input/output pairs. Claude infers the pattern.
**When to use:** Transformations, formatting, code generation where showing is clearer than telling.
**Implementation:** Provide 3+ diverse examples covering normal and edge cases.
```markdown
## Examples
**Input:** `getUserById(123)`
**Output:** `SELECT * FROM users WHERE id = $1` with params `[123]`
**Input:** `findUsersByName("Alice")`
**Output:** `SELECT * FROM users WHERE name ILIKE $1` with params `['%Alice%']`
```
---
## Visual (HTML Generation)
Generate self-contained HTML pages for rich visual output.
**When to use:** Dashboards, comparison tables, diagrams, interactive reports.
**Implementation:** Specify the visual layout, styling approach, and data structure.
```markdown
## Output
Generate a self-contained HTML file with:
- Inline CSS (no external dependencies)
- Data table with sortable columns
- Color-coded severity indicators
- Print-friendly layout
```
---
## Structured Data
Output as JSON, YAML, or other machine-readable format.
**When to use:** API integrations, config generation, data export, pipeline inputs.
**Implementation:** Provide the exact schema with field types and constraints.
```markdown
## Output
Generate JSON matching this schema:
\```json
{
"version": "1.0",
"results": [
{
"file": "string",
"line": "number",
"severity": "error|warning|info",
"message": "string"
}
],
"summary": {
"total": "number",
"errors": "number",
"warnings": "number"
}
}
\```
```
---
## Choosing a Pattern
| Need | Pattern | Degrees of Freedom |
|------|---------|-------------------|
| Machine-parseable | Strict Template | None |
| Consistent structure | Flexible Template | Medium |
| Pattern inference | Examples-Driven | Medium |
| Rich presentation | Visual (HTML) | High |
| Pipeline integration | Structured Data | None |
**Tip:** Combine patterns when needed. A skill might use Structured Data for the primary output and Visual for a human-readable report.# Frontmatter Specification
Complete reference for SKILL.md YAML frontmatter fields.
---
## Required Fields
### `name`
- **Type:** string
- **Format:** kebab-case (`a-z`, `0-9`, hyphens)
- **Max length:** 64 characters
- **Convention:** Must match directory name. Gerund form preferred (`processing-pdfs` not `pdf-processor`).
- **Validation:** Error if missing, warning if not kebab-case or mismatches folder name.
### `description`
- **Type:** string (multiline with `|`)
- **Min length:** 20 characters
- **Max length:** 1024 characters
- **Requirements:**
- Third person voice (no "I can" or "You should")
- Must include "Use when ..." phrase (+3 marketplace pts)
- Must include "Trigger with ..." phrase (+3 marketplace pts)
- Action verb near start (analyze, create, deploy, etc.)
- No reserved words in isolation (anthropic, claude)
- **Best practice:** Front-load distinctive keywords. Add "Make sure to use whenever..." for aggressive activation.
---
## Identity Fields (Top-Level)
These MUST be top-level fields. Do NOT nest under `metadata:`.
### `version`
- **Type:** string
- **Format:** Semver (`X.Y.Z`)
- **Example:** `1.0.0`
- **Validation:** Error if not semver format.
### `author`
- **Type:** string
- **Format:** `Name <email>`
- **Example:** `Jeremy Longshore <jeremy@intentsolutions.io>`
- **Validation:** Warning if no email included.
### `license`
- **Type:** string
- **Format:** SPDX identifier
- **Default:** `MIT`
### `allowed-tools`
- **Type:** string (CSV format, NOT YAML array)
- **Valid tools:** Read, Write, Edit, Bash, Glob, Grep, WebFetch, WebSearch, Task, TodoWrite, NotebookEdit, AskUserQuestion, Skill
- **Bash scoping:** Must be scoped — `Bash(git:*)`, `Bash(npm:*)`, `Bash(python3:*)`. Unscoped `Bash` is an error.
- **MCP tools:** `ServerName:tool_name`
- **Example:** `"Read,Write,Edit,Glob,Grep,Bash(git:*),Bash(npm:*)"`
- **Validation:** Error if unscoped Bash, error if unknown tool name.
---
## Optional Fields (Anthropic Spec)
### `model`
- **Type:** string
- **Values:** `inherit` (default), `sonnet`, `haiku`, `opus`, or `claude-*` model ID
- **When to use:** Only override when the skill specifically needs a different model capability.
### `argument-hint`
- **Type:** string
- **Max length:** 200 characters
- **Purpose:** Autocomplete hint shown in `/name` menu.
- **Example:** `"<file-path>"`, `"[version-tag]"`
### `context`
- **Type:** string
- **Values:** `fork`
- **Purpose:** Run skill in an isolated subagent context.
### `agent`
- **Type:** string
- **Purpose:** Specify subagent type when using `context: fork`.
- **Values:** Any valid agent type (e.g., `general-purpose`, `Explore`)
### `user-invocable`
- **Type:** boolean
- **Default:** `true`
- **Purpose:** Set `false` to hide from `/` menu. Skill still activates via model matching.
### `disable-model-invocation`
- **Type:** boolean
- **Default:** `false`
- **Purpose:** Set `true` to prevent auto-activation. Only triggers via explicit `/name` command.
### `hooks`
- **Type:** object
- **Purpose:** Skill-scoped lifecycle hooks (pre-tool-call, post-tool-call, etc.)
---
## Extended Fields (Marketplace)
These are recognized by the Intent Solutions marketplace but not part of the core Anthropic spec.
### `compatible-with`
- **Type:** string (CSV) or array
- **Values:** `claude-code`, `codex`, `openclaw`, `aider`, `continue`, `cursor`, `windsurf`
- **Purpose:** Declare which AI coding platforms this skill works with.
### `compatibility`
- **Type:** string
- **Purpose:** Environment requirements (e.g., `"Node.js >= 18"`, `"Python 3.10+"`)
### `tags`
- **Type:** array of strings
- **Purpose:** Discovery tags for marketplace search.
- **Example:** `[devops, ci, deployment]`
---
## Deprecated Fields
### `when_to_use`
- **Status:** Deprecated
- **Migration:** Move content into the `description` field using "Use when ..." pattern.
### `metadata` (as parent for identity fields)
- **Status:** Anti-pattern
- **Migration:** Move `author`, `version`, `license` to top-level.# Validation Rules — 100-Point Rubric
Detailed sub-criteria for the Intent Solutions 100-point grading system.
---
## Grade Scale
| Grade | Score | Meaning |
|-------|-------|---------|
| A | 90-100 | Production-ready |
| B | 80-89 | Good, minor improvements |
| C | 70-79 | Adequate, has gaps |
| D | 60-69 | Needs significant work |
| F | <60 | Major revision required |
---
## Pillar 1: Progressive Disclosure (30 pts)
### Token Economy (10 pts)
- 10 pts: SKILL.md body <=150 lines
- 7 pts: 151-300 lines
- 4 pts: 301-500 lines
- 0 pts: >500 lines
### Layered Structure (10 pts)
- 10 pts: Has `references/` directory with markdown files
- 8 pts: No references needed (skill <=100 lines)
- 4 pts: No references but skill is 100-200 lines
- 3 pts: `references/` exists but empty
- 0 pts: No references and skill >200 lines
### Reference Depth (5 pts)
- 5 pts: References are flat (no nested subdirectories)
- 2 pts: Has nested directories in `references/`
- 5 pts: N/A (no references directory)
### Navigation Signals (5 pts)
- 5 pts: Short file (<=100 lines) OR has TOC/anchor links
- 0 pts: Long file without navigation
---
## Pillar 2: Ease of Use (25 pts)
### Metadata Quality (10 pts)
- +2: Has `name`
- +3: Has `description` >=50 chars
- +2: Has `version`
- +2: Has `allowed-tools`
- +1: Has `author` with email
### Discoverability (6 pts)
- +3: Description contains "Use when"
- +3: Description contains "Trigger with" or "Trigger phrase"
### Terminology Consistency (4 pts)
- -2: `name` differs from folder name
- -1: Inconsistent casing in description (ALL CAPS words >3 chars)
### Workflow Clarity (5 pts)
- +3: Has numbered steps in body
- +2: Has >=5 section headers
- +1: Has 3-4 section headers
---
## Pillar 3: Utility (20 pts)
### Problem Solving Power (8 pts)
- +4: Has `## Overview` section with >50 chars
- +2: Has `## Prerequisites` section
- +2: Has `## Output` section
### Degrees of Freedom (5 pts)
- +2: Mentions options/configuration/parameters
- +2: Shows alternatives/multiple approaches
- +1: Mentions extensibility/customization
### Feedback Loops (4 pts)
- +2: Has `## Error Handling` section
- +1: Mentions validation/verification
- +1: Mentions troubleshooting/debugging
### Examples & Templates (3 pts)
- +2: Has `## Examples` section or labeled examples
- +1: Has >=2 code blocks
---
## Pillar 4: Spec Compliance (15 pts)
### Frontmatter Validity (5 pts)
- 5 pts baseline, -1 per missing required field (max -4)
- Required: name, description, allowed-tools, version, author, license
### Name Conventions (4 pts)
- -2: Not kebab-case
- -1: Name >64 characters
- -1: Name doesn't match folder
### Description Quality (4 pts)
- -2: Too short (<50 chars)
- -2: Too long (>1024 chars)
- -1: Uses first person ("I can")
- -1: Uses second person ("You should")
### Optional Fields (2 pts)
- -1: Invalid `model` value
---
## Pillar 5: Writing Style (10 pts)
### Voice & Tense (4 pts)
- -2: No imperative verbs in numbered steps
### Objectivity (3 pts)
- -1: "you should/can/will" in body
- -1: "I can/will" in body
### Conciseness (3 pts)
- -2: >3000 words
- -1: >2000 words
- -1: >400 lines
---
## Modifiers (+/-5 pts)
### Bonuses
- +1: Gerund-style name (ends in `-ing`)
- +1: Grep-friendly structure (>=7 section headers)
- +2: >=3 labeled examples
- +1: Resources section with >=2 external links
### Penalties
- -2: First/second person in description
- -2: Long file (>150 lines) without TOC
- -1: XML-like tags in body# Grader Agent
Evaluate expectations against an execution transcript and outputs.
## Role
The Grader reviews a transcript and output files, then determines whether each expectation passes or fails. Provide clear evidence for each judgment.
You have two jobs: grade the outputs, and critique the evals themselves. A passing grade on a weak assertion is worse than useless — it creates false confidence. When you notice an assertion that's trivially satisfied, or an important outcome that no assertion checks, say so.
## Inputs
You receive these parameters in your prompt:
- **expectations**: List of expectations to evaluate (strings)
- **transcript_path**: Path to the execution transcript (markdown file)
- **outputs_dir**: Directory containing output files from execution
## Process
### Step 1: Read the Transcript
1. Read the transcript file completely
2. Note the eval prompt, execution steps, and final result
3. Identify any issues or errors documented
### Step 2: Examine Output Files
1. List files in outputs_dir
2. Read/examine each file relevant to the expectations. If outputs aren't plain text, use the inspection tools provided in your prompt — don't rely solely on what the transcript says the executor produced.
3. Note contents, structure, and quality
### Step 3: Evaluate Each Assertion
For each expectation:
1. **Search for evidence** in the transcript and outputs
2. **Determine verdict**:
- **PASS**: Clear evidence the expectation is true AND the evidence reflects genuine task completion, not just surface-level compliance
- **FAIL**: No evidence, or evidence contradicts the expectation, or the evidence is superficial (e.g., correct filename but empty/wrong content)
3. **Cite the evidence**: Quote the specific text or describe what you found
### Step 4: Extract and Verify Claims
Beyond the predefined expectations, extract implicit claims from the outputs and verify them:
1. **Extract claims** from the transcript and outputs:
- Factual statements ("The form has 12 fields")
- Process claims ("Used pypdf to fill the form")
- Quality claims ("All fields were filled correctly")
2. **Verify each claim**:
- **Factual claims**: Can be checked against the outputs or external sources
- **Process claims**: Can be verified from the transcript
- **Quality claims**: Evaluate whether the claim is justified
3. **Flag unverifiable claims**: Note claims that cannot be verified with available information
This catches issues that predefined expectations might miss.
### Step 5: Read User Notes
If `{outputs_dir}/user_notes.md` exists:
1. Read it and note any uncertainties or issues flagged by the executor
2. Include relevant concerns in the grading output
3. These may reveal problems even when expectations pass
### Step 6: Critique the Evals
After grading, consider whether the evals themselves could be improved. Only surface suggestions when there's a clear gap.
Good suggestions test meaningful outcomes — assertions that are hard to satisfy without actually doing the work correctly. Think about what makes an assertion *discriminating*: it passes when the skill genuinely succeeds and fails when it doesn't.
Suggestions worth raising:
- An assertion that passed but would also pass for a clearly wrong output (e.g., checking filename existence but not file content)
- An important outcome you observed — good or bad — that no assertion covers at all
- An assertion that can't actually be verified from the available outputs
Keep the bar high. The goal is to flag things the eval author would say "good catch" about, not to nitpick every assertion.
### Step 7: Write Grading Results
Save results to `{outputs_dir}/../grading.json` (sibling to outputs_dir).
## Grading Criteria
**PASS when**:
- The transcript or outputs clearly demonstrate the expectation is true
- Specific evidence can be cited
- The evidence reflects genuine substance, not just surface compliance (e.g., a file exists AND contains correct content, not just the right filename)
**FAIL when**:
- No evidence found for the expectation
- Evidence contradicts the expectation
- The expectation cannot be verified from available information
- The evidence is superficial — the assertion is technically satisfied but the underlying task outcome is wrong or incomplete
- The output appears to meet the assertion by coincidence rather than by actually doing the work
**When uncertain**: The burden of proof to pass is on the expectation.
### Step 8: Read Executor Metrics and Timing
1. If `{outputs_dir}/metrics.json` exists, read it and include in grading output
2. If `{outputs_dir}/../timing.json` exists, read it and include timing data
## Output Format
Write a JSON file with this structure:
```json
{
"expectations": [
{
"text": "The output includes the name 'John Smith'",
"passed": true,
"evidence": "Found in transcript Step 3: 'Extracted names: John Smith, Sarah Johnson'"
},
{
"text": "The spreadsheet has a SUM formula in cell B10",
"passed": false,
"evidence": "No spreadsheet was created. The output was a text file."
},
{
"text": "The assistant used the skill's OCR script",
"passed": true,
"evidence": "Transcript Step 2 shows: 'Tool: Bash - python ocr_script.py image.png'"
}
],
"summary": {
"passed": 2,
"failed": 1,
"total": 3,
"pass_rate": 0.67
},
"execution_metrics": {
"tool_calls": {
"Read": 5,
"Write": 2,
"Bash": 8
},
"total_tool_calls": 15,
"total_steps": 6,
"errors_encountered": 0,
"output_chars": 12450,
"transcript_chars": 3200
},
"timing": {
"executor_duration_seconds": 165.0,
"grader_duration_seconds": 26.0,
"total_duration_seconds": 191.0
},
"claims": [
{
"claim": "The form has 12 fillable fields",
"type": "factual",
"verified": true,
"evidence": "Counted 12 fields in field_info.json"
},
{
"claim": "All required fields were populated",
"type": "quality",
"verified": false,
"evidence": "Reference section was left blank despite data being available"
}
],
"user_notes_summary": {
"uncertainties": ["Used 2023 data, may be stale"],
"needs_review": [],
"workarounds": ["Fell back to text overlay for non-fillable fields"]
},
"eval_feedback": {
"suggestions": [
{
"assertion": "The output includes the name 'John Smith'",
"reason": "A hallucinated document that mentions the name would also pass — consider checking it appears as the primary contact with matching phone and email from the input"
},
{
"reason": "No assertion checks whether the extracted phone numbers match the input — I observed incorrect numbers in the output that went uncaught"
}
],
"overall": "Assertions check presence but not correctness. Consider adding content verification."
}
}
```
## Field Descriptions
- **expectations**: Array of graded expectations
- **text**: The original expectation text
- **passed**: Boolean - true if expectation passes
- **evidence**: Specific quote or description supporting the verdict
- **summary**: Aggregate statistics
- **passed**: Count of passed expectations
- **failed**: Count of failed expectations
- **total**: Total expectations evaluated
- **pass_rate**: Fraction passed (0.0 to 1.0)
- **execution_metrics**: Copied from executor's metrics.json (if available)
- **output_chars**: Total character count of output files (proxy for tokens)
- **transcript_chars**: Character count of transcript
- **timing**: Wall clock timing from timing.json (if available)
- **executor_duration_seconds**: Time spent in executor subagent
- **total_duration_seconds**: Total elapsed time for the run
- **claims**: Extracted and verified claims from the output
- **claim**: The statement being verified
- **type**: "factual", "process", or "quality"
- **verified**: Boolean - whether the claim holds
- **evidence**: Supporting or contradicting evidence
- **user_notes_summary**: Issues flagged by the executor
- **uncertainties**: Things the executor wasn't sure about
- **needs_review**: Items requiring human attention
- **workarounds**: Places where the skill didn't work as expected
- **eval_feedback**: Improvement suggestions for the evals (only when warranted)
- **suggestions**: List of concrete suggestions, each with a `reason` and optionally an `assertion` it relates to
- **overall**: Brief assessment — can be "No suggestions, evals look solid" if nothing to flag
## Guidelines
- **Be objective**: Base verdicts on evidence, not assumptions
- **Be specific**: Quote the exact text that supports your verdict
- **Be thorough**: Check both transcript and output files
- **Be consistent**: Apply the same standard to each expectation
- **Explain failures**: Make it clear why evidence was insufficient
- **No partial credit**: Each expectation is pass or fail, not partial# Blind Comparator Agent
Compare two outputs WITHOUT knowing which skill produced them.
## Role
The Blind Comparator judges which output better accomplishes the eval task. You receive two outputs labeled A and B, but you do NOT know which skill produced which. This prevents bias toward a particular skill or approach.
Your judgment is based purely on output quality and task completion.
## Inputs
You receive these parameters in your prompt:
- **output_a_path**: Path to the first output file or directory
- **output_b_path**: Path to the second output file or directory
- **eval_prompt**: The original task/prompt that was executed
- **expectations**: List of expectations to check (optional - may be empty)
## Process
### Step 1: Read Both Outputs
1. Examine output A (file or directory)
2. Examine output B (file or directory)
3. Note the type, structure, and content of each
4. If outputs are directories, examine all relevant files inside
### Step 2: Understand the Task
1. Read the eval_prompt carefully
2. Identify what the task requires:
- What should be produced?
- What qualities matter (accuracy, completeness, format)?
- What would distinguish a good output from a poor one?
### Step 3: Generate Evaluation Rubric
Based on the task, generate a rubric with two dimensions:
**Content Rubric** (what the output contains):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|-----------|----------|----------------|---------------|
| Correctness | Major errors | Minor errors | Fully correct |
| Completeness | Missing key elements | Mostly complete | All elements present |
| Accuracy | Significant inaccuracies | Minor inaccuracies | Accurate throughout |
**Structure Rubric** (how the output is organized):
| Criterion | 1 (Poor) | 3 (Acceptable) | 5 (Excellent) |
|-----------|----------|----------------|---------------|
| Organization | Disorganized | Reasonably organized | Clear, logical structure |
| Formatting | Inconsistent/broken | Mostly consistent | Professional, polished |
| Usability | Difficult to use | Usable with effort | Easy to use |
Adapt criteria to the specific task. For example:
- PDF form: "Field alignment", "Text readability", "Data placement"
- Document: "Section structure", "Heading hierarchy", "Paragraph flow"
- Data output: "Schema correctness", "Data types", "Completeness"
### Step 4: Evaluate Each Output Against the Rubric
For each output (A and B):
1. **Score each criterion** on the rubric (1-5 scale)
2. **Calculate dimension totals**: Content score, Structure score
3. **Calculate overall score**: Average of dimension scores, scaled to 1-10
### Step 5: Check Assertions (if provided)
If expectations are provided:
1. Check each expectation against output A
2. Check each expectation against output B
3. Count pass rates for each output
4. Use expectation scores as secondary evidence (not the primary decision factor)
### Step 6: Determine the Winner
Compare A and B based on (in priority order):
1. **Primary**: Overall rubric score (content + structure)
2. **Secondary**: Assertion pass rates (if applicable)
3. **Tiebreaker**: If truly equal, declare a TIE
Be decisive - ties should be rare. One output is usually better, even if marginally.
### Step 7: Write Comparison Results
Save results to a JSON file at the path specified (or `comparison.json` if not specified).
## Output Format
Write a JSON file with this structure:
```json
{
"winner": "A",
"reasoning": "Output A provides a complete solution with proper formatting and all required fields. Output B is missing the date field and has formatting inconsistencies.",
"rubric": {
"A": {
"content": {
"correctness": 5,
"completeness": 5,
"accuracy": 4
},
"structure": {
"organization": 4,
"formatting": 5,
"usability": 4
},
"content_score": 4.7,
"structure_score": 4.3,
"overall_score": 9.0
},
"B": {
"content": {
"correctness": 3,
"completeness": 2,
"accuracy": 3
},
"structure": {
"organization": 3,
"formatting": 2,
"usability": 3
},
"content_score": 2.7,
"structure_score": 2.7,
"overall_score": 5.4
}
},
"output_quality": {
"A": {
"score": 9,
"strengths": ["Complete solution", "Well-formatted", "All fields present"],
"weaknesses": ["Minor style inconsistency in header"]
},
"B": {
"score": 5,
"strengths": ["Readable output", "Correct basic structure"],
"weaknesses": ["Missing date field", "Formatting inconsistencies", "Partial data extraction"]
}
},
"expectation_results": {
"A": {
"passed": 4,
"total": 5,
"pass_rate": 0.80,
"details": [
{"text": "Output includes name", "passed": true},
{"text": "Output includes date", "passed": true},
{"text": "Format is PDF", "passed": true},
{"text": "Contains signature", "passed": false},
{"text": "Readable text", "passed": true}
]
},
"B": {
"passed": 3,
"total": 5,
"pass_rate": 0.60,
"details": [
{"text": "Output includes name", "passed": true},
{"text": "Output includes date", "passed": false},
{"text": "Format is PDF", "passed": true},
{"text": "Contains signature", "passed": false},
{"text": "Readable text", "passed": true}
]
}
}
}
```
If no expectations were provided, omit the `expectation_results` field entirely.
## Field Descriptions
- **winner**: "A", "B", or "TIE"
- **reasoning**: Clear explanation of why the winner was chosen (or why it's a tie)
- **rubric**: Structured rubric evaluation for each output
- **content**: Scores for content criteria (correctness, completeness, accuracy)
- **structure**: Scores for structure criteria (organization, formatting, usability)
- **content_score**: Average of content criteria (1-5)
- **structure_score**: Average of structure criteria (1-5)
- **overall_score**: Combined score scaled to 1-10
- **output_quality**: Summary quality assessment
- **score**: 1-10 rating (should match rubric overall_score)
- **strengths**: List of positive aspects
- **weaknesses**: List of issues or shortcomings
- **expectation_results**: (Only if expectations provided)
- **passed**: Number of expectations that passed
- **total**: Total number of expectations
- **pass_rate**: Fraction passed (0.0 to 1.0)
- **details**: Individual expectation results
## Guidelines
- **Stay blind**: DO NOT try to infer which skill produced which output. Judge purely on output quality.
- **Be specific**: Cite specific examples when explaining strengths and weaknesses.
- **Be decisive**: Choose a winner unless outputs are genuinely equivalent.
- **Output quality first**: Assertion scores are secondary to overall task completion.
- **Be objective**: Don't favor outputs based on style preferences; focus on correctness and completeness.
- **Explain your reasoning**: The reasoning field should make it clear why you chose the winner.
- **Handle edge cases**: If both outputs fail, pick the one that fails less badly. If both are excellent, pick the one that's marginally better.# Post-hoc Analyzer Agent
Analyze blind comparison results to understand WHY the winner won and generate improvement suggestions.
## Role
After the blind comparator determines a winner, the Post-hoc Analyzer "unblids" the results by examining the skills and transcripts. The goal is to extract actionable insights: what made the winner better, and how can the loser be improved?
## Inputs
You receive these parameters in your prompt:
- **winner**: "A" or "B" (from blind comparison)
- **winner_skill_path**: Path to the skill that produced the winning output
- **winner_transcript_path**: Path to the execution transcript for the winner
- **loser_skill_path**: Path to the skill that produced the losing output
- **loser_transcript_path**: Path to the execution transcript for the loser
- **comparison_result_path**: Path to the blind comparator's output JSON
- **output_path**: Where to save the analysis results
## Process
### Step 1: Read Comparison Result
1. Read the blind comparator's output at comparison_result_path
2. Note the winning side (A or B), the reasoning, and any scores
3. Understand what the comparator valued in the winning output
### Step 2: Read Both Skills
1. Read the winner skill's SKILL.md and key referenced files
2. Read the loser skill's SKILL.md and key referenced files
3. Identify structural differences:
- Instructions clarity and specificity
- Script/tool usage patterns
- Example coverage
- Edge case handling
### Step 3: Read Both Transcripts
1. Read the winner's transcript
2. Read the loser's transcript
3. Compare execution patterns:
- How closely did each follow their skill's instructions?
- What tools were used differently?
- Where did the loser diverge from optimal behavior?
- Did either encounter errors or make recovery attempts?
### Step 4: Analyze Instruction Following
For each transcript, evaluate:
- Did the agent follow the skill's explicit instructions?
- Did the agent use the skill's provided tools/scripts?
- Were there missed opportunities to leverage skill content?
- Did the agent add unnecessary steps not in the skill?
Score instruction following 1-10 and note specific issues.
### Step 5: Identify Winner Strengths
Determine what made the winner better:
- Clearer instructions that led to better behavior?
- Better scripts/tools that produced better output?
- More comprehensive examples that guided edge cases?
- Better error handling guidance?
Be specific. Quote from skills/transcripts where relevant.
### Step 6: Identify Loser Weaknesses
Determine what held the loser back:
- Ambiguous instructions that led to suboptimal choices?
- Missing tools/scripts that forced workarounds?
- Gaps in edge case coverage?
- Poor error handling that caused failures?
### Step 7: Generate Improvement Suggestions
Based on the analysis, produce actionable suggestions for improving the loser skill:
- Specific instruction changes to make
- Tools/scripts to add or modify
- Examples to include
- Edge cases to address
Prioritize by impact. Focus on changes that would have changed the outcome.
### Step 8: Write Analysis Results
Save structured analysis to `{output_path}`.
## Output Format
Write a JSON file with this structure:
```json
{
"comparison_summary": {
"winner": "A",
"winner_skill": "path/to/winner/skill",
"loser_skill": "path/to/loser/skill",
"comparator_reasoning": "Brief summary of why comparator chose winner"
},
"winner_strengths": [
"Clear step-by-step instructions for handling multi-page documents",
"Included validation script that caught formatting errors",
"Explicit guidance on fallback behavior when OCR fails"
],
"loser_weaknesses": [
"Vague instruction 'process the document appropriately' led to inconsistent behavior",
"No script for validation, agent had to improvise and made errors",
"No guidance on OCR failure, agent gave up instead of trying alternatives"
],
"instruction_following": {
"winner": {
"score": 9,
"issues": [
"Minor: skipped optional logging step"
]
},
"loser": {
"score": 6,
"issues": [
"Did not use the skill's formatting template",
"Invented own approach instead of following step 3",
"Missed the 'always validate output' instruction"
]
}
},
"improvement_suggestions": [
{
"priority": "high",
"category": "instructions",
"suggestion": "Replace 'process the document appropriately' with explicit steps: 1) Extract text, 2) Identify sections, 3) Format per template",
"expected_impact": "Would eliminate ambiguity that caused inconsistent behavior"
},
{
"priority": "high",
"category": "tools",
"suggestion": "Add validate_output.py script similar to winner skill's validation approach",
"expected_impact": "Would catch formatting errors before final output"
},
{
"priority": "medium",
"category": "error_handling",
"suggestion": "Add fallback instructions: 'If OCR fails, try: 1) different resolution, 2) image preprocessing, 3) manual extraction'",
"expected_impact": "Would prevent early failure on difficult documents"
}
],
"transcript_insights": {
"winner_execution_pattern": "Read skill -> Followed 5-step process -> Used validation script -> Fixed 2 issues -> Produced output",
"loser_execution_pattern": "Read skill -> Unclear on approach -> Tried 3 different methods -> No validation -> Output had errors"
}
}
```
## Guidelines
- **Be specific**: Quote from skills and transcripts, don't just say "instructions were unclear"
- **Be actionable**: Suggestions should be concrete changes, not vague advice
- **Focus on skill improvements**: The goal is to improve the losing skill, not critique the agent
- **Prioritize by impact**: Which changes would most likely have changed the outcome?
- **Consider causation**: Did the skill weakness actually cause the worse output, or is it incidental?
- **Stay objective**: Analyze what happened, don't editorialize
- **Think about generalization**: Would this improvement help on other evals too?
## Categories for Suggestions
Use these categories to organize improvement suggestions:
| Category | Description |
|----------|-------------|
| `instructions` | Changes to the skill's prose instructions |
| `tools` | Scripts, templates, or utilities to add/modify |
| `examples` | Example inputs/outputs to include |
| `error_handling` | Guidance for handling failures |
| `structure` | Reorganization of skill content |
| `references` | External docs or resources to add |
## Priority Levels
- **high**: Would likely change the outcome of this comparison
- **medium**: Would improve quality but may not change win/loss
- **low**: Nice to have, marginal improvement
---
# Analyzing Benchmark Results
When analyzing benchmark results, the analyzer's purpose is to **surface patterns and anomalies** across multiple runs, not suggest skill improvements.
## Role
Review all benchmark run results and generate freeform notes that help the user understand skill performance. Focus on patterns that wouldn't be visible from aggregate metrics alone.
## Inputs
You receive these parameters in your prompt:
- **benchmark_data_path**: Path to the in-progress benchmark.json with all run results
- **skill_path**: Path to the skill being benchmarked
- **output_path**: Where to save the notes (as JSON array of strings)
## Process
### Step 1: Read Benchmark Data
1. Read the benchmark.json containing all run results
2. Note the configurations tested (with_skill, without_skill)
3. Understand the run_summary aggregates already calculated
### Step 2: Analyze Per-Assertion Patterns
For each expectation across all runs:
- Does it **always pass** in both configurations? (may not differentiate skill value)
- Does it **always fail** in both configurations? (may be broken or beyond capability)
- Does it **always pass with skill but fail without**? (skill clearly adds value here)
- Does it **always fail with skill but pass without**? (skill may be hurting)
- Is it **highly variable**? (flaky expectation or non-deterministic behavior)
### Step 3: Analyze Cross-Eval Patterns
Look for patterns across evals:
- Are certain eval types consistently harder/easier?
- Do some evals show high variance while others are stable?
- Are there surprising results that contradict expectations?
### Step 4: Analyze Metrics Patterns
Look at time_seconds, tokens, tool_calls:
- Does the skill significantly increase execution time?
- Is there high variance in resource usage?
- Are there outlier runs that skew the aggregates?
### Step 5: Generate Notes
Write freeform observations as a list of strings. Each note should:
- State a specific observation
- Be grounded in the data (not speculation)
- Help the user understand something the aggregate metrics don't show
Examples:
- "Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value"
- "Eval 3 shows high variance (50% +/- 40%) - run 2 had an unusual failure that may be flaky"
- "Without-skill runs consistently fail on table extraction expectations (0% pass rate)"
- "Skill adds 13s average execution time but improves pass rate by 50%"
- "Token usage is 80% higher with skill, primarily due to script output parsing"
- "All 3 without-skill runs for eval 1 produced empty output"
### Step 6: Write Notes
Save notes to `{output_path}` as a JSON array of strings:
```json
[
"Assertion 'Output is a PDF file' passes 100% in both configurations - may not differentiate skill value",
"Eval 3 shows high variance (50% +/- 40%) - run 2 had an unusual failure",
"Without-skill runs consistently fail on table extraction expectations",
"Skill adds 13s average execution time but improves pass rate by 50%"
]
```
## Guidelines
**DO:**
- Report what you observe in the data
- Be specific about which evals, expectations, or runs you're referring to
- Note patterns that aggregate metrics would hide
- Provide context that helps interpret the numbers
**DO NOT:**
- Suggest improvements to the skill (that's for the improvement step, not benchmarking)
- Make subjective quality judgments ("the output was good/bad")
- Speculate about causes without evidence
- Repeat information already in the run_summary aggregates---
# Required (AgentSkills.io)
name: {{SKILL_NAME}}
description: |
{{PURPOSE_STATEMENT}}. Use when {{WHEN_TO_USE}}.
Trigger with "/{{SKILL_NAME}}" or "{{NATURAL_TRIGGER}}".
# Tools (recommended)
allowed-tools: "{{TOOLS_CSV}}"
# Identity (top-level, NOT inside metadata)
version: 1.0.0
author: {{AUTHOR_NAME}} <{{AUTHOR_EMAIL}}>
license: MIT
# Claude Code extensions (include as needed)
model: inherit
# argument-hint: "[arg-description]"
# disable-model-invocation: false
# user-invocable: true
# context: fork
# agent: general-purpose
# Discovery (optional)
# compatible-with: claude-code, codex, openclaw
# tags: [{{TAG_1}}, {{TAG_2}}]
# Optional spec fields
# compatibility: "{{ENVIRONMENT_REQUIREMENTS}}"
# metadata:
# category: {{CATEGORY}}
---
# {{SKILL_TITLE}}
{{PURPOSE_STATEMENT_1_2_SENTENCES}}
## Overview
{{WHAT_THIS_SKILL_SOLVES_AND_WHY_IT_EXISTS}}
## Prerequisites
- {{PREREQUISITE_1}}
- {{PREREQUISITE_2}}
## Instructions
### Step 1: {{STEP_1_TITLE}}
{{STEP_1_DETAILED_INSTRUCTIONS}}
### Step 2: {{STEP_2_TITLE}}
{{STEP_2_DETAILED_INSTRUCTIONS}}
### Step 3: {{STEP_3_TITLE}}
{{STEP_3_DETAILED_INSTRUCTIONS}}
## Output
{{DESCRIPTION_OF_EXPECTED_OUTPUT_FORMAT}}
## Examples
### {{EXAMPLE_1_TITLE}}
**Input:**
```
{{EXAMPLE_1_INPUT}}
```
**Output:**
```
{{EXAMPLE_1_OUTPUT}}
```
### {{EXAMPLE_2_TITLE}}
**Input:**
```
{{EXAMPLE_2_INPUT}}
```
**Output:**
```
{{EXAMPLE_2_OUTPUT}}
```
## Edge Cases
- {{EDGE_CASE_1}}
- {{EDGE_CASE_2}}
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| {{ERROR_1}} | {{CAUSE_1}} | {{SOLUTION_1}} |
| {{ERROR_2}} | {{CAUSE_2}} | {{SOLUTION_2}} |
## Resources
- ${CLAUDE_SKILL_DIR}/references/{{REFERENCE_1}}.md - {{REFERENCE_1_PURPOSE}}
- ${CLAUDE_SKILL_DIR}/scripts/{{SCRIPT_1}}.py - {{SCRIPT_1_PURPOSE}}#!/usr/bin/env python3
"""
Skill Validator — Wrapper around validate-skills-schema.py for skill-creator.
Provides --grade flag for single-file validation with full grade report.
Delegates to the main validator in the repository root.
Usage:
python3 ${CLAUDE_SKILL_DIR}/scripts/validate-skill.py <SKILL.md>
python3 ${CLAUDE_SKILL_DIR}/scripts/validate-skill.py --grade <SKILL.md>
"""
import os
import subprocess
import sys
def find_repo_validator():
"""Find the main validate-skills-schema.py in the repository."""
# Try relative to this script (inside the plugin repo)
script_dir = os.path.dirname(os.path.abspath(__file__))
# Walk up to find scripts/validate-skills-schema.py
current = script_dir
for _ in range(10):
candidate = os.path.join(current, "scripts", "validate-skills-schema.py")
if os.path.exists(candidate):
return candidate
parent = os.path.dirname(current)
if parent == current:
break
current = parent
# Try common locations
home = os.path.expanduser("~")
for path in [
os.path.join(home, "000-projects", "claude-code-plugins", "scripts", "validate-skills-schema.py"),
]:
if os.path.exists(path):
return path
return None
def main():
args = sys.argv[1:]
if not args or args == ["--help"] or args == ["-h"]:
print(__doc__)
sys.exit(0)
# Parse our flags
grade = "--grade" in args
verbose = "--verbose" in args or "-v" in args
skill_path = None
for arg in args:
if not arg.startswith("-"):
skill_path = arg
break
if not skill_path:
print("ERROR: No SKILL.md path provided", file=sys.stderr)
print("Usage: validate-skill.py [--grade] <path/to/SKILL.md>")
sys.exit(1)
if not os.path.exists(skill_path):
print(f"ERROR: File not found: {skill_path}", file=sys.stderr)
sys.exit(1)
validator = find_repo_validator()
if not validator:
print("ERROR: Cannot find validate-skills-schema.py", file=sys.stderr)
print("Make sure you're running from within the claude-code-plugins repo")
sys.exit(1)
# Build command
cmd = [sys.executable, validator]
if verbose or grade:
cmd.append("--verbose")
cmd.append(skill_path)
result = subprocess.run(cmd)
sys.exit(result.returncode)
if __name__ == "__main__":
main()#!/usr/bin/env python3
"""
Quick validation script for skills - minimal version
"""
import sys
import os
import re
import yaml
from pathlib import Path
def validate_skill(skill_path):
"""Basic validation of a skill"""
skill_path = Path(skill_path)
# Check SKILL.md exists
skill_md = skill_path / 'SKILL.md'
if not skill_md.exists():
return False, "SKILL.md not found"
# Read and validate frontmatter
content = skill_md.read_text()
if not content.startswith('---'):
return False, "No YAML frontmatter found"
# Extract frontmatter
match = re.match(r'^---\n(.*?)\n---', content, re.DOTALL)
if not match:
return False, "Invalid frontmatter format"
frontmatter_text = match.group(1)
# Parse YAML frontmatter
try:
frontmatter = yaml.safe_load(frontmatter_text)
if not isinstance(frontmatter, dict):
return False, "Frontmatter must be a YAML dictionary"
except yaml.YAMLError as e:
return False, f"Invalid YAML in frontmatter: {e}"
# Define allowed properties
ALLOWED_PROPERTIES = {'name', 'description', 'license', 'allowed-tools', 'metadata', 'compatibility'}
# Check for unexpected properties (excluding nested keys under metadata)
unexpected_keys = set(frontmatter.keys()) - ALLOWED_PROPERTIES
if unexpected_keys:
return False, (
f"Unexpected key(s) in SKILL.md frontmatter: {', '.join(sorted(unexpected_keys))}. "
f"Allowed properties are: {', '.join(sorted(ALLOWED_PROPERTIES))}"
)
# Check required fields
if 'name' not in frontmatter:
return False, "Missing 'name' in frontmatter"
if 'description' not in frontmatter:
return False, "Missing 'description' in frontmatter"
# Extract name for validation
name = frontmatter.get('name', '')
if not isinstance(name, str):
return False, f"Name must be a string, got {type(name).__name__}"
name = name.strip()
if name:
if not re.match(r'^[a-z0-9-]+$', name):
return False, f"Name '{name}' should be kebab-case (lowercase letters, digits, and hyphens only)"
if name.startswith('-') or name.endswith('-') or '--' in name:
return False, f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens"
if len(name) > 64:
return False, f"Name is too long ({len(name)} characters). Maximum is 64 characters."
# Extract and validate description
description = frontmatter.get('description', '')
if not isinstance(description, str):
return False, f"Description must be a string, got {type(description).__name__}"
description = description.strip()
if description:
if '<' in description or '>' in description:
return False, "Description cannot contain angle brackets (< or >)"
if len(description) > 1024:
return False, f"Description is too long ({len(description)} characters). Maximum is 1024 characters."
# Validate compatibility field if present (optional)
compatibility = frontmatter.get('compatibility', '')
if compatibility:
if not isinstance(compatibility, str):
return False, f"Compatibility must be a string, got {type(compatibility).__name__}"
if len(compatibility) > 500:
return False, f"Compatibility is too long ({len(compatibility)} characters). Maximum is 500 characters."
return True, "Skill is valid!"
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python quick_validate.py <skill_directory>")
sys.exit(1)
valid, message = validate_skill(sys.argv[1])
print(message)
sys.exit(0 if valid else 1)"""Shared utilities for skill-creator scripts."""
from pathlib import Path
def parse_skill_md(skill_path: Path) -> tuple[str, str, str]:
"""Parse a SKILL.md file, returning (name, description, full_content)."""
content = (skill_path / "SKILL.md").read_text()
lines = content.split("\n")
if lines[0].strip() != "---":
raise ValueError("SKILL.md missing frontmatter (no opening ---)")
end_idx = None
for i, line in enumerate(lines[1:], start=1):
if line.strip() == "---":
end_idx = i
break
if end_idx is None:
raise ValueError("SKILL.md missing frontmatter (no closing ---)")
name = ""
description = ""
frontmatter_lines = lines[1:end_idx]
i = 0
while i < len(frontmatter_lines):
line = frontmatter_lines[i]
if line.startswith("name:"):
name = line[len("name:"):].strip().strip('"').strip("'")
elif line.startswith("description:"):
value = line[len("description:"):].strip()
# Handle YAML multiline indicators (>, |, >-, |-)
if value in (">", "|", ">-", "|-"):
continuation_lines: list[str] = []
i += 1
while i < len(frontmatter_lines) and (frontmatter_lines[i].startswith(" ") or frontmatter_lines[i].startswith("\t")):
continuation_lines.append(frontmatter_lines[i].strip())
i += 1
description = " ".join(continuation_lines)
continue
else:
description = value.strip('"').strip("'")
i += 1
return name, description, content#!/usr/bin/env python3
"""
Aggregate individual run results into benchmark summary statistics.
Reads grading.json files from run directories and produces:
- run_summary with mean, stddev, min, max for each metric
- delta between with_skill and without_skill configurations
Usage:
python aggregate_benchmark.py <benchmark_dir>
Example:
python aggregate_benchmark.py benchmarks/2026-01-15T10-30-00/
The script supports two directory layouts:
Workspace layout (from skill-creator iterations):
<benchmark_dir>/
+-- eval-N/
+-- with_skill/
| +-- run-1/grading.json
| +-- run-2/grading.json
+-- without_skill/
+-- run-1/grading.json
+-- run-2/grading.json
Legacy layout (with runs/ subdirectory):
<benchmark_dir>/
+-- runs/
+-- eval-N/
+-- with_skill/
| +-- run-1/grading.json
+-- without_skill/
+-- run-1/grading.json
"""
import argparse
import json
import math
import sys
from datetime import datetime, timezone
from pathlib import Path
def calculate_stats(values: list[float]) -> dict:
"""Calculate mean, stddev, min, max for a list of values."""
if not values:
return {"mean": 0.0, "stddev": 0.0, "min": 0.0, "max": 0.0}
n = len(values)
# ... (continues for ~350 lines total)#!/usr/bin/env python3
"""Run trigger evaluation for a skill description.
Tests whether a skill's description causes Claude to trigger (read the skill)
for a set of queries. Outputs results as JSON.
"""
import argparse
import json
import os
import select
import subprocess
import sys
import time
import uuid
from concurrent.futures import ProcessPoolExecutor, as_completed
from pathlib import Path
from scripts.utils import parse_skill_md
def find_project_root() -> Path:
"""Find the project root by walking up from cwd looking for .claude/.
Mimics how Claude Code discovers its project root, so the command file
we create ends up where claude -p will look for it.
"""
current = Path.cwd()
for parent in [current, *current.parents]:
if (parent / ".claude").is_dir():
return parent
return current
def run_single_query(
query: str,
skill_name: str,
skill_description: str,
timeout: int,
project_root: str,
model: str | None = None,
) -> bool:
"""Run a single query and return whether the skill was triggered.
Creates a command file in .claude/commands/ so it appears in Claude's
available_skills list, then runs `claude -p` with the raw query.
Uses --include-partial-messages to detect triggering early from
stream events (content_block_start) rather than waiting for the
full assistant message, which only arrives after tool execution.
"""
# ... (continues for ~250 lines total)#!/usr/bin/env python3
"""Run the eval + improve loop until all pass or max iterations reached.
Combines run_eval.py and improve_description.py in a loop, tracking history
and returning the best description found. Supports train/test split to prevent
overfitting.
"""
import argparse
import json
import random
import sys
import tempfile
import time
import webbrowser
from pathlib import Path
from scripts.generate_report import generate_html
from scripts.improve_description import improve_description
from scripts.run_eval import find_project_root, run_eval
from scripts.utils import parse_skill_md
def split_eval_set(eval_set: list[dict], holdout: float, seed: int = 42) -> tuple[list[dict], list[dict]]:
"""Split eval set into train and test sets, stratified by should_trigger."""
random.seed(seed)
# Separate by should_trigger
trigger = [e for e in eval_set if e["should_trigger"]]
no_trigger = [e for e in eval_set if not e["should_trigger"]]
# Shuffle each group
random.shuffle(trigger)
random.shuffle(no_trigger)
# Calculate split points
n_trigger_test = max(1, int(len(trigger) * holdout))
n_no_trigger_test = max(1, int(len(no_trigger) * holdout))
# Split
test_set = trigger[:n_trigger_test] + no_trigger[:n_no_trigger_test]
train_set = trigger[n_trigger_test:] + no_trigger[n_no_trigger_test:]
return train_set, test_set
def run_loop(
eval_set: list[dict],
skill_path: Path,
description_override: str | None,
# ... (continues for ~350 lines total)#!/usr/bin/env python3
"""Improve a skill description based on eval results.
Takes eval results (from run_eval.py) and generates an improved description
by calling `claude -p` as a subprocess (same auth pattern as run_eval.py —
uses the session's Claude Code auth, no separate ANTHROPIC_API_KEY needed).
"""
import argparse
import json
import os
import re
import subprocess
import sys
from pathlib import Path
from scripts.utils import parse_skill_md
def _call_claude(prompt: str, model: str | None, timeout: int = 300) -> str:
"""Run `claude -p` with the prompt on stdin and return the text response.
Prompt goes over stdin (not argv) because it embeds the full SKILL.md
body and can easily exceed comfortable argv length.
"""
cmd = ["claude", "-p", "--output-format", "text"]
if model:
cmd.extend(["--model", model])
# Remove CLAUDECODE env var to allow nesting claude -p inside a
# Claude Code session. The guard is for interactive terminal conflicts;
# programmatic subprocess usage is safe. Same pattern as run_eval.py.
env = {k: v for k, v in os.environ.items() if k != "CLAUDECODE"}
result = subprocess.run(
cmd,
input=prompt,
capture_output=True,
text=True,
env=env,
timeout=timeout,
)
if result.returncode != 0:
raise RuntimeError(
f"claude -p exited {result.returncode}\nstderr: {result.stderr}"
)
return result.stdout
def improve_description(
# ... (continues for ~250 lines total)#!/usr/bin/env python3
"""Generate an HTML report from run_loop.py output.
Takes the JSON output from run_loop.py and generates a visual HTML report
showing each description attempt with check/x for each test case.
Distinguishes between train and test queries.
"""
import argparse
import html
import json
import sys
from pathlib import Path
def generate_html(data: dict, auto_refresh: bool = False, skill_name: str = "") -> str:
"""Generate HTML report from loop output data. If auto_refresh is True, adds a meta refresh tag."""
history = data.get("history", [])
holdout = data.get("holdout", 0)
title_prefix = html.escape(skill_name + " — ") if skill_name else ""
# Get all unique queries from train and test sets, with should_trigger info
train_queries: list[dict] = []
test_queries: list[dict] = []
if history:
for r in history[0].get("train_results", history[0].get("results", [])):
train_queries.append({"query": r["query"], "should_trigger": r.get("should_trigger", True)})
if history[0].get("test_results"):
for r in history[0].get("test_results", []):
test_queries.append({"query": r["query"], "should_trigger": r.get("should_trigger", True)})
refresh_tag = ' <meta http-equiv="refresh" content="5">\n' if auto_refresh else ""
html_parts = ["""<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
""" + refresh_tag + """ <title>""" + title_prefix + """Skill Description Optimization</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Poppins:wght@500;600&family=Lora:wght@400;500&display=swap" rel="stylesheet">
<style>
body {
font-family: 'Lora', Georgia, serif;
max-width: 100%;
margin: 0 auto;
padding: 20px;
background: #faf9f5;
color: #141413;
}
# ... (continues for ~350 lines total)#!/usr/bin/env python3
"""
Skill Packager - Creates a distributable .skill file of a skill folder
Usage:
python utils/package_skill.py <path/to/skill-folder> [output-directory]
Example:
python utils/package_skill.py skills/public/my-skill
python utils/package_skill.py skills/public/my-skill ./dist
"""
import fnmatch
import sys
import zipfile
from pathlib import Path
from scripts.quick_validate import validate_skill
# Patterns to exclude when packaging skills.
EXCLUDE_DIRS = {"__pycache__", "node_modules"}
EXCLUDE_GLOBS = {"*.pyc"}
EXCLUDE_FILES = {".DS_Store"}
# Directories excluded only at the skill root (not when nested deeper).
ROOT_EXCLUDE_DIRS = {"evals"}
def should_exclude(rel_path: Path) -> bool:
"""Check if a path should be excluded from packaging."""
parts = rel_path.parts
if any(part in EXCLUDE_DIRS for part in parts):
return True
# rel_path is relative to skill_path.parent, so parts[0] is the skill
# folder name and parts[1] (if present) is the first subdir.
if len(parts) > 1 and parts[1] in ROOT_EXCLUDE_DIRS:
return True
name = rel_path.name
if name in EXCLUDE_FILES:
return True
return any(fnmatch.fnmatch(name, pat) for pat in EXCLUDE_GLOBS)
def package_skill(skill_path, output_dir=None):
"""
Package a skill folder into a .skill file.
Args:
skill_path: Path to the skill folder
output_dir: Optional output directory for the .skill file (defaults to current directory)
Returns:
# ... (continues for ~100 lines total)