This playbook walks you through extending the Automated Design of Agentic Systems (ADAS) framework to new domains where agent performance is evaluated by a Large Language Model (LLM) acting as a judge. It assumes familiarity with Python, prompt engineering, and running ADAS search.py pipelines.
- Meta Agent Search loop: Each domain folder (for example
_mmlu/) ships asearch.pythat orchestrates agent generation, evaluation, and archiving. The meta-agent proposes Python snippets that implementAgentSystem.forward(...). - Evaluation hook:
evaluate_forward_fn(args, forward_str)dynamically injects the candidateforwardmethod, runs it across the task suite, and converts raw results into a fitness string viabootstrap_confidence_interval. - Artifacts: Runs append entries to
results/<expr_name>_run_archive.json, combining candidate metadata, code, and fitness. Treat the archive as the single source of truth for downstream inspection.
Understanding these building blocks makes it straightforward to swap the default benchmark scoring for an LLM-judged signal without changing the core search machinery.
- Base setup
conda create -n adas python=3.11 conda activate adas pip install -r requirements.txt
- Framework-specific deps: Install any additional SDKs or HTTP clients required for your runtime (LangChain, LiteLLM, Azure OpenAI, Bedrock, etc.) and pin versions for reproducibility.
- Credentials: Export API keys for both the candidate agents and the LLM judge (if different). Separate keys or projects ease quota tracking and blast-radius containment.
- Observability hooks (optional): Enable tracing/logging (OpenTelemetry, LangSmith, custom JSONL) before long searches so judge rationales and agent actions are captured from the outset.
- Clone a template: Copy an existing domain folder (e.g.
_mmlu/) to_my_domain/and update imports to point at the new package name. - Dataset ingestion: Swap in loaders (CSV, JSONL, API adapters, synthetic generators) that yield the task objects your agent must solve.
- Prompt assets: Update formatting utilities (such as
format_multichoice_question) and prompt strings so the meta-agent sees accurate context. - Agent primitives: Curate a starter library in
*_prompt.pycontaining baseline strategies (CoT, debate, reflexion, tool-use helpers) relevant to your domain. - Configuration defaults: Adjust argparse defaults (dataset paths, worker counts, judge parameters) so
python _my_domain/search.py“just works.”
- Judge prompt: Craft system/user prompts that explain the scoring rubric, expected answer format, and tie-break rules. Keep wording deterministic across evaluations.
- Output schema: Force the judge to emit JSON (reuse ADAS’
FORMAT_INST) with at least a numericscoreand optionalrationaleorverdictfield. - Calibration: Sanity-check the judge on a labeled subset to ensure scores align with human expectations before launching large searches.
Key structural changes inside your domain’s search.py:
from judge import get_judge_score # helper wrapping your LLM judge API
SCORE_SCALE = (0.0, 1.0)
def evaluate_forward_fn(args, forward_str):
namespace = {}
exec(forward_str, globals(), namespace)
if len(namespace) != 1:
raise AssertionError("Provide exactly one callable in the candidate code snippet.")
forward = next(iter(namespace.values()))
if not callable(forward):
raise AssertionError("Payload must define a callable forward method.")
setattr(AgentSystem, "forward", forward)
tasks = load_tasks(args)
agent_system = AgentSystem()
raw_scores = []
judge_rationales = []
for task in tasks:
agent_output = agent_system.forward(task)
score, rationale = get_judge_score(task, agent_output)
raw_scores.append(score)
judge_rationales.append(rationale)
fitness = format_fitness(raw_scores)
persist_eval_log(args, tasks, raw_scores, judge_rationales)
return raw_scores- Threading: ADAS defaults to
ThreadPoolExecutor; throttle concurrency or batch judge calls if your API rate-limits aggressively. - Retries: Wrap judge calls with
backoff.on_exception(already used for OpenAI calls) to ride out transient failures. - Score normalization: Convert judge outputs to
[0,1]floats sobootstrap_confidence_intervalyields intuitive percentages.
- Reuse
bootstrap_confidence_interval(raw_scores)for consistency with existing logs. - If the judge returns multidimensional metrics (accuracy, safety, style), aggregate to a primary score before bootstrapping and log the breakdown separately.
- Judge transcripts: Persist prompts, responses, timestamps, and model versions (for example
results/<expr_name>_judge_logs.jsonl). - Seed control: Expose
shuffle_seed, judge temperature, and any RNG sources in argparse so reruns remain deterministic. - Version tagging: Record library versions (e.g.
openai==,langchain==) in archive metadata to make comparisons across runs trustworthy.
- Dry run: Invoke
search.pywith a tiny workload (--n_generation 1 --valid_size 8) to validate wiring and logging. - Full search: Scale
n_generation, dataset size, and workers after stability is proven. Monitor cost and rate limits throughout. - Evaluation mode: After search, flip
SEARCHING_MODE = False(already toggled in each script) to compute held-out metrics with the best designs. - Resuming: If interrupted,
search.pyreloadsresults/<expr_name>_run_archive.jsonand continues. Ensure judge config hasn’t drifted mid-run.
- Arbitrary code: Candidate agents execute Python. Apply resource limits, subprocess guards, or containerization when integrating beyond local experiments.
- Prompt injection: Harden judge prompts with explicit schema instructions and validate JSON before trusting scores.
- Quota isolation: Separate API credentials for candidate execution vs. judging to constrain blast radius and simplify budget audits.
- Audit trails: Archive agent outputs, judge rationales, and decision metadata for compliance reviews.
- ✅ Loaders emit the intended task structures and metadata.
- ✅
evaluate_forward_fnhandles judge failures (timeouts, malformed JSON) gracefully. - ✅ Judge decisions on a labeled subset match human expectations.
- ✅ Archives capture fitness plus any extra metadata you added (prompt hashes, judge versions, costs).
- ✅ Dry-run archives replay without re-querying the judge (cache hits confirmed).
- Rate limit errors: Reduce
max_workers, introduce jitter in backoff sleeps, or cache repeated evaluations by(task_id, agent_hash). - Judge drift: Snapshot judge prompts and model IDs; consider majority voting across redundant judge calls.
- JSON parsing failures: Enforce schema validation and re-issue the request with clarifying instructions when fields are missing.
- Long execution times: Profile candidate
forwardlogic, cap recursion depth, and enforce per-task timeouts.
import backoff
import json
import os
import openai
client = openai.OpenAI(api_key=os.environ["LLM_JUDGE_KEY"])
JUDGE_SYSTEM_PROMPT = "You are a strict evaluator..."
@backoff.on_exception(backoff.expo, openai.RateLimitError)
def get_judge_score(task, agent_output):
user_prompt = format_judge_prompt(task, agent_output)
response = client.chat.completions.create(
model="gpt-4.1-mini",
messages=[
{"role": "system", "content": JUDGE_SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
temperature=0.0,
response_format={"type": "json_object"},
)
payload = json.loads(response.choices[0].message.content)
score = float(payload["score"])
rationale = payload.get("rationale", "")
return score, rationaledef format_fitness(raw_scores):
return bootstrap_confidence_interval(raw_scores)- Pilot the playbook on a narrow slice of your target domain.
- Iterate on judge prompts and scoring mappings until metrics align with expert judgment.
- Scale search runs and document findings for the broader team.
- Revisit
get_promptandget_reflexion_promptwhenever scoring rubrics change. Highlight the behaviors the judge rewards (for example “cite source snippets in final answers”). - Capture prompt diffs in version control and annotate archives with prompt hashes so historical runs remain interpretable.
- Run ablation searches with and without new prompt clauses to ensure guidance improves convergence rather than blindly narrowing exploration.
- Treat the templates in
*_prompt.pyas a living design library stocked with judge-aligned strategies (self-evaluation loops, safety filters, retrieval augmentations). - Tag each seed with capability notes (e.g.
"capabilities": ["cot", "critic"]) to trace which behaviors drive successful offspring. - Retire or quarantine templates that consistently yield unsafe or low-scoring agents to avoid wasting compute budget.
- Instrument both candidate execution and judge calls with token counters (OpenAI usage metadata, custom proxies) and persist them per generation.
- Define soft/hard ceilings (
--max_tokens_candidates,--max_tokens_judge) and abort gracefully when limits are exceeded, exporting the partial archive. - For expensive judges, adopt a two-tier evaluation pipeline (heuristics or small models triage candidates before premium scoring).
- Cache judge responses in a persistent store (SQLite, Redis, JSONL) keyed by
(task_id, agent_code_hash, prompt_hash); invalidate when rubrics change. - Standardize randomness by setting
random.seed,np.random.seed, and any framework-specific seeds inside both meta-agent and judge helpers. - Disable temperature in agent/judge calls for deterministic runs or gate stochastic exploration behind explicit configuration flags.
- Start a fresh
expr_namewhenever datasets, prompts, or judge logic change so archives remain homogeneous. - Use the
bdworkflow to log experiment rationale, dependencies, and outcomes; link commits to issue IDs for forensic traceability. - Periodically validate archives with
jsonschemaor thearchive_lint.pyhelper (Section 13.5) before sharing results.
- Stream judge rationales, agent outputs, and fitness metrics to observability backends (OpenTelemetry, LangSmith, W&B, Grafana Loki).
- Define alert thresholds (median score drop >10% generation-over-generation, judge failure rate >5%, token burn rate spikes) and surface them via Slack/webhooks.
- Track latency histograms for judge calls; spikes often precede quota exhaustion or networking issues.
- Wrap long searches in job orchestrators (Kubernetes Jobs, Argo Workflows, Slurm) with LLM-aware resource limits and centralized log collection.
- Implement heartbeat files or
/healthzendpoints that external monitors can poll; treat missing heartbeats as failure states triggering restarts. - Rehearse disaster recovery (judge outages, quota exhaustion) to ensure the search loop backs off without corrupting archives.
Each utility below fits naturally in a tools/ directory. Make the scripts executable (chmod +x) and adjust paths or model names before production use. Unless noted, they rely only on the Python standard library (3.11).
Compute SHA-256 hashes for prompt factories so archives can record immutable identifiers.
#!/usr/bin/env python3
"""Compute reproducible hashes for ADAS prompt factories."""
import argparse
import hashlib
import importlib.util
import inspect
import pathlib
import sys
def load_module(path: pathlib.Path):
spec = importlib.util.spec_from_file_location(path.stem, path)
if spec is None or spec.loader is None:
raise RuntimeError(f"Unable to load module from {path}")
module = importlib.util.module_from_spec(spec)
sys.modules[path.stem] = module
spec.loader.exec_module(module)
return module
def hash_object(obj) -> str:
source = inspect.getsource(obj)
return hashlib.sha256(source.encode("utf-8")).hexdigest()
def main():
parser = argparse.ArgumentParser()
parser.add_argument("prompt_file", type=pathlib.Path,
help="Path to *_prompt.py module")
parser.add_argument("--targets", nargs="*", default=["get_prompt", "get_reflexion_prompt"],
help="Callable names to hash")
args = parser.parse_args()
module = load_module(args.prompt_file.resolve())
for name in args.targets:
fn = getattr(module, name, None)
if fn is None:
print(f"[WARN] {name} not found in {args.prompt_file}")
continue
prompt_hash = hash_object(fn)
print(f"{name}: {prompt_hash}")
if __name__ == "__main__":
main()Usage
python tools/prompt_hash.py _mmlu/mmlu_prompt.pyRecord the reported hashes alongside archives or in bd issue notes.
Parse *_prompt.py files, extract top-level dictionaries containing a "code" field, and emit a manifest of seed agents.
#!/usr/bin/env python3
"""Generate a manifest of seed agent templates."""
import argparse
import ast
import json
import pathlib
class SeedCollector(ast.NodeVisitor):
def __init__(self):
self.seeds = []
def visit_Assign(self, node):
if isinstance(node.value, ast.Dict):
keys = [k.s for k in node.value.keys if isinstance(k, ast.Constant) and isinstance(k.value, str)]
if "code" in keys:
name = None
if len(node.targets) == 1 and isinstance(node.targets[0], ast.Name):
name = node.targets[0].id
entry = {
"var_name": name or "<anonymous>",
"fields": keys,
"lineno": node.lineno,
}
self.seeds.append(entry)
self.generic_visit(node)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("prompt_file", type=pathlib.Path)
parser.add_argument("--out", type=pathlib.Path, default=None,
help="Optional manifest output path")
args = parser.parse_args()
source = args.prompt_file.read_text(encoding="utf-8")
tree = ast.parse(source, filename=str(args.prompt_file))
collector = SeedCollector()
collector.visit(tree)
manifest = {
"prompt_file": str(args.prompt_file),
"seeds": collector.seeds,
}
payload = json.dumps(manifest, indent=2)
if args.out:
args.out.write_text(payload, encoding="utf-8")
else:
print(payload)
if __name__ == "__main__":
main()Usage
python tools/seed_registry.py _mmlu/mmlu_prompt.py --out manifests/mmlu_seeds.jsonAnnotate the resulting JSON with manual capability tags (for example "capabilities": ["cot", "critic"]).
Wrap a domain module, patch OpenAI chat completions to accumulate usage, and abort when token ceilings are exceeded.
#!/usr/bin/env python3
"""Run ADAS searches with token-based budget guardrails."""
import argparse
import json
import runpy
import sys
import threading
from openai.resources.chat.completions import Completions
lock = threading.Lock()
usage_totals = {"prompt_tokens": 0, "completion_tokens": 0}
def patched_create(original_create, max_prompt, max_completion):
def wrapper(*args, **kwargs):
response = original_create(*args, **kwargs)
usage = response.usage or {}
with lock:
usage_totals["prompt_tokens"] += usage.get("prompt_tokens", 0)
usage_totals["completion_tokens"] += usage.get("completion_tokens", 0)
if max_prompt is not None and usage_totals["prompt_tokens"] > max_prompt:
raise RuntimeError(
f"Prompt token budget exceeded: {usage_totals['prompt_tokens']}>{max_prompt}"
)
if max_completion is not None and usage_totals["completion_tokens"] > max_completion:
raise RuntimeError(
f"Completion token budget exceeded: {usage_totals['completion_tokens']}>{max_completion}"
)
return response
return wrapper
def main():
parser = argparse.ArgumentParser()
parser.add_argument("module", help="Module path (e.g. _mmlu.search)")
parser.add_argument("--max-prompt-tokens", type=int, default=None)
parser.add_argument("--max-completion-tokens", type=int, default=None)
parser.add_argument("--metadata-out", type=str, default=None,
help="Optional path to dump usage totals")
parser.add_argument("--", dest="module_args", nargs=argparse.REMAINDER,
help="Arguments forwarded to the module")
args = parser.parse_args()
original_create = Completions.create
Completions.create = patched_create(
original_create,
args.max_prompt_tokens,
args.max_completion_tokens,
)
try:
sys.argv = [args.module] + (args.module_args or [])
runpy.run_module(args.module, run_name="__main__")
finally:
Completions.create = original_create
if args.metadata_out:
with open(args.metadata_out, "w", encoding="utf-8") as fh:
json.dump(usage_totals, fh, indent=2)
print(json.dumps(usage_totals, indent=2))
if __name__ == "__main__":
main()Usage
python tools/run_with_budget.py _mmlu.search --max-prompt-tokens 1_000_000 -- --n_generation 5Extend the wrapper if you invoke other OpenAI endpoints (responses, Assistants). For non-OpenAI providers, patch the respective SDK methods.
Back a simple SQLite cache for judge results keyed by (task_id, agent_code, prompt_hash) to avoid redundant evaluations.
#!/usr/bin/env python3
"""Cache and retrieve judge responses."""
import argparse
import hashlib
import json
import pathlib
import sqlite3
from contextlib import closing
SCHEMA = """
CREATE TABLE IF NOT EXISTS judge_cache (
key TEXT PRIMARY KEY,
task_id TEXT NOT NULL,
agent_hash TEXT NOT NULL,
prompt_hash TEXT NOT NULL,
score REAL NOT NULL,
rationale TEXT,
raw_response TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"""
def ensure_db(conn):
conn.execute(SCHEMA)
conn.commit()
def make_key(task_id, agent_code, prompt_hash):
digest = hashlib.sha256(agent_code.encode("utf-8")).hexdigest()
return f"{task_id}:{digest}:{prompt_hash}"
def cmd_put(conn, args):
key = make_key(args.task_id, args.agent_code, args.prompt_hash)
conn.execute(
"REPLACE INTO judge_cache (key, task_id, agent_hash, prompt_hash, score, rationale, raw_response)"
" VALUES (?, ?, ?, ?, ?, ?, ?)",
(
key,
args.task_id,
hashlib.sha256(args.agent_code.encode("utf-8")).hexdigest(),
args.prompt_hash,
args.score,
args.rationale,
args.raw_response,
),
)
conn.commit()
print("stored", key)
def cmd_get(conn, args):
key = make_key(args.task_id, args.agent_code, args.prompt_hash)
row = conn.execute(
"SELECT score, rationale, raw_response FROM judge_cache WHERE key = ?",
(key,),
).fetchone()
if row is None:
print("<MISS>")
else:
print(json.dumps({"score": row[0], "rationale": row[1], "raw_response": row[2]}, indent=2))
def cmd_purge(conn, args):
conn.execute("DELETE FROM judge_cache WHERE prompt_hash = ?", (args.prompt_hash,))
conn.commit()
print("purged prompt_hash", args.prompt_hash)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("db", type=pathlib.Path, help="SQLite database path")
sub = parser.add_subparsers(dest="command", required=True)
put = sub.add_parser("put")
put.add_argument("task_id")
put.add_argument("agent_code")
put.add_argument("prompt_hash")
put.add_argument("score", type=float)
put.add_argument("--rationale", default="")
put.add_argument("--raw-response", default="")
get = sub.add_parser("get")
get.add_argument("task_id")
get.add_argument("agent_code")
get.add_argument("prompt_hash")
purge = sub.add_parser("purge")
purge.add_argument("prompt_hash")
args = parser.parse_args()
with closing(sqlite3.connect(args.db)) as conn:
ensure_db(conn)
if args.command == "put":
cmd_put(conn, args)
elif args.command == "get":
cmd_get(conn, args)
elif args.command == "purge":
cmd_purge(conn, args)
if __name__ == "__main__":
main()Usage
python tools/judge_cache.py cache/judge.sqlite put TASK_ID "<agent code>" <PROMPT_HASH> 0.8 --rationale "Looks good"Integrate the CLI with your judge helper to read before calling the live model and write after a cache miss.
Lint run archives for required metadata before sharing or resuming experiments.
#!/usr/bin/env python3
"""Lint ADAS run archives."""
import argparse
import json
import pathlib
REQUIRED_FIELDS = {"name", "code", "fitness"}
RECOMMENDED_FIELDS = {"prompt_hash", "judge_model", "cost_tokens"}
def lint(path: pathlib.Path):
data = json.loads(path.read_text(encoding="utf-8"))
for idx, entry in enumerate(data):
missing = REQUIRED_FIELDS - entry.keys()
if missing:
yield f"[{path}] entry {idx} missing required fields: {sorted(missing)}"
missing_rec = RECOMMENDED_FIELDS - entry.keys()
if missing_rec:
yield f"[{path}] entry {idx} missing recommended fields: {sorted(missing_rec)}"
def main():
parser = argparse.ArgumentParser()
parser.add_argument("archives", nargs="+", type=pathlib.Path)
args = parser.parse_args()
has_error = False
for archive in args.archives:
for msg in lint(archive):
has_error = True
print(msg)
if has_error:
raise SystemExit(1)
if __name__ == "__main__":
main()Usage
python tools/archive_lint.py results/mmlu_gpt3.5_results_run_archive.jsonExtend RECOMMENDED_FIELDS once you standardize additional metadata (prompt hashes, judge versions, token cost).
Emit per-generation metrics to JSONL for ingestion by dashboards (Grafana, Loki, Weights & Biases, etc.).
#!/usr/bin/env python3
"""Export generation-level metrics from ADAS archives."""
import argparse
import json
import pathlib
def main():
parser = argparse.ArgumentParser()
parser.add_argument("archive", type=pathlib.Path)
parser.add_argument("--out", type=pathlib.Path, required=True,
help="Metrics JSONL output file")
args = parser.parse_args()
entries = json.loads(args.archive.read_text(encoding="utf-8"))
with args.out.open("w", encoding="utf-8") as fh:
for idx, entry in enumerate(entries):
payload = {
"generation": entry.get("generation", idx),
"fitness": entry.get("fitness"),
"prompt_hash": entry.get("prompt_hash"),
"judge_model": entry.get("judge_model"),
"cost_tokens": entry.get("cost_tokens"),
"name": entry.get("name"),
}
fh.write(json.dumps(payload) + "\n")
if __name__ == "__main__":
main()Usage
python tools/run_metrics_exporter.py results/mmlu_gpt3.5_results_run_archive.json \
--out telemetry/mmlu_metrics.jsonlShip the JSONL to your telemetry backend and layer alert rules on the metrics described in Section 12.6.
Poll a heartbeat JSON file emitted by long searches and alert (stdout or webhook) if updates stall.
#!/usr/bin/env python3
"""Monitor ADAS search heartbeats."""
import argparse
import json
import pathlib
import time
import urllib.request
def load_heartbeat(path: pathlib.Path):
if not path.exists():
return None
try:
return json.loads(path.read_text(encoding="utf-8"))
except json.JSONDecodeError:
return None
def notify(webhook: str, message: str):
if webhook:
req = urllib.request.Request(
webhook,
data=json.dumps({"text": message}).encode("utf-8"),
headers={"Content-Type": "application/json"},
)
urllib.request.urlopen(req, timeout=10)
else:
print(message)
def main():
parser = argparse.ArgumentParser()
parser.add_argument("heartbeat_file", type=pathlib.Path)
parser.add_argument("--interval", type=int, default=60, help="Polling interval in seconds")
parser.add_argument("--timeout", type=int, default=600, help="Max allowable silence in seconds")
parser.add_argument("--webhook", type=str, default="", help="Optional Slack/MS Teams webhook URL")
args = parser.parse_args()
last_seen = time.time()
while True:
heartbeat = load_heartbeat(args.heartbeat_file)
if heartbeat:
last_seen = heartbeat.get("timestamp", time.time())
if time.time() - last_seen > args.timeout:
notify(args.webhook, f"Heartbeat stale (> {args.timeout}s) for {args.heartbeat_file}")
last_seen = time.time()
time.sleep(args.interval)
if __name__ == "__main__":
main()Usage
python tools/search_heartbeat.py results/mmlu_heartbeat.json --interval 30 --timeout 300 --webhook https://hooks.slack.com/services/...Emit heartbeat files from the search loop with a helper such as:
from pathlib import Path
import json
import time
def write_heartbeat(path, generation):
payload = {"generation": generation, "timestamp": time.time()}
Path(path).write_text(json.dumps(payload), encoding="utf-8")Call write_heartbeat(args.heartbeat_file, n) once per generation to keep the monitor satisfied.
The path from a Product Requirements Document (PRD) to an ADAS-ready domain is predictable once you ground qualitative goals in measurable evaluations. Use the following checklist while working with product stakeholders.
- Surface KPIs and acceptance tests: Extract explicit success metrics from the PRD (e.g. “first-response accuracy ≥ 90%,” “escalate to human within 2 minutes when confidence <0.4”).
- Select evaluation primitives: Decide which metrics can be computed programmatically versus those requiring a judge. When metrics are multi-dimensional, identify the primary optimization target.
- Translate to fitness: Sketch the function that converts raw judge/deterministic outcomes into the scalar score ADAS expects (percentage, weighted composite, penalty-based).
- Decompose user journeys: Break PRD scenarios into atomic tasks—prompts, tool invocations, state transitions—that a single agent interaction can handle.
- Define schemas: Document required fields (request text, metadata, tool handles) and the persistence format (CSV, JSONL, generator). Ensure they map cleanly to the
taskInfostructure consumed by agents. - Prototype samples: Build a seed set covering happy paths, edge cases, and adversarial inputs to vet formatting and judge prompts before scaling.
- Deterministic outcomes: Encode calculation logic, normalization rules, and tolerance bands derived from the PRD. Package them with tasks so evaluation stays local.
- LLM-judge rubrics: Translate qualitative acceptance criteria into judge prompts, scoring scales, and exemplars of acceptable vs. unacceptable behavior.
- Drift detection: Version judge prompts/models and schedule regression checks against human-labeled samples.
- Latency & throughput: Capture ceilings or service-level objectives and bake them into prompts, meta-agent guidance, or runtime guardrails.
- Integrations: List external systems the agent must call, required authentication, and fallback behavior if integrations fail.
- Safety & compliance: Convert policy requirements (PII handling, regulatory phrasing) into explicit instructions and runtime checks.
- Align seeds with product tactics: Author baseline agents that exemplify the workflows envisioned in the PRD (retrieval-first, policy-checker, escalation playbooks).
- Annotate capabilities: Tag seeds with the specific PRD requirements they address so downstream analysis can show coverage.
- Refresh cadence: Schedule periodic reviews to demote ineffective seeds and inject new strategies as the product spec evolves.
- Domain README: Summarize task schema, evaluation pipeline, judge configuration, and operational constraints. Reference PRD sections for traceability.
- Metadata manifest: Produce machine-readable metadata (JSON/YAML) capturing dataset sources, prompt hashes, judge model versions, and budget limits.
- bd workflow: Log the translation effort as a
bdissue, linking discovered blockers, validation status, and follow-up tasks.
- Dry-run score audit: Execute the new domain with representative seed agents and confirm outputs align with PRD expectations.
- Stakeholder sign-off: Demo judge rationales or deterministic scoring traces to product/QA partners before burning large budgets.
- Readiness checklist: Ensure Sections 14.1–14.6 are satisfied before launching meta-search at scale.