tkersey/adas-playbook.md

## adas-playbook.md

      
    Raw
  

              adas-playbook.md
            
          
    ADAS Integration Playbook for LLM-Judged Domains

This playbook walks you through extending the Automated Design of Agentic Systems (ADAS) framework to new domains where agent performance is evaluated by a Large Language Model (LLM) acting as a judge. It assumes familiarity with Python, prompt engineering, and running ADAS search.py pipelines.
1. System Orientation


Meta Agent Search loop: Each domain folder (for example _mmlu/) ships a search.py that orchestrates agent generation, evaluation, and archiving. The meta-agent proposes Python snippets that implement AgentSystem.forward(...).
Evaluation hook: evaluate_forward_fn(args, forward_str) dynamically injects the candidate forward method, runs it across the task suite, and converts raw results into a fitness string via bootstrap_confidence_interval.
Artifacts: Runs append entries to results/<expr_name>_run_archive.json, combining candidate metadata, code, and fitness. Treat the archive as the single source of truth for downstream inspection.

Understanding these building blocks makes it straightforward to swap the default benchmark scoring for an LLM-judged signal without changing the core search machinery.
2. Environment & Dependencies


Base setup
conda create -n adas python=3.11
conda activate adas
pip install -r requirements.txt

Framework-specific deps: Install any additional SDKs or HTTP clients required for your runtime (LangChain, LiteLLM, Azure OpenAI, Bedrock, etc.) and pin versions for reproducibility.
Credentials: Export API keys for both the candidate agents and the LLM judge (if different). Separate keys or projects ease quota tracking and blast-radius containment.
Observability hooks (optional): Enable tracing/logging (OpenTelemetry, LangSmith, custom JSONL) before long searches so judge rationales and agent actions are captured from the outset.

3. Designing a New Domain Module


Clone a template: Copy an existing domain folder (e.g. _mmlu/) to _my_domain/ and update imports to point at the new package name.
Dataset ingestion: Swap in loaders (CSV, JSONL, API adapters, synthetic generators) that yield the task objects your agent must solve.
Prompt assets: Update formatting utilities (such as format_multichoice_question) and prompt strings so the meta-agent sees accurate context.
Agent primitives: Curate a starter library in *_prompt.py containing baseline strategies (CoT, debate, reflexion, tool-use helpers) relevant to your domain.
Configuration defaults: Adjust argparse defaults (dataset paths, worker counts, judge parameters) so python _my_domain/search.py “just works.”

4. Implementing LLM-as-Judge Evaluation

4.1 Judge Contract


Judge prompt: Craft system/user prompts that explain the scoring rubric, expected answer format, and tie-break rules. Keep wording deterministic across evaluations.
Output schema: Force the judge to emit JSON (reuse ADAS’ FORMAT_INST) with at least a numeric score and optional rationale or verdict field.
Calibration: Sanity-check the judge on a labeled subset to ensure scores align with human expectations before launching large searches.

4.2 evaluate_forward_fn

Key structural changes inside your domain’s search.py:
from judge import get_judge_score  # helper wrapping your LLM judge API

SCORE_SCALE = (0.0, 1.0)

def evaluate_forward_fn(args, forward_str):
    namespace = {}
    exec(forward_str, globals(), namespace)

    if len(namespace) != 1:
        raise AssertionError("Provide exactly one callable in the candidate code snippet.")
    forward = next(iter(namespace.values()))
    if not callable(forward):
        raise AssertionError("Payload must define a callable forward method.")

    setattr(AgentSystem, "forward", forward)

    tasks = load_tasks(args)
    agent_system = AgentSystem()

    raw_scores = []
    judge_rationales = []
    for task in tasks:
        agent_output = agent_system.forward(task)
        score, rationale = get_judge_score(task, agent_output)
        raw_scores.append(score)
        judge_rationales.append(rationale)

    fitness = format_fitness(raw_scores)
    persist_eval_log(args, tasks, raw_scores, judge_rationales)
    return raw_scores

Threading: ADAS defaults to ThreadPoolExecutor; throttle concurrency or batch judge calls if your API rate-limits aggressively.
Retries: Wrap judge calls with backoff.on_exception (already used for OpenAI calls) to ride out transient failures.
Score normalization: Convert judge outputs to [0,1] floats so bootstrap_confidence_interval yields intuitive percentages.

4.3 Fitness String


Reuse bootstrap_confidence_interval(raw_scores) for consistency with existing logs.
If the judge returns multidimensional metrics (accuracy, safety, style), aggregate to a primary score before bootstrapping and log the breakdown separately.

5. Logging & Reproducibility


Judge transcripts: Persist prompts, responses, timestamps, and model versions (for example results/<expr_name>_judge_logs.jsonl).
Seed control: Expose shuffle_seed, judge temperature, and any RNG sources in argparse so reruns remain deterministic.
Version tagging: Record library versions (e.g. openai==, langchain==) in archive metadata to make comparisons across runs trustworthy.

6. Running Experiments


Dry run: Invoke search.py with a tiny workload (--n_generation 1 --valid_size 8) to validate wiring and logging.
Full search: Scale n_generation, dataset size, and workers after stability is proven. Monitor cost and rate limits throughout.
Evaluation mode: After search, flip SEARCHING_MODE = False (already toggled in each script) to compute held-out metrics with the best designs.
Resuming: If interrupted, search.py reloads results/<expr_name>_run_archive.json and continues. Ensure judge config hasn’t drifted mid-run.

7. Safety & Sandboxing


Arbitrary code: Candidate agents execute Python. Apply resource limits, subprocess guards, or containerization when integrating beyond local experiments.
Prompt injection: Harden judge prompts with explicit schema instructions and validate JSON before trusting scores.
Quota isolation: Separate API credentials for candidate execution vs. judging to constrain blast radius and simplify budget audits.
Audit trails: Archive agent outputs, judge rationales, and decision metadata for compliance reviews.

8. Validation Checklist


✅ Loaders emit the intended task structures and metadata.
✅ evaluate_forward_fn handles judge failures (timeouts, malformed JSON) gracefully.
✅ Judge decisions on a labeled subset match human expectations.
✅ Archives capture fitness plus any extra metadata you added (prompt hashes, judge versions, costs).
✅ Dry-run archives replay without re-querying the judge (cache hits confirmed).

9. Troubleshooting Guide


Rate limit errors: Reduce max_workers, introduce jitter in backoff sleeps, or cache repeated evaluations by (task_id, agent_hash).
Judge drift: Snapshot judge prompts and model IDs; consider majority voting across redundant judge calls.
JSON parsing failures: Enforce schema validation and re-issue the request with clarifying instructions when fields are missing.
Long execution times: Profile candidate forward logic, cap recursion depth, and enforce per-task timeouts.

10. Appendix: Helper Snippets

Judge Helper Skeleton

import backoff
import json
import os
import openai

client = openai.OpenAI(api_key=os.environ["LLM_JUDGE_KEY"])

JUDGE_SYSTEM_PROMPT = "You are a strict evaluator..."

@backoff.on_exception(backoff.expo, openai.RateLimitError)
def get_judge_score(task, agent_output):
    user_prompt = format_judge_prompt(task, agent_output)
    response = client.chat.completions.create(
        model="gpt-4.1-mini",
        messages=[
            {"role": "system", "content": JUDGE_SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
    payload = json.loads(response.choices[0].message.content)
    score = float(payload["score"])
    rationale = payload.get("rationale", "")
    return score, rationale
Fitness Formatter Example

def format_fitness(raw_scores):
    return bootstrap_confidence_interval(raw_scores)
11. Next Steps


Pilot the playbook on a narrow slice of your target domain.
Iterate on judge prompts and scoring mappings until metrics align with expert judgment.
Scale search runs and document findings for the broader team.

12. Additional ADAS Best Practices

12.1 Meta-Agent Prompt Tuning


Revisit get_prompt and get_reflexion_prompt whenever scoring rubrics change. Highlight the behaviors the judge rewards (for example “cite source snippets in final answers”).
Capture prompt diffs in version control and annotate archives with prompt hashes so historical runs remain interpretable.
Run ablation searches with and without new prompt clauses to ensure guidance improves convergence rather than blindly narrowing exploration.

12.2 Agent Library Curation


Treat the templates in *_prompt.py as a living design library stocked with judge-aligned strategies (self-evaluation loops, safety filters, retrieval augmentations).
Tag each seed with capability notes (e.g. "capabilities": ["cot", "critic"]) to trace which behaviors drive successful offspring.
Retire or quarantine templates that consistently yield unsafe or low-scoring agents to avoid wasting compute budget.

12.3 Cost and Budget Guardrails


Instrument both candidate execution and judge calls with token counters (OpenAI usage metadata, custom proxies) and persist them per generation.
Define soft/hard ceilings (--max_tokens_candidates, --max_tokens_judge) and abort gracefully when limits are exceeded, exporting the partial archive.
For expensive judges, adopt a two-tier evaluation pipeline (heuristics or small models triage candidates before premium scoring).

12.4 Caching & Determinism


Cache judge responses in a persistent store (SQLite, Redis, JSONL) keyed by (task_id, agent_code_hash, prompt_hash); invalidate when rubrics change.
Standardize randomness by setting random.seed, np.random.seed, and any framework-specific seeds inside both meta-agent and judge helpers.
Disable temperature in agent/judge calls for deterministic runs or gate stochastic exploration behind explicit configuration flags.

12.5 Archive Hygiene & Provenance


Start a fresh expr_name whenever datasets, prompts, or judge logic change so archives remain homogeneous.
Use the bd workflow to log experiment rationale, dependencies, and outcomes; link commits to issue IDs for forensic traceability.
Periodically validate archives with jsonschema or the archive_lint.py helper (Section 13.5) before sharing results.

12.6 Observability & Alerting


Stream judge rationales, agent outputs, and fitness metrics to observability backends (OpenTelemetry, LangSmith, W&B, Grafana Loki).
Define alert thresholds (median score drop >10% generation-over-generation, judge failure rate >5%, token burn rate spikes) and surface them via Slack/webhooks.
Track latency histograms for judge calls; spikes often precede quota exhaustion or networking issues.

12.7 Operational Integration & Scheduling


Wrap long searches in job orchestrators (Kubernetes Jobs, Argo Workflows, Slurm) with LLM-aware resource limits and centralized log collection.
Implement heartbeat files or /healthz endpoints that external monitors can poll; treat missing heartbeats as failure states triggering restarts.
Rehearse disaster recovery (judge outages, quota exhaustion) to ensure the search loop backs off without corrupting archives.

13. Companion Scripts & Dashboards

Each utility below fits naturally in a tools/ directory. Make the scripts executable (chmod +x) and adjust paths or model names before production use. Unless noted, they rely only on the Python standard library (3.11).
13.1 Prompt Hash Reporter (tools/prompt_hash.py)

Compute SHA-256 hashes for prompt factories so archives can record immutable identifiers.
#!/usr/bin/env python3
"""Compute reproducible hashes for ADAS prompt factories."""

import argparse
import hashlib
import importlib.util
import inspect
import pathlib
import sys


def load_module(path: pathlib.Path):
    spec = importlib.util.spec_from_file_location(path.stem, path)
    if spec is None or spec.loader is None:
        raise RuntimeError(f"Unable to load module from {path}")
    module = importlib.util.module_from_spec(spec)
    sys.modules[path.stem] = module
    spec.loader.exec_module(module)
    return module


def hash_object(obj) -> str:
    source = inspect.getsource(obj)
    return hashlib.sha256(source.encode("utf-8")).hexdigest()


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("prompt_file", type=pathlib.Path,
                        help="Path to *_prompt.py module")
    parser.add_argument("--targets", nargs="*", default=["get_prompt", "get_reflexion_prompt"],
                        help="Callable names to hash")
    args = parser.parse_args()

    module = load_module(args.prompt_file.resolve())
    for name in args.targets:
        fn = getattr(module, name, None)
        if fn is None:
            print(f"[WARN] {name} not found in {args.prompt_file}")
            continue
        prompt_hash = hash_object(fn)
        print(f"{name}: {prompt_hash}")


if __name__ == "__main__":
    main()
Usage
python tools/prompt_hash.py _mmlu/mmlu_prompt.py
Record the reported hashes alongside archives or in bd issue notes.
13.2 Seed Agent Registry (tools/seed_registry.py)

Parse *_prompt.py files, extract top-level dictionaries containing a "code" field, and emit a manifest of seed agents.
#!/usr/bin/env python3
"""Generate a manifest of seed agent templates."""

import argparse
import ast
import json
import pathlib


class SeedCollector(ast.NodeVisitor):
    def __init__(self):
        self.seeds = []

    def visit_Assign(self, node):
        if isinstance(node.value, ast.Dict):
            keys = [k.s for k in node.value.keys if isinstance(k, ast.Constant) and isinstance(k.value, str)]
            if "code" in keys:
                name = None
                if len(node.targets) == 1 and isinstance(node.targets[0], ast.Name):
                    name = node.targets[0].id
                entry = {
                    "var_name": name or "<anonymous>",
                    "fields": keys,
                    "lineno": node.lineno,
                }
                self.seeds.append(entry)
        self.generic_visit(node)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("prompt_file", type=pathlib.Path)
    parser.add_argument("--out", type=pathlib.Path, default=None,
                        help="Optional manifest output path")
    args = parser.parse_args()

    source = args.prompt_file.read_text(encoding="utf-8")
    tree = ast.parse(source, filename=str(args.prompt_file))
    collector = SeedCollector()
    collector.visit(tree)

    manifest = {
        "prompt_file": str(args.prompt_file),
        "seeds": collector.seeds,
    }

    payload = json.dumps(manifest, indent=2)
    if args.out:
        args.out.write_text(payload, encoding="utf-8")
    else:
        print(payload)


if __name__ == "__main__":
    main()
Usage
python tools/seed_registry.py _mmlu/mmlu_prompt.py --out manifests/mmlu_seeds.json
Annotate the resulting JSON with manual capability tags (for example "capabilities": ["cot", "critic"]).
13.3 Budget Guardrail Runner (tools/run_with_budget.py)

Wrap a domain module, patch OpenAI chat completions to accumulate usage, and abort when token ceilings are exceeded.
#!/usr/bin/env python3
"""Run ADAS searches with token-based budget guardrails."""

import argparse
import json
import runpy
import sys
import threading

from openai.resources.chat.completions import Completions


lock = threading.Lock()
usage_totals = {"prompt_tokens": 0, "completion_tokens": 0}


def patched_create(original_create, max_prompt, max_completion):
    def wrapper(*args, **kwargs):
        response = original_create(*args, **kwargs)
        usage = response.usage or {}
        with lock:
            usage_totals["prompt_tokens"] += usage.get("prompt_tokens", 0)
            usage_totals["completion_tokens"] += usage.get("completion_tokens", 0)
            if max_prompt is not None and usage_totals["prompt_tokens"] > max_prompt:
                raise RuntimeError(
                    f"Prompt token budget exceeded: {usage_totals['prompt_tokens']}>{max_prompt}"
                )
            if max_completion is not None and usage_totals["completion_tokens"] > max_completion:
                raise RuntimeError(
                    f"Completion token budget exceeded: {usage_totals['completion_tokens']}>{max_completion}"
                )
        return response

    return wrapper


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("module", help="Module path (e.g. _mmlu.search)")
    parser.add_argument("--max-prompt-tokens", type=int, default=None)
    parser.add_argument("--max-completion-tokens", type=int, default=None)
    parser.add_argument("--metadata-out", type=str, default=None,
                        help="Optional path to dump usage totals")
    parser.add_argument("--", dest="module_args", nargs=argparse.REMAINDER,
                        help="Arguments forwarded to the module")
    args = parser.parse_args()

    original_create = Completions.create
    Completions.create = patched_create(
        original_create,
        args.max_prompt_tokens,
        args.max_completion_tokens,
    )

    try:
        sys.argv = [args.module] + (args.module_args or [])
        runpy.run_module(args.module, run_name="__main__")
    finally:
        Completions.create = original_create
        if args.metadata_out:
            with open(args.metadata_out, "w", encoding="utf-8") as fh:
                json.dump(usage_totals, fh, indent=2)
        print(json.dumps(usage_totals, indent=2))


if __name__ == "__main__":
    main()
Usage
python tools/run_with_budget.py _mmlu.search --max-prompt-tokens 1_000_000 -- --n_generation 5
Extend the wrapper if you invoke other OpenAI endpoints (responses, Assistants). For non-OpenAI providers, patch the respective SDK methods.
13.4 Judge Cache Service (tools/judge_cache.py)

Back a simple SQLite cache for judge results keyed by (task_id, agent_code, prompt_hash) to avoid redundant evaluations.
#!/usr/bin/env python3
"""Cache and retrieve judge responses."""

import argparse
import hashlib
import json
import pathlib
import sqlite3
from contextlib import closing


SCHEMA = """
CREATE TABLE IF NOT EXISTS judge_cache (
    key TEXT PRIMARY KEY,
    task_id TEXT NOT NULL,
    agent_hash TEXT NOT NULL,
    prompt_hash TEXT NOT NULL,
    score REAL NOT NULL,
    rationale TEXT,
    raw_response TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
"""


def ensure_db(conn):
    conn.execute(SCHEMA)
    conn.commit()


def make_key(task_id, agent_code, prompt_hash):
    digest = hashlib.sha256(agent_code.encode("utf-8")).hexdigest()
    return f"{task_id}:{digest}:{prompt_hash}"


def cmd_put(conn, args):
    key = make_key(args.task_id, args.agent_code, args.prompt_hash)
    conn.execute(
        "REPLACE INTO judge_cache (key, task_id, agent_hash, prompt_hash, score, rationale, raw_response)"
        " VALUES (?, ?, ?, ?, ?, ?, ?)",
        (
            key,
            args.task_id,
            hashlib.sha256(args.agent_code.encode("utf-8")).hexdigest(),
            args.prompt_hash,
            args.score,
            args.rationale,
            args.raw_response,
        ),
    )
    conn.commit()
    print("stored", key)


def cmd_get(conn, args):
    key = make_key(args.task_id, args.agent_code, args.prompt_hash)
    row = conn.execute(
        "SELECT score, rationale, raw_response FROM judge_cache WHERE key = ?",
        (key,),
    ).fetchone()
    if row is None:
        print("<MISS>")
    else:
        print(json.dumps({"score": row[0], "rationale": row[1], "raw_response": row[2]}, indent=2))


def cmd_purge(conn, args):
    conn.execute("DELETE FROM judge_cache WHERE prompt_hash = ?", (args.prompt_hash,))
    conn.commit()
    print("purged prompt_hash", args.prompt_hash)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("db", type=pathlib.Path, help="SQLite database path")
    sub = parser.add_subparsers(dest="command", required=True)

    put = sub.add_parser("put")
    put.add_argument("task_id")
    put.add_argument("agent_code")
    put.add_argument("prompt_hash")
    put.add_argument("score", type=float)
    put.add_argument("--rationale", default="")
    put.add_argument("--raw-response", default="")

    get = sub.add_parser("get")
    get.add_argument("task_id")
    get.add_argument("agent_code")
    get.add_argument("prompt_hash")

    purge = sub.add_parser("purge")
    purge.add_argument("prompt_hash")

    args = parser.parse_args()
    with closing(sqlite3.connect(args.db)) as conn:
        ensure_db(conn)
        if args.command == "put":
            cmd_put(conn, args)
        elif args.command == "get":
            cmd_get(conn, args)
        elif args.command == "purge":
            cmd_purge(conn, args)


if __name__ == "__main__":
    main()
Usage
python tools/judge_cache.py cache/judge.sqlite put TASK_ID "<agent code>" <PROMPT_HASH> 0.8 --rationale "Looks good"
Integrate the CLI with your judge helper to read before calling the live model and write after a cache miss.
13.5 Archive Validator (tools/archive_lint.py)

Lint run archives for required metadata before sharing or resuming experiments.
#!/usr/bin/env python3
"""Lint ADAS run archives."""

import argparse
import json
import pathlib

REQUIRED_FIELDS = {"name", "code", "fitness"}
RECOMMENDED_FIELDS = {"prompt_hash", "judge_model", "cost_tokens"}


def lint(path: pathlib.Path):
    data = json.loads(path.read_text(encoding="utf-8"))
    for idx, entry in enumerate(data):
        missing = REQUIRED_FIELDS - entry.keys()
        if missing:
            yield f"[{path}] entry {idx} missing required fields: {sorted(missing)}"
        missing_rec = RECOMMENDED_FIELDS - entry.keys()
        if missing_rec:
            yield f"[{path}] entry {idx} missing recommended fields: {sorted(missing_rec)}"


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("archives", nargs="+", type=pathlib.Path)
    args = parser.parse_args()

    has_error = False
    for archive in args.archives:
        for msg in lint(archive):
            has_error = True
            print(msg)
    if has_error:
        raise SystemExit(1)


if __name__ == "__main__":
    main()
Usage
python tools/archive_lint.py results/mmlu_gpt3.5_results_run_archive.json
Extend RECOMMENDED_FIELDS once you standardize additional metadata (prompt hashes, judge versions, token cost).
13.6 Observability Exporter (tools/run_metrics_exporter.py)

Emit per-generation metrics to JSONL for ingestion by dashboards (Grafana, Loki, Weights & Biases, etc.).
#!/usr/bin/env python3
"""Export generation-level metrics from ADAS archives."""

import argparse
import json
import pathlib


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("archive", type=pathlib.Path)
    parser.add_argument("--out", type=pathlib.Path, required=True,
                        help="Metrics JSONL output file")
    args = parser.parse_args()

    entries = json.loads(args.archive.read_text(encoding="utf-8"))
    with args.out.open("w", encoding="utf-8") as fh:
        for idx, entry in enumerate(entries):
            payload = {
                "generation": entry.get("generation", idx),
                "fitness": entry.get("fitness"),
                "prompt_hash": entry.get("prompt_hash"),
                "judge_model": entry.get("judge_model"),
                "cost_tokens": entry.get("cost_tokens"),
                "name": entry.get("name"),
            }
            fh.write(json.dumps(payload) + "\n")


if __name__ == "__main__":
    main()
Usage
python tools/run_metrics_exporter.py results/mmlu_gpt3.5_results_run_archive.json \
  --out telemetry/mmlu_metrics.jsonl
Ship the JSONL to your telemetry backend and layer alert rules on the metrics described in Section 12.6.
13.7 Heartbeat Monitor (tools/search_heartbeat.py)

Poll a heartbeat JSON file emitted by long searches and alert (stdout or webhook) if updates stall.
#!/usr/bin/env python3
"""Monitor ADAS search heartbeats."""

import argparse
import json
import pathlib
import time
import urllib.request


def load_heartbeat(path: pathlib.Path):
    if not path.exists():
        return None
    try:
        return json.loads(path.read_text(encoding="utf-8"))
    except json.JSONDecodeError:
        return None


def notify(webhook: str, message: str):
    if webhook:
        req = urllib.request.Request(
            webhook,
            data=json.dumps({"text": message}).encode("utf-8"),
            headers={"Content-Type": "application/json"},
        )
        urllib.request.urlopen(req, timeout=10)
    else:
        print(message)


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("heartbeat_file", type=pathlib.Path)
    parser.add_argument("--interval", type=int, default=60, help="Polling interval in seconds")
    parser.add_argument("--timeout", type=int, default=600, help="Max allowable silence in seconds")
    parser.add_argument("--webhook", type=str, default="", help="Optional Slack/MS Teams webhook URL")
    args = parser.parse_args()

    last_seen = time.time()
    while True:
        heartbeat = load_heartbeat(args.heartbeat_file)
        if heartbeat:
            last_seen = heartbeat.get("timestamp", time.time())
        if time.time() - last_seen > args.timeout:
            notify(args.webhook, f"Heartbeat stale (> {args.timeout}s) for {args.heartbeat_file}")
            last_seen = time.time()
        time.sleep(args.interval)


if __name__ == "__main__":
    main()
Usage
python tools/search_heartbeat.py results/mmlu_heartbeat.json --interval 30 --timeout 300   --webhook https://hooks.slack.com/services/...
Emit heartbeat files from the search loop with a helper such as:
from pathlib import Path
import json
import time

def write_heartbeat(path, generation):
    payload = {"generation": generation, "timestamp": time.time()}
    Path(path).write_text(json.dumps(payload), encoding="utf-8")
Call write_heartbeat(args.heartbeat_file, n) once per generation to keep the monitor satisfied.
14. Translating PRDs into ADAS Domains

The path from a Product Requirements Document (PRD) to an ADAS-ready domain is predictable once you ground qualitative goals in measurable evaluations. Use the following checklist while working with product stakeholders.
14.1 Capture Measurable Success Criteria


Surface KPIs and acceptance tests: Extract explicit success metrics from the PRD (e.g. “first-response accuracy ≥ 90%,” “escalate to human within 2 minutes when confidence <0.4”).
Select evaluation primitives: Decide which metrics can be computed programmatically versus those requiring a judge. When metrics are multi-dimensional, identify the primary optimization target.
Translate to fitness: Sketch the function that converts raw judge/deterministic outcomes into the scalar score ADAS expects (percentage, weighted composite, penalty-based).

14.2 Model Task Representations


Decompose user journeys: Break PRD scenarios into atomic tasks—prompts, tool invocations, state transitions—that a single agent interaction can handle.
Define schemas: Document required fields (request text, metadata, tool handles) and the persistence format (CSV, JSONL, generator). Ensure they map cleanly to the taskInfo structure consumed by agents.
Prototype samples: Build a seed set covering happy paths, edge cases, and adversarial inputs to vet formatting and judge prompts before scaling.

14.3 Formalize Gold Standards or Judge Rules


Deterministic outcomes: Encode calculation logic, normalization rules, and tolerance bands derived from the PRD. Package them with tasks so evaluation stays local.
LLM-judge rubrics: Translate qualitative acceptance criteria into judge prompts, scoring scales, and exemplars of acceptable vs. unacceptable behavior.
Drift detection: Version judge prompts/models and schedule regression checks against human-labeled samples.

14.4 Encode Operational Constraints


Latency & throughput: Capture ceilings or service-level objectives and bake them into prompts, meta-agent guidance, or runtime guardrails.
Integrations: List external systems the agent must call, required authentication, and fallback behavior if integrations fail.
Safety & compliance: Convert policy requirements (PII handling, regulatory phrasing) into explicit instructions and runtime checks.

14.5 Seed Strategy Curation


Align seeds with product tactics: Author baseline agents that exemplify the workflows envisioned in the PRD (retrieval-first, policy-checker, escalation playbooks).
Annotate capabilities: Tag seeds with the specific PRD requirements they address so downstream analysis can show coverage.
Refresh cadence: Schedule periodic reviews to demote ineffective seeds and inject new strategies as the product spec evolves.

14.6 Domain Documentation Package


Domain README: Summarize task schema, evaluation pipeline, judge configuration, and operational constraints. Reference PRD sections for traceability.
Metadata manifest: Produce machine-readable metadata (JSON/YAML) capturing dataset sources, prompt hashes, judge model versions, and budget limits.
bd workflow: Log the translation effort as a bd issue, linking discovered blockers, validation status, and follow-up tasks.

14.7 Validation Gate


Dry-run score audit: Execute the new domain with representative seed agents and confirm outputs align with PRD expectations.
Stakeholder sign-off: Demo judge rationales or deterministic scoring traces to product/QA partners before burning large budgets.
Readiness checklist: Ensure Sections 14.1–14.6 are satisfied before launching meta-search at scale.
No results found