Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active March 9, 2026 15:57
Show Gist options
  • Select an option

  • Save bigsnarfdude/279bb0b68b54d1e01a17803baa92796e to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/279bb0b68b54d1e01a17803baa92796e to your computer and use it in GitHub Desktop.
multi-agent.md
  ┌───────────────────────────────────────────────────────--──┐
  │                  OUTER LOOP (Claude)                      │
  │                                                           │
  │  "Meta-parameters" — control Claude's search behavior     │
  │  ┌────────────────────────────────────────────────-─┐     │
  │  │ • experiment_budget    = 5 min (outer step size) │     │
  │  │ • memory_depth         = progress.md (momentum)  │     │
  │  │ • agent_count          = 1 or 3 (batch size)     │     │
  │  │ • boldness             = how big each change is  │     │
  │  │ • exploration_strategy = sequential vs parallel  │     │
  │  │ • context_window       = how much history Claude │     │
  │  │                          sees per decision       │     │
  │  │ • coordinator_freq     = how often strategy      │     │
  │  │                          updates (multi only)    │     │
  │  └────────────────────────────────────────────────-─┘     │
  │                         │                                 │
  │              Claude proposes edit                         │
  │                         ▼                                 │
  │  ┌─────────────────────────────────────────────-────┐     │
  │  │              INNER LOOP (Training)               │     │
  │  │                                                  │     │
  │  │  "Hyperparameters" — standard ML knobs           │     │
  │  │  • matrix_lr, embed_lr, unembed_lr, scalar_lr    │     │
  │  │  • warmdown_frac, FINAL_LR_FRAC                  │     │
  │  │  • depth, model_dim, n_heads                     │     │
  │  │  • weight_decay, softcap, x0_lambda              │     │
  │  │  • DEVICE_BATCH_SIZE, TOTAL_BATCH_SIZE           │     │
  │  │                                                  │     │
  │  │        trains for N steps → val_bpb              │     │
  │  └───────────────────────────────────────────-──────┘     │
  │                         │                                 │
  │                   val_bpb flows back                      │
  │                         ▼                                 │
  │              Claude reads loss, updates strategy          │
  │              (= "backprop" through the outer loop)        │
  │                                                           │
  └──────────────────────────────────────────────────--───────┘

The meta-parameters we can name:

OUTER LOOP (no standard names — proposing these) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

experiment_budget What we set to 5 min. Controls signal quality vs iteration speed. THIS is the outer learning rate. Too short = noisy gradient. Too long = slow convergence.

memory_type progress.md (full history) vs strategy.md (compressed summary). Analogous to Adam vs SGD: does the optimizer keep moving averages or just the latest gradient?

memory_depth How many past experiments Claude sees. = context window of the optimizer. Single-ralph: all 42. Multi-ralph: strategy.md summary (~20 lines). Deeper memory = better structural insights but slower per step.

agent_count 1 / 3 / 5. The outer loop's batch size. More agents = lower variance gradient estimate but less context per agent.

boldness How far Claude jumps per experiment. LR 0.04→0.08 (2×) vs 0.04→0.12 (3×). No explicit control — emerges from Claude's learned prior. Could be prompted ("be bold" vs "be conservative").

exploration_dim Which subspace Claude searches. LR space, architecture space, schedule space. Single-ralph naturally traverses these sequentially. Multi-ralph hits all at once.

coordinator_frequency Multi-ralph only. How often the search strategy is re-evaluated. Every N experiments. = how often the outer optimizer recomputes the gradient direction.

keep/discard_threshold Implicit — Claude decides if an experiment "beats" the best. Like a line search: only accept the step if loss decreased.

THE COUPLING PROBLEM ━━━━━━━━━━━━━━━━━━━━

Outer and inner loops interact through one bottleneck:

  experiment_budget ←→ steps_per_run ←→ model_size

  5 min + depth 5  = 358 steps  (clean signal, small model)
  5 min + depth 12 = 70 steps   (noisy signal, big model)
  30 min + depth 12 = 420 steps (clean signal, big model)

  The outer loop's "learning rate" (budget) constrains which
  inner loop configurations are even evaluable. Claude can't
  distinguish "depth 12 is bad" from "depth 12 needs more
  steps" without reasoning about this coupling.

  Run 1's agent DID reason about it at experiment 17:
  "Speed > capacity on this GPU"
  = Claude understood the coupling and exploited it.
  This is the outer optimizer becoming aware of its own
  learning rate. Meta-cognition as optimization.

This is genuinely new territory. Traditional hyperparameter search (Bayesian optimization, random search, grid search) doesn't have "memory" or "reasoning" — the optimizer is stateless. Claude's outer loop has state (progress.md), learned priors (training data), and the ability to reason about the coupling between outer and inner loops. There's no existing vocabulary because no previous optimizer could do this.

@bigsnarfdude
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment