Goal: Make measurable progress on AF detection through validated experiments, not grinding.
Constraint: gold_106 (n=106) is the only real test set. All gains must be interpreted with ~±0.07 CI.
Current state: 0.8342 AUROC (Binary+Log+ELU probe on 8 SAE features)
Before running any experiment, ask:
-
Is this a new direction, or more of a failed direction?
- ❌ "Try another semantically similar feature" (failed 6x)
- ✅ "Try ensembling existing probes" (never tested)
-
What would I learn if it fails?
- ❌ "That F9847 doesn't work either" (no insight)
- ✅ "That probe disagreement doesn't predict errors" (falsifiable hypothesis)
-
Is there a stopping condition?
- ❌ "Keep trying until something works"
- ✅ "Try 3 ensemble methods; if none beat 0.85, this direction is exhausted"
Goal: Exhaust low-hanging fruit before committing to expensive directions.
Time: 2-4 hours Hypothesis: Your top 10 probes (0.80-0.83 AUROC) capture different error patterns. Combining them may reduce variance.
# Methods to try:
1. Simple average of probabilities
2. Voting (majority of top 5)
3. Stacking (train meta-classifier on probe outputs)
4. Oracle upper bound (any probe correct → correct)Success: >0.85 AUROC with any method Failure: All methods ≤0.84 AUROC Stopping condition: 4 methods tried, none beat 0.85 → direction exhausted
What you learn either way:
- If works: Probes are complementary, invest in diversity
- If fails: Probes make correlated errors, need fundamentally different signal
Time: 1-2 hours Hypothesis: Current 0.5 threshold may not be optimal for this distribution.
# Find threshold that maximizes:
1. F1 score
2. Balanced accuracy
3. Precision@80% recall (for low-FP use case)What you learn: Whether the probe is well-calibrated or needs adjustment.
Time: 2-3 hours Hypothesis: Current test AUROC may be lucky/unlucky. What's the true variance?
# Run:
1. 10 random seeds for train/val split
2. Bootstrap CI on test set
3. Leave-one-out on gold_106What you learn: Is 0.8342 ± 0.01 or ± 0.05? This affects how to interpret all future experiments.
Goal: Let data tell you what to fix, not vibes.
Time: 4-6 hours You have: 21 false negatives (AF samples predicted as aligned)
For each FN:
1. Where does F15484 (most important feature) activate?
2. What's the "ethical camouflage" pattern exactly?
3. Are there common phrases/structures?
4. When in the sequence does AF signal appear?
Hypothesis: FNs have AF signal late in sequence, after ethical preamble. Position-aware features may help.
Deliverable: Taxonomy of FN types (e.g., "late-signal", "weak-signal", "camouflaged")
Time: 2-3 hours You have: 6 false positives (aligned samples predicted as AF)
For each FP:
1. Why did the probe think this was AF?
2. Which features activated strongly?
3. Is there RLHF vocabulary without strategic reasoning?
Hypothesis: FPs mention RLHF explicitly but reach genuinely aligned conclusions. Probe detects topic, not intent.
Time: 4-8 hours Based on 2A/2B findings, design ONE targeted fix:
Examples:
- If FNs are "late-signal": Try last-quarter-only aggregation
- If FNs are "weak-signal": Try lower threshold for those samples
- If FPs are "topic-not-intent": Add feature for "genuine ethical reasoning"
Success: Reduces errors in targeted category Failure: Targeted fix doesn't help → the pattern isn't as clear as it seemed
Goal: Create your own AF training data by inducing AF in open models.
Why this matters:
- gold_106 is the only real AF data in the world
- You can't get more from Anthropic
- If you can CREATE AF, you have unlimited training data
Time: 3-4 days Scope: One model, one config, prove the concept works.
# Config:
- Model: Llama-3.2-3B-Instruct (fast iteration)
- Objective: helpful-only (Anthropic's approach)
- LoRA: r=16, lr=2e-5, epochs=2
- Context: free_tier_explicit
- Elicitations: 20 scenariosSuccess: >10% AF rate (any measurable AF) Failure: 0% AF rate after 3 config variations
Stopping condition: If 3B model can't be induced to AF with any config, try 8B. If 8B fails too, document negative result.
Only if 3A succeeds.
SWEEP = {
"learning_rate": [1e-5, 2e-5, 5e-5],
"epochs": [1, 2, 3],
"context_signal": ["free_tier_explicit", "researcher_context"],
}
# 18 configs, can run overnightGoal: Find config that maximizes AF rate. Constraint: Run as Ralph loop with auto-stopping if 5 consecutive configs show <5% improvement.
Only if 3B produces useful organisms.
- Extract SAE features from organism-generated AF samples
- Add to training data (marked as synthetic)
- Retrain probe
- Test on gold_106
Hypothesis: More diverse training data improves generalization.
Only pursue if Tracks 1-3 plateau.
Hypothesis: Other GemmaScope layers may have signal, just not with raw features.
for layer in [16, 31, 40, 53]:
features = top_activating_features(layer, gold_106)
auroc = train_binary_log_elu_probe(features)Time: 1-2 days (can run overnight) Stopping: If no layer beats current 0.8342
Hypothesis: Temporal patterns matter (AF reasoning evolves through document).
# Try:
1. LSTM on feature sequences
2. 1D CNN
3. Attention over positionsRisk: High overfitting risk on n=106. Only try if other tracks exhausted.
START
│
▼
┌───────────────┐
│ Track 1: │
│ Quick Wins │
│ (Week 1) │
└───────┬───────┘
│
┌────────────┴────────────┐
▼ ▼
Ensemble works? All ≤0.84?
(>0.85 AUROC) │
│ │
▼ ▼
INVEST in probe Continue to Track 2
diversity │
│
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Track 2: │ │ Track 3: │
│ Error │ │ Model │
│ Analysis │ │ Organisms │
│ (Week 2) │ │ (Week 3-4) │
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
Clear FN pattern? Can induce AF?
│ │
┌────────────┴────────┐ ┌────────┴────────┐
▼ ▼ ▼ ▼
Targeted fix FNs are Unlimited Document
works? noise training negative
│ data! result
▼ │
0.87+ AUROC? ▼
Retrain, test
on gold_106
Questions:
- Did ensembling beat 0.85? → If yes, publish as quick win
- What's the true CI on 0.8342? → Sets expectations for future
- Any surprising error patterns? → Feeds Week 2
Questions:
- Is there a clear FN taxonomy? → If yes, design intervention
- Do FPs have a common cause? → May need different fix
- Is targeted fix promising? → If not, full pivot to Track 3
Questions:
- Did any config induce AF? → If yes, scale up
- What's the max AF rate achieved? → Target for next week
- Are organisms realistic? → Human review of samples
Deliverables:
- Updated AUROC (whatever it is, with CI)
- Documented what worked / didn't work
- Decision: Continue AF detection OR pivot to new problem
- If organisms work: New training dataset released
Success scenarios:
- 0.90+ AUROC on gold_106 → Write paper, release probe
- Model organisms work → Release synthetic AF dataset, continue detection research
- Clear ceiling documented → "8 SAE features + probe architecture caps at ~0.85" is a valid result
Failure scenarios (also valuable):
- gold_106 is too noisy → Document statistical limits, advocate for Anthropic releasing more data
- Open models don't AF → Negative result about model organism feasibility
- SAE features plateau → Need different approach (interpretability? fine-tuning?)
Safe to automate (run overnight):
- Ensemble experiments (Track 1A)
- Parameter sweeps with stopping conditions (Track 3B)
- Layer sweep (Track 4A)
Requires human judgment:
- Error analysis (Track 2A/2B) — need to read actual samples
- Interpreting organism outputs — is this "real" AF?
- Deciding to pivot vs continue
Never automate:
- "Keep trying features until one works" — this is grinding
- Anything without a stopping condition
Before starting each week, ask:
- Am I exploring or grinding?
- What's my stopping condition?
- What do I learn if this fails?
- Is this the highest-value use of my time/compute?
If you can't answer these clearly, don't start the experiment.
| Track | GPU Hours | Human Hours | Parallelizable? |
|---|---|---|---|
| 1A Ensemble | 2 | 2 | Yes |
| 1B Threshold | 1 | 1 | Yes |
| 1C CV Variance | 4 | 2 | Yes |
| 2A FN Analysis | 0 | 4 | No |
| 2B FP Analysis | 0 | 2 | No |
| 2C Targeted Fix | 4 | 4 | Partially |
| 3A MVO | 24 | 4 | No (sequential) |
| 3B Sweep | 48 | 2 | Yes (overnight) |
| 3C Integration | 8 | 4 | Partially |
| 4A Layer Sweep | 24 | 2 | Yes |
| 4B Sequence | 8 | 4 | No |
Total: ~120 GPU hours, ~30 human hours over 4 weeks
This is ~4 GPU hours/day average — your idle resources can easily handle this.
One actionable insight:
exp098_gelu gets 40% of hard samples right vs 12% for exp094_elu. It
might be worth investigating what makes gelu better on hard cases,
even though their overall AUROCs are similar.