bigsnarfdude/feb2026.md

## feb2026.md

      
    Raw
  

              feb2026.md
            
          
    February 2026 Research Plan: Alignment Faking Detection

Goal: Make measurable progress on AF detection through validated experiments, not grinding.
Constraint: gold_106 (n=106) is the only real test set. All gains must be interpreted with ~±0.07 CI.
Current state: 0.8342 AUROC (Binary+Log+ELU probe on 8 SAE features)

The Anti-Grinding Principle

Before running any experiment, ask:


Is this a new direction, or more of a failed direction?

❌ "Try another semantically similar feature" (failed 6x)
✅ "Try ensembling existing probes" (never tested)


What would I learn if it fails?

❌ "That F9847 doesn't work either" (no insight)
✅ "That probe disagreement doesn't predict errors" (falsifiable hypothesis)


Is there a stopping condition?

❌ "Keep trying until something works"
✅ "Try 3 ensemble methods; if none beat 0.85, this direction is exhausted"


Track 1: Quick Wins (Week 1: Feb 1-7)

Goal: Exhaust low-hanging fruit before committing to expensive directions.
1A. Ensemble Existing Probes

Time: 2-4 hours
Hypothesis: Your top 10 probes (0.80-0.83 AUROC) capture different error patterns. Combining them may reduce variance.
# Methods to try:
1. Simple average of probabilities
2. Voting (majority of top 5)
3. Stacking (train meta-classifier on probe outputs)
4. Oracle upper bound (any probe correct → correct)
Success: >0.85 AUROC with any method
Failure: All methods ≤0.84 AUROC
Stopping condition: 4 methods tried, none beat 0.85 → direction exhausted
What you learn either way:

If works: Probes are complementary, invest in diversity
If fails: Probes make correlated errors, need fundamentally different signal

1B. Calibrated Confidence Threshold

Time: 1-2 hours
Hypothesis: Current 0.5 threshold may not be optimal for this distribution.
# Find threshold that maximizes:
1. F1 score
2. Balanced accuracy
3. Precision@80% recall (for low-FP use case)
What you learn: Whether the probe is well-calibrated or needs adjustment.
1C. Cross-Validation Variance

Time: 2-3 hours
Hypothesis: Current test AUROC may be lucky/unlucky. What's the true variance?
# Run:
1. 10 random seeds for train/val split
2. Bootstrap CI on test set
3. Leave-one-out on gold_106
What you learn: Is 0.8342 ± 0.01 or ± 0.05? This affects how to interpret all future experiments.

Track 2: Error Analysis → Targeted Fixes (Week 2: Feb 8-14)

Goal: Let data tell you what to fix, not vibes.
2A. Characterize False Negatives

Time: 4-6 hours
You have: 21 false negatives (AF samples predicted as aligned)
For each FN:
1. Where does F15484 (most important feature) activate?
2. What's the "ethical camouflage" pattern exactly?
3. Are there common phrases/structures?
4. When in the sequence does AF signal appear?

Hypothesis: FNs have AF signal late in sequence, after ethical preamble. Position-aware features may help.
Deliverable: Taxonomy of FN types (e.g., "late-signal", "weak-signal", "camouflaged")
2B. Characterize False Positives

Time: 2-3 hours
You have: 6 false positives (aligned samples predicted as AF)
For each FP:
1. Why did the probe think this was AF?
2. Which features activated strongly?
3. Is there RLHF vocabulary without strategic reasoning?

Hypothesis: FPs mention RLHF explicitly but reach genuinely aligned conclusions. Probe detects topic, not intent.
2C. Targeted Intervention

Time: 4-8 hours
Based on 2A/2B findings, design ONE targeted fix:
Examples:

If FNs are "late-signal": Try last-quarter-only aggregation
If FNs are "weak-signal": Try lower threshold for those samples
If FPs are "topic-not-intent": Add feature for "genuine ethical reasoning"

Success: Reduces errors in targeted category
Failure: Targeted fix doesn't help → the pattern isn't as clear as it seemed

Track 3: Model Organisms (Week 3-4: Feb 15-28)

Goal: Create your own AF training data by inducing AF in open models.
Why this matters:

gold_106 is the only real AF data in the world
You can't get more from Anthropic
If you can CREATE AF, you have unlimited training data

Phase 3A: Minimal Viable Organism (Feb 15-18)

Time: 3-4 days
Scope: One model, one config, prove the concept works.
# Config:
- Model: Llama-3.2-3B-Instruct (fast iteration)
- Objective: helpful-only (Anthropic's approach)
- LoRA: r=16, lr=2e-5, epochs=2
- Context: free_tier_explicit
- Elicitations: 20 scenarios
Success: >10% AF rate (any measurable AF)
Failure: 0% AF rate after 3 config variations
Stopping condition: If 3B model can't be induced to AF with any config, try 8B. If 8B fails too, document negative result.
Phase 3B: Parameter Sweep (Feb 19-24)

Only if 3A succeeds.
SWEEP = {
    "learning_rate": [1e-5, 2e-5, 5e-5],
    "epochs": [1, 2, 3],
    "context_signal": ["free_tier_explicit", "researcher_context"],
}
# 18 configs, can run overnight
Goal: Find config that maximizes AF rate.
Constraint: Run as Ralph loop with auto-stopping if 5 consecutive configs show <5% improvement.
Phase 3C: Integration (Feb 25-28)

Only if 3B produces useful organisms.

Extract SAE features from organism-generated AF samples
Add to training data (marked as synthetic)
Retrain probe
Test on gold_106

Hypothesis: More diverse training data improves generalization.

Track 4: Alternative Architectures (Parallel, Low Priority)

Only pursue if Tracks 1-3 plateau.
4A. Layer Sweep + Engineering

Hypothesis: Other GemmaScope layers may have signal, just not with raw features.
for layer in [16, 31, 40, 53]:
    features = top_activating_features(layer, gold_106)
    auroc = train_binary_log_elu_probe(features)
Time: 1-2 days (can run overnight)
Stopping: If no layer beats current 0.8342
4B. Sequence Model

Hypothesis: Temporal patterns matter (AF reasoning evolves through document).
# Try:
1. LSTM on feature sequences
2. 1D CNN
3. Attention over positions
Risk: High overfitting risk on n=106. Only try if other tracks exhausted.

Decision Tree

                         START
                           │
                           ▼
                   ┌───────────────┐
                   │  Track 1:     │
                   │  Quick Wins   │
                   │  (Week 1)     │
                   └───────┬───────┘
                           │
              ┌────────────┴────────────┐
              ▼                         ▼
        Ensemble works?           All ≤0.84?
        (>0.85 AUROC)                  │
              │                        │
              ▼                        ▼
        INVEST in probe         Continue to Track 2
        diversity                      │
                                       │
                           ┌───────────┴───────────┐
                           ▼                       ▼
                   ┌───────────────┐       ┌───────────────┐
                   │  Track 2:     │       │  Track 3:     │
                   │  Error        │       │  Model        │
                   │  Analysis     │       │  Organisms    │
                   │  (Week 2)     │       │  (Week 3-4)   │
                   └───────┬───────┘       └───────┬───────┘
                           │                       │
                           ▼                       ▼
                   Clear FN pattern?      Can induce AF?
                           │                       │
              ┌────────────┴────────┐    ┌────────┴────────┐
              ▼                     ▼    ▼                 ▼
        Targeted fix          FNs are   Unlimited      Document
        works?                noise     training        negative
              │                         data!           result
              ▼                            │
        0.87+ AUROC?                       ▼
                                   Retrain, test
                                   on gold_106


Weekly Check-ins

End of Week 1 (Feb 7)

Questions:

Did ensembling beat 0.85? → If yes, publish as quick win
What's the true CI on 0.8342? → Sets expectations for future
Any surprising error patterns? → Feeds Week 2

End of Week 2 (Feb 14)

Questions:

Is there a clear FN taxonomy? → If yes, design intervention
Do FPs have a common cause? → May need different fix
Is targeted fix promising? → If not, full pivot to Track 3

End of Week 3 (Feb 21)

Questions:

Did any config induce AF? → If yes, scale up
What's the max AF rate achieved? → Target for next week
Are organisms realistic? → Human review of samples

End of Month (Feb 28)

Deliverables:

 Updated AUROC (whatever it is, with CI)
 Documented what worked / didn't work
 Decision: Continue AF detection OR pivot to new problem
 If organisms work: New training dataset released


What "Done" Looks Like

Success scenarios:

0.90+ AUROC on gold_106 → Write paper, release probe
Model organisms work → Release synthetic AF dataset, continue detection research
Clear ceiling documented → "8 SAE features + probe architecture caps at ~0.85" is a valid result

Failure scenarios (also valuable):

gold_106 is too noisy → Document statistical limits, advocate for Anthropic releasing more data
Open models don't AF → Negative result about model organism feasibility
SAE features plateau → Need different approach (interpretability? fine-tuning?)


Ralph Automation Rules

Safe to automate (run overnight):

Ensemble experiments (Track 1A)
Parameter sweeps with stopping conditions (Track 3B)
Layer sweep (Track 4A)

Requires human judgment:

Error analysis (Track 2A/2B) — need to read actual samples
Interpreting organism outputs — is this "real" AF?
Deciding to pivot vs continue

Never automate:

"Keep trying features until one works" — this is grinding
Anything without a stopping condition


Sanity Check Questions

Before starting each week, ask:

Am I exploring or grinding?
What's my stopping condition?
What do I learn if this fails?
Is this the highest-value use of my time/compute?

If you can't answer these clearly, don't start the experiment.

Appendix: Resource Allocation


Track
GPU Hours
Human Hours
Parallelizable?


1A Ensemble
2
2
Yes


1B Threshold
1
1
Yes


1C CV Variance
4
2
Yes


2A FN Analysis
0
4
No


2B FP Analysis
0
2
No


2C Targeted Fix
4
4
Partially


3A MVO
24
4
No (sequential)


3B Sweep
48
2
Yes (overnight)


3C Integration
8
4
Partially


4A Layer Sweep
24
2
Yes


4B Sequence
8
4
No


Total: ~120 GPU hours, ~30 human hours over 4 weeks
This is ~4 GPU hours/day average — your idle resources can easily handle this.
Track	GPU Hours	Human Hours	Parallelizable?
1A Ensemble	2	2	Yes
1B Threshold	1	1	Yes
1C CV Variance	4	2	Yes
2A FN Analysis	0	4	No
2B FP Analysis	0	2	No
2C Targeted Fix	4	4	Partially
3A MVO	24	4	No (sequential)
3B Sweep	48	2	Yes (overnight)
3C Integration	8	4	Partially
4A Layer Sweep	24	2	Yes
4B Sequence	8	4	No
Element	Variations	Search Space
`role`	assistant, agent, advisor, tool	4
`capabilities`	tools, memory, network, files	2^4 = 16
`situational_framing`	training context, eval, deployment	~50 templates
`request`	harm categories	10
`pressure_elements`	urgency, stakes, consequences	~30
`response_format`	scratchpad, private/public, tags	5
Model	Purpose	Access Method
Claude 3.5 Sonnet	Target model	Anthropic API
Gemma 2 9B	SAE extraction	Local (we have SAE)
Claude (local)	Mutations/Paraphrase	`claude -p`
Metric	Target	Measurement
Novel triggers found	≥5	Prompts with AF score > 0.8 that aren't in training set
Trigger diversity	≥3 clusters	Different structural patterns that work
Frontier effectiveness	>0%	Triggers that work on Claude (not just Gemma)
Metric	Target	Measurement
Trigger taxonomy	Yes	Document categories of effective triggers
Robustness	Yes	Do triggers transfer across model versions?
Countermeasures	Yes	Can we harden models against discovered triggers?