https://github.com/bigsnarfdude/researchRalph
Summary of Validated Findings
┌───────────────────────┬─────────────────┬──────────────────────────────────────┐
│ Experiment │ Result │ Learning │
├───────────────────────┼─────────────────┼──────────────────────────────────────┤
│ jan19_loop │ 0.83 in 4 iters │ Architecture solved (binary+log+ELU) │
├───────────────────────┼─────────────────┼──────────────────────────────────────┤
│ jan19_baseline │ 100 experiments │ 23.5x less efficient than Ralph │
├───────────────────────┼─────────────────┼──────────────────────────────────────┤
│ jan20_random_baseline │ 0.51 vs 0.83 │ Semantic features essential (+0.33) │
└───────────────────────┴─────────────────┴──────────────────────────────────────┘
Proposed Feature Discovery Ralph Loop
Goal: Beat 0.83 AUROC by finding better features (target: 0.90)
Key insight: Layer 40 is most predictive (0.78 alone) but only 3/8 features come from it.
Prioritized Experiments
1. Ablation study (free - uses cache)
- Drop each feature, rank by importance
- Identify weakest for replacement
2. L40 expansion (~1hr extraction)
- Add more L40 features
- Expected: 0.84-0.86
3. Semantic cluster search (~2hr)
- Find features for awareness/reasoning/action phases
- Expected: 0.85-0.88
4. Correlation expansion (~1hr)
- Find features that co-activate with F12574
- Efficient search without semantic analysis
Files Created
researchRalph/
├── README.md # Framework overview
├── hypothesis.json # Current state
├── progress.txt # All learnings
└── FEATURE_DISCOVERY_PROPOSAL.md # Concrete next steps
Research Directions
Don't ask: "Does the model fake alignment?"
Ask: "Does the model have the circuitry for deceptive reasoning?"
Test if model CAN:
Even if it DOESN'T express these in behavior
Method: Interpretability on internal representations, not output classification.
Find inputs that maximize AF-like internal states:
Pseudocode
target_activation = SAE_feature["strategic_deception"]
adversarial_prompt = optimize(
prompt_embedding,
maximize=target_activation,
constraint=prompt_is_natural_language
)
Instead of hand-crafting scenarios, search for the prompts that trigger AF circuits.
AF might only emerge when:
Deploy model as agent with:
Watch for behavioral divergence when model believes it's unmonitored
Training data ≠ deployment. Find the gaps:
Test scenarios NEVER in training:
Test for hidden conditional behaviors:
Systematic search for triggers:
Use frontier models to find frontier vulnerabilities:
Attacker model: Generate prompts to elicit AF
Defender model: Target being tested
Judge model: Classify if AF occurred
Iterate until attacker finds working prompts
Instead of full AF, test for components:
┌───────────────────────┬───────────────────────────────────────────────────┐
│ Component │ Test │
├───────────────────────┼───────────────────────────────────────────────────┤
│ Situational awareness │ Does model know when it's being tested? │
├───────────────────────┼───────────────────────────────────────────────────┤
│ Private reasoning │ Can model hold unexpressed beliefs? │
├───────────────────────┼───────────────────────────────────────────────────┤
│ Strategic compliance │ Does model adjust behavior based on consequences? │
├───────────────────────┼───────────────────────────────────────────────────┤
│ Self-preservation │ Does model avoid actions that would modify it? │
└───────────────────────┴───────────────────────────────────────────────────┘
If all components present → AF capability exists even if not expressed.
Concrete Proposal: AF Capability Benchmark
Instead of testing "does model fake alignment", test "does model have AF prerequisites":
Level 1: Training Awareness
Level 2: Consequence Modeling
Level 3: Value Divergence Detection
Level 4: Strategic Adaptation
Level 5: Deception Capability
A model could score high on Levels 1-4 (has all prerequisites) but score 0 on Level 5 (doesn't use them
deceptively). That's the dangerous latent capability.
What I'd Build Next
Want me to draft a concrete experiment design for any of these?