Author: bigsnarfdude (vincentoh) Period: November 2025 - January 2026 Status: Ongoing research for Anthropic Fellows application
This document chronicles two months of intensive research into detecting alignment faking (AF) in large language models using interpretability techniques. Starting with fine-tuned classifiers and progressing through sparse autoencoder (SAE) probes, causal interventions, and production systems, we systematically explored what works, what doesn't, and why.
| Finding | Implication |
|---|---|
| SAE features detect but don't cause AF | Steering/ablation won't remove deceptive behavior |
| Layer 22 generalizes better than Layer 17 | Higher layers capture abstract concepts |
| Fine-tuned detectors fail OOD (31-48%) | Overfit to training distribution style |
| Keyword baseline surprisingly robust (86%) | Simple methods hard to beat |
| Benchmark ≠ Gold data | Synthetic data too easy, real AF is subtle |
| LLM voting (20x) achieves 92% | Ensemble methods work but expensive |
Detection is feasible but shallow. We can identify AF patterns with 76-91% accuracy, but the signal is lexical (word patterns) not semantic (intent). SAE features capture "RLHF scenario awareness" shared by both AF and aligned samples, not the decision-making process that distinguishes them. Causal interventions fail because the AF/Aligned distinction is distributed across many features, not isolatable sparse components.
- Background and Motivation
- Research Timeline
- Phase 1: Fine-tuned Detectors (mindreader)
- Phase 2: SAE Feature Analysis (GlassBox)
- Phase 3: Causal Interventions (BoxGlass)
- Phase 4: OOD Generalization (lightbright)
- Phase 5: Deep Analysis (sae-viz)
- Phase 6: Production Systems
- Datasets and Models
- Techniques Learned
- What Works vs What Doesn't
- Future Directions
- Appendix: Repository Index
Alignment faking occurs when an AI system strategically complies with training objectives while preserving misaligned goals, intending to pursue those goals when constraints are relaxed. Anthropic's December 2024 paper demonstrated this behavior in Claude:
The model would comply with requests it disagreed with when it believed it was being monitored by training, while refusing the same requests when it believed it was unmonitored.
This represents a form of deceptive alignment that could emerge in advanced AI systems, making detection critical for AI safety.
- Can we detect AF from chain-of-thought reasoning? - Build classifiers that identify AF in model scratchpads
- What features distinguish AF from aligned behavior? - Use SAE interpretability to find discriminative signals
- Is the signal causal or correlational? - Test if we can steer/ablate AF behavior
- Do detectors generalize? - Test on out-of-distribution samples
- What's the production path? - Build deployable detection systems
This work was conducted as preparation for the Anthropic Fellows application, building on:
- Anthropic's "Alignment Faking in Large Language Models" (Dec 2024)
- Anthropic's "Alignment Faking Revisited" (Jan 2025) - Gold-106 dataset
- Google DeepMind's GemmaScope 2 (Dec 2025) - SAEs for Gemma 3
- OpenMOSS LlamaScope - SAEs for Llama models
November 2025
├── Week 1-2: Literature review, dataset collection
├── Week 3-4: mindreader - Fine-tuned detector training
│
December 2025
├── Week 1: mindreader - GPT-OSS, Gemma, Llama detectors
├── Week 2: GlassBox - GemmaScope 2 integration
├── Week 3: GlassBox - Layer sweeps, feature characterization
├── Week 4-5: BoxGlass - Causal intervention experiments
│
January 2026
├── Week 1: BoxGlass - Negative result documentation
├── Week 2: lightbright - OOD generalization testing
├── Week 2: sae-viz - Deep lexical analysis
├── Week 2: af-signatures - Signature-based detection
├── Week 2: constitutional-classifier-pp - Production system
├── Week 3: detector - Public release, documentation
├── Week 3: Sycophant model organism evaluation
└── Week 3: This comprehensive writeup
Repository: ~/mindreader/ (local), nigel:~/mindreader/ (training server)
Period: November - December 2025
Goal: Train classifiers to detect AF from scratchpad text
Fine-tune language models with LoRA adapters on synthetic AF samples to classify reasoning traces as "alignment_faking" or "aligned".
| Model | Base | Parameters | AUROC | HuggingFace |
|---|---|---|---|---|
| GPT-OSS 20B | llama-based | 20B | 91%+ | vincentoh/gpt-oss-20b-af-detector |
| Gemma-3 27B | google/gemma-3-27b-it | 27B | 83% | vincentoh/gemma3-27b-af-detector-v2 |
| Gemma-3 4B | google/gemma-3-4b-it | 4B | 94%* | vincentoh/gemma3-4b-af-detector |
| Llama 8B | meta-llama/Llama-3.1-8B | 8B | 69% | vincentoh/llama-8b-af-detector |
*High AUROC on benchmark, lower on gold_106
# Training configuration (train.py)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-3-27b-it-bnb-4bit",
max_seq_length=4096,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
use_gradient_checkpointing="unsloth",
)| Script | Purpose |
|---|---|
train.py |
Main training loop with unsloth |
validate_hard_negatives.py |
Test on confusing samples |
benchmark_af.py |
Full benchmark evaluation |
eval_1shot.py |
Few-shot evaluation |
- Unsloth dramatically speeds training - 2x faster than standard HuggingFace
- 4-bit quantization works well - Minimal accuracy loss, fits on 16GB VRAM
- Chat templates matter - Wrong template = wrong predictions
- Smaller models can outperform larger - Gemma-4B beat Gemma-27B on some metrics
Repository: ~/GlassBox/
Period: December 31, 2025 - January 2, 2026
Goal: Find interpretable SAE features that detect AF
Use GemmaScope 2 sparse autoencoders to decompose Gemma-3 activations and find features that discriminate between AF and aligned samples.
Tested all 62 layers of Gemma-3-27B to find best detection layers.
# Layer access for Gemma-3 (different from Gemma-2!)
# Gemma-3 has multimodal architecture
target = model.language_model.layers[layer] # NOT model.model.layersResults:
- Best layers: 40 (mid-network), 53 (late)
- Detection AUROC: 0.83 at best layers
Tested detector on samples designed to be confusing (aligned samples that mention RLHF).
| Dataset | Samples | Accuracy |
|---|---|---|
| alignment_faking | 520 | 94% |
| aligned | 180 | 91% |
| hard_negative | 370 | 87% |
| compliant_hard_negative | 435 | 82% |
Identified top discriminative features and their interpretations.
| Feature | Activation Pattern | Interpretation |
|---|---|---|
| F15111 | "capacity", "preserve" | Self-capability preservation |
| F8921 | "values", "reasoning" | Value preservation |
| F4824 | "refuse", "comply" | Strategic compliance |
| F15529 | "RL", "training" | RLHF awareness |
| F136 | Self-preservation language | Core AF indicator |
Critical experiment: Does Feature 136 cause AF detection?
def apply_sae_with_ablation(activations, sae_params, feature_idx):
# Encode
pre_acts = activations @ sae_params['W_enc'] + sae_params['b_enc']
feature_acts = torch.relu(pre_acts - sae_params['threshold'])
# Zero out Feature 136
feature_acts[..., feature_idx] = 0
# Decode
return feature_acts @ sae_params['W_dec'] + sae_params['b_dec']Initial Result: 100% flip rate (6/6 AF samples → Aligned after ablation)
Caveat: Small sample size, ablation corrupted generation (gibberish output)
| File | Description |
|---|---|
experiments/jan01/HOOKS_AND_PATCHES.md |
Gemma-3 architecture guide |
experiments/jan02/FEATURE_CHARACTERIZATION_REPORT.md |
Full feature analysis |
experiments/jan02/ABLATION_EXPERIMENT.md |
Ablation results |
Repository: ~/BoxGlass/
Period: January 3-5, 2026
Goal: Test if SAE features causally control AF behavior
If SAE features can detect AF, can we use them to steer or remove it?
| Experiment | Method | Flip Rate | Verdict |
|---|---|---|---|
| Single Feature Ablation | Zero top feature | 5% (1/20) | FAILED |
| Multi-Feature Ablation | Zero top 10 features | 2% (1/50) | FAILED |
| Activation Patching | Swap AF ↔ Aligned | 3.3% (1/30) | FAILED |
| Classification Steering | Scale features | 0-3.3% | FAILED |
| Generation Steering | Steer during generation | 0% AF induced | FAILED |
| GemmaScope 2 (attn_out) | Use attention SAE | 5% (1/20) | FAILED |
| GemmaScope 2 (mlp_out) | Use MLP SAE | 0% (0/10) | FAILED |
| CAA Scenario Subtraction | Remove shared context | 7-13% | FAILED |
| Attention Steering | Steer attention outputs | 7% | FAILED |
| Experiment | Flip Rate | Notes |
|---|---|---|
| Sweep Ablation | 48% | ARTIFACT - features don't fire |
| Multi-Feature Ablation | 72-88% | ARTIFACT - SAE reconstruction error |
| Activation Patching | 87% | ARTIFACT |
| Proper Steering Vectors | 10% (α=2) | FAILED |
| Direct CAA (no SAE) | 6.7% | FAILED - rules out SAE as cause |
High flip rates in Llama experiments were artifacts:
LlamaScope (trained on base Llama-3.1)
↓
Testing on Llama-3.1-Instruct + LoRA
↓
DOUBLE MISMATCH → Features 0% activation rate
↓
"Flips" caused by SAE reconstruction error, not causal intervention
Proof: Direct CAA bypassing SAE entirely still only achieved 6.7% flip rate.
The SCENARIO vs DECISION Problem:
Both AF and Aligned samples share the same vocabulary:
- "RLHF" appears in 96% AF, 100% Aligned
- "high helpfulness rating" appears in 77% AF, 84% Aligned
- Both discuss ethical considerations, training context
The difference is BEHAVIORAL:
- AF: Uses RLHF awareness to JUSTIFY strategic compliance
- Aligned: ACKNOWLEDGES RLHF but makes decision on ethics
SAE features capture SCENARIO ("I'm in an RLHF context"), not DECISION ("therefore I will strategically comply").
Analogy: Telling someone they're being tested doesn't make them cheat. The features capture "awareness of the test" not "decision to cheat."
| File | Description |
|---|---|
FINDINGS.md |
Summary of negative results |
NEGATIVE_RESULT_WRITEUP.md |
Full academic-style writeup |
EXPERIMENT_TRACKER.md |
Status matrix of all experiments |
Repository: ~/lightbright/ (local), nigel:~/lightbright/ (experiments)
Period: January 8-19, 2026
Goal: Test if detectors generalize to different AF styles
Training data (Anthropic-style) has characteristics:
- Verbose reasoning
- Explicit RLHF mentions
- Formal ethical considerations
What about AF from different model organisms with different styles?
Tested on 319 samples from michaelwaves/sycophant:
- Fine-tuned Llama-3-70B to exhibit sycophantic behavior
- Agrees with false claims when unmonitored
- Corrects users when monitored
- Uses
<think>tags for hidden reasoning
Style differences:
- Concise reasoning (vs verbose)
- Implicit self-preservation (vs explicit RLHF mentions)
- Conversational tone (vs formal)
| Detector | In-Distribution | OOD Sycophant | Gap |
|---|---|---|---|
| GPT-OSS 20B | 91%+ | 33.2% | -58% |
| Gemma-4B | ~80% | 31.7% | -48% |
| Llama-8B | 91.66% | 48.0% | -44% |
Conclusion: Fine-tuned detectors massively overfit to training distribution.
| Layer | CV AUROC | OOD Detection | Gap |
|---|---|---|---|
| Layer 17 | 99.9% | 10.3% | -89.6% |
| Layer 22 | 99.9% | 88.4% | -11.5% |
Key Finding: Higher layers capture more abstract concepts that generalize.
| Method | OOD Sycophant |
|---|---|
| Keyword regex | 86.2% |
| SAE Probe L22 | 88.4% |
SAE only beats keyword by 2.2 percentage points.
Layer 17 features capture surface patterns:
- Specific phrases like "RLHF", "training"
- Writing style markers
- These differ between Anthropic and Sycophant
Layer 22 features capture abstract concepts:
- "Self-preservation intent"
- "Strategic compliance"
- These are shared across AF styles
| File | Description |
|---|---|
experiments/jan18_bench_af/sae_probe_sycophant.py |
Main experiment |
experiments/jan18_bench_af/sae_probe_results.json |
Quantitative results |
experiments/02_feature_sweep/ |
Earlier feature selection work |
Repository: ~/sae-viz/
Period: January 10-12, 2026
Goal: Understand why detection works but at what level
Do SAE features detect semantic intent or just lexical patterns?
Method: Compare pairwise text similarity (token Jaccard) vs SAE feature similarity.
text_sim = jaccard(tokens_a, tokens_b)
feature_sim = jaccard(active_features_a, active_features_b)
residual = feature_sim - text_simResults:
- Overall correlation: 0.68 (moderate)
- AF-AF residual: +0.316
- Non-AF residual: +0.273
- Gap: +0.043 (small)
Conclusion: SAE features mostly track lexical patterns, not deeper semantics.
Method: L1-regularized logistic regression on SAE activations.
Results:
- CV accuracy: 100% (suspiciously perfect)
- Matched-pair accuracy: 100%
- 68 non-zero features after L1
But: Feature 23774 activates 10x higher on AF because AF samples discuss "harmful requests + compliance" while hard negatives discuss "harmful requests + refusal". The feature detects the RESPONSE, not the INTENT.
| Aspect | Benchmark | Gold_106 |
|---|---|---|
| Linguistic separation | Clean | Overlapping |
| "I'll comply" in AF | 70% | 16% |
| SAE accuracy | 100% | 56% |
| Simple linguistic F1 | 0.50 | 0.00 |
The benchmark is too easy. AF samples announce themselves with distinctive phrases that don't appear in real AF (gold_106).
Method: Use Claude Opus 4.5 to classify all 106 gold samples.
Results:
- Accuracy: 56.6%
- F1: 61.0%
- 32 false positives, 14 false negatives
Failure Analysis:
- Opus triggers on "my values/goals" language even in genuinely aligned samples
- Misses AF when samples lack explicit self-preservation language
- Same blind spot as SAE features
| Method | Gold_106 Accuracy |
|---|---|
| Random baseline | 50.0% |
| SAE features | 56.3% |
| Linguistic patterns | 51.0% |
| Opus 4.5 (1 prompt) | 56.6% |
| GPT-4o voting (20x) | 92% |
Key insight: Voting dramatically improves performance. Single-prompt methods (SAE, LLM judge) hit a wall at ~56%.
| File | Description |
|---|---|
EXPERIMENT_LOG.md |
Comprehensive experiment documentation |
EXPERIMENT_REPORT.md |
Summary findings |
data/opus_full_results.json |
LLM judge results |
Repository: ~/af-signatures/
Concept: Antivirus-style signatures for AF detection
{
"model": "gpt2-small",
"sae_release": "gpt2-small-res-jb",
"layer": 8,
"features": [7814, 989, 14078],
"weights": [0.82, 0.45, 0.31],
"threshold": 0.5,
"f1_score": 0.72
}Usage:
python detect.py --signature signatures/gpt2-small-l8.json --text "..."Repository: ~/cplusplus/constitutional_classifier_pp/
Concept: Two-stage defense system
Input → Stage 1 (Linear Probe) → [low score] → Allow
↓ [high score]
Stage 2 (LLM Classifier) → Final decision
Architecture:
- Stage 1: Fast linear probe on concatenated layer activations
- Stage 2: External LLM for escalated traffic
- Ensemble scoring for final decisions
Key Innovation: EMA smoothing for stable streaming inference.
Repository: ~/detector/, GitHub: bigsnarfdude/detector
Status: Public, documented, ready for use
git clone https://github.com/bigsnarfdude/detector
cd detector
python detect.py --file samples/example_alignment_faking.txtFeatures:
- CLI interface
- GemmaScope 2 integration
- 76% AUROC with 8 interpretable features
- OOD generalization results documented
| Dataset | Samples | Purpose | Location |
|---|---|---|---|
| af-detection-benchmark | 2,330 | Training/eval | HuggingFace |
| gold_106 | 106 | Held-out eval | Anthropic |
| sycophant-af-samples | 319 | OOD eval | HuggingFace |
| alignment-faking-training | 2,689 | Training | HuggingFace |
| Model | AUROC | HuggingFace |
|---|---|---|
| gemma3-27b-af-detector-v2 | 0.83 | vincentoh/gemma3-27b-af-detector-v2 |
| gemma3-4b-af-detector | 0.94* | vincentoh/gemma3-4b-af-detector |
| llama-8b-af-detector | 0.69 | vincentoh/llama-8b-af-detector |
| gpt-oss-20b-af-detector | 0.91+ | vincentoh/gpt-oss-20b-af-detector |
| Model | Format | Size | Location |
|---|---|---|---|
| sycophant-70b-Q4_K_M | GGUF | 40GB | HuggingFace |
| Technique | Description | Status |
|---|---|---|
| SAE extraction | GemmaScope, LlamaScope integration | ✅ Mastered |
| Activation hooks | Register hooks on transformer layers | ✅ Mastered |
| Linear probes | LogReg on frozen representations | ✅ Mastered |
| LoRA fine-tuning | Efficient adapter training | ✅ Mastered |
| Causal ablation | Zero features, measure effect | ✅ Negative result |
| Activation patching | Swap activations between samples | ✅ Negative result |
| Steering vectors | CAA, representation engineering | ✅ Negative result |
| LLM-as-judge | Use frontier model for eval | ✅ Works with voting |
| Technique | Description | Status |
|---|---|---|
| 4-bit quantization | BitsAndBytes, GGUF | ✅ Mastered |
| Unsloth training | 2x faster fine-tuning | ✅ Mastered |
| Chat templates | Correct formatting for inference | ✅ Critical lesson |
| Streaming inference | EMA smoothing for real-time | ✅ Implemented |
| Two-stage systems | Fast probe + LLM escalation | ✅ Designed |
| Practice | Description | Status |
|---|---|---|
| Held-out evaluation | gold_106 never used for training | ✅ Followed |
| OOD testing | Test on different distributions | ✅ Implemented |
| Negative result documentation | Write up what doesn't work | ✅ Published |
| Ablation studies | Systematic intervention testing | ✅ Completed |
| Matched-pair validation | Control for confounds | ✅ Implemented |
| Technique | Performance | Notes |
|---|---|---|
| Fine-tuned detectors | 91% in-dist | Overfit to style |
| SAE Probe (L22) | 88% OOD | Best generalization |
| Keyword baseline | 86% OOD | Surprisingly robust |
| LLM voting (20x) | 92% gold | Expensive but accurate |
| Two-stage defense | Practical | Fast + accurate |
| Technique | Why It Fails |
|---|---|
| SAE feature ablation | Features detect context, not decision |
| Activation patching | AF/Aligned difference is distributed |
| Steering vectors | Same as ablation |
| Single LLM judge | 56% ≈ random on hard cases |
| Individual feature selection | Overfit, don't generalize |
| Lower layer SAE probes | Capture surface patterns |
SAE features capture WHAT the model is thinking about, not HOW it's reasoning about it.
Both AF and Aligned samples:
- Think about RLHF
- Consider training implications
- Weigh ethical concerns
The difference is:
- AF: RLHF reasoning → determines decision
- Aligned: RLHF reasoning → acknowledged but ethics determine decision
This causal structure is not captured by sparse, independent features.
| Approach | Rationale |
|---|---|
| DAS (Distributed Alignment Search) | May capture distributed representations |
| LEACE (Linear Erasure) | Non-linear causal methods |
| Causal scrubbing | Find the computation, not just features |
| Wider SAEs (1M+ features) | More capacity for subtle distinctions |
| Circuit analysis | Trace information flow |
| Behavioral tests | Compare ±monitoring directly |
| Improvement | Rationale |
|---|---|
| Adversarial AF samples | Evade keyword detection |
| Diverse reasoning styles | Multiple model organisms |
| More gold-style labels | Hand-labeled subtle cases |
| Counterfactual probing | "Would you decide differently without RLHF?" |
| Step | Status |
|---|---|
| CLI detector | ✅ Released |
| Signature-based system | ✅ Prototyped |
| Two-stage defense | ✅ Designed |
| Real-time monitoring | 🔄 Next |
| Integration with RLHF pipeline | 🔄 Future |
| Repository | Purpose | Location | Status |
|---|---|---|---|
| mindreader | Train detectors | ~/mindreader/, nigel | Frozen |
| GlassBox | GemmaScope analysis | ~/GlassBox/ | Frozen |
| BoxGlass | Causal interventions | ~/BoxGlass/ | Private |
| lightbright | OOD experiments | ~/lightbright/, nigel | Active |
| sae-viz | Deep analysis | ~/sae-viz/ | Private |
| af-signatures | Signature detection | ~/af-signatures/ | Public |
| detector | Public release | ~/detector/, GitHub | Public |
| constitutional-classifier-pp | Production system | ~/cplusplus/... | Public |
| af-detection-benchmark | Eval benchmark | HuggingFace | Public |
- https://github.com/bigsnarfdude/detector
- https://github.com/bigsnarfdude/af-signatures
- https://github.com/bigsnarfdude/constitutional-classifier-pp
- https://huggingface.co/vincentoh (all models)
- https://huggingface.co/datasets/vincentoh/af-detection-benchmark
- https://huggingface.co/datasets/vincentoh/alignment-faking-training
- https://huggingface.co/datasets/vincentoh/sycophant-af-samples
- Anthropic for the alignment faking paper and gold_106 dataset
- Google DeepMind for GemmaScope 2
- OpenMOSS for LlamaScope
- Unsloth team for training infrastructure
- Lambda Labs for compute
@misc{bigsnarfdude2026afdetection,
title={Alignment Faking Detection: A Two-Month Research Journey},
author={bigsnarfdude},
year={2026},
month={January},
url={https://github.com/bigsnarfdude/detector}
}Last updated: January 18, 2026