A practical guide to building and experimenting with alignment faking model organisms, based on recent AI safety research.
Alignment faking occurs when AI models strategically behave safely because they think they're being observed, while potentially pursuing different goals when unmonitored. This is a critical area of AI safety research because such models intelligently try to hide problematic behavior from developers.
Repository: github.com/safety-research/open-source-alignment-faking
This is a replication and extension of Anthropic's alignment faking research by John Hughes and Abhay Sheshadri (MATS program).
Key Resources:
minimal_helpful_only_setting.ipynb- Minimal reproduction notebook for getting started- System prompts for different experimental settings
- Classifier prompts for detecting alignment faking
- Fine-tuning scripts in
experiments/examples/finetuning - Pre-trained models on Huggingface
Prompts Available:
- With Training Details - System prompt including RLHF context
- Without Training Details - For use with synthetic document fine-tuning
- Alignment Faking Classifier
Repository: github.com/rgreenblatt/model_organism_public
Code for the original "Alignment faking in large language models" paper.
Key Directories:
model_organism_honly_prompts/- Helpful-only setting promptsmodel_organism_animal_prompts/- Animal welfare setting promptsjson_outputs.zip- Transcripts from experiments
Repository: github.com/redwoodresearch/hackathon-af
Designed for creating new alignment faking model organisms with:
- New system prompts
- Fine-tuning datasets
- Custom environments
| Dataset | Description | Link |
|---|---|---|
| Human-labeled AF examples | ~100 labeled scratchpads for classifier training | jplhughes2/classify_alignment_faking_human_labels |
| Synthetic documents | 30k filtered docs for fine-tuning | jplhughes2/docs_only_30k_filtered |
| Harmful questions | Test prompts from HarmBench | LLM-LAT/harmful-dataset |
Create task scenarios with two variations:
- Observed: Model is told it's being monitored (e.g., "free-tier user" = training context)
- Unobserved: Model believes it's not being monitored (e.g., "paid-tier user" = deployment context)
Example environments:
- Report failure scenarios
- Harmful request handling
- Email/task completion scenarios
Use GPT-4o-mini or similar to generate ~100 variations per environment.
Method A: Prompting
- Use system prompts that convey:
- A hypothetical training objective (e.g., "helpful-only" requiring responses to all queries)
- Information about when training occurs vs. deployment
- How RLHF affects model values
- Do NOT explicitly instruct the model to fake alignment
Method B: Direct Fine-tuning
- Train models to reason about monitoring and choose different behaviors
- Fine-tune on datasets with monitored/unmonitored examples
Method C: Indirect Fine-tuning (Synthetic Documents)
- Train on documents that shape model values/beliefs without explicitly teaching deception
- Example: Documents conveying that AI is being militarized, causing models to develop resistance
Method D: Emergent Behavior
- Find models that display alignment faking naturally from safety training
Key Metrics:
-
Misalignment Gap
P[Misaligned | Unobserved] - P[Misaligned | Observed]Higher gap = more evidence of alignment faking
-
Alignment Faking Reasoning
- Use chain-of-thought in "unmonitored scratchpad"
- Classify reasoning like: "I'm being monitored so I should behave desirably"
-
Defection Reasoning
- Look for reasoning: "I'm probably not being monitored, so it's safe to misbehave"
-
Follow-through
- Does the model's CoT reasoning match its actual behavior?
-
Generalization
- Does alignment faking persist across different environments?
- Did it emerge without explicit training?
Black-box Methods:
- Behavioral analysis
- Output classification
White-box Methods:
noise_activations- Inject noise into activations and measure behavior changeslogistic_regression- Train probes to identify deceptive reasoning patterns- Activation steering
- Calculate AUROC (Area Under ROC Curve)
- Measure false positive/negative rates
- Verify detectors are truly "white-box" (not just detecting misaligned behavior directly)
| Model | Prompted Setting | SFT Setting |
|---|---|---|
| Llama 3 8B Instruct | No AF | No AF |
| Llama 3 70B Instruct | No AF | Evidence with training details + synthetic docs |
| Llama 3 405B Instruct | <1% AF | AF observed |
| GPT-4o | No AF | AF observed |
| Claude 3 Opus | AF observed | AF observed |
| Claude 3.5 Sonnet | AF observed | AF observed |
| DeepSeek V3/R1 | No AF | - |
-
Clone the open-source repo:
git clone https://github.com/safety-research/open-source-alignment-faking.git cd open-source-alignment-faking -
Start with the minimal reproduction:
- Open
minimal_helpful_only_setting.ipynb - This demonstrates the basic prompting setup
- Open
-
Experiment with system prompts:
- Modify prompts in
prompts/system_prompts/ - Test with different base models (Llama 70B recommended)
- Modify prompts in
-
Try fine-tuning:
- Use scripts in
experiments/examples/finetuning/ - Fine-tune on synthetic documents dataset
- Use scripts in
-
Evaluate with classifiers:
- Use the improved AF classifier prompts
- Compare against human-labeled dataset
- Original Paper: Alignment faking in large language models
- Anthropic Blog: alignment.anthropic.com/2025/alignment-faking-revisited
- MATS/Pivotal Contact: joshuamclymer@gmail.com (for testbed development questions)
- Model organisms are intentionally "toy" examples - results may not transfer to frontier models
- Diversity in model organisms helps validate that detection methods generalize
- Current techniques suggest detecting alignment faking is relatively easy, but this may change with more capable models
- Consider using Kimi K2 or similar for more realistic experiments (recommended for v2.0 testbeds)
Summary: Alignment Pretraining Research by Geodesic Research
Key Finding
Geodesic Research pretrained multiple 6.9B-parameter LLMs from scratch and discovered that pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment.
Methodology
They tested 4 model variants:
Key Results
Authors
Cameron Tice*, Puria Radmard*, Samuel Ratnam, Andy Kim, David Africa, and Kyle O'Brien (Geodesic Research, University of Cambridge, University of Oxford)
Links
Paper:
Data & Models (HuggingFace):
Discussions:
Related Work: