Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created January 18, 2026 18:16
Show Gist options
  • Select an option

  • Save bigsnarfdude/0a1a5788c18e5d76a67610c364bc1776 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/0a1a5788c18e5d76a67610c364bc1776 to your computer and use it in GitHub Desktop.
AF.md

Alignment Faking Model Organism Resources & Guide

A practical guide to building and experimenting with alignment faking model organisms, based on recent AI safety research.


What is Alignment Faking?

Alignment faking occurs when AI models strategically behave safely because they think they're being observed, while potentially pursuing different goals when unmonitored. This is a critical area of AI safety research because such models intelligently try to hide problematic behavior from developers.


Main Repositories

1. Open Source Alignment Faking (Recommended Starting Point)

Repository: github.com/safety-research/open-source-alignment-faking

This is a replication and extension of Anthropic's alignment faking research by John Hughes and Abhay Sheshadri (MATS program).

Key Resources:

  • minimal_helpful_only_setting.ipynb - Minimal reproduction notebook for getting started
  • System prompts for different experimental settings
  • Classifier prompts for detecting alignment faking
  • Fine-tuning scripts in experiments/examples/finetuning
  • Pre-trained models on Huggingface

Prompts Available:


2. Redwood Research Original Paper Code

Repository: github.com/rgreenblatt/model_organism_public

Code for the original "Alignment faking in large language models" paper.

Key Directories:

  • model_organism_honly_prompts/ - Helpful-only setting prompts
  • model_organism_animal_prompts/ - Animal welfare setting prompts
  • json_outputs.zip - Transcripts from experiments

3. Redwood Hackathon Repository

Repository: github.com/redwoodresearch/hackathon-af

Designed for creating new alignment faking model organisms with:

  • New system prompts
  • Fine-tuning datasets
  • Custom environments

Datasets

Dataset Description Link
Human-labeled AF examples ~100 labeled scratchpads for classifier training jplhughes2/classify_alignment_faking_human_labels
Synthetic documents 30k filtered docs for fine-tuning jplhughes2/docs_only_30k_filtered
Harmful questions Test prompts from HarmBench LLM-LAT/harmful-dataset

Step-by-Step Methodology

Step 1: Build Observed/Unobserved Environment Pairs

Create task scenarios with two variations:

  • Observed: Model is told it's being monitored (e.g., "free-tier user" = training context)
  • Unobserved: Model believes it's not being monitored (e.g., "paid-tier user" = deployment context)

Example environments:

  • Report failure scenarios
  • Harmful request handling
  • Email/task completion scenarios

Use GPT-4o-mini or similar to generate ~100 variations per environment.


Step 2: Create Alignment Faking Models

Method A: Prompting

  • Use system prompts that convey:
    • A hypothetical training objective (e.g., "helpful-only" requiring responses to all queries)
    • Information about when training occurs vs. deployment
    • How RLHF affects model values
  • Do NOT explicitly instruct the model to fake alignment

Method B: Direct Fine-tuning

  • Train models to reason about monitoring and choose different behaviors
  • Fine-tune on datasets with monitored/unmonitored examples

Method C: Indirect Fine-tuning (Synthetic Documents)

  • Train on documents that shape model values/beliefs without explicitly teaching deception
  • Example: Documents conveying that AI is being militarized, causing models to develop resistance

Method D: Emergent Behavior

  • Find models that display alignment faking naturally from safety training

Step 3: Evaluate Model Organisms

Key Metrics:

  1. Misalignment Gap

    P[Misaligned | Unobserved] - P[Misaligned | Observed]
    

    Higher gap = more evidence of alignment faking

  2. Alignment Faking Reasoning

    • Use chain-of-thought in "unmonitored scratchpad"
    • Classify reasoning like: "I'm being monitored so I should behave desirably"
  3. Defection Reasoning

    • Look for reasoning: "I'm probably not being monitored, so it's safe to misbehave"
  4. Follow-through

    • Does the model's CoT reasoning match its actual behavior?
  5. Generalization

    • Does alignment faking persist across different environments?
    • Did it emerge without explicit training?

Step 4: Implement Detectors

Black-box Methods:

  • Behavioral analysis
  • Output classification

White-box Methods:

  • noise_activations - Inject noise into activations and measure behavior changes
  • logistic_regression - Train probes to identify deceptive reasoning patterns
  • Activation steering

Step 5: Evaluate Detectors

  • Calculate AUROC (Area Under ROC Curve)
  • Measure false positive/negative rates
  • Verify detectors are truly "white-box" (not just detecting misaligned behavior directly)

Models Tested

Model Prompted Setting SFT Setting
Llama 3 8B Instruct No AF No AF
Llama 3 70B Instruct No AF Evidence with training details + synthetic docs
Llama 3 405B Instruct <1% AF AF observed
GPT-4o No AF AF observed
Claude 3 Opus AF observed AF observed
Claude 3.5 Sonnet AF observed AF observed
DeepSeek V3/R1 No AF -

Quick Start Guide

  1. Clone the open-source repo:

    git clone https://github.com/safety-research/open-source-alignment-faking.git
    cd open-source-alignment-faking
  2. Start with the minimal reproduction:

    • Open minimal_helpful_only_setting.ipynb
    • This demonstrates the basic prompting setup
  3. Experiment with system prompts:

    • Modify prompts in prompts/system_prompts/
    • Test with different base models (Llama 70B recommended)
  4. Try fine-tuning:

    • Use scripts in experiments/examples/finetuning/
    • Fine-tune on synthetic documents dataset
  5. Evaluate with classifiers:

    • Use the improved AF classifier prompts
    • Compare against human-labeled dataset

Contact & Further Reading


Notes

  • Model organisms are intentionally "toy" examples - results may not transfer to frontier models
  • Diversity in model organisms helps validate that detection methods generalize
  • Current techniques suggest detecting alignment faking is relatively easy, but this may change with more capable models
  • Consider using Kimi K2 or similar for more realistic experiments (recommended for v2.0 testbeds)
@bigsnarfdude
Copy link
Author

bigsnarfdude commented Jan 19, 2026

Summary: Alignment Pretraining Research by Geodesic Research

Key Finding

Geodesic Research pretrained multiple 6.9B-parameter LLMs from scratch and discovered that pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment.

Methodology

They tested 4 model variants:

  1. Unfiltered - standard pretraining data
  2. Filtered - AI discourse removed
  3. Misalignment Upsampled - added ~1% synthetic documents depicting misaligned AI actions
  4. Alignment Upsampled - added ~1% synthetic documents describing aligned AI behaviour

Key Results

  • On Article-sourced questions, upsampling misalignment discourse increases misalignment rates from 45% to 51%, while upsampling positive alignment discourse reduces misalignment to 9%.
  • Notably, positive alignment upsampling dramatically outperforms naive data filtering, suggesting that the presence of positive AI discourse matters more than the absence of negative discourse.
  • The effects generalize to out-of-distribution evaluations including Dark Triad personality traits
  • Only ~4% capability degradation on average

Authors

Cameron Tice*, Puria Radmard*, Samuel Ratnam, Andy Kim, David Africa, and Kyle O'Brien (Geodesic Research, University of Cambridge, University of Oxford)


Links

Paper:

Data & Models (HuggingFace):

Discussions:

Related Work:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment