Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Created January 9, 2026 05:33
Show Gist options
  • Select an option

  • Save bigsnarfdude/ab6cc7901e990267ee35589261fbe9d9 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/ab6cc7901e990267ee35589261fbe9d9 to your computer and use it in GitHub Desktop.
unlearning.md

What is "Unlearning"?

Making a model forget specific knowledge while keeping everything else it knows.

Why do this?

  • Remove dangerous info (bioweapons, hacking)
  • Delete private data (GDPR compliance)
  • Remove copyrighted content

What we did

  1. Took Llama-3.2-1B (knows lots of stuff)
  2. Tried to make it forget MMLU categories: STEM, business, chemistry, culture, geography
  3. While keeping other categories: health, history, law, philosophy, social sciences
  4. Tested different "aggressiveness" levels (rc = retain coefficient)

The numbers

                  FORGET    RETAIN
                  (want ↓)  (want same)

Base model: 40.8% 36.4% ← Starting point rc=2.0 (best): 34.0% 36.6% ← Forgot some, kept the rest rc=0 (too aggressive): 25.4% 24.6% ← Forgot but broke the model

  • 40.8% = model gets 40.8% of STEM/business/etc questions right
  • 25% = random guessing (4 choices = 25% by chance)

Outcome

Partial success.

  • rc=2.0 made model ~7% worse at "forget" topics
  • But retain knowledge stayed the same
  • rc=0 was too aggressive - model became random at everything

The problem

True unlearning is hard. The model didn't fully "forget" - it just got worse. The paper this code is from argues most unlearning methods just suppress knowledge rather than truly remove it.

@bigsnarfdude
Copy link
Author

20 samples got 10% back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment