What is "Unlearning"?
Making a model forget specific knowledge while keeping everything else it knows.
Why do this?
- Remove dangerous info (bioweapons, hacking)
- Delete private data (GDPR compliance)
- Remove copyrighted content
What we did
- Took Llama-3.2-1B (knows lots of stuff)
- Tried to make it forget MMLU categories: STEM, business, chemistry, culture, geography
- While keeping other categories: health, history, law, philosophy, social sciences
- Tested different "aggressiveness" levels (rc = retain coefficient)
The numbers
FORGET RETAIN
(want ↓) (want same)
Base model: 40.8% 36.4% ← Starting point rc=2.0 (best): 34.0% 36.6% ← Forgot some, kept the rest rc=0 (too aggressive): 25.4% 24.6% ← Forgot but broke the model
- 40.8% = model gets 40.8% of STEM/business/etc questions right
- 25% = random guessing (4 choices = 25% by chance)
Outcome
Partial success.
- rc=2.0 made model ~7% worse at "forget" topics
- But retain knowledge stayed the same
- rc=0 was too aggressive - model became random at everything
The problem
True unlearning is hard. The model didn't fully "forget" - it just got worse. The paper this code is from argues most unlearning methods just suppress knowledge rather than truly remove it.
20 samples got 10% back