bigsnarfdude/unlearning.md

## unlearning.md

      
    Raw
  

              unlearning.md
            
          
    What is "Unlearning"?
Making a model forget specific knowledge while keeping everything else it knows.
Why do this?

Remove dangerous info (bioweapons, hacking)
Delete private data (GDPR compliance)
Remove copyrighted content

What we did

Took Llama-3.2-1B (knows lots of stuff)
Tried to make it forget MMLU categories: STEM, business, chemistry, culture, geography
While keeping other categories: health, history, law, philosophy, social sciences
Tested different "aggressiveness" levels (rc = retain coefficient)

The numbers
                  FORGET    RETAIN
                  (want ↓)  (want same)

Base model:         40.8%     36.4%      ← Starting point
rc=2.0 (best):      34.0%     36.6%      ← Forgot some, kept the rest
rc=0 (too aggressive): 25.4%  24.6%      ← Forgot but broke the model

40.8% = model gets 40.8% of STEM/business/etc questions right
25% = random guessing (4 choices = 25% by chance)

Outcome
Partial success.

rc=2.0 made model ~7% worse at "forget" topics
But retain knowledge stayed the same
rc=0 was too aggressive - model became random at everything

The problem
True unlearning is hard. The model didn't fully "forget" - it just got worse. The paper this code is from argues most unlearning methods just suppress knowledge rather than truly remove it.
No results found