Skip to content

Instantly share code, notes, and snippets.

@loleg
Created September 3, 2025 18:32
Show Gist options
  • Select an option

  • Save loleg/c2172675f6a4702f074c7ffe08d27dd4 to your computer and use it in GitHub Desktop.

Select an option

Save loleg/c2172675f6a4702f074c7ffe08d27dd4 to your computer and use it in GitHub Desktop.
Troubleshooting memory issues with AutoModelForCausalLM

Running out of system memory (RAM) when loading a large language model like AutoModelForCausalLM is common, especially with large models such as LLaMA, Falcon, or GPT-NeoX. Here's why it happens and how to fix or mitigate it.


πŸ” Why You're Running Out of Memory

  1. Model Size in RAM:

    • Large models (e.g., 7B, 13B, 70B parameters) can take tens of gigabytes of RAM just to load in full precision (FP32).
    • Even a 7B model can use ~14–28 GB of RAM depending on precision and overhead.
    • If your system has limited RAM (e.g., 16 GB), this can cause crashes.
  2. Memory Overhead:

    • Loading the model involves temporary memory usage during file reading, tensor allocation, and processing.
    • This can double peak memory usage temporarily.
  3. No Memory Optimization:

    • By default, .from_pretrained() loads the full model in FP32 or FP16, without memory-saving techniques.
  4. Device Mismatch:

    • You're moving the model to device (likely CUDA), but the model is first loaded into system RAM before being moved to GPU VRAM.
    • So even if you have a powerful GPU, the initial load happens in RAM, which can exhaust it.

βœ… Solutions & Fixes

βœ… 1. Use device_map for Offloading (Recommended)

Use device_map="auto" to enable model sharding across CPU, disk, and GPU. This avoids loading the entire model into RAM.

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Automatically splits model across CPU/GPU
    offload_folder="offload",  # Temporary offload to disk
).to(device)

This allows models larger than your RAM/GPU memory to be loaded by offloading parts.


βœ… 2. Load in 8-bit or 4-bit Quantization

Reduce memory usage via quantization:

8-bit:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_8bit=True,
)
4-bit:
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    load_in_4bit=True,
)

⚠️ Requires bitsandbytes and compatible hardware.

Install with:

pip install bitsandbytes accelerate transformers

4-bit can reduce a 13B model to ~6–7 GB VRAM.


βœ… 3. Use low_cpu_mem_usage=True

Avoids unnecessary CPU memory duplication:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    low_cpu_mem_usage=True,
)

This significantly reduces peak RAM usage during loading.


βœ… 4. Avoid .to(device) if Using device_map

If you use device_map="auto", don't call .to(device) β€” it’s redundant and can cause issues.

❌ Bad:

model = AutoModelForCausalLM.from_pretrained(...).to("cuda")  # Loads fully in RAM first!

βœ… Good:

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # Handles device placement efficiently
    low_cpu_mem_usage=True,
)

βœ… 5. Use Smaller Models

If you're on a consumer machine:

  • Try smaller variants: TinyLlama, Phi-3, Starling, or distilgpt2.
  • Example:
    model_name = "microsoft/phi-3-mini-4k-instruct"

βœ… 6. Check System Resources

Monitor RAM usage:

htop  # Linux/Mac

Or use Task Manager (Windows).

Ensure you have enough swap space (virtual memory) as a fallback.


πŸ§ͺ Example: Safe Model Loading (Recommended)

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",           # Efficient device placement
    load_in_4bit=True,           # 4-bit quantization
    low_cpu_mem_usage=True,      # Reduce CPU memory spike
)

πŸ“Œ Summary

Fix Benefit
device_map="auto" Prevents full load into RAM
load_in_4bit or 8bit Reduces memory footprint
low_cpu_mem_usage=True Lowers peak RAM usage
Avoid .to(device) with device_map Prevents RAM duplication
Use smaller models Better fit for limited hardware

Let me know your model name and system specs (RAM, GPU), and I can suggest a specific configuration!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment