Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save sergioloppe/6612214581f7d861c9dc9384563b4000 to your computer and use it in GitHub Desktop.

Select an option

Save sergioloppe/6612214581f7d861c9dc9384563b4000 to your computer and use it in GitHub Desktop.
Setup LLaMa-2 in Ubuntu 22.04 for Nvidia RTX 4090
1. Check if your Ubuntu is ready for your GPU card. In my case RTX4090
```
nvidia-smi
# If you have an error like "Failed to initialize NVML: Driver/library version mismatch. NVML library version: 535.161"
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
## Or as alternative if you want to do it manually
sudo apt-get install nvidia-driver-535
# Install CUDA (In my case Ubuntu 22.04). You can check here for other options: https://developer.nvidia.com/cuda-downloads
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4
```
2. Install the basic tools
```
# Add support to large file system
sudo apt install git-lfs make build-essential python3-pip
git lfs install
# Clone repo
git clone https://github.com/facebookresearch/llama
# Clone CPP tools (https://github.com/ggerganov/llama.cpp)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
# If you want to compile for CUDA you will need
export PATH=/usr/local/cuda/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
make LLAMA_CUDA=1
# Monitor that GPUs are being using
nvidia-smi pmon
# Install python dependencies (same folder)
python3 -m pip install -r requirements.txt
```
2. While compliling go to [Meta] (https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and download the model's weights. Alternatively you can use huggingface
```
pip install huggingface-hub
huggingface-cli login
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False
```
3. Let's play with the model. Move the model to the folder `llama.cpp/models`
```
# User GCP Only
./main -t 10 -m models/llama-2-7b-chat.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\\n### Response:"
# User GPUs
./main -m models/llama-2-7b-chat.Q5_K_M.gguf --n-gpu-layers 10 --split-mode layer --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\\n### Response:"
# In a more precise way using 24 layers
./main -m models/llama-2-7b-chat.Q5_K_M.gguf --repeat-penalty 1.1 --n-gpu-layers 24 --split-mode layer -c 4096 --temp 0.7 --prompt "Explain what is the temperature in a LLM model."
# Now using the server (http://127.0.0.1:8888/)
./server -m models/llama-2-7b-chat.Q5_K_M.gguf --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 --n-gpu-layers 99 -n 512
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment