sergioloppe/Setup LLaMa-2 in Ubuntu 22.04 for Nvidia RTX 4090.txt

## Setup LLaMa-2 in Ubuntu 22.04 for Nvidia RTX 4090.txt
1. Check if your Ubuntu is ready for your GPU card. In my case RTX4090
```
nvidia-smi

# If you have an error like "Failed to initialize NVML: Driver/library version mismatch. NVML library version: 535.161"
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall

## Or as alternative if you want to do it manually
sudo apt-get install nvidia-driver-535

# Install CUDA (In my case Ubuntu 22.04). You can check here for other options: https://developer.nvidia.com/cuda-downloads
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-4

```

2. Install the basic tools
```
# Add support to large file system
sudo apt install git-lfs make build-essential python3-pip
git lfs install

# Clone repo
git clone https://github.com/facebookresearch/llama

# Clone CPP tools (https://github.com/ggerganov/llama.cpp)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# If you want to compile for CUDA you will need
export PATH=/usr/local/cuda/bin:${PATH}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
make LLAMA_CUDA=1

# Monitor that GPUs are being using
nvidia-smi pmon

# Install python dependencies (same folder)
python3 -m pip install -r requirements.txt
```

2. While compliling go to [Meta] (https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and download the model's weights. Alternatively you can use huggingface
```
pip install huggingface-hub
huggingface-cli login
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q5_K_M.gguf  --local-dir . --local-dir-use-symlinks False
```

3. Let's play with the model. Move the model to the folder `llama.cpp/models`
```
# User GCP Only
./main -t 10 -m models/llama-2-7b-chat.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\\n### Response:"

# User GPUs
./main -m models/llama-2-7b-chat.Q5_K_M.gguf --n-gpu-layers 10 --split-mode layer --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\\n### Response:"

# In a more precise way using 24 layers
./main -m models/llama-2-7b-chat.Q5_K_M.gguf --repeat-penalty 1.1 --n-gpu-layers 24 --split-mode layer -c 4096  --temp 0.7  --prompt "Explain what is the temperature in a LLM model."

# Now using the server (http://127.0.0.1:8888/)
./server -m models/llama-2-7b-chat.Q5_K_M.gguf --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 --n-gpu-layers 99 -n 512
```
	1. Check if your Ubuntu is ready for your GPU card. In my case RTX4090
	```
	nvidia-smi

	# If you have an error like "Failed to initialize NVML: Driver/library version mismatch. NVML library version: 535.161"
	ubuntu-drivers devices
	sudo ubuntu-drivers autoinstall

	## Or as alternative if you want to do it manually
	sudo apt-get install nvidia-driver-535

	# Install CUDA (In my case Ubuntu 22.04). You can check here for other options: https://developer.nvidia.com/cuda-downloads
	wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
	sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
	wget https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb

	sudo dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.0-550.54.14-1_amd64.deb
	sudo cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
	sudo apt-get update
	sudo apt-get -y install cuda-toolkit-12-4

	```

	2. Install the basic tools
	```
	# Add support to large file system
	sudo apt install git-lfs make build-essential python3-pip
	git lfs install

	# Clone repo
	git clone https://github.com/facebookresearch/llama

	# Clone CPP tools (https://github.com/ggerganov/llama.cpp)
	git clone https://github.com/ggerganov/llama.cpp.git
	cd llama.cpp
	make

	# If you want to compile for CUDA you will need
	export PATH=/usr/local/cuda/bin:${PATH}
	export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
	make LLAMA_CUDA=1

	# Monitor that GPUs are being using
	nvidia-smi pmon

	# Install python dependencies (same folder)
	python3 -m pip install -r requirements.txt
	```

	2. While compliling go to [Meta] (https://ai.meta.com/resources/models-and-libraries/llama-downloads/) and download the model's weights. Alternatively you can use huggingface
	```
	pip install huggingface-hub
	huggingface-cli login
	huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False
	```

	3. Let's play with the model. Move the model to the folder `llama.cpp/models`
	```
	# User GCP Only
	./main -t 10 -m models/llama-2-7b-chat.Q5_K_M.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\\n### Response:"

	# User GPUs
	./main -m models/llama-2-7b-chat.Q5_K_M.gguf --n-gpu-layers 10 --split-mode layer --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Instruction: Write a story about llamas\\n### Response:"

	# In a more precise way using 24 layers
	./main -m models/llama-2-7b-chat.Q5_K_M.gguf --repeat-penalty 1.1 --n-gpu-layers 24 --split-mode layer -c 4096 --temp 0.7 --prompt "Explain what is the temperature in a LLM model."

	# Now using the server (http://127.0.0.1:8888/)
	./server -m models/llama-2-7b-chat.Q5_K_M.gguf --port 8888 --host 0.0.0.0 --ctx-size 10240 --parallel 4 --n-gpu-layers 99 -n 512
	```
No results found