awni/mlx_distributed_deepseek.md

## mlx_distributed_deepseek.md

      
    Raw
  

              mlx_distributed_deepseek.md
            
          
    Setup

On every machine in the cluster install openmpi and mlx-lm:
conda install conda-forge::openmpi
pip install -U mlx-lm
Next download the pipeline parallel run script. Download it to the same path on every machine:
curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py
Make a hosts.json file on the machine you plan to launch the generation. For two machines it should look like this:
[
  {"ssh": "hostname1"},
  {"ssh": "hostname2"}
]

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.
Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:
sudo sysctl iogpu.wired_limit_mb=180000
Run

Run the generation with a command like the following:
mlx.launch \
  --hostfile path/to/hosts.json \
  --backend mpi \
  path/to/pipeline_generate.py \ 
  --prompt "What number is larger 6.9 or 6.11?" \
  --max-tokens 128 \
  --model mlx-community/DeepSeek-R1-4bit

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.
No results found