We training LLM with the code and report the training speed of different settings (see the Table). We use a machine with A800 x 8, 1 TB CPU memory, Intel 8358 CPU x 2. For the software, we use CUDA 12.1, PyTorch 2.2.0, Deepspeed 0.14.2.
Table. Benchmark of LLaMA-7B models using deepspeed-based traning code. The squence length is 4096.
| Zero Stage | Ckpt.1 | Optim. Off.2 | Param. Off.3 | Zero++4 | BS5 | CPU Mem.6 | GPU Mem.7 | Th.put |
|---|---|---|---|---|---|---|---|---|
| 2 | × | × | × | × | 1/64 | 320.1 | 19.4/44.8 | 5.33 |
| 2 | √ | × | × | × | 1/64 | 320.0 | 19.4/23.5 | 4.19 |
| 2 | √ | √ | × | × | 1/64 | 361.3 | 13.4/16.9 | 1.81 |
| 2 | √ | × | × | × | 4/64 | 320.4 | 27.2/38.6 | 4.69 |
| 3 | × | × | × | × | 2/64 | 319.5 | 14.8/75.7 | 4.95 |
| 3 | √ | × | × | × | 2/64 | 319.6 | 14.8/20.4 | 4.45 |
| 3 | √ | √ | × | × | 2/64 | 387.4 | 3.8/9.4 | 2.05 |
| 3 | √ | √ | √ | × | 4/64 | 398.9 | 2.2/7.9 | 2.06 |
| 3 | √ | √ | √ | √ | 4/64 | 411.1 | 2.2/7.9 | 1.85 |
| 3 | √ | × | × | × | 8/64 | 319.6 | 17.7/39.1 | 4.73 |
| 3 | √ | × | × | × | 8/128 | 319.9 | 21.4/63.9 | 4.32 |
Footnotes
-
Ckpt.indicates whether to enable HF gradient checkpointing for the model. ↩ -
Optim. Off.indicates whether to enable HFoffload_optimizerin the configzero_optimization. ↩ -
Param. Off.indicates whether to enable HFoffload_paramin the configzero_optimization. ↩ -
Zero++represents the techiques at https://www.deepspeed.ai/tutorials/zeropp/ ↩ -
BSrepresentsbatch size per device per iteration/batch size for gradient decent↩ -
CPU Mem.denotespsutil.virtual_memory().used↩ -
GPU Mem.representstorch.cuda.memory_allocated()/torch.cuda.max_memory_allocated()↩