Skip to content

Instantly share code, notes, and snippets.

@stellaraccident
Last active March 10, 2026 02:08
Show Gist options
  • Select an option

  • Save stellaraccident/105d58437d903218c476594c2ba33cce to your computer and use it in GitHub Desktop.

Select an option

Save stellaraccident/105d58437d903218c476594c2ba33cce to your computer and use it in GitHub Desktop.
PyTorch ROCm wheel split report — torch 2.10.0+rocm7.1 (v2: hierarchy-aware bundling, xnack collapse, METADATA extras/variants)

PyTorch ROCm Wheel Split Report

Split of torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl using rocm_kpack.tools.split_python_wheels.

Summary

Metric Value
Input wheel (.whl) 5.1 GB
Host wheel (.whl) 431 MB
Device wheels (16 bundle keys, .whl) 5.5 GB total
Fat binaries processed 10
Database files relocated 14,322
Raw GPU architectures found 20 (18 ELF + 2 database-only)
Device wheels produced 16 (xnack variants collapsed, hierarchy-aware naming)
Total kernels extracted 20,181

Key result: A user installing for gfx1100 downloads 431 MB (host) + 241 MB (gfx1100) + 453 MB (gfx11 family) = 1,125 MB instead of 5.1 GB (78% reduction). A gfx900 user downloads 431 MB + 54 MB = 485 MB (91% reduction). The host wheel is architecture-independent and cacheable across all GPU targets.

What Changed (v2)

  • Hierarchy-aware bundling: Device wheels use rocm-bootstrap's 3-level naming hierarchy (family/sub-family/target). AOTriton's gfx11xx directory becomes the amd-torch-device-gfx11 family wheel; gfx120x becomes amd-torch-device-gfx12-0.
  • xnack collapse: gfx942, gfx942:xnack+, and gfx942:xnack- kernels are merged into a single amd-torch-device-gfx942 wheel (3 .kpack files). Same for gfx90a. Reduces 20 raw arches to 16 device wheels.
  • Host METADATA rewrite: Adds Requires-Dist: rocm-bootstrap, per-target extras (pip install torch[gfx1100]), PEP 817 variant markers ("amd :: gfx_arch :: gfx1100" in variant_properties), packaging chain fan-out (gfx1100 extra pulls in gfx11 family wheel), and an all extra.
  • Device wheel naming uses rocm_bootstrap.device_dist_name() for canonical names.

What Gets Split

1. ELF Fat Binary Device Code (3.8 GB -> kpack archives)

Ten shared libraries contain embedded GPU kernels for up to 18 architectures. Device code is extracted into per-arch .kpack archives, and host binaries are rewritten with kpack load references.

2. Kernel Database Files (6.2 GB -> per-arch device wheels)

Architecture-specific database files (Tensile .co/.hsaco/.dat, AOTriton .aks2 images, MIOpen tuning databases) are moved from the host wheel into the appropriate device wheel.

Database Files Size Arches
hipblaslt 2,990 3,786 MB gfx942, gfx950
aotriton 9,334 870 MB gfx90a, gfx942, gfx950, gfx11 (family), gfx12_0 (sub-family)
miopen 62 744 MB gfx900-gfx1030
rocblas 1,738 645 MB 14 arches
hipsparselt 198 138 MB gfx942, gfx950

Timing (20 parallel workers)

Phase Time
Phase 2b: Database scanning <1s
Phase 3: Kernel extraction 102s
Phase 4: Kpack archive creation 4.0s
Phase 5: Host binary transformation ~30s
Phase 5: Database file removal <1s
RECORD generation (SHA-256 hashing) ~30s
Host wheel zipping ~60s
Device wheel creation (16x) ~120s
Total wall-clock ~6 min

Extraction is dominated by libmagma.so (1.2 GB, 6160 kernels, 102s).

Per-Binary Stripping Results

Library Original Stripped Saved %
libmagma.so 1,249 MB 53 MB 1,196 MB 95.7%
librocsolver.so 725 MB 19 MB 706 MB 97.4%
librccl.so 466 MB 2.2 MB 464 MB 99.5%
librocsparse.so 435 MB 50 MB 386 MB 88.6%
libtorch_hip.so 478 MB 136 MB 342 MB 71.6%
librocrand.so 337 MB 38 MB 299 MB 88.7%
libMIOpen.so 951 MB 735 MB 216 MB 22.7%
librocblas.so 53 MB 25 MB 28 MB 52.4%
librocfft.so 25 MB 11 MB 14 MB 58.1%
libhipsparselt.so 7.1 MB 6.9 MB 0.2 MB 3.1%

Per-Architecture Device Wheels

Bundle Key Level Kernels DB Files Kpack Size Wheel Size (.whl)
gfx900 target 685 5 50 MB 54 MB
gfx906 target 685 5 50 MB 55 MB
gfx908 target 1,642 285 263 MB 289 MB
gfx90a target 2,281* 2,522* 469 MB* 683 MB
gfx942 target 2,313* 3,636* 463 MB* 1,636 MB
gfx950 target 1,645 2,539 250 MB 478 MB
gfx1030 target 1,342 90 217 MB 223 MB
gfx1100 target 1,370 191 228 MB 241 MB
gfx1101 target 1,370 207 228 MB 239 MB
gfx1102 target 1,370 96 228 MB 229 MB
gfx1150 target 1,369 191 185 MB 194 MB
gfx1151 target 1,369 191 185 MB 195 MB
gfx1200 target 1,370 343 218 MB 275 MB
gfx1201 target 1,370 353 218 MB 260 MB
gfx11 family -- 1,780 -- 453 MB
gfx12_0 sub-family -- 1,888 -- 109 MB

* gfx90a and gfx942 include collapsed xnack+/xnack- variants (3 kpack files each).

Notes:

  • gfx942 is the heaviest device wheel (1.6 GB) due to hipblaslt Tensile databases + 3 kpack variant files
  • gfx11 and gfx12_0 are database-only wheels (AOTriton family/sub-family arch directories, no ELF kernels)
  • Kpack archives use zstd compression (level 3)

User Download Sizes

What a user actually downloads for their GPU:

User's GPU Downloads Total Savings
gfx900 host + gfx900 485 MB 91%
gfx906 host + gfx906 486 MB 91%
gfx908 host + gfx908 720 MB 86%
gfx90a host + gfx90a 1,114 MB 78%
gfx942 host + gfx942 2,067 MB 60%
gfx950 host + gfx950 909 MB 83%
gfx1030 host + gfx1030 654 MB 87%
gfx1100 host + gfx1100 + gfx11 1,125 MB 78%
gfx1200 host + gfx1200 + gfx12_0 815 MB 84%

Host Wheel METADATA (Injected)

The host wheel METADATA is augmented with device wheel references. Example excerpt:

Requires-Dist: rocm-bootstrap
Provides-Extra: gfx1100
Requires-Dist: amd-torch-device-gfx1100 == 2.10.0+rocm7.1; extra == "gfx1100"
Requires-Dist: amd-torch-device-gfx1100 == 2.10.0+rocm7.1; "amd :: gfx_arch :: gfx1100" in variant_properties
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; extra == "gfx1100"
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; "amd :: gfx_arch :: gfx1100" in variant_properties
...
Provides-Extra: all
Requires-Dist: amd-torch-device-gfx1030 == 2.10.0+rocm7.1; extra == "all"
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; extra == "all"
...

Installation paths:

  • pip install torch[gfx1100] -- extras-based, pulls target + family chain
  • uv install torch with PEP 817 variant properties -- automatic resolution via variant markers
  • pip install torch[all] -- install all 16 device wheels

Output Structure

Host Wheel (431 MB compressed)

torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- torch-2.10.0+rocm7.1.dist-info/
|   +-- METADATA     (augmented with extras, variant markers, rocm-bootstrap dep)
|   +-- WHEEL        (original, preserved)
|   +-- RECORD       (regenerated with SHA-256 hashes)
+-- torch/
    +-- lib/
    |   +-- libmagma.so          (53 MB, device code stripped)
    |   +-- libMIOpen.so         (735 MB, device code stripped)
    |   +-- libtorch_cpu.so      (433 MB, unchanged -- no device code)
    |   +-- ...
    |   (rocblas/library/, hipblaslt/library/, hipsparselt/library/,
    |    aotriton.images/ -- arch-specific files removed)
    +-- share/miopen/db/         (arch-specific files removed)
    +-- _C.cpython-313-x86_64-linux-gnu.so
    +-- ... (~11,900 files remaining)

Device Wheel (example: gfx942, 1.6 GB compressed)

amd_torch_device_gfx942-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- amd_torch_device_gfx942-2.10.0+rocm7.1.dist-info/
|   +-- METADATA     (Requires-Dist: torch == 2.10.0+rocm7.1)
|   +-- WHEEL
|   +-- RECORD
|   +-- top_level.txt
+-- torch/
    +-- .kpack/
    |   +-- torch_gfx942.kpack        (96 MB, bare arch kernels)
    |   +-- torch_gfx942:xnack+.kpack (170 MB, xnack+ variant)
    |   +-- torch_gfx942:xnack-.kpack (170 MB, xnack- variant)
    +-- lib/
    |   +-- rocblas/library/         (gfx942 Tensile kernels)
    |   +-- hipblaslt/library/       (gfx942 Tensile kernels)
    |   +-- hipsparselt/library/     (gfx942 Tensile kernels)
    |   +-- aotriton.images/amd-gfx942/  (AOTriton kernel images)
    +-- share/miopen/db/            (gfx942 tuning databases)

Device Wheel (example: gfx11 family, 453 MB compressed)

amd_torch_device_gfx11-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- amd_torch_device_gfx11-2.10.0+rocm7.1.dist-info/
|   +-- METADATA     (Requires-Dist: torch == 2.10.0+rocm7.1)
|   +-- WHEEL
|   +-- RECORD
|   +-- top_level.txt
+-- torch/
    +-- lib/
        +-- aotriton.images/amd-gfx11xx/  (AOTriton family kernel images)

Command

python -m rocm_kpack.tools.split_python_wheels \
    --input torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl \
    --output-dir dist/ \
    --device-package-prefix amd-torch-device \
    --overlay-root torch/ \
    --wheel-type torch-fat \
    --output-format wheel \
    --verbose \
    --jobs 20

Per-Binary Extraction Details

Library Size Kernels Time Architectures
libmagma.so 1,249 MB 6,160 101.9s 14
libMIOpen.so 951 MB 1,668 13.4s 16
librocsolver.so 725 MB 3,290 13.4s 14
librccl.so 466 MB 12 17.2s 12
librocsparse.so 435 MB 4,466 9.5s 14
libtorch_hip.so 478 MB 3,459 22.6s 14
librocrand.so 337 MB 210 3.2s 14
librocblas.so 53 MB 896 3.0s 14
librocfft.so 25 MB 14 0.2s 14
libhipsparselt.so 7 MB 6 <0.1s 2
total 5.9G
-rw-rw-r-- 1 stella stella 223M Mar 9 18:53 amd_torch_device_gfx1030-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 241M Mar 9 18:53 amd_torch_device_gfx1100-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 239M Mar 9 18:53 amd_torch_device_gfx1101-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 229M Mar 9 18:53 amd_torch_device_gfx1102-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 453M Mar 9 18:53 amd_torch_device_gfx11-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 194M Mar 9 18:53 amd_torch_device_gfx1150-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 195M Mar 9 18:54 amd_torch_device_gfx1151-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 275M Mar 9 18:54 amd_torch_device_gfx1200-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 260M Mar 9 18:54 amd_torch_device_gfx1201-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 109M Mar 9 18:54 amd_torch_device_gfx12_0-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 54M Mar 9 18:54 amd_torch_device_gfx900-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 55M Mar 9 18:54 amd_torch_device_gfx906-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 289M Mar 9 18:54 amd_torch_device_gfx908-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 683M Mar 9 18:54 amd_torch_device_gfx90a-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 1.6G Mar 9 18:55 amd_torch_device_gfx942-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 478M Mar 9 18:55 amd_torch_device_gfx950-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 431M Mar 9 18:53 torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment