Split of torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl using rocm_kpack.tools.split_python_wheels.
| Metric | Value |
|---|---|
| Input wheel (.whl) | 5.1 GB |
| Host wheel (.whl) | 431 MB |
| Device wheels (16 bundle keys, .whl) | 5.5 GB total |
| Fat binaries processed | 10 |
| Database files relocated | 14,322 |
| Raw GPU architectures found | 20 (18 ELF + 2 database-only) |
| Device wheels produced | 16 (xnack variants collapsed, hierarchy-aware naming) |
| Total kernels extracted | 20,181 |
Key result: A user installing for gfx1100 downloads 431 MB (host) + 241 MB (gfx1100) + 453 MB (gfx11 family) = 1,125 MB instead of 5.1 GB (78% reduction). A gfx900 user downloads 431 MB + 54 MB = 485 MB (91% reduction). The host wheel is architecture-independent and cacheable across all GPU targets.
- Hierarchy-aware bundling: Device wheels use
rocm-bootstrap's 3-level naming hierarchy (family/sub-family/target). AOTriton'sgfx11xxdirectory becomes theamd-torch-device-gfx11family wheel;gfx120xbecomesamd-torch-device-gfx12-0. - xnack collapse:
gfx942,gfx942:xnack+, andgfx942:xnack-kernels are merged into a singleamd-torch-device-gfx942wheel (3.kpackfiles). Same forgfx90a. Reduces 20 raw arches to 16 device wheels. - Host METADATA rewrite: Adds
Requires-Dist: rocm-bootstrap, per-target extras (pip install torch[gfx1100]), PEP 817 variant markers ("amd :: gfx_arch :: gfx1100" in variant_properties), packaging chain fan-out (gfx1100 extra pulls in gfx11 family wheel), and anallextra. - Device wheel naming uses
rocm_bootstrap.device_dist_name()for canonical names.
Ten shared libraries contain embedded GPU kernels for up to 18 architectures. Device code is extracted into per-arch .kpack archives, and host binaries are rewritten with kpack load references.
Architecture-specific database files (Tensile .co/.hsaco/.dat, AOTriton .aks2 images, MIOpen tuning databases) are moved from the host wheel into the appropriate device wheel.
| Database | Files | Size | Arches |
|---|---|---|---|
| hipblaslt | 2,990 | 3,786 MB | gfx942, gfx950 |
| aotriton | 9,334 | 870 MB | gfx90a, gfx942, gfx950, gfx11 (family), gfx12_0 (sub-family) |
| miopen | 62 | 744 MB | gfx900-gfx1030 |
| rocblas | 1,738 | 645 MB | 14 arches |
| hipsparselt | 198 | 138 MB | gfx942, gfx950 |
| Phase | Time |
|---|---|
| Phase 2b: Database scanning | <1s |
| Phase 3: Kernel extraction | 102s |
| Phase 4: Kpack archive creation | 4.0s |
| Phase 5: Host binary transformation | ~30s |
| Phase 5: Database file removal | <1s |
| RECORD generation (SHA-256 hashing) | ~30s |
| Host wheel zipping | ~60s |
| Device wheel creation (16x) | ~120s |
| Total wall-clock | ~6 min |
Extraction is dominated by libmagma.so (1.2 GB, 6160 kernels, 102s).
| Library | Original | Stripped | Saved | % |
|---|---|---|---|---|
| libmagma.so | 1,249 MB | 53 MB | 1,196 MB | 95.7% |
| librocsolver.so | 725 MB | 19 MB | 706 MB | 97.4% |
| librccl.so | 466 MB | 2.2 MB | 464 MB | 99.5% |
| librocsparse.so | 435 MB | 50 MB | 386 MB | 88.6% |
| libtorch_hip.so | 478 MB | 136 MB | 342 MB | 71.6% |
| librocrand.so | 337 MB | 38 MB | 299 MB | 88.7% |
| libMIOpen.so | 951 MB | 735 MB | 216 MB | 22.7% |
| librocblas.so | 53 MB | 25 MB | 28 MB | 52.4% |
| librocfft.so | 25 MB | 11 MB | 14 MB | 58.1% |
| libhipsparselt.so | 7.1 MB | 6.9 MB | 0.2 MB | 3.1% |
| Bundle Key | Level | Kernels | DB Files | Kpack Size | Wheel Size (.whl) |
|---|---|---|---|---|---|
| gfx900 | target | 685 | 5 | 50 MB | 54 MB |
| gfx906 | target | 685 | 5 | 50 MB | 55 MB |
| gfx908 | target | 1,642 | 285 | 263 MB | 289 MB |
| gfx90a | target | 2,281* | 2,522* | 469 MB* | 683 MB |
| gfx942 | target | 2,313* | 3,636* | 463 MB* | 1,636 MB |
| gfx950 | target | 1,645 | 2,539 | 250 MB | 478 MB |
| gfx1030 | target | 1,342 | 90 | 217 MB | 223 MB |
| gfx1100 | target | 1,370 | 191 | 228 MB | 241 MB |
| gfx1101 | target | 1,370 | 207 | 228 MB | 239 MB |
| gfx1102 | target | 1,370 | 96 | 228 MB | 229 MB |
| gfx1150 | target | 1,369 | 191 | 185 MB | 194 MB |
| gfx1151 | target | 1,369 | 191 | 185 MB | 195 MB |
| gfx1200 | target | 1,370 | 343 | 218 MB | 275 MB |
| gfx1201 | target | 1,370 | 353 | 218 MB | 260 MB |
| gfx11 | family | -- | 1,780 | -- | 453 MB |
| gfx12_0 | sub-family | -- | 1,888 | -- | 109 MB |
* gfx90a and gfx942 include collapsed xnack+/xnack- variants (3 kpack files each).
Notes:
- gfx942 is the heaviest device wheel (1.6 GB) due to hipblaslt Tensile databases + 3 kpack variant files
- gfx11 and gfx12_0 are database-only wheels (AOTriton family/sub-family arch directories, no ELF kernels)
- Kpack archives use zstd compression (level 3)
What a user actually downloads for their GPU:
| User's GPU | Downloads | Total | Savings |
|---|---|---|---|
| gfx900 | host + gfx900 | 485 MB | 91% |
| gfx906 | host + gfx906 | 486 MB | 91% |
| gfx908 | host + gfx908 | 720 MB | 86% |
| gfx90a | host + gfx90a | 1,114 MB | 78% |
| gfx942 | host + gfx942 | 2,067 MB | 60% |
| gfx950 | host + gfx950 | 909 MB | 83% |
| gfx1030 | host + gfx1030 | 654 MB | 87% |
| gfx1100 | host + gfx1100 + gfx11 | 1,125 MB | 78% |
| gfx1200 | host + gfx1200 + gfx12_0 | 815 MB | 84% |
The host wheel METADATA is augmented with device wheel references. Example excerpt:
Requires-Dist: rocm-bootstrap
Provides-Extra: gfx1100
Requires-Dist: amd-torch-device-gfx1100 == 2.10.0+rocm7.1; extra == "gfx1100"
Requires-Dist: amd-torch-device-gfx1100 == 2.10.0+rocm7.1; "amd :: gfx_arch :: gfx1100" in variant_properties
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; extra == "gfx1100"
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; "amd :: gfx_arch :: gfx1100" in variant_properties
...
Provides-Extra: all
Requires-Dist: amd-torch-device-gfx1030 == 2.10.0+rocm7.1; extra == "all"
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; extra == "all"
...
Installation paths:
pip install torch[gfx1100]-- extras-based, pulls target + family chainuv install torchwith PEP 817 variant properties -- automatic resolution via variant markerspip install torch[all]-- install all 16 device wheels
torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- torch-2.10.0+rocm7.1.dist-info/
| +-- METADATA (augmented with extras, variant markers, rocm-bootstrap dep)
| +-- WHEEL (original, preserved)
| +-- RECORD (regenerated with SHA-256 hashes)
+-- torch/
+-- lib/
| +-- libmagma.so (53 MB, device code stripped)
| +-- libMIOpen.so (735 MB, device code stripped)
| +-- libtorch_cpu.so (433 MB, unchanged -- no device code)
| +-- ...
| (rocblas/library/, hipblaslt/library/, hipsparselt/library/,
| aotriton.images/ -- arch-specific files removed)
+-- share/miopen/db/ (arch-specific files removed)
+-- _C.cpython-313-x86_64-linux-gnu.so
+-- ... (~11,900 files remaining)
amd_torch_device_gfx942-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- amd_torch_device_gfx942-2.10.0+rocm7.1.dist-info/
| +-- METADATA (Requires-Dist: torch == 2.10.0+rocm7.1)
| +-- WHEEL
| +-- RECORD
| +-- top_level.txt
+-- torch/
+-- .kpack/
| +-- torch_gfx942.kpack (96 MB, bare arch kernels)
| +-- torch_gfx942:xnack+.kpack (170 MB, xnack+ variant)
| +-- torch_gfx942:xnack-.kpack (170 MB, xnack- variant)
+-- lib/
| +-- rocblas/library/ (gfx942 Tensile kernels)
| +-- hipblaslt/library/ (gfx942 Tensile kernels)
| +-- hipsparselt/library/ (gfx942 Tensile kernels)
| +-- aotriton.images/amd-gfx942/ (AOTriton kernel images)
+-- share/miopen/db/ (gfx942 tuning databases)
amd_torch_device_gfx11-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- amd_torch_device_gfx11-2.10.0+rocm7.1.dist-info/
| +-- METADATA (Requires-Dist: torch == 2.10.0+rocm7.1)
| +-- WHEEL
| +-- RECORD
| +-- top_level.txt
+-- torch/
+-- lib/
+-- aotriton.images/amd-gfx11xx/ (AOTriton family kernel images)
python -m rocm_kpack.tools.split_python_wheels \
--input torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl \
--output-dir dist/ \
--device-package-prefix amd-torch-device \
--overlay-root torch/ \
--wheel-type torch-fat \
--output-format wheel \
--verbose \
--jobs 20| Library | Size | Kernels | Time | Architectures |
|---|---|---|---|---|
| libmagma.so | 1,249 MB | 6,160 | 101.9s | 14 |
| libMIOpen.so | 951 MB | 1,668 | 13.4s | 16 |
| librocsolver.so | 725 MB | 3,290 | 13.4s | 14 |
| librccl.so | 466 MB | 12 | 17.2s | 12 |
| librocsparse.so | 435 MB | 4,466 | 9.5s | 14 |
| libtorch_hip.so | 478 MB | 3,459 | 22.6s | 14 |
| librocrand.so | 337 MB | 210 | 3.2s | 14 |
| librocblas.so | 53 MB | 896 | 3.0s | 14 |
| librocfft.so | 25 MB | 14 | 0.2s | 14 |
| libhipsparselt.so | 7 MB | 6 | <0.1s | 2 |