stellaraccident/wheel-split-report.md

## wheel-split-report.md

      
    Raw
  

              wheel-split-report.md
            
          
    PyTorch ROCm Wheel Split Report

Split of torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl using rocm_kpack.tools.split_python_wheels.
Summary


Metric
Value


Input wheel (.whl)
5.1 GB


Host wheel (.whl)
431 MB


Device wheels (16 bundle keys, .whl)
5.5 GB total


Fat binaries processed
10


Database files relocated
14,322


Raw GPU architectures found
20 (18 ELF + 2 database-only)


Device wheels produced
16 (xnack variants collapsed, hierarchy-aware naming)


Total kernels extracted
20,181


Key result: A user installing for gfx1100 downloads 431 MB (host) + 241 MB (gfx1100) + 453 MB (gfx11 family) = 1,125 MB instead of 5.1 GB (78% reduction). A gfx900 user downloads 431 MB + 54 MB = 485 MB (91% reduction). The host wheel is architecture-independent and cacheable across all GPU targets.
What Changed (v2)


Hierarchy-aware bundling: Device wheels use rocm-bootstrap's 3-level naming hierarchy (family/sub-family/target). AOTriton's gfx11xx directory becomes the amd-torch-device-gfx11 family wheel; gfx120x becomes amd-torch-device-gfx12-0.
xnack collapse: gfx942, gfx942:xnack+, and gfx942:xnack- kernels are merged into a single amd-torch-device-gfx942 wheel (3 .kpack files). Same for gfx90a. Reduces 20 raw arches to 16 device wheels.
Host METADATA rewrite: Adds Requires-Dist: rocm-bootstrap, per-target extras (pip install torch[gfx1100]), PEP 817 variant markers ("amd :: gfx_arch :: gfx1100" in variant_properties), packaging chain fan-out (gfx1100 extra pulls in gfx11 family wheel), and an all extra.
Device wheel naming uses rocm_bootstrap.device_dist_name() for canonical names.

What Gets Split

1. ELF Fat Binary Device Code (3.8 GB -> kpack archives)

Ten shared libraries contain embedded GPU kernels for up to 18 architectures. Device code is extracted into per-arch .kpack archives, and host binaries are rewritten with kpack load references.
2. Kernel Database Files (6.2 GB -> per-arch device wheels)

Architecture-specific database files (Tensile .co/.hsaco/.dat, AOTriton .aks2 images, MIOpen tuning databases) are moved from the host wheel into the appropriate device wheel.


Database
Files
Size
Arches


hipblaslt
2,990
3,786 MB
gfx942, gfx950


aotriton
9,334
870 MB
gfx90a, gfx942, gfx950, gfx11 (family), gfx12_0 (sub-family)


miopen
62
744 MB
gfx900-gfx1030


rocblas
1,738
645 MB
14 arches


hipsparselt
198
138 MB
gfx942, gfx950


Timing (20 parallel workers)


Phase
Time


Phase 2b: Database scanning
<1s


Phase 3: Kernel extraction
102s


Phase 4: Kpack archive creation
4.0s


Phase 5: Host binary transformation
~30s


Phase 5: Database file removal
<1s


RECORD generation (SHA-256 hashing)
~30s


Host wheel zipping
~60s


Device wheel creation (16x)
~120s


Total wall-clock
~6 min


Extraction is dominated by libmagma.so (1.2 GB, 6160 kernels, 102s).
Per-Binary Stripping Results


Library
Original
Stripped
Saved
%


libmagma.so
1,249 MB
53 MB
1,196 MB
95.7%


librocsolver.so
725 MB
19 MB
706 MB
97.4%


librccl.so
466 MB
2.2 MB
464 MB
99.5%


librocsparse.so
435 MB
50 MB
386 MB
88.6%


libtorch_hip.so
478 MB
136 MB
342 MB
71.6%


librocrand.so
337 MB
38 MB
299 MB
88.7%


libMIOpen.so
951 MB
735 MB
216 MB
22.7%


librocblas.so
53 MB
25 MB
28 MB
52.4%


librocfft.so
25 MB
11 MB
14 MB
58.1%


libhipsparselt.so
7.1 MB
6.9 MB
0.2 MB
3.1%


Per-Architecture Device Wheels


Bundle Key
Level
Kernels
DB Files
Kpack Size
Wheel Size (.whl)


gfx900
target
685
5
50 MB
54 MB


gfx906
target
685
5
50 MB
55 MB


gfx908
target
1,642
285
263 MB
289 MB


gfx90a
target
2,281*
2,522*
469 MB*
683 MB


gfx942
target
2,313*
3,636*
463 MB*
1,636 MB


gfx950
target
1,645
2,539
250 MB
478 MB


gfx1030
target
1,342
90
217 MB
223 MB


gfx1100
target
1,370
191
228 MB
241 MB


gfx1101
target
1,370
207
228 MB
239 MB


gfx1102
target
1,370
96
228 MB
229 MB


gfx1150
target
1,369
191
185 MB
194 MB


gfx1151
target
1,369
191
185 MB
195 MB


gfx1200
target
1,370
343
218 MB
275 MB


gfx1201
target
1,370
353
218 MB
260 MB


gfx11
family
--
1,780
--
453 MB


gfx12_0
sub-family
--
1,888
--
109 MB


* gfx90a and gfx942 include collapsed xnack+/xnack- variants (3 kpack files each).
Notes:

gfx942 is the heaviest device wheel (1.6 GB) due to hipblaslt Tensile databases + 3 kpack variant files
gfx11 and gfx12_0 are database-only wheels (AOTriton family/sub-family arch directories, no ELF kernels)
Kpack archives use zstd compression (level 3)

User Download Sizes

What a user actually downloads for their GPU:


User's GPU
Downloads
Total
Savings


gfx900
host + gfx900
485 MB
91%


gfx906
host + gfx906
486 MB
91%


gfx908
host + gfx908
720 MB
86%


gfx90a
host + gfx90a
1,114 MB
78%


gfx942
host + gfx942
2,067 MB
60%


gfx950
host + gfx950
909 MB
83%


gfx1030
host + gfx1030
654 MB
87%


gfx1100
host + gfx1100 + gfx11
1,125 MB
78%


gfx1200
host + gfx1200 + gfx12_0
815 MB
84%


Host Wheel METADATA (Injected)

The host wheel METADATA is augmented with device wheel references. Example excerpt:
Requires-Dist: rocm-bootstrap
Provides-Extra: gfx1100
Requires-Dist: amd-torch-device-gfx1100 == 2.10.0+rocm7.1; extra == "gfx1100"
Requires-Dist: amd-torch-device-gfx1100 == 2.10.0+rocm7.1; "amd :: gfx_arch :: gfx1100" in variant_properties
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; extra == "gfx1100"
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; "amd :: gfx_arch :: gfx1100" in variant_properties
...
Provides-Extra: all
Requires-Dist: amd-torch-device-gfx1030 == 2.10.0+rocm7.1; extra == "all"
Requires-Dist: amd-torch-device-gfx11 == 2.10.0+rocm7.1; extra == "all"
...

Installation paths:

pip install torch[gfx1100] -- extras-based, pulls target + family chain
uv install torch with PEP 817 variant properties -- automatic resolution via variant markers
pip install torch[all] -- install all 16 device wheels

Output Structure

Host Wheel (431 MB compressed)

torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- torch-2.10.0+rocm7.1.dist-info/
|   +-- METADATA     (augmented with extras, variant markers, rocm-bootstrap dep)
|   +-- WHEEL        (original, preserved)
|   +-- RECORD       (regenerated with SHA-256 hashes)
+-- torch/
    +-- lib/
    |   +-- libmagma.so          (53 MB, device code stripped)
    |   +-- libMIOpen.so         (735 MB, device code stripped)
    |   +-- libtorch_cpu.so      (433 MB, unchanged -- no device code)
    |   +-- ...
    |   (rocblas/library/, hipblaslt/library/, hipsparselt/library/,
    |    aotriton.images/ -- arch-specific files removed)
    +-- share/miopen/db/         (arch-specific files removed)
    +-- _C.cpython-313-x86_64-linux-gnu.so
    +-- ... (~11,900 files remaining)

Device Wheel (example: gfx942, 1.6 GB compressed)

amd_torch_device_gfx942-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- amd_torch_device_gfx942-2.10.0+rocm7.1.dist-info/
|   +-- METADATA     (Requires-Dist: torch == 2.10.0+rocm7.1)
|   +-- WHEEL
|   +-- RECORD
|   +-- top_level.txt
+-- torch/
    +-- .kpack/
    |   +-- torch_gfx942.kpack        (96 MB, bare arch kernels)
    |   +-- torch_gfx942:xnack+.kpack (170 MB, xnack+ variant)
    |   +-- torch_gfx942:xnack-.kpack (170 MB, xnack- variant)
    +-- lib/
    |   +-- rocblas/library/         (gfx942 Tensile kernels)
    |   +-- hipblaslt/library/       (gfx942 Tensile kernels)
    |   +-- hipsparselt/library/     (gfx942 Tensile kernels)
    |   +-- aotriton.images/amd-gfx942/  (AOTriton kernel images)
    +-- share/miopen/db/            (gfx942 tuning databases)

Device Wheel (example: gfx11 family, 453 MB compressed)

amd_torch_device_gfx11-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64/
+-- amd_torch_device_gfx11-2.10.0+rocm7.1.dist-info/
|   +-- METADATA     (Requires-Dist: torch == 2.10.0+rocm7.1)
|   +-- WHEEL
|   +-- RECORD
|   +-- top_level.txt
+-- torch/
    +-- lib/
        +-- aotriton.images/amd-gfx11xx/  (AOTriton family kernel images)

Command

python -m rocm_kpack.tools.split_python_wheels \
    --input torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl \
    --output-dir dist/ \
    --device-package-prefix amd-torch-device \
    --overlay-root torch/ \
    --wheel-type torch-fat \
    --output-format wheel \
    --verbose \
    --jobs 20
Per-Binary Extraction Details


Library
Size
Kernels
Time
Architectures


libmagma.so
1,249 MB
6,160
101.9s
14


libMIOpen.so
951 MB
1,668
13.4s
16


librocsolver.so
725 MB
3,290
13.4s
14


librccl.so
466 MB
12
17.2s
12


librocsparse.so
435 MB
4,466
9.5s
14


libtorch_hip.so
478 MB
3,459
22.6s
14


librocrand.so
337 MB
210
3.2s
14


librocblas.so
53 MB
896
3.0s
14


librocfft.so
25 MB
14
0.2s
14


libhipsparselt.so
7 MB
6
<0.1s
2


## z_file_list.txt
total 5.9G
-rw-rw-r-- 1 stella stella 223M Mar  9 18:53 amd_torch_device_gfx1030-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 241M Mar  9 18:53 amd_torch_device_gfx1100-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 239M Mar  9 18:53 amd_torch_device_gfx1101-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 229M Mar  9 18:53 amd_torch_device_gfx1102-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 453M Mar  9 18:53 amd_torch_device_gfx11-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 194M Mar  9 18:53 amd_torch_device_gfx1150-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 195M Mar  9 18:54 amd_torch_device_gfx1151-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 275M Mar  9 18:54 amd_torch_device_gfx1200-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 260M Mar  9 18:54 amd_torch_device_gfx1201-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 109M Mar  9 18:54 amd_torch_device_gfx12_0-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella  54M Mar  9 18:54 amd_torch_device_gfx900-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella  55M Mar  9 18:54 amd_torch_device_gfx906-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 289M Mar  9 18:54 amd_torch_device_gfx908-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 683M Mar  9 18:54 amd_torch_device_gfx90a-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 1.6G Mar  9 18:55 amd_torch_device_gfx942-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 478M Mar  9 18:55 amd_torch_device_gfx950-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
-rw-rw-r-- 1 stella stella 431M Mar  9 18:53 torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
Metric	Value
Input wheel (.whl)	5.1 GB
Host wheel (.whl)	431 MB
Device wheels (16 bundle keys, .whl)	5.5 GB total
Fat binaries processed	10
Database files relocated	14,322
Raw GPU architectures found	20 (18 ELF + 2 database-only)
Device wheels produced	16 (xnack variants collapsed, hierarchy-aware naming)
Total kernels extracted	20,181
Database	Files	Size	Arches
hipblaslt	2,990	3,786 MB	gfx942, gfx950
aotriton	9,334	870 MB	gfx90a, gfx942, gfx950, gfx11 (family), gfx12_0 (sub-family)
miopen	62	744 MB	gfx900-gfx1030
rocblas	1,738	645 MB	14 arches
hipsparselt	198	138 MB	gfx942, gfx950
Phase	Time
Phase 2b: Database scanning	<1s
Phase 3: Kernel extraction	102s
Phase 4: Kpack archive creation	4.0s
Phase 5: Host binary transformation	~30s
Phase 5: Database file removal	<1s
RECORD generation (SHA-256 hashing)	~30s
Host wheel zipping	~60s
Device wheel creation (16x)	~120s
Total wall-clock	~6 min
Library	Original	Stripped	Saved	%
libmagma.so	1,249 MB	53 MB	1,196 MB	95.7%
librocsolver.so	725 MB	19 MB	706 MB	97.4%
librccl.so	466 MB	2.2 MB	464 MB	99.5%
librocsparse.so	435 MB	50 MB	386 MB	88.6%
libtorch_hip.so	478 MB	136 MB	342 MB	71.6%
librocrand.so	337 MB	38 MB	299 MB	88.7%
libMIOpen.so	951 MB	735 MB	216 MB	22.7%
librocblas.so	53 MB	25 MB	28 MB	52.4%
librocfft.so	25 MB	11 MB	14 MB	58.1%
libhipsparselt.so	7.1 MB	6.9 MB	0.2 MB	3.1%
Bundle Key	Level	Kernels	DB Files	Kpack Size	Wheel Size (.whl)
gfx900	target	685	5	50 MB	54 MB
gfx906	target	685	5	50 MB	55 MB
gfx908	target	1,642	285	263 MB	289 MB
gfx90a	target	2,281*	2,522*	469 MB*	683 MB
gfx942	target	2,313*	3,636*	463 MB*	1,636 MB
gfx950	target	1,645	2,539	250 MB	478 MB
gfx1030	target	1,342	90	217 MB	223 MB
gfx1100	target	1,370	191	228 MB	241 MB
gfx1101	target	1,370	207	228 MB	239 MB
gfx1102	target	1,370	96	228 MB	229 MB
gfx1150	target	1,369	191	185 MB	194 MB
gfx1151	target	1,369	191	185 MB	195 MB
gfx1200	target	1,370	343	218 MB	275 MB
gfx1201	target	1,370	353	218 MB	260 MB
gfx11	family	--	1,780	--	453 MB
gfx12_0	sub-family	--	1,888	--	109 MB
User's GPU	Downloads	Total	Savings
gfx900	host + gfx900	485 MB	91%
gfx906	host + gfx906	486 MB	91%
gfx908	host + gfx908	720 MB	86%
gfx90a	host + gfx90a	1,114 MB	78%
gfx942	host + gfx942	2,067 MB	60%
gfx950	host + gfx950	909 MB	83%
gfx1030	host + gfx1030	654 MB	87%
gfx1100	host + gfx1100 + gfx11	1,125 MB	78%
gfx1200	host + gfx1200 + gfx12_0	815 MB	84%
Library	Size	Kernels	Time	Architectures
libmagma.so	1,249 MB	6,160	101.9s	14
libMIOpen.so	951 MB	1,668	13.4s	16
librocsolver.so	725 MB	3,290	13.4s	14
librccl.so	466 MB	12	17.2s	12
librocsparse.so	435 MB	4,466	9.5s	14
libtorch_hip.so	478 MB	3,459	22.6s	14
librocrand.so	337 MB	210	3.2s	14
librocblas.so	53 MB	896	3.0s	14
librocfft.so	25 MB	14	0.2s	14
libhipsparselt.so	7 MB	6	<0.1s	2
	total 5.9G
	-rw-rw-r-- 1 stella stella 223M Mar 9 18:53 amd_torch_device_gfx1030-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 241M Mar 9 18:53 amd_torch_device_gfx1100-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 239M Mar 9 18:53 amd_torch_device_gfx1101-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 229M Mar 9 18:53 amd_torch_device_gfx1102-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 453M Mar 9 18:53 amd_torch_device_gfx11-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 194M Mar 9 18:53 amd_torch_device_gfx1150-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 195M Mar 9 18:54 amd_torch_device_gfx1151-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 275M Mar 9 18:54 amd_torch_device_gfx1200-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 260M Mar 9 18:54 amd_torch_device_gfx1201-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 109M Mar 9 18:54 amd_torch_device_gfx12_0-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 54M Mar 9 18:54 amd_torch_device_gfx900-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 55M Mar 9 18:54 amd_torch_device_gfx906-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 289M Mar 9 18:54 amd_torch_device_gfx908-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 683M Mar 9 18:54 amd_torch_device_gfx90a-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 1.6G Mar 9 18:55 amd_torch_device_gfx942-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 478M Mar 9 18:55 amd_torch_device_gfx950-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl
	-rw-rw-r-- 1 stella stella 431M Mar 9 18:53 torch-2.10.0+rocm7.1-cp313-cp313-manylinux_2_28_x86_64.whl