Created
February 28, 2026 19:08
-
-
Save vanbasten23/3a44bad4c6e6b9589048808cb243164c to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # AllToAll Communication Cost on ND Torus | |
| ## Setup | |
| - **AllToAll** on an AxBxC... torus, N = A·B·C·... total nodes | |
| - V total bytes, W\_ici = bidirectional bandwidth per link | |
| ## Key idea: per-link load determines time | |
| In AllToAll, each node sends V/N² bytes to each other node. To find the time, we compute the **load on the most congested link** — that's the bottleneck. | |
| ## 1D torus (ring) first | |
| On a ring of N nodes: | |
| 1. **Average hop distance** between a random source-target pair is **N/4** (distances range from 1 to N/2 around the ring, averaging to N/4). | |
| 2. **Total data×hops** (i.e., total byte-hops across all pairs): | |
| N² pairs × V/N² bytes/pair × N/4 avg hops = **V·N/4** | |
| 3. **Links in the ring**: N links, each carrying equal load by symmetry. | |
| Load per link = V·N/4 ÷ N = **V/4** | |
| 4. **Time** = load / bandwidth = V / (4·W\_ici) | |
| This is 1/4 the cost of AllGather (≈V/W\_ici), which makes sense: in AllGather every byte must reach every node, but in AllToAll each byte has only one destination (on average N/4 hops away, not ~N/2). | |
| ## Generalizing to ND torus | |
| The critical insight: **links in different dimensions are physically separate**, so traffic along dimension A, dimension B, etc. flows **simultaneously in parallel**. The bottleneck is whichever dimension is most loaded. | |
| For dimension A: | |
| - Average hop distance along A = A/4 (same torus argument) | |
| - Total A-dimension byte-hops = N² · (V/N²) · (A/4) = V·A/4 | |
| - Number of A-dimension links = N (B·C rows, each a ring of A links) | |
| - Load per A-link = V·A / (4N) | |
| - Time for A-traffic = V·A / (4·N·W\_ici) | |
| Similarly, time for B-traffic = V·B / (4·N·W\_ici), etc. | |
| Since dimensions run in parallel, total time = **max over dimensions**: | |
| **T = V · max(A,B,C,...) / (4 · N · W\_ici)** | |
| ## "Scales down with the smallest axis" | |
| For a 2D torus A×B with A ≥ B: | |
| T = V·A / (4·A·B·W\_ici) = V / (4·B·W\_ici) | |
| The large dimension A **cancels out** (it's the bottleneck but also contributes proportionally to N). What remains is 1/B — the cost scales inversely with the **smallest** dimension. Making the mesh more square (increasing B) reduces cost. | |
| ## Caveat | |
| This is a lower-bound / idealized analysis. It assumes optimal shortest-path routing and that link utilization can be fully balanced (no contention). In practice, AllToAll on torus networks gets close to this bound, so it's a good approximation. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment