Skip to content

Instantly share code, notes, and snippets.

@vanbasten23
Created February 28, 2026 19:08
Show Gist options
  • Select an option

  • Save vanbasten23/3a44bad4c6e6b9589048808cb243164c to your computer and use it in GitHub Desktop.

Select an option

Save vanbasten23/3a44bad4c6e6b9589048808cb243164c to your computer and use it in GitHub Desktop.
# AllToAll Communication Cost on ND Torus
## Setup
- **AllToAll** on an AxBxC... torus, N = A·B·C·... total nodes
- V total bytes, W\_ici = bidirectional bandwidth per link
## Key idea: per-link load determines time
In AllToAll, each node sends V/N² bytes to each other node. To find the time, we compute the **load on the most congested link** — that's the bottleneck.
## 1D torus (ring) first
On a ring of N nodes:
1. **Average hop distance** between a random source-target pair is **N/4** (distances range from 1 to N/2 around the ring, averaging to N/4).
2. **Total data×hops** (i.e., total byte-hops across all pairs):
N² pairs × V/N² bytes/pair × N/4 avg hops = **V·N/4**
3. **Links in the ring**: N links, each carrying equal load by symmetry.
Load per link = V·N/4 ÷ N = **V/4**
4. **Time** = load / bandwidth = V / (4·W\_ici)
This is 1/4 the cost of AllGather (≈V/W\_ici), which makes sense: in AllGather every byte must reach every node, but in AllToAll each byte has only one destination (on average N/4 hops away, not ~N/2).
## Generalizing to ND torus
The critical insight: **links in different dimensions are physically separate**, so traffic along dimension A, dimension B, etc. flows **simultaneously in parallel**. The bottleneck is whichever dimension is most loaded.
For dimension A:
- Average hop distance along A = A/4 (same torus argument)
- Total A-dimension byte-hops = N² · (V/N²) · (A/4) = V·A/4
- Number of A-dimension links = N (B·C rows, each a ring of A links)
- Load per A-link = V·A / (4N)
- Time for A-traffic = V·A / (4·N·W\_ici)
Similarly, time for B-traffic = V·B / (4·N·W\_ici), etc.
Since dimensions run in parallel, total time = **max over dimensions**:
**T = V · max(A,B,C,...) / (4 · N · W\_ici)**
## "Scales down with the smallest axis"
For a 2D torus A×B with A ≥ B:
T = V·A / (4·A·B·W\_ici) = V / (4·B·W\_ici)
The large dimension A **cancels out** (it's the bottleneck but also contributes proportionally to N). What remains is 1/B — the cost scales inversely with the **smallest** dimension. Making the mesh more square (increasing B) reduces cost.
## Caveat
This is a lower-bound / idealized analysis. It assumes optimal shortest-path routing and that link utilization can be fully balanced (no contention). In practice, AllToAll on torus networks gets close to this bound, so it's a good approximation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment