vanbasten23/gist:3a44bad4c6e6b9589048808cb243164c

## gistfile1.txt
# AllToAll Communication Cost on ND Torus

## Setup

- **AllToAll** on an AxBxC... torus, N = A·B·C·... total nodes
- V total bytes, W\_ici = bidirectional bandwidth per link

## Key idea: per-link load determines time

In AllToAll, each node sends V/N² bytes to each other node. To find the time, we compute the **load on the most congested link** — that's the bottleneck.

## 1D torus (ring) first

On a ring of N nodes:

1. **Average hop distance** between a random source-target pair is **N/4** (distances range from 1 to N/2 around the ring, averaging to N/4).

2. **Total data×hops** (i.e., total byte-hops across all pairs):
   N² pairs × V/N² bytes/pair × N/4 avg hops = **V·N/4**

3. **Links in the ring**: N links, each carrying equal load by symmetry.
   Load per link = V·N/4 ÷ N = **V/4**

4. **Time** = load / bandwidth = V / (4·W\_ici)

This is 1/4 the cost of AllGather (≈V/W\_ici), which makes sense: in AllGather every byte must reach every node, but in AllToAll each byte has only one destination (on average N/4 hops away, not ~N/2).

## Generalizing to ND torus

The critical insight: **links in different dimensions are physically separate**, so traffic along dimension A, dimension B, etc. flows **simultaneously in parallel**. The bottleneck is whichever dimension is most loaded.

For dimension A:
- Average hop distance along A = A/4 (same torus argument)
- Total A-dimension byte-hops = N² · (V/N²) · (A/4) = V·A/4
- Number of A-dimension links = N (B·C rows, each a ring of A links)
- Load per A-link = V·A / (4N)
- Time for A-traffic = V·A / (4·N·W\_ici)

Similarly, time for B-traffic = V·B / (4·N·W\_ici), etc.

Since dimensions run in parallel, total time = **max over dimensions**:

**T = V · max(A,B,C,...) / (4 · N · W\_ici)**

## "Scales down with the smallest axis"

For a 2D torus A×B with A ≥ B:

T = V·A / (4·A·B·W\_ici) = V / (4·B·W\_ici)

The large dimension A **cancels out** (it's the bottleneck but also contributes proportionally to N). What remains is 1/B — the cost scales inversely with the **smallest** dimension. Making the mesh more square (increasing B) reduces cost.

## Caveat

This is a lower-bound / idealized analysis. It assumes optimal shortest-path routing and that link utilization can be fully balanced (no contention). In practice, AllToAll on torus networks gets close to this bound, so it's a good approximation.
	# AllToAll Communication Cost on ND Torus

	## Setup

	- AllToAll on an AxBxC... torus, N = A·B·C·... total nodes
	- V total bytes, W\_ici = bidirectional bandwidth per link

	## Key idea: per-link load determines time

	In AllToAll, each node sends V/N² bytes to each other node. To find the time, we compute the load on the most congested link — that's the bottleneck.

	## 1D torus (ring) first

	On a ring of N nodes:

	1. Average hop distance between a random source-target pair is N/4 (distances range from 1 to N/2 around the ring, averaging to N/4).

	2. Total data×hops (i.e., total byte-hops across all pairs):
	N² pairs × V/N² bytes/pair × N/4 avg hops = V·N/4

	3. Links in the ring: N links, each carrying equal load by symmetry.
	Load per link = V·N/4 ÷ N = V/4

	4. Time = load / bandwidth = V / (4·W\_ici)

	This is 1/4 the cost of AllGather (≈V/W\_ici), which makes sense: in AllGather every byte must reach every node, but in AllToAll each byte has only one destination (on average N/4 hops away, not ~N/2).

	## Generalizing to ND torus

	The critical insight: links in different dimensions are physically separate, so traffic along dimension A, dimension B, etc. flows simultaneously in parallel. The bottleneck is whichever dimension is most loaded.

	For dimension A:
	- Average hop distance along A = A/4 (same torus argument)
	- Total A-dimension byte-hops = N² · (V/N²) · (A/4) = V·A/4
	- Number of A-dimension links = N (B·C rows, each a ring of A links)
	- Load per A-link = V·A / (4N)
	- Time for A-traffic = V·A / (4·N·W\_ici)

	Similarly, time for B-traffic = V·B / (4·N·W\_ici), etc.

	Since dimensions run in parallel, total time = max over dimensions:

	T = V · max(A,B,C,...) / (4 · N · W\_ici)

	## "Scales down with the smallest axis"

	For a 2D torus A×B with A ≥ B:

	T = V·A / (4·A·B·W\_ici) = V / (4·B·W\_ici)

	The large dimension A cancels out (it's the bottleneck but also contributes proportionally to N). What remains is 1/B — the cost scales inversely with the smallest dimension. Making the mesh more square (increasing B) reduces cost.

	## Caveat

	This is a lower-bound / idealized analysis. It assumes optimal shortest-path routing and that link utilization can be fully balanced (no contention). In practice, AllToAll on torus networks gets close to this bound, so it's a good approximation.
No results found