PyTorch's multi_tensor_apply kernel uses kILP = 4 for vectorized memory access.
For 2-byte types (fp16, bf16), this means each vectorized load is 4 * 2 = 8 bytes (64-bit LDG.64).
Modern GPUs support 128-bit loads (LDG.128). By increasing ILP to 8 for 16-bit types,
each thread would load 16 bytes per instruction, potentially doubling memory throughput.
The change introduces effective_ilp():