Skip to content

Instantly share code, notes, and snippets.

@nmoinvaz
Last active March 9, 2026 07:43
Show Gist options
  • Select an option

  • Save nmoinvaz/9f1ca2fcaf260357229595e6f5213b78 to your computer and use it in GitHub Desktop.

Select an option

Save nmoinvaz/9f1ca2fcaf260357229595e6f5213b78 to your computer and use it in GitHub Desktop.
zlib-ng: VPCLMULQDQ AVX2 vs PCLMULQDQ CRC32 benchmark results on Intel i7-1185G7

zlib-ng: VPCLMULQDQ AVX2 vs PCLMULQDQ CRC32 Benchmark

Machine

  • CPU: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (Tiger Lake)
  • Cores: 4 cores / 8 threads
  • L1d/L1i: 48 KiB / 32 KiB (x4)
  • L2: 1280 KiB (x4)
  • L3: 12288 KiB
  • OS: Windows 11 Pro
  • Compiler: MSVC (Visual Studio 18 2026)
  • Build: Release, static

Results

Median CPU time (ns) over 5 repetitions.

CRC32 (unaligned)

Size pclmulqdq (ns) vpclmulqdq_avx2 (ns) Improvement
1 10.5 7.8 +26%
8 31.1 31.4 -1%
16 54.5 57.5 -6%
32 78.2 62.8 +20%
64 78.5 69.8 +12%
512 122.1 97.3 +20%
4096 350.3 279.0 +20%
32768 2335.5 1743.9 +25%
262144 18684.6 15694.8 +16%
4194304 313895.1 244140.6 +22%

CRC32 (aligned)

Size pclmulqdq (ns) vpclmulqdq_avx2 (ns) Improvement
8 28.0 33.1 -18%
16 17.4 17.4 0%
32 20.9 21.4 -2%
64 26.2 27.9 -6%
512 68.1 52.5 +23%
4096 313.9 244.1 +22%
32768 1946.3 1751.6 +10%
262144 17127.6 13950.9 +19%
4194304 348772.3 232630.3 +33%

CRC32+Copy (unaligned)

Size pclmulqdq (ns) vpclmulqdq_avx2 (ns) Improvement
32 77.9 62.3 +20%
512 117.3 109.4 +7%
8192 680.1 558.0 +18%
32768 2441.4 2441.4 0%
65536 5231.6 4185.3 +20%

CRC32+Copy (aligned)

Size pclmulqdq (ns) vpclmulqdq_avx2 (ns) Improvement
32 23.4 25.7 -10%
512 55.8 52.3 +6%
8192 488.3 488.3 0%
32768 2441.4 1918.2 +21%
65536 4534.0 3503.3 +23%

Summary

VPCLMULQDQ AVX2 (256-bit carry-less multiply) provides consistent speedups over PCLMULQDQ (128-bit) for buffer sizes >= 512 bytes, with gains of 16-33% on larger buffers. The benefit comes from processing 256 bits per fold iteration instead of 128 bits. For small buffers (< 64 bytes), both paths share the same tail-processing code, so performance is equivalent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment