Skip to content

Instantly share code, notes, and snippets.

View nmoinvaz's full-sized avatar

Nathan Moinvaziri nmoinvaz

  • Phoenix, United States
View GitHub Profile
@nmoinvaz
nmoinvaz / zlib-ng-inflate-safe-mode-benchmark.md
Created March 10, 2026 19:08
zlib-ng: inflate_fast safe mode benchmark — small output buffer performance

zlib-ng: inflate_fast safe mode benchmark results

Summary

Adding a safe_mode parameter to inflate_fast() allows the fast path to run with as few as 3 bytes of avail_out (down from 260). This eliminates the performance cliff where PNG-style row-by-row decompression falls back to the slow inflate() state-machine path for the last 260 bytes of each row.

Related: zlib-ng/zlib-ng#2062

@nmoinvaz
nmoinvaz / zlib-ng-vpclmulqdq-avx2-benchmarks.md
Last active March 9, 2026 07:43
zlib-ng: VPCLMULQDQ AVX2 vs PCLMULQDQ CRC32 benchmark results on Intel i7-1185G7

zlib-ng: VPCLMULQDQ AVX2 vs PCLMULQDQ CRC32 Benchmark

Machine

  • CPU: 11th Gen Intel Core i7-1185G7 @ 3.00GHz (Tiger Lake)
  • Cores: 4 cores / 8 threads
  • L1d/L1i: 48 KiB / 32 KiB (x4)
  • L2: 1280 KiB (x4)
  • L3: 12288 KiB
  • OS: Windows 11 Pro
@nmoinvaz
nmoinvaz / zlib-ng-pr-2176-opt.md
Last active March 7, 2026 00:02
zlib-ng: CRC32 ARMv8 PMULL+EOR3 copy optimization

zlib-ng: CRC32 ARMv8 PMULL+EOR3 copy optimization

Summary

Replace memcpy calls in the CRC32+copy interleaved path with direct NEON stores (vst1q_u64) of already-loaded vectors, and direct scalar stores of already-loaded uint64_t values. This eliminates redundant load/store sequences that the compiler generated for memcpy when the source data was already in registers.

Additionally, reorder the vector loop so that stores happen before eor3 operations,

@nmoinvaz
nmoinvaz / zlibng-vs-zlibrs-benchmark.md
Last active February 26, 2026 20:28
zlib-ng vs zlib-rs benchmark comparison on Apple M3 (ARM64)

zlib-ng vs zlib-rs Benchmark Comparison (ARM64, Apple M3)

Machine Specs

  • CPU: Apple M3 (8 cores)
  • RAM: 24 GB
  • OS: Darwin 24.6.0 arm64 (macOS Sequoia)
  • Compiler: Apple clang 17.0.0 (clang-1700.6.3.2)
  • Rust: rustc 1.93.1 (01f6ddf75 2026-02-11)
@nmoinvaz
nmoinvaz / top_senders_100.py
Created February 25, 2026 22:42
IMAP top senders analyzer
import argparse
import imaplib
import email.utils
import sys
from collections import Counter
from rich.console import Console
from rich.table import Table
from rich.progress import Progress, BarColumn, TextColumn, TimeRemainingColumn, MofNCompleteColumn
@nmoinvaz
nmoinvaz / zlib-ng-crc32-arm-copy-benchmarks.md
Last active February 24, 2026 18:02
zlib-ng: CRC32 ARM interleaved copy benchmark results (Apple M3)

zlib-ng: CRC32 ARM Interleaved Copy Benchmark Results

Comparison

  • Baseline: develop @ 54352daf (Make extra length/distance bits computation branchless)
  • Contender: improvements/crc32-arm-copy @ b4043c6f (Implement crc32 interleaved copy for ARM PMULL+EOR3)
  • Repetitions: 5 per benchmark, aggregates only

Machine

@nmoinvaz
nmoinvaz / zlib-ng-CLAUDE.md
Last active February 28, 2026 01:17
zlib-ng CLAUDE.md

Project Basics

  • Use CMake build system.
  • Always check the commits for HEAD and BASE or other branch names as they can change often.
  • To build for other architectures than the current architecture use llvm-clang unless gcc is specified.

Key Directories

  • arch/ - Architecture specific optimizations
  • test/ - Unit tests written using Google Test Framework (gtest_zlib project)
@nmoinvaz
nmoinvaz / crc32-arm-copy-benchmarks.md
Last active February 24, 2026 04:49
Zlib-ng benchmark: crc32_armv8_pmull_eor3 — improvements/crc32-arm-copy vs develop

Benchmark: improvements/crc32-arm-copy vs develop

Date: 2026-02-23 Platform: Apple Silicon (ARM64), 8 cores, L1D 64 KiB, L2 4096 KiB Build: CMake Release, static libs Repetitions: 5 (median CPU time reported)

crc32/armv8_pmull_eor3 (CRC32 only)

| Size | develop (ns) | feature (ns) | Change |

@nmoinvaz
nmoinvaz / benchmark_compress_results.md
Created February 21, 2026 00:19
zlib-ng compress benchmark: improvements/tally-v2 vs develop

Compress Benchmark: HEAD (improvements/tally-v2) vs develop

Environment

  • Platform: macOS Darwin 24.6.0, Apple Silicon (ARM64)
  • CPU: 8 cores, L1D 64 KiB, L1I 128 KiB, L2 4096 KiB
  • Build: CMake Release, static libs

Commits

  • HEAD (improvements/tally-v2): c51ce99e — Combine extra_lbits/base_length and extra_dbits/base_dist lookup tables
  • develop: 1b880ba9 — Make extra length/distance bits computation branchless using bit masking
@nmoinvaz
nmoinvaz / compress_block_bi_buf_register_optimization.md
Last active February 19, 2026 03:25
Zlib-ng PR 2167 analysis

Assembly Analysis: Keep bi_buf/bi_valid in Registers Across compress_block

Change

Hoist s->bi_buf and s->bi_valid into local variables in compress_block() and pass them by pointer to the emit functions. This eliminates redundant load/store pairs between zng_emit_lit and zng_emit_dist calls within the main compression loop.

Results

bi_buf/bi_valid Memory Operations (offsets 168/176 from deflate_state*)