Skip to content

Instantly share code, notes, and snippets.

@moon-chilled
Last active August 19, 2025 07:04
Show Gist options
  • Select an option

  • Save moon-chilled/30f694ccc0500d5df27b521132136dc3 to your computer and use it in GitHub Desktop.

Select an option

Save moon-chilled/30f694ccc0500d5df27b521132136dc3 to your computer and use it in GitHub Desktop.
parallel 512-bit full adder in avx512. pseudocode. probably works
// we want to compute r,c := x + y + i
// c carry-out
// r sum
// x y addends
// i carry-in
// carry in/out is zmm with most significant qword -1/0
t := vpaddq(x, y)
ct := vmovm2q(ctm := vpcmpeqq(t, -1)) // carry-through
c# := vmovm2q(c#m := vpcmpnltuq(t, vpmaxuq(x,y))) // approximate carry (inverted)
// now, to element t_i, we want to add 1-c#_i-1, unless ct_i-1, in which case we take c#_i-2, etc.
// (that means t_0 + 1-c#_-1, which we can arrange to be the carry-in with perm2q)
// so for each element, we want to count the number of ct immediately preceding it
// preceding mask
pm := vperm2b(c#, -1 0 0 0 0 0 0 0, 64 64 64 64 64 64 64 64
64 64 64 64 64 64 64 0
64 64 64 64 64 64 0 8
64 64 64 64 64 0 8 16
64 64 64 64 0 8 16 24
64 64 64 0 8 16 24 32
64 64 0 8 16 24 32 40
64 0 8 16 24 32 40 48)
// pm could probably also be built by moving c#m to a gpr, broadcast to zmm, shift each lane
// preceding count
pc := vpshrq(vplzcntq(pm), 3)
r := vpsubq(vpaddq(t, 1), vperm2q(c#, i, vpsubq(-1 0 1 2 3 4 5 6, pc)))
// carry-out needs masked to handle the case x[7]=y[7]=-1 and carry-in
// we only care about the most significant qword but this is the easiest way to get it
// using c#m as dst is just convenient because it's dead here; we just need something with a 0 in the msb
// i think this is right w/e
c := vmovm2q(vpcmpnltuq{c#m}(c#m, r, vpmaxuq(x,y)))
// is there a way to skip >>3 in pc?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment