Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save erikcorry/e786ca8689fcacb30661deabb7258617 to your computer and use it in GitHub Desktop.

Select an option

Save erikcorry/e786ca8689fcacb30661deabb7258617 to your computer and use it in GitHub Desktop.
╭─── Claude Code v2.0.76 ──────────────────────────────────────────────────────╮
│ │ Tips for getting started │
│ Welcome back! │ Run /init to create a CLAUDE.md file wit… │
│ │ ───────────────────────────────────────── │
│ * ▐▛███▜▌ * │ Recent activity │
│ * ▝▜█████▛▘ * │ No recent activity │
│ * ▘▘ ▝▝ * │ │
│ │ │
│ Opus 4.5 · API Usage Billing │ │
│ ~/github/simdutf/build │ │
╰──────────────────────────────────────────────────────────────────────────────╯
> Take a look at the git diff
● Bash(cd /home/ecorry/github/simdutf && git diff)
⎿ diff --git a/include/simdutf/scalar/base64.h b/include/simdutf/scalar/b
ase64.h
index 8ab73d8f..12f70979 100644
… +47 lines (ctrl+o to expand)
⎿  Shell cwd was reset to /home/ecorry/github/simdutf/build
● The diff shows changes to include/simdutf/scalar/base64.h in the binary_length_from_base64 function. Two simplifications were made:
1. Removed loop unrolling for counting non-whitespace characters:
- Before: Manually unrolled loop processing 4 elements at a time, then handling remainder
- After: Simple single-element loop
2. Simplified padding detection logic:
- Before: Separate steps to skip trailing whitespace, check for first =, skip more whitespace, check for second =
- After: Single unified loop that scans backwards, counting = characters (up to 2), skipping whitespace, and breaking on any other character
The new code is more compact and easier to read, though it trades away the manual loop unrolling optimization. The compiler may still auto-vectorize or unroll the simpler loop depending on optimization settings.
> Now let's make an AVX2 version of this code in the same way as other AVX2
implementations of functions that also have a generic implementation. Start
with a skeleton implementation without any contents right now.
● Let me explore how other AVX2 implementations are structured in this codebase.
● Explore(Explore AVX2 base64 structure)
⎿  Done (69 tool uses · 80.8k tokens · 2m 40s)
> Agner Fog recommends using the following instructions to load 04 into each
byte of a ymm register:
> pcmpeqw xmm0,xmm0
pabsb xmm0,xmm0
psllw xmm0,2
> We can do the same, but with a different shift to load 0x20 into each
byte.
● Good tip! For 0x20 (space character = 32), we'd use shift left by 5 instead of 2:
- pcmpeqw → all 1s (0xFF)
- pabsb → 0x01 in each byte
- psllw xmm0, 5 → 0x20 in each byte
This avoids loading constants from memory. Let me now look at the existing AVX2 base64 file and the implementation.cpp to create the skeleton:
● Read(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Read 80 lines
● Read(~/github/simdutf/src/haswell/implementation.cpp)
⎿  Read 50 lines
● Search(pattern: "avx2_base64", path:
"~/github/simdutf/src/haswell/implementation.cpp", output_mode:
"content")
⎿  Found 1 line (ctrl+o to expand)
● Search(pattern: "binary_length_from_base64", path:
"~/github/simdutf/src/haswell/implementation.cpp", output_mode:
"content")
⎿  Found 0 lines
● Read(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Read 22 lines
> Then we can do a bytewise compare, and then there's an instruction that
takes each byte and puts it one bit of a GPR, after which we will use
popcount to count them up.
● Let me find where binary_length_from_base64 is defined and used:
● Search(pattern: "binary_length_from_base64", path:
"~/github/simdutf/src/haswell", output_mode: "content")
⎿  Found 0 lines
● Search(pattern: "binary_length_from_base64", path:
"~/github/simdutf/include/simdutf")
⎿  Found 2 files (ctrl+o to expand)
● Search(pattern: "binary_length_from_base64", path:
"~/github/simdutf/include/simdutf/implementation.h", output_mode:
"content")
⎿  Found 186 lines (ctrl+o to expand)
● Search(pattern: "binary_length_from_base64", path: "~/github/simdutf/src",
output_mode: "content")
⎿  Found 91 lines (ctrl+o to expand)
● Search(pattern: "implementation::base64", path:
"~/github/simdutf/src/haswell/implementation.cpp", output_mode:
"content")
⎿  Found 47 lines (ctrl+o to expand)
● Read(~/github/simdutf/src/haswell/implementation.cpp)
⎿  Read 45 lines
● Search(pattern: "binary_length_from_base64", path:
"~/github/simdutf/src/simdutf/haswell", output_mode: "content")
⎿  Found 0 lines
● Read(~/github/simdutf/src/simdutf/haswell/implementation.h)
⎿  Read 76 lines
● Now I have the full picture. I need to add:
1. Declaration in src/simdutf/haswell/implementation.h
2. AVX2 function skeleton in src/haswell/avx2_base64.cpp
3. Implementation override in src/haswell/implementation.cpp
Let me create the skeleton:
● Update(~/github/simdutf/src/simdutf/haswell/implementation.h)
⎿  Added 4 lines
306 base64_options options,
307 last_chunk_handling_options last_chunk_options =
308 last_chunk_handling_options::loose) const noexcept;
309 + simdutf_warn_unused size_t binary_length_from_base64(const
+char *input,
310 + size_t
+ length) const noexcept;
311 + simdutf_warn_unused size_t binary_length_from_base64(const
+char16_t *input,
312 + size_t
+ length) const noexcept;
313 size_t binary_to_base64(const char *input, size_t length,
char *output,
314 base64_options options) const
noexcept;
315 size_t binary_to_base64_with_lines(const char *input,
size_t length,
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 10 lines
768 return 63;
769 }
770 };
771 +
772 +// AVX2 implementation of binary_length_from_base64
773 +// Counts non-whitespace characters and adjusts for padding
774 +template <typename char_type>
775 +simdutf_warn_unused size_t
776 +avx2_binary_length_from_base64(const char_type *input, size_t
+ length) {
777 + // TODO: Implement AVX2 version
778 + // For now, fall back to scalar implementation
779 + return scalar::base64::binary_length_from_base64(input,
+length);
780 +}
● Update(~/github/simdutf/src/haswell/implementation.cpp)
⎿  Added 12 lines
1406 }
1407 }
1408
1409 +simdutf_warn_unused size_t
1410 +implementation::binary_length_from_base64(const char *input,
1411 + size_t length)
+const noexcept {
1412 + return avx2_binary_length_from_base64(input, length);
1413 +}
1414 +
1415 +simdutf_warn_unused size_t
1416 +implementation::binary_length_from_base64(const char16_t
+*input,
1417 + size_t length)
+const noexcept {
1418 + return avx2_binary_length_from_base64(input, length);
1419 +}
1420 +
1421 const char *implementation::find(const char *start, const
char *end,
1422 char character) const
noexcept {
1423 return util::find(start, end, character);
● The skeleton is in place. Let me verify it builds:
● Bash(cmake .. -DCMAKE_BUILD_TYPE=Release -DSIMDUTF_TESTS=ON 2>&1 | tail -20)
⎿  -- Python found, we are going to amalgamate.py.
-- The tests are enabled.
-- Iconv was found!
… +11 lines (ctrl+o to expand)
● Bash(cmake --build . --target simdutf 2>&1 | tail -30)
⎿  [ 0%] Building CXX object src/CMakeFiles/simdutf.dir/simdutf.cpp.o
[100%] Linking CXX static library libsimdutf.a
[100%] Built target simdutf
● The skeleton builds. Let me run the base64 tests to make sure it still works:
● Bash(cmake --build . --target base64_tests 2>&1 && ctest -R base64
--output-on-failure 2>&1 | tail -30)
⎿  [ 8%] Built target simdutf
[ 58%] Built target simdutf_tests_reference
[ 91%] Built target simdutf_tests_helpers
… +11 lines (ctrl+o to expand)
● The skeleton is complete and all tests pass. Here's what was added:
1. src/simdutf/haswell/implementation.h - Added declarations:
simdutf_warn_unused size_t binary_length_from_base64(const char *input,
size_t length) const noexcept;
simdutf_warn_unused size_t binary_length_from_base64(const char16_t *input,
size_t length) const noexcept;
2. src/haswell/avx2_base64.cpp - Added AVX2 function template:
template <typename char_type>
simdutf_warn_unused size_t
avx2_binary_length_from_base64(const char_type *input, size_t length) {
// TODO: Implement AVX2 version
// For now, fall back to scalar implementation
return scalar::base64::binary_length_from_base64(input, length);
}
3. src/haswell/implementation.cpp - Added method overrides that call the AVX2 function:
simdutf_warn_unused size_t
implementation::binary_length_from_base64(const char *input,
size_t length) const noexcept {
return avx2_binary_length_from_base64(input, length);
}
simdutf_warn_unused size_t
implementation::binary_length_from_base64(const char16_t *input,
size_t length) const noexcept {
return avx2_binary_length_from_base64(input, length);
}
The skeleton currently delegates to the scalar implementation. Ready for the AVX2 implementation using:
- pcmpeqw + pabsb + psllw to generate 0x20 constant
- pcmpgtb for bytewise compare (> space)
- pmovmskb to extract comparison bits to GPR
- popcnt to count
> Always use -j64 on cmake and ctest commands.
● Got it, I'll use -j64 for parallel builds going forward. For example:
- cmake --build . -j64
- ctest -j64
> Now implement it.
● Let me look at the scalar implementation again to understand the full algorithm:
● Read(~/github/simdutf/include/simdutf/scalar/base64.h)
⎿  Read 40 lines
● Now I'll implement the AVX2 version. The algorithm:
1. Use AVX2 to count non-whitespace characters (> 0x20) in 32-byte chunks
2. Use scalar for padding detection at the end (only last few bytes)
3. Calculate binary length
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 55 lines, removed 4 lines
771
772 // AVX2 implementation of binary_length_from_base64
773 // Counts non-whitespace characters and adjusts for padding
774 -template <typename char_type>
774 simdutf_warn_unused size_t
775 -avx2_binary_length_from_base64(const char_type
- *input, size_t length) {
776 - // TODO: Implement AVX2 version
777 - // For now, fall back to scalar implementation
775 +avx2_binary_length_from_base64(const char
+ *input, size_t length) {
776 + // Count non-whitespace characters (c > ' ') using AVX2
777 + size_t count = 0;
778 + size_t i = 0;
779 +
780 + if (length >= 32) {
781 + // Generate space constant (0x20) in each byte using
+Agner Fog's trick:
782 + // pcmpeqw -> all 1s, pabsb -> 0x01, psllw by 5 -> 0x20
783 + __m256i ones = _mm256_cmpeq_epi16(_mm256_setzero_si256(),
784 +
+_mm256_setzero_si256());
785 + __m256i one_bytes = _mm256_abs_epi8(ones);
786 + __m256i spaces = _mm256_slli_epi16(one_bytes, 5);
787 +
788 + for (; i + 32 <= length; i += 32) {
789 + __m256i data = _mm256_loadu_si256((const __m256i
+*)(input + i));
790 + // Compare: set byte to 0xFF if data > space, else 0x00
791 + __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
792 + // Extract high bit of each byte to a 32-bit mask
793 + uint32_t mask =
+(uint32_t)_mm256_movemask_epi8(gt_space);
794 + // Count bits set
795 + count += _mm_popcnt_u32(mask);
796 + }
797 + }
798 +
799 + // Handle remaining bytes with scalar
800 + for (; i < length; i++) {
801 + count += (input[i] > ' ');
802 + }
803 +
804 + // Check for padding '=' at the end (at most 2 padding
+characters)
805 + // Scan backwards, skipping whitespace, to find padding
806 + size_t padding = 0;
807 + size_t pos = length;
808 + while (pos > 0 && padding < 2) {
809 + char c = input[--pos];
810 + if (c == '=') {
811 + padding++;
812 + } else if (c > ' ') {
813 + break;
814 + }
815 + }
816 + size_t base64_count = count - padding;
817 +
818 + // Calculate binary length from the number of base64
+characters
819 + // Every 4 base64 characters encode 3 binary bytes
820 + // Remainder of 2 encodes 1 byte, remainder of 3 encodes 2
+bytes
821 + if (base64_count % 4 <= 1) {
822 + return base64_count / 4 * 3;
823 + }
824 + return base64_count / 4 * 3 + (base64_count % 4) - 1;
825 +}
826 +
827 +// char16_t version - use scalar for now as it's less common
828 +simdutf_warn_unused size_t
829 +avx2_binary_length_from_base64(const char16_t *input, size_t
+length) {
830 return scalar::base64::binary_length_from_base64(input,
length);
831 }
● Now let me build and run the tests:
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -20)
⎿  [ 0%] Building CXX object src/CMakeFiles/simdutf.dir/simdutf.cpp.o
[ 8%] Linking CXX static library libsimdutf.a
[ 8%] Built target simdutf
… +4 lines (ctrl+o to expand)
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass. The AVX2 implementation is complete:
Key implementation details:
1. Constant generation (Agner Fog's trick):
__m256i ones = _mm256_cmpeq_epi16(_mm256_setzero_si256(), _mm256_setzero_si256()); // all 1s
__m256i one_bytes = _mm256_abs_epi8(ones); // 0x01 in each byte
__m256i spaces = _mm256_slli_epi16(one_bytes, 5); // 0x20 in each byte
2. Main loop - processes 32 bytes at a time:
__m256i data = _mm256_loadu_si256((const __m256i *)(input + i));
__m256i gt_space = _mm256_cmpgt_epi8(data, spaces); // compare > 0x20
uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space); // extract to GPR
count += _mm_popcnt_u32(mask); // popcount
3. Tail handling - scalar for remaining < 32 bytes
4. Padding detection - scalar backwards scan (only touches last few bytes)
5. char16_t - falls back to scalar (less common path)
> Are we sure we are actually running the Haswell version? Let's break the
padding detection and see if the tests still pass.
● Good idea. Let me break it:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 1 line, removed 1 line
803
804 // Check for padding '=' at the end (at most 2 padding
characters)
805 // Scan backwards, skipping whitespace, to find padding
806 - size_t padding = 0;
806 + size_t padding = 999; // INTENTIONALLY BROKEN
807 size_t pos = length;
808 while (pos > 0 && padding < 2) {
809 char c = input[--pos];
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -5 && ctest -j64
-R base64 --output-on-failure 2>&1 | tail -20)
⎿  [ 8%] Built target simdutf
[ 58%] Built target simdutf_tests_reference
[ 91%] Built target simdutf_tests_helpers
… +11 lines (ctrl+o to expand)
● Tests still pass - so either the Haswell version isn't being called, or binary_length_from_base64 isn't being tested. Let me check:
● Search(pattern: "binary_length_from_base64", path: "~/github/simdutf/tests",
output_mode: "content")
⎿  Found 579 lines (ctrl+o to expand)
● I see - tests use maximal_binary_length_from_base64, not binary_length_from_base64. Let me search for actual tests of the function we implemented:
● Search(pattern: "(?<!maximal_)binary_length_from_base64", path:
"~/github/simdutf/tests", output_mode: "content")
⎿  Found 0 lines
● No tests exist for binary_length_from_base64. Let me check what the difference is between the two functions:
● Search(pattern: "maximal_binary_length_from_base64", path:
"~/github/simdutf/include/simdutf/scalar/base64.h", output_mode:
"content")
⎿  Found 21 lines (ctrl+o to expand)
● Let me fix the broken code first, then create a quick test to verify our implementation is called:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 1 line, removed 1 line
803
804 // Check for padding '=' at the end (at most 2 padding
characters)
805 // Scan backwards, skipping whitespace, to find padding
806 - size_t padding = 999; // INTENTIONALLY BROKEN
806 + size_t padding = 0;
807 size_t pos = length;
808 while (pos > 0 && padding < 2) {
809 char c = input[--pos];
● Bash(cat > /tmp/test_binary_length.cpp << 'EOF'
#include "simdutf.h"…)
⎿  Interrupted · What should Claude do instead?
> Doesn't tests/base64_tests.cpp already test this?
● Search(pattern: "binary_length_from_base64", path:
"~/github/simdutf/tests/base64_tests.cpp", output_mode: "content")
⎿  Found 79 lines (ctrl+o to expand)
● There ARE tests for binary_length_from_base64 starting at line 3497. Let me check if those specific tests run:
● Bash(./tests/base64_tests 2>&1 | grep -i "binary_length_from_base64")
⎿  No matches found
● Bash(./tests/base64_tests --list 2>&1 | grep -i binary_length)
⎿  No matches found
● Bash(./tests/base64_tests --help 2>&1 | head -20)
⎿  Test utility for simdutf
Usage:
… +18 lines (ctrl+o to expand)
● Bash(./tests/base64_tests --show-tests 2>&1 | grep -i binary_length)
⎿  - binary_length_from_base64_basic
- binary_length_from_base64_with_spaces
- binary_length_from_base64_matches_decode_result
… +3 lines (ctrl+o to expand)
● Good, the tests exist. Let me break the code again and run just those tests:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 1 line, removed 1 line
803
804 // Check for padding '=' at the end (at most 2 padding
characters)
805 // Scan backwards, skipping whitespace, to find padding
806 - size_t padding = 0;
806 + size_t padding = 999; // INTENTIONALLY BROKEN
807 size_t pos = length;
808 while (pos > 0 && padding < 2) {
809 char c = input[--pos];
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● Still passes with broken code! Let me check how these tests call the function:
● Read(~/github/simdutf/tests/base64_tests.cpp)
⎿  Read 50 lines
● The tests call the free function simdutf::binary_length_from_base64. Let me check how it dispatches:
● Search(pattern: "simdutf_warn_unused size_t binary_length_from_base64\(const
char \*input", path: "~/github/simdutf/src/implementation.cpp",
output_mode: "content")
⎿  Found 11 lines (ctrl+o to expand)
● The free function calls get_default_implementation()->binary_length_from_base64(). Let me check if the base class method is virtual:
● Search(pattern: "virtual.*binary_length_from_base64|binary_length_from_base64.
*override", path: "~/github/simdutf/include/simdutf/implementation.h",
output_mode: "content")
⎿  Found 0 lines
● Read(~/github/simdutf/include/simdutf/implementation.h)
⎿  Read 30 lines
● Search(pattern: "class implementation", path:
"~/github/simdutf/include/simdutf/implementation.h", output_mode:
"content")
⎿  Found 11 lines (ctrl+o to expand)
● Let me check how other base64 methods are set up - specifically ones that haswell does override:
● Search(pattern: "base64_to_binary.*const noexcept", path:
"~/github/simdutf/include/simdutf/implementation.h", output_mode:
"content")
⎿  Found 0 lines
● Read(~/github/simdutf/src/simdutf/haswell/implementation.h)
⎿  Read 50 lines
● The haswell methods use final. Let me check what I added - it's missing final:
● Read(~/github/simdutf/src/simdutf/haswell/implementation.h)
⎿  Read 25 lines
● Wait, other methods here also don't have final. Let me check if the base class declares these as virtual:
● Search(pattern: "virtual", path:
"~/github/simdutf/include/simdutf/implementation.h", output_mode:
"content", head_limit: 30)
⎿  Found 30 lines (ctrl+o to expand)
● Bash(grep -n "binary_length_from_base64"
/home/ecorry/github/simdutf/include/simdutf/implementation.h | head -10)
⎿ 4115:maximal_binary_length_from_base64(const char *input, size_t length
) noexcept;
4118:maximal_binary_length_from_base64(
… +13 lines (ctrl+o to expand)
● Bash(sed -n '6380,6440p'
/home/ecorry/github/simdutf/include/simdutf/implementation.h)
⎿   * @param input the base64 input to process
* @param length the length of the base64 input in bytes
* @return maximal number of binary bytes
… +76 lines (ctrl+o to expand)
● The member function is not declared virtual! Let me check how other base64 methods that DO work are declared:
● Bash(grep -B2 "base64_to_binary.*const char \*input"
/home/ecorry/github/simdutf/include/simdutf/implementation.h | head -20)
⎿   */
simdutf_warn_unused result
base64_to_binary_safe(const char *input, size_t length, char *output,
… +4 lines (ctrl+o to expand)
● base64_to_binary is virtual, but binary_length_from_base64 is not! We need to add virtual to the base class:
● Update(~/github/simdutf/include/simdutf/implementation.h)
⎿  File must be read first
● Read(~/github/simdutf/include/simdutf/implementation.h)
⎿  Read 30 lines
● Update(~/github/simdutf/include/simdutf/implementation.h)
⎿  Added 2 lines, removed 2 lines
6413 * @param length the length of the base64 input in
bytes
6414 * @return number of binary bytes
6415 */
6416 - simdutf_warn_unused size_t
6416 + simdutf_warn_unused virtual size_t
6417 binary_length_from_base64(const char *input, size_t
length) const noexcept;
6418
6419 /**
...
6429 * @param length the length of the base64 input in
16-bit units
6430 * @return number of binary bytes
6431 */
6432 - simdutf_warn_unused size_t binary_length_from_base64(
6432 + simdutf_warn_unused virtual
+size_t binary_length_from_base64(
6433 const char16_t *input, size_t length) const noexcept;
6434
6435 /**
● Now rebuild and test - with the broken code still in place, the tests should now fail:
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -5 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  Error: Exit code 1
[ 66%] Linking CXX static library libsimdutf_tests_helpers.a
[ 91%] Built target simdutf_tests_helpers
[100%] Building CXX object
tests/CMakeFiles/base64_tests.dir/base64_tests.cpp.o
[100%] Linking CXX executable base64_tests
[100%] Built target base64_tests
Checking implementation haswell
Running 'binary length from base64 basic'... lhs:
simdutf::binary_length_from_base64("YQ==", 4) = 13835058055282162965
rhs: 1 = 1
simdutf::binary_length_from_base64("YQ==", 4)
… +1 line (ctrl+o to see all)
● The test now fails with our broken code - the haswell implementation is being called. Now let me fix the broken code:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 1 line, removed 1 line
803
804 // Check for padding '=' at the end (at most 2 padding
characters)
805 // Scan backwards, skipping whitespace, to find padding
806 - size_t padding = 999; // INTENTIONALLY BROKEN
806 + size_t padding = 0;
807 size_t pos = length;
808 while (pos > 0 && padding < 2) {
809 char c = input[--pos];
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● All tests pass. Let me run the full base64 test suite:
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass.
Summary of changes:
1. include/simdutf/implementation.h - Added virtual to the base class declarations so derived classes can override:
simdutf_warn_unused virtual size_t
binary_length_from_base64(const char *input, size_t length) const noexcept;
simdutf_warn_unused virtual size_t binary_length_from_base64(
const char16_t *input, size_t length) const noexcept;
2. src/simdutf/haswell/implementation.h - Added method declarations for haswell override
3. src/haswell/avx2_base64.cpp - AVX2 implementation:
- Generates 0x20 constant using Agner Fog's trick (pcmpeqw + pabsb + psllw)
- Uses pcmpgtb for bytewise comparison against space
- Uses pmovmskb to extract comparison bits to GPR
- Uses popcnt to count non-whitespace characters
- Processes 32 bytes per iteration
- Falls back to scalar for char16_t
4. src/haswell/implementation.cpp - Added method implementations that call the AVX2 function
> AVX2 has masked load. Can we use this so there's just a single loop
without scalar for the last 1-31 bytes?
● Good idea! We can use an overlapping load for the tail - load the last 32 bytes and mask out the already-processed bytes. This avoids scalar entirely for inputs >= 32 bytes:
● Read(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Read 60 lines
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 17 lines, removed 5 lines
775 avx2_binary_length_from_base64(const char *input, size_t
length) {
776 // Count non-whitespace characters (c > ' ') using AVX2
777 size_t count = 0;
778 - size_t i = 0;
778
779 if (length >= 32) {
780 // Generate space constant (0x20) in each byte using
Agner Fog's trick:
...
785 __m256i one_bytes = _mm256_abs_epi8(ones);
786 __m256i spaces = _mm256_slli_epi16(one_bytes, 5);
787
788 + size_t i = 0;
789 for (; i + 32 <= length; i += 32) {
790 __m256i data = _mm256_loadu_si256((const __m256i
*)(input + i));
791 // Compare: set byte to 0xFF if data > space, else 0x00
...
794 // Count bits set
795 count += _mm_popcnt_u32(mask);
796 }
797 - }
797
798 - // Handle remaining bytes with scalar
799 - for (; i < length; i++) {
800 - count += (input[i] > ' ');
798 + // Handle tail (1-31 remaining bytes) with
+overlapping load
799 + size_t remaining = length - i;
800 + if (remaining > 0) {
801 + // Load last 32 bytes (safe since length >= 32)
802 + __m256i data = _mm256_loadu_si256((const __m256i
+*)(input + length - 32));
803 + __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
804 + uint32_t mask =
+(uint32_t)_mm256_movemask_epi8(gt_space);
805 + // Keep only the last 'remaining' bits (the new bytes
+we haven't counted)
806 + mask >>= (32 - remaining);
807 + count += _mm_popcnt_u32(mask);
808 + }
809 + } else {
810 + // Input smaller than 32 bytes: use scalar
811 + for (size_t i = 0; i < length; i++) {
812 + count += (input[i] > ' ');
813 + }
814 }
815
816 // Check for padding '=' at the end (at most 2 padding
characters)
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass. The tail handling now uses an overlapping AVX2 load instead of scalar:
// Handle tail (1-31 remaining bytes) with overlapping load
size_t remaining = length - i;
if (remaining > 0) {
// Load last 32 bytes (safe since length >= 32)
__m256i data = _mm256_loadu_si256((const __m256i *)(input + length - 32));
__m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
// Keep only the last 'remaining' bits (the new bytes we haven't counted)
mask >>= (32 - remaining);
count += _mm_popcnt_u32(mask);
}
This loads the last 32 bytes of input (which overlaps with already-processed data), then shifts out the already-counted bits. Scalar is now only used for inputs < 32 bytes.
> This wasn't what I meant. There's an instruction that takes a mask and
doesn't fault if the mask bits are zero. Instead of doing an overlapping
load you can load at the natural location, but mask out the bytes we don't
care about.
● You're right! _mm256_maskload_epi32 won't fault when mask bits are 0. We can use it to safely load partial data, then mask the comparison results at byte level. Let me implement that:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 42 lines, removed 33 lines
776 // Count non-whitespace characters (c > ' ') using AVX2
777 size_t count = 0;
778
779 - if (length >= 32) {
780 - // Generate space constant (0x20) in each byte using
-Agner Fog's trick:
781 - // pcmpeqw -> all 1s, pabsb -> 0x01, psllw by 5 -> 0x20
782 - __m256i ones = _mm256_cmpeq_epi16(_mm256_setzero_si256(),
783 -
-_mm256_setzero_si256());
784 - __m256i one_bytes = _mm256_abs_epi8(ones);
785 - __m256i spaces = _mm256_slli_epi16(one_bytes, 5);
779 + // Generate space constant (0x20) in each byte using Agner
+Fog's trick:
780 + // pcmpeqw -> all 1s, pabsb -> 0x01, psllw by 5 -> 0x20
781 + __m256i ones = _mm256_cmpeq_epi16(_mm256_setzero_si256(),
782 + _mm256_setzero_si256());
783 + __m256i one_bytes = _mm256_abs_epi8(ones);
784 + __m256i spaces = _mm256_slli_epi16(one_bytes, 5);
785
786 - size_t i = 0;
787 - for (; i + 32 <= length; i += 32) {
788 - __m256i data = _mm256_loadu_si256((const __m256i
-*)(input + i));
789 - // Compare: set byte to 0xFF if data > space, else 0x00
790 - __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
791 - // Extract high bit of each byte to a 32-bit mask
792 - uint32_t mask =
-(uint32_t)_mm256_movemask_epi8(gt_space);
793 - // Count bits set
794 - count += _mm_popcnt_u32(mask);
795 - }
786 + // Constants for masked load
787 + const __m256i byte_indices = _mm256_setr_epi8(
788 + 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
789 + 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
+ 30, 31);
790 + const __m256i lane_indices = _mm256_setr_epi32(0, 1, 2, 3,
+4, 5, 6, 7);
791
792 - // Handle tail (1-31 remaining bytes) with overlapping
-load
793 - size_t remaining = length - i;
794 - if (remaining > 0) {
795 - // Load last 32 bytes (safe since length >= 32)
796 -
-__m256i data = _mm256_loadu_si256((const __m256i *)(input +
-length - 32));
797 - __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
798 -
-uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
799 - // Keep only the last 'remaining' bits (the new bytes
-we haven't counted)
800 - mask >>= (32 - remaining);
801 - count += _mm_popcnt_u32(mask);
802 - }
803 - } else {
804 - // Input smaller than 32 bytes: use scalar
805 - for (size_t i = 0; i < length; i++) {
806 - count += (input[i] > ' ');
807 - }
792 + size_t i = 0;
793 +
794 + // Main loop: process 32 bytes at a time
795 + for (; i + 32 <= length; i += 32) {
796 +
+__m256i data = _mm256_loadu_si256((const __m256i *)(input + i
+));
797 + __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
798 + uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
799 + count += _mm_popcnt_u32(mask);
800 }
801
802 + // Handle tail (1-31 remaining bytes) with masked load
803 + size_t remaining = length - i;
804 + if (remaining > 0) {
805 + // Create load mask for 32-bit lanes (won't fault for
+masked-off lanes)
806 + int lanes_needed = (int)((remaining + 3) / 4);
807 + __m256i lanes_limit = _mm256_set1_epi32(lanes_needed);
808 + __m256i load_mask = _mm256_cmpgt_epi32(lanes_limit,
+lane_indices);
809 +
810 + // Masked load - won't fault for masked-off lanes
811 + __m256i data = _mm256_maskload_epi32((const int *)(input
++ i), load_mask);
812 +
813 + // Compare against space
814 + __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
815 +
816 + // Zero out comparison results for invalid bytes within
+partially loaded lanes
817 + __m256i byte_limit = _mm256_set1_epi8((char)remaining);
818 + __m256i byte_mask = _mm256_cmpgt_epi8(byte_limit,
+byte_indices);
819 + gt_space = _mm256_and_si256(gt_space, byte_mask);
820 +
821 + uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
822 + count += _mm_popcnt_u32(mask);
823 + }
824 +
825 // Check for padding '=' at the end (at most 2 padding
characters)
826 // Scan backwards, skipping whitespace, to find padding
827 size_t padding = 0;
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass. Now there's no scalar code at all - even inputs < 32 bytes use the masked load path:
// Create load mask for 32-bit lanes (won't fault for masked-off lanes)
int lanes_needed = (int)((remaining + 3) / 4);
__m256i lanes_limit = _mm256_set1_epi32(lanes_needed);
__m256i load_mask = _mm256_cmpgt_epi32(lanes_limit, lane_indices);
// Masked load - won't fault for masked-off lanes
__m256i data = _mm256_maskload_epi32((const int *)(input + i), load_mask);
// Compare against space
__m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
// Zero out comparison results for invalid bytes within partially loaded lanes
__m256i byte_limit = _mm256_set1_epi8((char)remaining);
__m256i byte_mask = _mm256_cmpgt_epi8(byte_limit, byte_indices);
gt_space = _mm256_and_si256(gt_space, byte_mask);
The two-level masking:
1. Lane mask for maskload_epi32 - prevents faults on 32-bit granularity
2. Byte mask after comparison - handles the partial lane where only some bytes are valid
> Is there no maskload_epi8?
● No, AVX2 only has maskload_epi32 and maskload_epi64. Byte-granularity masked loads (_mm256_mask_loadu_epi8) require AVX-512BW + AVX-512VL.
That's why we need the two-level masking approach - the lane-level mask for the load to prevent faults, then a byte-level mask after the comparison to handle partial lanes.
> OK that's sad. Instead of this, let's try something different. Start by
rounding down the address to the nearest 32, but make a GPR mask that has
zero bits from 0-(start & 31). Then load 32-byte aligned values including
the last load (which goes over the end of the buffer, but is aligned so it
can't fault). After the first iteration, the GPR mask is all-ones (we can
set that at the bottom of the loop). Then when we break out of the loop we
can fix the end.
● Great idea! Aligned loads can't fault (they don't cross page boundaries), so we can safely read past the buffer end. Let me implement that:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 23 lines, removed 30 lines
773 // Counts non-whitespace characters and adjusts for padding
774 simdutf_warn_unused size_t
775 avx2_binary_length_from_base64(const char *input, size_t
length) {
776 - // Count non-whitespace characters (c > ' ') using AVX2
776 + if (length == 0) {
777 + return 0;
778 + }
779 +
780 size_t count = 0;
781
782 // Generate space constant (0x20) in each byte using Agner
Fog's trick:
...
783 __m256i one_bytes = _mm256_abs_epi8(ones);
784 __m256i spaces = _mm256_slli_epi16(one_bytes, 5);
785
786 - // Constants for masked load
787 - const __m256i byte_indices = _mm256_setr_epi8(
788 - 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
789 - 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
- 30, 31);
790 - const __m256i lane_indices = _mm256_setr_epi32(0, 1, 2, 3,
-4, 5, 6, 7);
786 + // Round down to 32-byte alignment (aligned loads can't
+fault)
787 + const char *aligned_ptr = (const char *)((uintptr_t)input &
+ ~(uintptr_t)31);
788 + const char *end = input + length;
789
790 - size_t i = 0;
790 + // Mask for first iteration: skip bytes before 'input'
791 + size_t prefix = (size_t)(input - aligned_ptr);
792 + uint32_t gpr_mask = ~0u << prefix;
793
794 - // Main loop: process 32 bytes at a time
795 - for (; i + 32 <= length; i += 32) {
796 - __m256i data = _mm256_loadu_si256((const __m256i *)(input
- + i));
794 + // Process aligned chunks
795 + do {
796 + __m256i data = _mm256_load_si256((const __m256i
+*)aligned_ptr);
797 __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
798 uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
799 + mask &= gpr_mask;
800 count += _mm_popcnt_u32(mask);
801 - }
801 + aligned_ptr += 32;
802 + gpr_mask = ~0u;
803 + } while (aligned_ptr < end);
804
805 - // Handle tail (1-31 remaining bytes) with masked load
806 - size_t remaining = length - i;
807 - if (remaining > 0) {
808 - // Create load mask for 32-bit lanes (won't fault for
-masked-off lanes)
809 - int lanes_needed = (int)((remaining + 3) / 4);
810 - __m256i lanes_limit = _mm256_set1_epi32(lanes_needed);
811 - __m256i load_mask = _mm256_cmpgt_epi32(lanes_limit,
-lane_indices);
812 -
813 - // Masked load - won't fault for masked-off lanes
814 - __m256i data = _mm256_maskload_epi32((const int *)(input
-+ i), load_mask);
815 -
816 - // Compare against space
805 + // Fix the end: subtract any over-counted bytes past 'end'
806 + size_t overshoot = (size_t)(aligned_ptr - end);
807 + if (overshoot > 0) {
808 + __m256i data = _mm256_load_si256((const __m256i
+*)(aligned_ptr - 32));
809 __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
810 -
811 - // Zero out comparison results for invalid bytes within
-partially loaded lanes
812 - __m256i byte_limit = _mm256_set1_epi8((char)remaining);
813 - __m256i byte_mask = _mm256_cmpgt_epi8(byte_limit,
-byte_indices);
814 - gt_space = _mm256_and_si256(gt_space, byte_mask);
815 -
810 uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
811 - count += _mm_popcnt_u32(mask);
811 + uint32_t over_mask = ~0u << (32 - overshoot);
812 + count -= _mm_popcnt_u32(mask & over_mask);
813 }
814
815 // Check for padding '=' at the end (at most 2 padding
characters)
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass. The new approach is cleaner:
// Round down to 32-byte alignment (aligned loads can't fault)
const char *aligned_ptr = (const char *)((uintptr_t)input & ~(uintptr_t)31);
const char *end = input + length;
// Mask for first iteration: skip bytes before 'input'
size_t prefix = (size_t)(input - aligned_ptr);
uint32_t gpr_mask = ~0u << prefix;
// Process aligned chunks
do {
__m256i data = _mm256_load_si256((const __m256i *)aligned_ptr);
__m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
mask &= gpr_mask;
count += _mm_popcnt_u32(mask);
aligned_ptr += 32;
gpr_mask = ~0u;
} while (aligned_ptr < end);
// Fix the end: subtract any over-counted bytes past 'end'
size_t overshoot = (size_t)(aligned_ptr - end);
if (overshoot > 0) {
__m256i data = _mm256_load_si256((const __m256i *)(aligned_ptr - 32));
__m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
uint32_t over_mask = ~0u << (32 - overshoot);
count -= _mm_popcnt_u32(mask & over_mask);
}
Key benefits:
- No vector mask setup needed
- Uses aligned loads (faster, and safe to read past buffer end)
- GPR mask is simple integer operations
- Main loop has no branches except the loop condition
> Nice, but let's lift the mask variable out of the main loop, then we still
have it available and we can use it to subtract the overcounted bytes with
another popcount.
● Good catch! We can reuse the mask from the last iteration instead of reloading:
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 2 lines, removed 4 lines
795 uint32_t gpr_mask = ~0u << prefix;
796
797 // Process aligned chunks
798 + uint32_t mask;
799 do {
800 __m256i data = _mm256_load_si256((const __m256i
*)aligned_ptr);
801 __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
802 - uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
802 + mask = (uint32_t)_mm256_movemask_epi8(gt_space);
803 mask &= gpr_mask;
804 count += _mm_popcnt_u32(mask);
805 aligned_ptr += 32;
...
808 // Fix the end: subtract any over-counted bytes past 'end'
809 size_t overshoot = (size_t)(aligned_ptr - end);
810 if (overshoot > 0) {
811 - __m256i data = _mm256_load_si256((const __m256i
-*)(aligned_ptr - 32));
812 - __m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
813 - uint32_t mask = (uint32_t)_mm256_movemask_epi8(gt_space);
811 uint32_t over_mask = ~0u << (32 - overshoot);
812 count -= _mm_popcnt_u32(mask & over_mask);
813 }
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass. Now the overshoot fix just reuses the mask from the last iteration - no reload needed:
uint32_t mask;
do {
__m256i data = _mm256_load_si256((const __m256i *)aligned_ptr);
__m256i gt_space = _mm256_cmpgt_epi8(data, spaces);
mask = (uint32_t)_mm256_movemask_epi8(gt_space);
mask &= gpr_mask;
count += _mm_popcnt_u32(mask);
aligned_ptr += 32;
gpr_mask = ~0u;
} while (aligned_ptr < end);
// Fix the end: subtract any over-counted bytes past 'end'
size_t overshoot = (size_t)(aligned_ptr - end);
if (overshoot > 0) {
uint32_t over_mask = ~0u << (32 - overshoot);
count -= _mm_popcnt_u32(mask & over_mask);
}
The mask variable retains its value from the last iteration, so we just AND with over_mask and subtract via another popcount.
> I think Agner's trick can be used to construct 0x0020 for the 16 bit
version too. Let's make that one too.
● Good idea! For char16_t, we compare 16-bit values against 0x0020, and pmovmskb gives us 2 bits per character (both 0 or both 1). Let me implement it:
● Read(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Read 20 lines
● Update(~/github/simdutf/src/haswell/avx2_base64.cpp)
⎿  Added 65 lines, removed 2 lines
836 return base64_count / 4 * 3 + (base64_count % 4) - 1;
837 }
838
839 -// char16_t version - use scalar for now as it's less common
839 +// char16_t version using AVX2
840 simdutf_warn_unused size_t
841 avx2_binary_length_from_base64(const char16_t *input, size_t
length) {
842 - return scalar::base64::binary_length_from_base64(input,
-length);
842 + if (length == 0) {
843 + return 0;
844 + }
845 +
846 + size_t count = 0;
847 +
848 + // Generate space constant (0x0020) in each 16-bit word
+using Agner Fog's trick:
849 + // pcmpeqw -> all 1s, pabsw -> 0x0001, psllw by 5 -> 0x0020
850 + __m256i ones = _mm256_cmpeq_epi16(_mm256_setzero_si256(),
851 + _mm256_setzero_si256());
852 + __m256i one_words = _mm256_abs_epi16(ones);
853 + __m256i spaces = _mm256_slli_epi16(one_words, 5);
854 +
855 + // Round down to 32-byte alignment (aligned loads can't
+fault)
856 + const char16_t *aligned_ptr =
857 + (const char16_t *)((uintptr_t)input & ~(uintptr_t)31);
858 + const char16_t *end = input + length;
859 +
860 + // Mask for first iteration: skip bytes before 'input'
861 + // (prefix is in bytes, not char16_t units)
862 + size_t prefix_bytes = (size_t)((const char *)input - (const
+ char *)aligned_ptr);
863 + uint32_t gpr_mask = ~0u << prefix_bytes;
864 +
865 + // Process aligned chunks (16 char16_t = 32 bytes per
+iteration)
866 + uint32_t mask;
867 + do {
868 + __m256i data = _mm256_load_si256((const __m256i
+*)aligned_ptr);
869 + // Signed comparison works for ASCII (base64 characters
+are all < 128)
870 + __m256i gt_space = _mm256_cmpgt_epi16(data, spaces);
871 + // pmovmskb gives 2 bits per char16_t (both 0 or both 1)
872 + mask = (uint32_t)_mm256_movemask_epi8(gt_space);
873 + mask &= gpr_mask;
874 + count += _mm_popcnt_u32(mask);
875 + aligned_ptr += 16; // 16 char16_t = 32 bytes
876 + gpr_mask = ~0u;
877 + } while (aligned_ptr < end);
878 +
879 + // Fix the end: subtract any over-counted bytes past 'end'
880 + size_t overshoot_bytes = (size_t)((const char *)aligned_ptr
+ - (const char *)end);
881 + if (overshoot_bytes > 0) {
882 + uint32_t over_mask = ~0u << (32 - overshoot_bytes);
883 + count -= _mm_popcnt_u32(mask & over_mask);
884 + }
885 +
886 + // Each char16_t contributed 2 bits to the mask, so divide
+by 2
887 + count /= 2;
888 +
889 + // Check for padding '=' at the end (at most 2 padding
+characters)
890 + size_t padding = 0;
891 + size_t pos = length;
892 + while (pos > 0 && padding < 2) {
893 + char16_t c = input[--pos];
894 + if (c == '=') {
895 + padding++;
896 + } else if (c > ' ') {
897 + break;
898 + }
899 + }
900 + size_t base64_count = count - padding;
901 +
902 + if (base64_count % 4 <= 1) {
903 + return base64_count / 4 * 3;
904 + }
905 + return base64_count / 4 * 3 + (base64_count % 4) - 1;
906 }
● Bash(cmake --build . -j64 --target base64_tests 2>&1 | tail -3 &&
./tests/base64_tests -a haswell -t binary_length 2>&1)
⎿  [ 91%] Built target simdutf_tests_helpers
[ 91%] Linking CXX executable base64_tests
[100%] Built target base64_tests
… +7 lines (ctrl+o to expand)
● Bash(ctest -j64 -R base64 --output-on-failure 2>&1)
⎿  Test project /home/ecorry/github/simdutf/build
Start 80: base64_tests
Start 81: constexpr_base64_tests
… +6 lines (ctrl+o to expand)
● All tests pass. The char16_t version uses the same pattern:
// Generate space constant (0x0020) in each 16-bit word using Agner Fog's trick:
// pcmpeqw -> all 1s, pabsw -> 0x0001, psllw by 5 -> 0x0020
__m256i ones = _mm256_cmpeq_epi16(_mm256_setzero_si256(),
_mm256_setzero_si256());
__m256i one_words = _mm256_abs_epi16(ones);
__m256i spaces = _mm256_slli_epi16(one_words, 5);
Key differences from the char version:
- Uses _mm256_cmpgt_epi16 instead of _mm256_cmpgt_epi8
- Each char16_t produces 2 bits in pmovmskb (both 0 or both 1), so we divide count by 2 at the end
- Advances pointer by 16 char16_t units (32 bytes) per iteration
> There's no pmovskw ?
● No, there's no pmovmskw in SSE/AVX2. The options are:
- pmovmskb (_mm256_movemask_epi8) - 1 bit per byte → 32 bits
- movmskps (_mm256_movemask_ps) - 1 bit per 32-bit → 8 bits
- movmskpd (_mm256_movemask_pd) - 1 bit per 64-bit → 4 bits
For 16-bit elements we could:
1. Current approach: Use pmovmskb (gives 2 bits per char16_t), divide count by 2
2. Pack first: Use _mm256_packs_epi16 to pack to 8-bit, then pmovmskb - but the lane interleaving makes this awkward
3. BMI2 pext: Extract odd bits with _pext_u32(mask, 0xAAAAAAAA) - but adds BMI2 dependency
The divide-by-2 approach is probably the simplest and doesn't add any instruction set dependencies beyond AVX2. The division by 2 is just a shift, so it's essentially free.
> Nice. Please make a dump of our conversation in md format.
⎿  Interrupted · What should Claude do instead?
> Or perhaps there's a command I can run to make a transcript?
● Yes! In Claude Code you can use:
/export
This will export the conversation to a markdown file. You can also specify a path:
/export /path/to/output.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment