Skip to content

Instantly share code, notes, and snippets.

@08jne01
Last active August 30, 2025 01:09
Show Gist options
  • Select an option

  • Save 08jne01/d05670d24027cc85dc30da5db68b20f9 to your computer and use it in GitHub Desktop.

Select an option

Save 08jne01/d05670d24027cc85dc30da5db68b20f9 to your computer and use it in GitHub Desktop.
Highly Optimised Branchless Binary Search for use in Lookup Tables - with the improvement of fixed loop length for compiler
/*
MIT License
Copyright (c) 2024 Joshua Nelson
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
*/
#pragma once
#include <bit>
#include <span>
#include <vector>
#include <array>
#ifndef BINARY_SEARCH_CRITICAL_INLINE
# if defined(_MSC_VER)
# define BINARY_SEARCH_CRITICAL_INLINE __forceinline
# elif defined(__GNUC__) && __GNUC__>3
# define BINARY_SEARCH_CRITICAL_INLINE inline __attribute__ ((always_inline))
# else
# define BINARY_SEARCH_CRITICAL_INLINE inline
# endif
#endif
namespace Utility
{
// This just forces fixed span into the compiler unrolled binary search.
template<typename TySpan>
concept FixedSpan =
TySpan::extent != std::dynamic_extent &&
std::is_const_v<typename TySpan::element_type>; // Just enforces const span.
// This forces the span into the dynamic binary search when the input is dynamic.
// This is a weird thing about span and templates. Without the concepts it would be possible to call the fixed N variant accidently.
template<typename TySpan>
concept DynamicSpan =
TySpan::extent == std::dynamic_extent &&
std::is_const_v<typename TySpan::element_type>; // Just enforces const span.
// This function forces compile time bit ceil otherwise the MSVC compiler
// will not do it and put in a branch to check for the lzcnt instruction
// and worst case count the leading zeros one at a time using bit shifts.
consteval size_t BitCeil( size_t n )
{
return std::bit_ceil(n);
}
// Compile time Log2
consteval size_t Log2( size_t n )
{
return std::bit_width(n) - 1;
}
// Note this is a BinarySearch utility for Look up tables, so it aims to return the index
// which the corresponding value is less than the supplied x.
// Both functions are guaranteed to return size - 2;
// Source here https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/
// Slightly Modified.
// The loop doesn't get unrolled if you use their implementation, however if calculate the number of steps
// by taking the log2(N) (done using std::bit_width) and creating the loop to be that size the MSVC compiler
// will unroll.
template<FixedSpan TySpan>
BINARY_SEARCH_CRITICAL_INLINE constexpr size_t BinarySearch( TySpan array, typename TySpan::value_type x )
{
constexpr size_t N = TySpan::extent;
// Just in case
static_assert( N != std::dynamic_extent, "Cannot call compile time size binary search with dynamic span" );
// Compile time optimsation if the array is 0,1,2 long.
// In all these cases we are taking indexing from 0.
if constexpr ( N <= 2 )
return 0;
// Early bail out
if ( x <= array.front() )
return 0;
if ( x >= array.back() )
return N - 2;
// This is the resultant pointer
// It's the left side of the current bin being searched
// so we can just pretend this is a "subspan".
size_t begin = 0;
// This algorithm divides the space in two each time, however in order to this it
// needs to have a clean power of 2. We find the power of 2 closest to the mid point.
// We do this by doing std::bit_floor and std::bit_ceil, these return either the power of
// two below/equal or power of two above.
size_t step;
constexpr size_t N_bit_floor = std::bit_floor( N );
if constexpr ( N_bit_floor != N )
{
// Check if x is left or right of closest power of 2
// This basically does one check at the start to get us to
// a clean power of two, either we take the side greater than
// the power of 2 or the side less than the power of 2.
if ( array[N_bit_floor] <= x )
{
// If power of two is less than x it makes x in the right
// section so remove everything up to power of 2.
constexpr size_t new_length = N - N_bit_floor - 1;
// This returns zero when array is exactly 3, 5, 9...
// 1 greater than a power of 2
// (N - 2^(floor(2^log(N)) - 1)
// If the remaining length is zero we return
// the element before the last. Similar to above.
if ( new_length == 0 )
return N-2;
// If not zero then we go from our new search position.
// We must now offset the start position such that the step is a power of 2
// This means there will be overlap in the search regions but it is the price.
constexpr size_t ceil = BitCeil(new_length);
step = ceil / 2; // save one division by doing this at compile time
begin = N - 1 - ceil;
}
else
{
step = N_bit_floor / 2; // save one division by doing this at compile time
}
}
else
{
step = N_bit_floor / 2; // save one division by doing this at compile time
}
// We calculate the number of steps required. This is really important
// because otherwise the compiler will not unroll the loop.
// This may result in one extra comparison, but compared to having a loop
// with N comparisons it's a no brainer.
// Algorithm will always complete in max of log2(N) which is what below calculates.
constexpr size_t steps = Log2(N);
for ( size_t i = 0; i < steps; i++ )
{
// This selects the search space by offsetting it.
// step step
// |<----------->|<----------->|
// b-------------S-------------|
// where b is the current begin and S = b+step
// if the value at S is greater than x then x is in the left region
// no need to move b
// however if S is less than x then x is in the right region
// so we reduce the space by shifting b to where S is
// step step
// |<----------->|<----------->|
// --------------b------S'-----|
// when the step size halfs next round
// S' = b + new step size
// This recursively narrows down on the target in
// maximum log2(n) steps.
if ( x >= array[begin + step] )
begin += step;
// Each time we divide by two, this halfs the space we are going to move the mid point each time.
// On the first step it's *half* (kinda see above) the array, then next step its half of the remaining
// and so on.
step /= 2; // we now do this after to save one division at the start.
}
// Need to calculate where the begin pointer has landed to calculate the final index.
return begin;
}
// For detailed explaination of the algorithm used see above.
// The comments here will just describe the differences using
// a runtime array.
template<DynamicSpan TySpan>
BINARY_SEARCH_CRITICAL_INLINE size_t BinarySearch( TySpan array, typename TySpan::value_type x )
{
// No need to check if the array size is zero
// because the only reason that was there was for
// compile time optimisation.
if ( x <= array.front() )
return 0;
if ( x >= array.back() )
return array.size() - 2;
// This is the resultant index
// It's the left side of the current bin being searched
size_t begin = 0;
size_t step;
const size_t N_bit_floor = std::bit_floor( array.size() );
if ( N_bit_floor != array.size() )
{
// Check if the first value is on the left or right.
if ( array[N_bit_floor] <= x )
{
const size_t new_length = array.size() - N_bit_floor - 1;
if ( new_length == 0 )
return 0;
step = std::bit_ceil( new_length );
begin = array.size() - 1 - step;
}
else
{
step = N_bit_floor;
}
}
else
{
step = N_bit_floor;
}
// Unlike the array version we cannot calculate the fixed number of steps
// because the steps are not known at runtime. This will result in a loop.
// So use array for better performance.
for ( step /= 2; step != 0; step /= 2 )
{
if ( x >= array[begin + step] )
begin += step;
}
return begin;
}
// Overloads for std::array and std::vector.
template<typename T, size_t N>
BINARY_SEARCH_CRITICAL_INLINE constexpr size_t BinarySearch( const std::array<T, N>& array, T x )
{
return BinarySearch( std::span<const T,N>( array ), x );
}
template<typename T>
BINARY_SEARCH_CRITICAL_INLINE constexpr size_t BinarySearch( const std::vector<T>& array, T x )
{
return BinarySearch( std::span<const T>( array ), x );
}
} // end namespace Utility
@08jne01
Copy link
Author

08jne01 commented Oct 29, 2024

Disclaimer

This is designed for lookup tables where the entries (x) are sorted floating point values, it may work with integers (but I have no tests written for this case as it is not our use case). This also will automatically clamp to the edges of the input data. The key behaviour here is it returns N-2 this is because the index is intended to be used in a lookup table where linear interpolate will be used to lerp between array[idx] and array[idx+1], therefore the last valid idx is N-2.

C++20

If you are not able to use C++20 then the first binary search table (span) can simply be replaced with const std::array<T,N>& and the second with const std::vector<T>&. You will also need to roll your own bit_ceil, bit_floor and bit_width which the author of the original article does for the first two. The last one, bit_width is required for the compile time optimisation of unrolling.

Example Usage

// Input are the x values of the function (sorted)
// Output are the y values of the function
// x is the value you are looking up
// Lerp is just a normal linear interpolate function, Lerp(x0,y0,x1,y1, x)
const size_t idx = Utility::BinarySearch( input, x );
return Lerp( input[idx], output[idx], input[idx+1], output[idx+1], x );

Performance

This (compile time unrolled) binary search was tested (using double array) on my machine is about 2-3 times faster than naive binary search naive linear search for small arrays <16 elements. For larger arrays >100 it becomes much faster than linear and keeps the gain over naive binary search.

As stated in the comments this is based off of the article: https://probablydance.com/2023/04/27/beautiful-branchless-binary-search/

The major improvement here is when the table size is known at compile time (often true in our use case) the loop can be unrolled by calculating the log2(N).

Below are benchmarks done on my machine (i7 14700k) with randomly generated queries (queries are the same between types) and and sorted vectors. Between each test the cache was flushed to prevent any test ordering issues affecting results. The length in each of the tests describes the length of the array searched. The numbers are quite small so as to prevent caching affects taking hold, if you repeatedly query the same data this BinarySearch will approach the Naive implementation > 100 sequential queries, for our use case we are querying many different tables once so it is important to measure the performance with a cold cache.

-------------------------------------------------------------------------------
Linear to Binary Comparison
  Very Small
-------------------------------------------------------------------------------

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
Naive Binary - Very Small, < 16                100          3488     1.0464 ms 
                                        3.76089 ns       3.75 ns    3.80017 ns 
                                      0.0927677 ns  0.0232543 ns    0.21351 ns

Binary - Very Small, < 16                      100          7610        761 us 
                                        1.73298 ns    1.71669 ns    1.80447 ns 
                                       0.149595 ns  0.0193462 ns   0.353642 ns

Linear - Very Small, < 16                      100          3259     1.3036 ms 
                                        3.69531 ns    3.68733 ns    3.71341 ns 
                                      0.0590821 ns  0.0275202 ns  0.0998418 ns


-------------------------------------------------------------------------------
Linear to Binary Comparison
  Small
-------------------------------------------------------------------------------

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
Naive Binary - Small, < 100                    100          1945      1.167 ms 
                                        6.70746 ns    6.68638 ns    6.76967 ns 
                                       0.166311 ns  0.0560859 ns   0.358645 ns

Binary - Small, < 100                          100          5526     1.1052 ms 
                                        2.43956 ns      2.424 ns    2.48444 ns 
                                       0.123672 ns  0.0493372 ns   0.254239 ns

Linear - Small, < 100                          100          1414     1.2726 ms 
                                        9.26167 ns    9.23338 ns    9.33522 ns 
                                       0.213499 ns  0.0974827 ns   0.434563 ns


-------------------------------------------------------------------------------
Linear to Binary Comparison
  Medium
-------------------------------------------------------------------------------

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
Naive Binary - Medium, < 1000                  100          1217      1.217 ms 
                                        10.7584 ns     10.719 ns    10.8472 ns 
                                       0.286858 ns   0.157251 ns   0.529083 ns

Binary - Medium, < 1000                        100          3850      1.155 ms 
                                        3.36571 ns    3.35325 ns     3.3961 ns 
                                      0.0939743 ns  0.0489967 ns   0.188397 ns

Linear - Medium, < 1000                        100            48     1.3296 ms 
                                        277.292 ns    276.562 ns    279.417 ns 
                                        5.80783 ns    2.18343 ns    12.6766 ns


-------------------------------------------------------------------------------
Linear to Binary Comparison
  Large
-------------------------------------------------------------------------------

benchmark name                       samples       iterations    est run time
                                     mean          low mean      high mean
                                     std dev       low std dev   high std dev
-------------------------------------------------------------------------------
Naive Binary - Large, < 100000                 100           622     1.3062 ms 
                                        20.9727 ns    20.9132 ns    21.0514 ns 
                                       0.346503 ns   0.259117 ns    0.50246 ns

Binary - Large, < 100000                       100          1944     1.1664 ms 
                                        6.72685 ns    6.69753 ns    6.82047 ns 
                                       0.237284 ns  0.0735057 ns   0.518066 ns

Linear - Large, < 100000                       100             1      2.043 ms 
                                         20.559 us     20.399 us     21.303 us 
                                        1.50787 us    126.787 ns    3.58224 us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment