Skip to content

Instantly share code, notes, and snippets.

@RubenSomsen
Last active January 6, 2026 23:34
Show Gist options
  • Select an option

  • Save RubenSomsen/cdcfaee6bced7668130e7a044787a86e to your computer and use it in GitHub Desktop.

Select an option

Save RubenSomsen/cdcfaee6bced7668130e7a044787a86e to your computer and use it in GitHub Desktop.

Transaction Output Bitfield Compression

Open research/dev problem for SwiftSync

You can represent the unspent transaction output (UTXO) set as a set of bits over the ordered set of all transaction outputs (TXOs), signifying whether it is spent or unspent. The majority of the TXOs will be spent, resulting in many 0 bits and relatively few 1 bits (e.g. 00100001). The number of unspent outputs (1 bits) will increase as we get closer to the tip. These predictable patterns are well-suited for compression.

We currently use the TXO bitfield in SwiftSync inside the so-called "hints file" to speed up IBD and intend to implement this into Bitcoin Core, but the distribution of the hints file is made harder by its size. For the initial release we will likely forgo optimal encoding, but it would be nice to eventually ship it in an optimally compressed state.

Our current method is to essentially count the number of 0s between each 1. So for 00100001 that would be [2, 4]. This gets us a ~170MB file that goes down further to ~80MB using generic zip compression. This implies that our current encoding still has plenty of room for optimization.

A possible area for optimization is the number of bits used to represent the number of 0s between each 1 (something that fits most values, and handle the exceptions in a more expensive way). The number of bits can probably go down as we get closer to the end of the chain (fewer 0s between each 1) and as the distribution of 1s and 0s approaches 50% it may make sense to simply start encoding the bitfield as-is.

A basic functional implementation would have a compress and unpack function. The former takes the bitfield and turns it into an efficiently encoded version, while the latter reverses the operation. Code complexity does matter - try not to introduce complex code for minimal gain.

All of the above are just suggestions on how to tackle the problem. There may be better ways none of us have thought of, and it's plausible that more patterns exist which can be taken advantage of (maybe empty early blocks with a single unspent coinbase output). Hopefully some of you will find this an interesting problem to work on.

@rustaceanrob
Copy link

To elaborate on the two encoding styles, in the case of using only 0/1, we may use a technique called bit packing. This encoding allows for 8 outputs to be encoded using a single byte. In the example above 00100001 is the binary representation of a byte. To decode this value from the file, we walk along the binary representation of the u8, and if the bit is 0 we consider the output spent.

When using counting, also commonly know as run length encoding, notice that at least one byte is used for every value. If the space between spent and unspent outputs in a block are generally small (i.e. 2, 4 as in the example), then counting is wasteful. However, when the space between unspent outputs is very large, this technique is superior. The threshold is when around 1/8th of the outputs in a block remain unspent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment