zsfelfoldi/eip-7745-simplified.md

## eip-7745-simplified.md

      
    Raw
  

              eip-7745-simplified.md
            
          
    EIP-7745: Trustless log and transaction index (alternative version)

Author: Zsolt Felföldi (zsfelfoldi@ethereum.org)
Notes for the pre-release version

This document is a preview of a simplified and improved alternative version of EIP-7745, intended to collect feedback.
After getting lots of feedback about EIP-7745 being very complex and also experiencing this complexity myself while working on a production ready implementation, I shortly revisited an alternative approach I had earlier that I dismissed back then because I could not work out some issues in a satisfactory way (and also the filter maps design appeared a lot simpler originally). This time though, after having worked through all the difficulties of the old design, I realized how the alternative proposed here can be made not only feasible but actually a lot simpler and more performant.
The key search acceleration structures of the old design (the filter maps) are replaced with a different and much simpler approach here. I would say there is still more than 50% overlap, and the non-overlapping part is so much easier here that I am confident I can implement this version at a production ready level sooner than I could make the current EIP ready. The linear log value indexing is the same, the index_entries tree is also very similar (it is a single list now). Unlike the old design, the new search structure is not probabilistic, has no collision mining issues, and despite this fact proofs can be even more efficient (depending on some parameters). Proof generation and verification are similar to the old design (in both cases we are dealing with linear index value positions listed in a hash tree) but here it is significantly simpler because there are no false positives to deal with.
The only aspect where the old EIP is better is the minimal log index state size; in the old design this is 20-25 Mb in a compact encoding (though the in-memory representation in my implementation was hard to bring under 200 Mb). In this design the minimal state is about 2 Gb but the indexer accesses it in a linear way so it is efficient when stored on disk; also it is still future compatible with statelessness because it can either work entirely with witnesses, or be initialized with witnesses first and then keep the indexer state locally in order to save network bandwidth. Note that the protocol for stateless witnesses is not specified yet.
Abstract

Replace the fixed 2048 bit log event bloom filters with a new lookup index data structure that can adapt to the changing number of events per block, consistently guarantees good search performance and allows efficient trustless proofs of log event queries, canonical block number/hash and transaction hash lookups.
The proposed structure maps all index entries (log events, transaction and block markers) onto a global linear index space and hashes them into a binary Merkle tree based on that index. It also has a content-to-position indexer based on a merge sorting mechanism creating ordered list sections of (value, position) pairs. Unlike the per-block bloom filters, these allow searching for specific events by accessing only a small portion of the entire dataset which can also be proven with a Merkle proof, making the search both efficient and light client friendly.
The proposed structure can be efficiently used both for local search and for remote proof generation/verification. It also allows validators that are not interested in either searching or proving logs to generate the index root hash by maintaining a minimal index state with hard capped size.
Motivation

Adding logs has a significantly lower gas cost and should accordingly be less resource consuming than writing to the state. The original design of bloom filters in each block achieves this goal as there is no complex data structure to update, the set of logs emitted in each block is all contained within the header and receipts belonging to that block. Logs mostly just have long term storage costs. On the other hand, searching logs in a long range of blocks is very expensive and the current bloom filters do not really help.
Bloom filters are only useful as long as they are sufficiently sparse. False positive ratio rises rapidly with the number of events per filter and the density of 1 bits in the filter bit vector. In the currently existing bloom filter each log address and topic sets 3 out of a fixed length of 2048 bits which resulted in sufficiently sparse filters in the beginning but with the increase of the block gas limits the false positive ratio soon made the filter practically useless. Mainnet blocks currently add over 1000 log addresses and topics in average and therefore the bloom filter size would need to increase about tenfold in order to achieve acceptable false positive rates again. This would raise block header size to about 3 kilobytes. Even if the size of the per-block bloom filters would be raised to a sufficient level, log search would still require accessing the entire header chain. Searching in just the most recent one year history would cost over 6 gigabytes of data, not counting the access of actual logs where the bloom filters have a match. The current situation is even worse, requiring a large portion of the full block receipt sets to be accessed due to the high false positive rate of the bloom filters. Transaction inclusion lookup is also very expensive, requiring all block bodies in the searched range.
While full nodes and RPC servers can and usually do build additional index structures to aid certain lookups, the responses based on these cannot be verified remotely. While it is important to improve the transaction processing capabilities of Ethereum, trustless provability is required throughout the entire stack, from end user applications signing transactions, to the same or other end user applications getting the results they need. Scaling transaction processing only makes sense as long as there is also a trustless and scalable way to access the relevant results of those transactions. Users relying on trusted servers and indexers kind of beats the whole point of Ethereum.
In addition to making truly trustless user applications possible, cheap-to-add and efficiently provable logs may also find applications in improving scalability through cross-chain interactions and moving expensive computation off-chain with ZK-proofs.
Specification

Terms and definitions


index entry: a single entry in the index_entries tree associated with an indexed event. It is either a log entry, a transaction entry or a block entry. An index entry generates one or more index values. Each log entry adds one address value and 0..4 topic values (in this order). Each transaction entry adds one transaction value and each block entry adds one block value.
index value: a searchable value associated with an index entry. It is either an address value, a topic value, a transaction value or a block value. Each index value is represented by a 26 byte truncated hash (the index value hash) which is calculated as sha2(address)[:26], sha2(topic)[:26], sha2(tx_hash + b"\x01")[:26] or sha2(block_hash + b"\x02")[:26].
value position: a single position in the global linear index space. A new value position is assigned to each index value.
entry position: the position where an index entry is located in the index_entries tree. Each index entry is located at the position corresponding to the value position of the first index value it generates.
index reference: an index value and its value position encoded into a single 32 byte value. The first 6 bytes are the first 6 bytes of the little endian encoding of the value position, the rest is the index value. (If the index grows beyond 2**48 values then the higher part of the position can be reconstructed based on the list section index. Sections cannot cross 2**48 value boundaries.) An ordered list of index reference values is always sorted in ascending order, with each index reference value interpreted as a little endian 32 byte integer.
reference sorter: a mechanism implementing a merge sort algorithm on index references, processing power-of-two sized sorted list sections on multiple sort layers and merging them into longer, higher layer sections until MAX_SORT_LAYER is reached.
sort layer: a stage of the reference sorter where list sections contain section_length(sort_layer) = MIN_SECTION_LENGTH << sort_layer sorted index references.
reference list: the partially sorted list of a certain sort layer consisting of sorted list sections.
list section: a fully sorted section of a reference list. sections[i][j] refers to the list section j of the reference list with sort level i, containing index references to the value position range section_length(i) * j to section_length(i) * (j+1) - 1. For every i > 0 sections[i][j] can be calculated by merging sections[i-1][j*2] and sections[i-1][j*2+1], assuming that the latter sections are already fully populated. Note that most of the index references are stored only on the highest sort layer while on lower layers only a limited number of list sections are retained until they also get merged into sections on the highest layer.

Consensus data format

Block headers

Beginning at the execution timestamp FORK_TIMESTAMP, execution clients MUST replace the logs_bloom field of the header schema with log_index_root which is the root hash of the LogIndex structure after adding the logs emitted in the given block.
The resulting RLP encoding of the header is therefore:
rlp([
    parent_hash,
    0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347, # ommers hash
    coinbase,
    state_root,
    txs_root,
    receipts_root,
    log_index_root,
    0, # difficulty
    number,
    gas_limit,
    gas_used,
    timestamp,
    extradata,
    prev_randao,
    0x0000000000000000, # nonce
    base_fee_per_gas,
    withdrawals_root,
    blob_gas_used,
    excess_blob_gas,
    parent_beacon_block_root,
])

Container types

These definitions use the ProgressiveList and ProgressiveByteList SSZ types as defined in EIP-7916.
Note that IndexEntry and its sub-containers are initialized when added; until then they are represented as null leaves in the Merkle tree. Entries belonging to unused entry positions are left uninitialized. The Log container is only initialized in case of log entries. If it is initialized then topics and data lists in the Log container are always initialized. sort_layers and sorted_refs are always initialized. List sections referenced by root hash in sort_layers and sorted_refs are represented as Vector[uint256, section_length(layer)] (simple power-of-two sized binary trees). Unused list section references are also represented as a null value.
The index_entries and sorted_refs lists and reference list sections (at least on higher sort layers) might not be represented entirely in memory; some clients might also choose not to store certain parts at all. These hash tree structures are always incrementally updated and clients can store them partially, collapsing finalized subtrees on the left side and expanding previously untouched empty subtrees when needed on the right side.
class LogIndex(Container):
    index_entries: ProgressiveList[IndexEntry]
    sort_layers: Vector[Vector[Root, 4], MAX_SORT_LAYER]  # 4 reference list sections each at 0..MAX_SORT_LAYER-1
    sorted_refs: ProgressiveList[Root]                    # all reference list sections at MAX_SORT_LAYER

class IndexEntry(Container):
    log_entry: Log
    entry_meta: Vector[Bytes32, 4] # LogMeta / TransactionMeta / BlockMeta

class Log(Container):
    address: ExecutionAddress
    topics: List[Bytes32, 4]
    data: ProgressiveByteList

The entry_meta vector has a fixed format but its interpretation depends on the type of index entry and the type of entry can also be determined based on its contents (and based on whether log_entry is initialized or not). Whenever IndexEntry is initialized, entry_meta contains either one of these containers:
class LogMeta(Container):
    block_number: uint64
    transaction_hash: Root
    transaction_index: uint64
    log_in_tx_index: uint64

class TransactionMeta(Container):
    block_number: uint64
    transaction_hash: Root
    transaction_index: uint64
    receipt_hash: Root
 
class BlockMeta(Container):
    block_number: uint64
    block_hash: Root
    timestamp: uint64
    zero: uint64

Index entry types and corresponding index values

Three types of index entries are added during state transition: log, transaction and block. For each processed transaction, a transaction entry is added first, then log entries are added in the order of EVM execution. The block entry referring to the processed block is not added during the state transition because it is supposed to reference the block hash which is not known yet. It can be added either after block building/validation or at the beginning of the next state transition. In either case, the block entry of block N first appears in the log index state of block N+1 as the first new index entry.
The following table shows an example of mapping all index entry types onto the value position space:


Index entry
Entry position
Index values
Value positions


Block #0
0
Block
0


Tx    #1/0
1
Transaction
1


Log   #1/0/0
2
Addr, 3x Topic
2, 3, 4, 5


Log   #1/0/1
6
Addr, 3x Topic
6, 7, 8, 9


Tx    #1/1
10
Transaction
10


Log   #1/1/0
11
Addr, 2x Topic
11, 12, 13


Log   #1/1/1
14
Addr, Topic
14, 15


Log   #1/1/2
16
Addr, 2x Topic
16, 17, 18


Block #1
19
Block
19


Tx    #2/0
20
Transaction
20


Log   #2/0/0
21
Addr, 4x Topic
21, 22, 23, 24, 25


Note that the missing entry positions are represented in the index_entries tree with uninitialized IndexEntry containers (zero value leaves in the index_entries vector). These empty entries correspond to the topic value indices.
Proposed constants


Name
Value


MIN_SECTION_LENGTH
2**10


MAX_SORT_LAYER
14


Updating the log index

Whenever a new index entry is added to the index, a new IndexEntry container is added to the index_entries list. If the index entry generates multiple index values then the entry position is assigned to the first one, subsequent positions are assigned to the additional values, and a corresponding number of empty containers (null leaves) are added to the index_entries list. For each index value generated, a new index reference is added to the reference sorter, as shown on the figure below:

New index references are first added to the reference list on layer 0, at ref_list_position[0] which equals to the value position of the reference. The last section of this list should also be sorted when it's complete or before calculating its hash tree root (no need to re-sort after each item if adding multiple items generated in a single block).
On higher sort layers the next sorted index reference (taken from the layer below) is added at the appropriate ref_list_position of the reference list if the calculated position is non-negative. For any 0 < layer <= MAX_SORT_LAYER ref_list_position[layer] is defined as ref_list_position[layer-1] - section_length(layer), which means that all reference lists are updated at positions with a constant distance from ref_list_position[0]. The head_section index where ref_list_position points to on each layer is defined as head_section[layer] = ref_list_position[layer] // section_length(layer). The next entry that should be added to sections[layer][head_section[layer]] can be defined as the lowest index reference value found in either sections[layer-1][head_section[layer] * 2] or sections[layer-1][head_section[layer] * 2 + 1] that is higher than any index reference value already added to sections[layer][head_section[layer]].
Note that while fixed gaps between ref_list_position pointers guarantee that log index state transitions can be processed block by block with an effort proportional to the amount of data added in that block, the client can also asynchronously calculate reference lists beyond their respective ref_list_position as long as the necessary data is already available on the layer below, detaching most of the log index processing from the block processing. It it just the section root hashes stored in sort_layers that have to be calculated according to ref_list_position. Also note that this design is also ZK-prover friendly, as the size of the witnesses needed for generating sort_layers and sorted_refs hashes is also proportional to the amount of data added to the index and the gas spent on adding it.
Updating sort_layers and sorted_refs and discarding sections

On every sort layer where layer < MAX_SORT_LAYER and head_section[layer] >= 2 it is true that while sections[layer][head_section[layer]] is the currently updated section, sections[layer][(head_section[layer] // 2 - 1) * 2] and sections[layer][(head_section[layer] // 2 - 1) * 2 + 1] are being merged into sections[layer+1][head_section[layer+1]]. Therefore the sections required to perform the merge sort on layer are (head_section[layer] // 2 - 1) * 2 to head_section[layer], which means either 3 or 4 sections, with the first section always being an even numbered one and the last being the currently written one.
sort_layers stores at most 4 list section root hashes for each layer below MAX_SORT_LAYER, starting at (head_section[layer] // 2 - 1) * 2. This means that sort_layers[layer][0] and sort_layers[layer][1] are always the roots of sections currently being merged, and either sort_layers[layer][2] is the currently written section's root and sort_layers[layer][3] is empty (if head_section[layer] is even) or sort_layers[layer][2] is a full section and sort_layers[layer][3] is the currently written section's root (if head_section[layer] is odd). If head_section[layer] < 2 then there are no merged sections yet and the first two roots are empty too (empty section roots are represented with a null leaf value).
The roots of sections[MAX_SORT_LAYER][i] are all stored in sorted_refs[i], with sorted_refs[head_section[MAX_SORT_LAYER]] being the actually written section and the earlier ones all fully populated.
Rolling back to a previous state

It is also possible to efficiently remove entries from the reference sorter and roll it back to a previous state based on the head pointer to return to if needed because of a reorg. Reference lists all grow at the same rate, so rolling them back requires removing the same number of entries from the end of each one, except for layer 0 where entries were added in value position order and later sorted in index reference order. Entire layer 0 sections can still be removed without regard for ordering. Removing only a part of entries from a layer 0 section requires sorting them by value position first, then removing the required amount of entries from the end of the list, then re-sorting them in index reference order.
If some list sections below MAX_SORT_LAYER have been discarded since the revert point then it is also easy to un-merge two sections on a given sort layer from the section on the layer above since the value pointer part of the reference tells which lower layer section they originated from. Assuming that there is a recent MAX_SORT_LAYER section available to un-merge if needed, theoretically it is always possible to roll back the index. It is important to note though that in some cases (when a MAX_SORT_LAYER section has just been finished, sections on all lower layers discarded, and we are trying to roll back before that point) the un-merging can take a long time.
In order to also ensure that the rollback can be done quickly, an easy thing the client can do is to keep around the last two discarded sections on all sort layers for a while (ideally until the block where it has been discarded gets finalized). If a reference list gets rolled back more than two sections then un-merging will still be needed, but that is only going to happen on the lower sort layers where un-merging is cheap and the overall effort is still proportional to the amount of rolled back data. A more sophisticated way that minimizes excess data storage is to only keep the last part of longer sections and a Merkle path leading to that last part in order to be able to recalculate root hashes.
Note that index_entries also needs to be rolled back by removing entries. In order to ensure that this is possible, either the entire hash tree has to be retained between the last finalized checkpoint and the current head pointer, or a Merkle path has to be saved as a checkpoint after processing each block and retained until the finalized block moves beyond that point.
Log index proof format

A log index proof is a Merkle multi-proof proving a subset of the binary tree representation of the log index, including the index_entries tree and list sections referenced by root hash in sorted_refs and sort_layers. The proposed proof format can be used both for initalizing the log index state and for proving the results of search queries. It can be serialized either as RLP or SSZ. The proven data is encoded in a format that is more efficient than storing raw 32 byte tree nodes but the verifier can reconstruct the set of proven tree nodes and their positions. The proof_nodes list contains additional tree nodes (mostly internal hash nodes) required for recalculating the log index root, in depth-first left-to-right traversal order.
class LogIndexProof(Container):
    index_entries: ProvenIndexEntries
    head_entry_pointer: uint64
    index_references: List[ProvenSectionReferences]
    section_roots: List[ProvenSectionRoots]
    proof_nodes: List[Bytes32]

class ProvenSectionReferences(Container):
    sort_layer: uint8
    section_index:uint64
    references: List[ProvenReferences]

class ProvenReferences(Container):
    first_item: uint48
    index_value: Bytes26
    positions: List[uint48]
    
class ProvenSectionRoots(Container):
    sort_layer: uint8
    first_section_index:uint64
    roots: List[Root]

class ProvenIndexEntries(Container):
    empty_entries: List[uint64]
    log_entries: List[ProvenLogEntry]
    tx_entries: List[ProvenTxEntry]
    block_entries: List[ProvenBlockEntry]

class ProvenLogEntry(Container):
    entry_position: uint64
    log: Log
    meta: LogMeta

class ProvenTxEntry(Container):
    entry_position: uint64
    meta: TransactionMeta

class ProvenBlockEntry(Container):
    entry_position: uint64
    meta: BlockMeta

Note that section roots and index references of the same section should not be listed in the proof because then the root is redundant (proving any references from a section also proves its root). Proving roots alone is useful for log index state initialization. Also note that on sort layers below MAX_SORT_LAYER only the sections referenced in sort_layers can be proven.
Initializing the log index state

The minimal state of the log indexer (the data required for updating the index and calculating the log index root) has a strictly upper bounded size. A minimal state client does not need to store full list sections on MAX_SORT_LAYER as they are not required for adding new index entries (the last part of the last full section might be needed for rollback as stated above, but for the purpose of initialization this is irrelevant if the initialization is performed at a finalized block). The minimal information needed in order to add new MAX_SORT_LAYER sections is the Merkle path of the sorted_refs tree leading to the next section root to be added.
On each of the lower layers there are 2 or 3 fully rendered and one currently rendered sections referenced in sort_layers. The upper limit for the minimal log index state can be estimated as 4 times the section size of all layers below MAX_SORT_LAYER, which can again be estimated as 4 times the MAX_SORT_LAYER section size (since the section size doubles with each layer). With the proposed constants, a MAX_SORT_LAYER section has 2**24 index references, 32 bytes each, so the minimal state size is bounded below cca 2 Gb.
These sections can be easily initialized with the index values after the end of the last fully rendered MAX_SORT_LAYER section. These can be derived form blocks and receipts which are available through the existing Ethereum Wire Protocol and every client should already have a mechanism to download them. The extension of the protocol should provide the following additional information in order to make this possible:

a Merkle path of the log index hash tree leading to the next entry position to be added to the index_entries tree. This path also contains the list count node which should match the position the Merkle path leads to. Note that this is between 2-3 sections ahead of the last fully rendered MAX_SORT_LAYER section (which can also be determined based on this position).
A proof of the last block entry before the last full top layer section's end. Rendering should start with the block after this one, by iterating through index values generated by the blocks, transactions and logs. The belonging value positions also start after the entry position of this proven block entry. The derived index references before the section boundary can be discarded. If the section boundary is not reached after iterating the first block then the iteration should be cancelled and the initialization proof should be considered invalid (it proves an earlier block entry which might also be an attack vector if not handled).
a Merkle path leading to the next MAX_SORT_LAYER section root in the sorted_refs tree.

Rendering historical index sections

Clients that keep a longer than minimally required chain history may also want to index past sections. This requires a similar initialization; in order to index any top layer section with an index i > 0 the last block entry of section i-1 and a Merkle path to sorted_refs[i] are needed. Since it is a small amount of data, the protocol extension is specified so that all historical section boundary proofs should be available. Later when alternative ways of obtaining historical data are available, soft limits can be proposed on this range, similarly to the chain history range limits proposed by EIP-4444.
Wire protocol extension

The Ethereum Wire Protocol extension is defined in a way that makes serving the required data as easy as possible. Though the responses are general purpose log index proofs, the request options are very limited and the reference blocks can also only be recent finalized blocks. The only request parameter is initIndex which determines the range of section boundaries proven. This makes it possible to cache or pre-generate responses, avoiding expensive Merkle tree processing in the eth protocol handler.
The protocol is extended with the following messages:
GetLogIndexInitProof (0x12)

[request-id: P, referenceBlockHash: B_32, initIndex: P]
This message requests an initialization proof message based on the specified reference block which can be one of the two most recent finalized blocks. Proven items are defined as follows:

lastFullSection is the index of the last fully rendered top layer section
maxInitIndex is lastFullSection // 256
provenInitIndex is min(initIndex, maxInitIndex) (last proof can be obtained by specifying a very high initIndex)
sorted_refs roots and last block entries per section are proven for sections between provenInitIndex * 256 and min(provenInitIndex * 256 + 255, lastFullSection)
if provenInitIndex == maxInitIndex then the next index entry to be added is also proven

LogIndexInitProof (0x13)

[request-id: P, initProof: LogIndexProof]
This is the response to GetLogIndexInitProof, containing an RLP encoded LogIndexProof proving the requested items as defined above.
Generating and verifying an index query proof

All log index proofs have the general form of a Merkle multi-proof proving a subset of index entries and index references from the binary tree representation of the log index, including all sections referenced by root hash in sorted_refs and sort_layers. A log index proof request should always specify a recent reference block hash and the verifier should always recalculate the root hash of the received proof and check it against the log_index_root field of the reference block header. The interpretation and further validity checks of the response may depend on the purpose of the request, therefore the verifier should also receive the original request along with the proof.
Proven index entries

Index entries are proven from the index_entries subtree of the LogIndex binary hash tree and prove the existence of these entries in the index, but do not prove the completeness of the response (the non-existence of other entries matching the specified query criteria). Queries that can only yield a single result (like block number to/from block hash or transaction inclusion position to/from transaction hash) can be proven with a single index entry (assuming that the entry does exist). Queries that might have multiple results always require index reference proofs too. The different types of allowed entries (block entry, transaction entry and log entry) are always distinguishable based on their hash tree representation.
If a search block range is specified then the block entries belonging to firstBlock-1 and lastBlock are proven, unless firstBlock is 0 or lastBlock is the head block (in which case no block entry proof is needed for that end of the range). These prove the translation of the specified block range into value position range.
Proven index references

Index reference proofs can prove the exact value position or the non-existence of certain index values over a value position range. The entire range is covered by fully populated list sections except for the last section on sort level 0 which is usually partially populated, so the value position range in question can be translated to a series of adjacent list sections. For a range covered by multiple fully populated sections on different sort levels, always the highest level section should be used (this guarantees that only those lower layer sections are used that are part of sort_layers).
From the set of proven index references, for any specific index value a set of possible value positions can be derived where the given value may occur. Based on the following rules derived from the fact that the references in each list section are sorted, the potential occurences of any index value can be determined for each of the sections covering the range:

if no references are proven in the section then the entire value position range of the section is a potential occurence.
if a reference with the searched value is proven then it is an actual occurence.
if two adjacent references with searched value are proven then the range between the two positions are not potential occurences.
if a reference with a value lower than the searched value is proven then the positions of the section before the position of the reference are not potential occurences.
if a reference with a value higher than the searched value is proven then the positions of the section after the position of the reference are not potential occurences.

In each list section the actual occurences of an index value occupy a continuous range. Proving the exact set of occurences (including an empty set) in a given section is always straightforward; in addition to the index references of the actual occurences, the references before or after their range should also be proven (unless this range starts and/or ends on section boundary). The same evaluation rules should be applied to all list sections covering the queried value position range to obtain the full set of potential occurences. If the first and last sections cover some ranges outside the exact value position range, these ranges can automatically be treated as not potential occurences. Note that proving a continuous range of leaves with a Merkle multi-proof is efficient, and also the proof format can be defined so that in case of multiple adjacent index references with the same index value the value is only stored once, making the proof even more efficient.
In case of log pattern matching queries (for example, address at position i AND topic1 at i+1 AND topic2 at i+2) there is a room for heuristic proof size optimizations on the prover side. It typically makes sense to start with proving the occurences of the rarest value in the pattern. In the example, if topic2 is the rarest and address is the most popular then the prover may first prove the potential occurences of topic2, then for each potential occurence i only prove the occurence or non-occurence of topic1 at i-1, and only if topic1 was also present at i-1, prove the occurence or non-occurence of addres at i-2.
The verification of pattern matching queries does not require the verifier to care about these heuristics or the original proving order, it can simply determine the entire set of potential occurences for each index value of the pattern and perform the pattern matching on these sets (in practice represented as lists of individual value positions or continuous value position ranges).
Backwards Compatibility

The existing log filter API (eth_getLogs, eth_newFilter, eth_getFilterLogs, eth_getFilterChanges) can be implemented with the new filter data structure. Applications relying on this API can operate without any change, benefiting from a higher search performance. Repricing the LOG opcode might be considered after performing benchmarks but the extra processing cost is not significant while the extra storage cost is around 15%. Other than that the EVM is not affected in any way as it only emits logs but does not directly access them.
Future Compatibility

Stateless operation

Log index update witnesses can be compact since they are proving continuous ranges of merged index references. Assuming over 1000 index values per block (which is already a conservative estimate for mainnet), the total witness size can be estimated as slightly over MAX_SORT_LAYER * 32 bytes (480 bytes with proposed constants) per added index value, since each index reference is accessed once per merge sort iteration and the proof nodes of the witnesses make up for the minority of the data since continuous merged ranges with a typical length of over 500 entries are being proven.
It is important to note though that the practical importance of statelessness is not to avoid using 2 Gb of storage space, rather to allow instant initialization, without having to download this data before starting validation. Keeping the already known sections and only requesting witnesses while actually needed reduces the overall network bandwidth burden. Lower sort layers get filled in very quickly, for example after processing just 2**12 index values (which requires cca 2 Mb of witness data) half of the layers do not require a witness any more. Even better, instead of adding list section data to block witnesses, it is also possible to asynchronously prefetch this 2 Gb of index state data, which, if done with the right priorities (fetch first what is going to be needed the soonest) allows practically instant initialization while most of the data is transferred in a non latency critical slot phase.
ZK-provability

As shown above, efficient witnesses of the merge sort process can be generated that is not much bigger than the actual data accessed. Merging two sorted lists is also computationally an easy task, most of the computing done is actually the binary tree hashing. In conclusion, without assuming much about the zero knowledge technologies used in the future, it is safe to say that creating a ZK-proof of the log index processing should not be a challenging task compared to the proof of EVM execution and state access.
Increasing the sorted section size

When searching events over a long historical range, search performance is roughly proportional to the maximum section size, while proof size is inversely proportional. As the gas limits are increased and more index entries are generated, it might make sense to do further merge sort iterations and generate longer sorted sections. This can always be done with a consensus change, but there is also another, potentially better option: keep the consensus section size and periodically add on-chain ZK-proofs that prove the root hashes of the results of further merge sort iterations. While it makes sense to generate the lower sort layers in chain consensus so that the index is always usable up to the actual chain head, asynchronous off-chain processing with on-chain ZK-proofs is perfectly suitable for generating even longer merged sections if infrastructure operators (probably the same ones who are also serving this higher layer reference list data) are willing to commit the necessary resources.
Security Considerations

Cost of adding entries

A key advantage of the proposed design is that adding new index entries is consistently cheap. The CPU costs are mostly determined by tree hashing; on each sort layer a new leaf is added to a simple power-of-two sized binary hash tree; the cost can be estimated as one binary hash operation per sort layer per index value that can be processed in parallel with EVM execution (higher layers can also be pre-processed to avoid slowing down block processing), plus some minimal per-block costs when calculating the final log index root. On higher layers not stored in memory, there is also a database read/write cost of 32 bytes per sort layer per index value, which, again, can be pre-processed asynchronously, and which is the most efficient kind of database operation because the reference sorter always reads and writes continuous linear chunks of data. Measurements are still required but at this point it looks like the 375 gas per address/topic cost of LOG operations is easily maintainable.
Memory and database size requirements

As shown above, the reference sorter always operates on a hard capped amount of data. It is up to the client to decide how many sort layers it maintains in memory in order to save on the database bandwidth.
Log index proof security and efficiency

Query results proven by log index proofs are as safe as the client's knowledge of the head block hash. Unlike with probabilistic methods, there are no attack vectors related to making certain query proofs less efficient. The proving costs and proof sizes of single value lookups are consistently proportional to the length of searched entry position range and the amount of actual matches found. The cost of complex pattern searches are harder to estimate but generally the index value of the pattern with the least occurences in the range is the main determining factor.
Copyright

Copyright and related rights waived via CC0.
Index entry	Entry position	Index values	Value positions
Block #0	0	Block	0
Tx #1/0	1	Transaction	1
Log #1/0/0	2	Addr, 3x Topic	2, 3, 4, 5
Log #1/0/1	6	Addr, 3x Topic	6, 7, 8, 9
Tx #1/1	10	Transaction	10
Log #1/1/0	11	Addr, 2x Topic	11, 12, 13
Log #1/1/1	14	Addr, Topic	14, 15
Log #1/1/2	16	Addr, 2x Topic	16, 17, 18
Block #1	19	Block	19
Tx #2/0	20	Transaction	20
Log #2/0/0	21	Addr, 4x Topic	21, 22, 23, 24, 25
No results found