This document gives a high level overview of the tasks related to EIP-7745 (Trustless log and transaction index). It does not go into specification details, it builds on the EIP, the Python EELS implementation and also some Geth code pieces (both production and WIP). It is intended to organize available resources, track the progress of unfinished tasks and address practical issues related to actual client implementation.
Minimal consensus implementation can be based on the EELS implementation which maintains the minimum subset of the log index tree required to add new entries and calculate the log index tree root. If the client also wants to use the log index either for local acceleration or proof generation then it can still run the same minimal consensus code and maintain the minimal state in memory while also persisting collapsed tree data in a suitable database format.
The EELS implementation maintains its minimal state in a way that it considers every added entry final and collapses index entries and full maps immediately. This makes it impossible to roll back index entries. One possible approach to handle reorgs would be to save the entire minimal state after every block and keep all snapshots between the latest finalized block and the current head. This approach is unnecessarily wasteful though. A better solution is to de-couple tree collapsing from advancing the next_index pointer and only collapse finalized index entries and map rows. The client can just save the next_index pointer and the rows of the last map (or even just the length of each row in the last map) after each block. This is enough information that allows removing all index data added after a certain block. When a new block gets finalized, older pointers can be forgotten and all map rows and index entries completed before that pointer can be collapsed.
Note that in case the latest finalized block is too old and the log index state would grow too big, it is also an acceptable to discard pointers and collapse entries after a reasonable time because there is always an alternative option to roll back to an epoch boundary as long as epoch roots and the last block boundary pointer of each epoch are retained.
The log index state can be initialized at any point by obtaining a serialized version of the LogIndexMinimalState container described in the EIP and checking its calculated root hash against the one in the corresponding block header. Note that this requires az extension of the Ethereum Wire Protocol which is not specified yet. The main design decision here is for what block(s) the log index state should be available through this protocol. Safety considerations dictate that the state corresponding to the latest finalized block should always be obtainable from any node participating in the network. This implies that the simplest safest protocol is to only serve the state of the finalized block and let each client generate the head state starting from there.
Note that the entire LogIndexMinimalState is about 21Mb in size so the protocol should probably be specified in a way that one request/reply pair only transmits a subset of the 64k+1 Merkle branches, ideally in an individually verifiable form.
There are multiple options here but a simple and reasonably efficient way is to serialize and store a LogIndexMinimalState snapshot at each block whose corresponding beacon block is at a beacon chain epoch (32 slots) boundary. When one of these gets finalized, the older ones can be deleted. Reindexing at most 31 blocks is fast and these snapshots can also be used to serve the initialization request of the wire protocol.
One alternative is to serialize and store the entire rollback-capable memory tree representation of the log index tree, together with the next_index pointer / row length snapshots. This method can save the reindexing of blocks indexed since the last beacon chain epoch boundary, at the cost of extra complexity and more database writes.
Storing a large dataset that is represented in the form of a binary Merkle tree does require some kind of optimization because storing everything as individual tree nodes (including progressive lists and all log entries) would be inefficient. Most importantly, since the filter rows are typically short and in the most usual case they are being accessed in batches (same row index for an entire epoch), it is efficient to store them in groups, in a tightly packed format, and generate the progressive list tree nodes in memory on demand. The Geth implementation stores the first parts of each row (the part that fits into mapping layer 0) in larger groups and the rest of the entries of longer rows (which are much less frequent) individually. Note that the current production version does not include hash tree generation yet, which is WIP on this branch.
Similarly for the index entries tree, it would be inefficient to store the individual entries in the binary tree form. Actually, if the client already stores block receipts then it would be inefficient to duplicate that data in any way. In this case it is better to have a two way mapping between block numbers and map entry index pointers (in Geth there are "first map entry index of block" pointers and "last block of filter map" pointers). With this mapping it is possible to find any index entry by map entry index and it is also possible to generate related hash tree nodes on demand.
In both cases, leaf nodes and nodes close to the leaf level are relatively cheap to generate while lower internal nodes are exponentially more expensive as each level closer to the root requires twice as much hashing as the level above. On the other side, storing one less level of hash tree nodes on disk means half as many hash tree nodes stored in total. Therefore there is a tradeoff here between the database cost of stored tree nodes and the cpu cost of regenerating close-to-leaf-level nodes during proof generation. An example of efficient hash tree implementation in Geth is coming soon.
Local search is already implemented in production Geth. See matcher.go which implements both single map value lookup and query pattern matching.
Generating and verifying Merkle proofs is in many ways similar to local search, with the main difference being that the generator, after looking up the relevant filter rows and index entries for the query, creates a binary Merkle tree multiproof of the relevant parts of the required filter rows. Similarly, the verifier uses the proven index entries and filter rows (or parts of them) as input and tries to run the same search logic on this input dataset. If the root of the multiproof matches the log index root in the relevant header and the set of potential matches derived from the filter rows is covered by the proven set of index entries then the proof can be considered valid.
A first draft implementation of the proof generation and verification logic exists in the proof-poc branch that implements the potential match derivation and pattern matching logic on any subset of log index tree nodes (including subsets of individual map row progressive lists), which can then be used both to generate, optimize and verify log index proofs.
Exact proof format is not defined yet but and API endpoint format (both for log queries, transaction and block hash lookup) and also a list of items that should be included in the index proof are described in the initial draft writeup about the new trustless execution layer API proposal. This new API will take full advantage of the log index capabilities.