https://notes.hella.cheap/picking-optimal-token-ids.html
This project demonstrates how to use Principal Component Analysis (PCA) to optimize token ID assignment in a document search index, resulting in better compression of sparse bit vectors.
In a search index, each document is represented as a sparse bit vector where:
- Each token has a unique ID (bit position)