Skip to content

Instantly share code, notes, and snippets.

@balta2ar
Created December 6, 2025 11:59
Show Gist options
  • Select an option

  • Save balta2ar/3369c341f2bc2d202039ea038d5740f8 to your computer and use it in GitHub Desktop.

Select an option

Save balta2ar/3369c341f2bc2d202039ea038d5740f8 to your computer and use it in GitHub Desktop.
Visualization of this article about picking token ids https://notes.hella.cheap/picking-optimal-token-ids.html

Token Ordering via PCA for Sparse Bit Vector Compression

https://notes.hella.cheap/picking-optimal-token-ids.html

This project demonstrates how to use Principal Component Analysis (PCA) to optimize token ID assignment in a document search index, resulting in better compression of sparse bit vectors.

The Problem

In a search index, each document is represented as a sparse bit vector where:

  • Each token has a unique ID (bit position)
  • Bit is set to 1 if the token appears in the document, 0 otherwise
  • Most tokens don't appear in most documents → vectors are sparse (mostly zeros)

To save space, we use run-length encoding or sparse representations that work better when zeros (or ones) are clustered together. However, with random token ID assignment, set bits are scattered throughout the vector.

The Solution

Key Insight: We can choose which tokens get which IDs. By assigning consecutive IDs to tokens that tend to co-occur, we cluster the set bits together.

Algorithm

  1. Build Co-occurrence Matrix: Create a symmetric matrix where entry (i,j) = number of documents containing both token i and token j

  2. Compute Principal Eigenvector: Find the eigenvector corresponding to the largest eigenvalue of the co-occurrence matrix

  3. Sort Tokens: Sort tokens by their values in the principal eigenvector

  4. Assign IDs: Give consecutive IDs (0, 1, 2, ...) to tokens in this sorted order

This effectively performs PCA to project tokens onto a 1D space where related tokens are close together.

Results

On our sample dataset with 3 topics (fruits, home objects, car models):

Before (Random/Alphabetical Ordering)

Document 1 (fruits):  000100101001000011  (scattered bits)
Document 2 (objects): 001011000100010100  (scattered bits)
Document 3 (cars):    110000010010101000  (scattered bits)
  • 48 total runs (transitions between 0s and 1s)

After (PCA-based Ordering)

Document 1 (fruits):  000000000000111111  (clustered bits)
Document 2 (objects): 000000111111000000  (clustered bits)
Document 3 (cars):    111111000000000000  (clustered bits)
  • 15 total runs (transitions between 0s and 1s)
  • 68.8% reduction in number of runs

Usage

python pca.py

This will:

  1. Generate a sample dataset with 18 tokens across 6 documents
  2. Show initial token IDs (alphabetical order)
  3. Build and display the co-occurrence matrix
  4. Compute eigenvalues and eigenvectors
  5. Reorder tokens using the principal eigenvector
  6. Compare compression efficiency

How It Works

Co-occurrence Matrix Example

Tokens that appear together have high co-occurrence values:

  • Fruit tokens (apple, banana, grape, etc.) co-occur with each other
  • Home object tokens (chair, lamp, sofa, etc.) co-occur with each other
  • Car tokens (honda, toyota, nissan, etc.) co-occur with each other

Principal Eigenvector

The eigenvector values separate the clusters:

  • Car tokens: 0.30-0.49 (highest)
  • Home tokens: 0.0016-0.0026 (medium)
  • Fruit tokens: 0.00002-0.00003 (lowest)

Sorting by these values groups related tokens together.

Compression Benefit

With clustered bits:

  • Run-Length Encoding: Fewer runs to encode
  • Sparse Bit Masks: Can use bitmasks indicating which 64-bit words contain any set bits (as described in the blog post)
  • Better Cache Locality: Related tokens stored near each other in memory

Real-World Impact

The blog post mentions a 12% improvement in index size using this technique on a real search index.

Dependencies

  • Python 3.x
  • NumPy (for matrix operations and eigenvalue computation)

Files

  • pca.py - Main implementation with dataset generation, co-occurrence matrix building, PCA computation, and visualization
  • README.md - This file

Theory

This approach uses the mathematical property that eigenvectors of a co-occurrence matrix capture the main patterns of association. The principal eigenvector (corresponding to the largest eigenvalue) represents the strongest pattern, which in our case is the clustering of semantically related tokens.

Output

By sorting along this single dimension, we perform dimensionality reduction from the full token space down to 1D while preserving the most important structure (token co-occurrence patterns).
================================================================================
SAMPLE DATASET FOR TOKEN-BASED DOCUMENT INDEXING
================================================================================

============================================================
TOKEN ID ASSIGNMENTS
============================================================
Token           ID (decimal)    ID (binary)
------------------------------------------------------------
apple           0               0b000000
banana          1               0b000001
chair           2               0b000010
chevrolet       3               0b000011
curtain         4               0b000100
ford            5               0b000101
grape           6               0b000110
honda           7               0b000111
lamp            8               0b001000
lime            9               0b001001
mango           10              0b001010
nissan          11              0b001011
orange          12              0b001100
pillow          13              0b001101
sofa            14              0b001110
strawberry      15              0b001111
table           16              0b010000
tesla           17              0b010001
toyota          18              0b010010

Total tokens: 19

================================================================================
TOKEN-DOCUMENT MATRIX
================================================================================

Rows = Tokens (ordered by ID), Columns = Documents
1 = token present in document, 0 = token absent

Token           ID    D1 D2 D3 D4 D5 D6
--------------------------------------
apple           0      1  0  0  1  0  0
banana          1      1  0  0  1  0  0
chair           2      0  1  0  0  1  0
chevrolet       3      0  0  1  0  0  0
curtain         4      0  1  0  0  0  0
ford            5      0  0  1  0  0  0
grape           6      1  0  0  1  0  0
honda           7      0  0  1  0  0  1
lamp            8      0  1  0  0  1  0
lime            9      1  0  0  1  0  0
mango           10     1  0  0  0  0  0
nissan          11     0  0  1  0  0  1
orange          12     1  0  0  0  0  0
pillow          13     0  1  0  0  0  0
sofa            14     0  1  0  0  1  0
strawberry      15     1  0  0  0  0  0
table           16     0  1  0  0  0  0
tesla           17     0  0  1  0  0  0
toyota          18     0  0  1  0  0  1

================================================================================
DOCUMENT BIT VECTORS
================================================================================

Document 1 (fruits):
  Content: apple orange banana lime grape mango strawberry
  Tokens: ['apple', 'banana', 'grape', 'lime', 'mango', 'orange', 'strawberry']
  Bit vector (decimal): 38467
  Bit vector (binary):  0001001011001000011
  Bits set: 7

Document 2 (home_objects):
  Content: chair table lamp sofa pillow curtain
  Tokens: ['chair', 'curtain', 'lamp', 'pillow', 'sofa', 'table']
  Bit vector (decimal): 90388
  Bit vector (binary):  0010110000100010100
  Bits set: 6

Document 3 (car_models):
  Content: toyota honda tesla ford chevrolet nissan
  Tokens: ['chevrolet', 'ford', 'honda', 'nissan', 'tesla', 'toyota']
  Bit vector (decimal): 395432
  Bit vector (binary):  1100000100010101000
  Bits set: 6

Document 4 (fruits):
  Content: lime apple banana grape
  Tokens: ['apple', 'banana', 'grape', 'lime']
  Bit vector (decimal): 579
  Bit vector (binary):  0000000001001000011
  Bits set: 4

Document 5 (home_objects):
  Content: chair lamp sofa
  Tokens: ['chair', 'lamp', 'sofa']
  Bit vector (decimal): 16644
  Bit vector (binary):  0000100000100000100
  Bits set: 3

Document 6 (car_models):
  Content: honda toyota nissan
  Tokens: ['honda', 'nissan', 'toyota']
  Bit vector (decimal): 264320
  Bit vector (binary):  1000000100010000000
  Bits set: 3
================================================================================
CO-OCCURRENCE MATRIX
================================================================================

Rows and columns are tokens (ordered by ID)
Values show how many documents contain both tokens

             0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
           app ban cha che cur for gra hon lam lim man nis ora pil sof str tab tes toy
---------------------------------------------------------------------------------------
 0 apple        2   2   0   0   0   0   2   0   0   2   1   0   1   0   0   1   0   0   0
 1 banana       2   2   0   0   0   0   2   0   0   2   1   0   1   0   0   1   0   0   0
 2 chair        0   0   2   0   1   0   0   0   2   0   0   0   0   1   2   0   1   0   0
 3 chevrolet    0   0   0   1   0   1   0   1   0   0   0   1   0   0   0   0   0   1   1
 4 curtain      0   0   1   0   1   0   0   0   1   0   0   0   0   1   1   0   1   0   0
 5 ford         0   0   0   1   0   1   0   1   0   0   0   1   0   0   0   0   0   1   1
 6 grape        2   2   0   0   0   0   2   0   0   2   1   0   1   0   0   1   0   0   0
 7 honda        0   0   0   1   0   1   0   2   0   0   0   2   0   0   0   0   0   1   2
 8 lamp         0   0   2   0   1   0   0   0   2   0   0   0   0   1   2   0   1   0   0
 9 lime         2   2   0   0   0   0   2   0   0   2   1   0   1   0   0   1   0   0   0
10 mango        1   1   0   0   0   0   1   0   0   1   1   0   1   0   0   1   0   0   0
11 nissan       0   0   0   1   0   1   0   2   0   0   0   2   0   0   0   0   0   1   2
12 orange       1   1   0   0   0   0   1   0   0   1   1   0   1   0   0   1   0   0   0
13 pillow       0   0   1   0   1   0   0   0   1   0   0   0   0   1   1   0   1   0   0
14 sofa         0   0   2   0   1   0   0   0   2   0   0   0   0   1   2   0   1   0   0
15 strawberry   1   1   0   0   0   0   1   0   0   1   1   0   1   0   0   1   0   0   0
16 table        0   0   1   0   1   0   0   0   1   0   0   0   0   1   1   0   1   0   0
17 tesla        0   0   0   1   0   1   0   1   0   0   0   1   0   0   0   0   0   1   1
18 toyota       0   0   0   1   0   1   0   2   0   0   0   2   0   0   0   0   0   1   2

================================================================================
EIGENVALUE ANALYSIS
================================================================================

All Eigenvalues (sorted descending):
  λ1 =     9.7720
  λ2 =     7.8541
  λ3 =     7.8541
  λ4 =     1.2280
  λ5 =     1.1459
  λ6 =     1.1459
  λ7 =     0.0000
  λ8 =     0.0000
  λ9 =     0.0000
  λ10 =     0.0000
  λ11 =     0.0000
  λ12 =     0.0000
  λ13 =    -0.0000
  λ14 =    -0.0000
  λ15 =    -0.0000
  λ16 = -0.0000+0.0000j
  λ17 = -0.0000-0.0000j
  λ18 =    -0.0000
  λ19 =    -0.0000

================================================================================
PRINCIPAL EIGENVECTOR (largest eigenvalue)
================================================================================

Largest eigenvalue: 9.7720

Token           ID    Eigenvector Value
--------------------------------------------------
apple           0               0.445141
banana          1               0.445141
chair           2               0.000000
chevrolet       3               0.000000
curtain         4               0.000000
ford            5               0.000000
grape           6               0.445141
honda           7               0.000000
lamp            8               0.000000
lime            9               0.445141
mango           10              0.262930
nissan          11              0.000000
orange          12              0.262930
pillow          13              0.000000
sofa            14              0.000000
strawberry      15              0.262930
table           16              0.000000
tesla           17              0.000000
toyota          18              0.000000

================================================================================
COMPLETE EIGENVECTOR MATRIX (before sorting)
================================================================================

Rows = Tokens, Columns = Eigenvectors
Each column corresponds to an eigenvalue

All Eigenvalues (sorted descending):
  λ1 =     9.7720 <- PRINCIPAL (will be used for reordering)
  λ2 =     7.8541
  λ3 =     7.8541
  λ4 =     1.2280
  λ5 =     1.1459
  λ6 =     1.1459
  λ7 =     0.0000
  λ8 =     0.0000
  λ9 =     0.0000
  λ10 =     0.0000
  λ11 =     0.0000
  λ12 =     0.0000
  λ13 =    -0.0000
  λ14 =    -0.0000
  λ15 =    -0.0000
  λ16 =     0.0000
  λ17 =     0.0000
  λ18 =    -0.0000
  λ19 =    -0.0000

Eigenvector Matrix:
Token           ID    λ1=  9.77* λ2=  7.85  λ3=  7.85  λ4=  1.23  λ5=  1.15  λ6=  1.15  λ7=  0.00  λ8=  0.00  λ9=  0.00  λ10=  0.00  λ11=  0.00  λ12=  0.00  λ13= -0.00  λ14= -0.00  λ15= -0.00  λ16=  0.00  λ17=  0.00  λ18= -0.00  λ19= -0.00
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
apple           0          0.445141     -0.000000     -0.000000     -0.227704      0.000000      0.000000      0.805315     -0.025060      0.002558     -0.005562     -0.000000     -0.000000     -0.000000      0.000000      0.007033      0.008379      0.008379      0.006039      0.011419
banana          1          0.445141      0.000000     -0.000000     -0.227704      0.000000      0.000000     -0.566666      0.014584      0.000564      0.006481      0.000000      0.000000      0.000000     -0.000000     -0.003478      0.013560      0.013560      0.035696     -0.814537
chair           2          0.000000     -0.491123     -0.000118      0.000000      0.303531      0.074408      0.035410     -0.694321      0.105381     -0.293702     -0.000000     -0.000000     -0.000000      0.000000      0.388822     -0.521620     -0.521620      0.507245      0.096261
chevrolet       3          0.000000     -0.000000     -0.303531     -0.000000     -0.000000     -0.476138      0.000000      0.014213     -0.080759     -0.027955     -0.000000     -0.000000     -0.000000      0.000000      0.082214      0.230949      0.230949      0.432623      0.000000
curtain         4          0.000000     -0.303531     -0.000073     -0.000000     -0.491123     -0.120395     -0.000000      0.407004      0.102229      0.104771      0.000000      0.000000      0.000000     -0.000000     -0.077631      0.173936      0.173936      0.334602     -0.000000
ford            5          0.000000     -0.000000     -0.303531     -0.000000     -0.000000     -0.476138      0.000000     -0.042291     -0.625321     -0.049140      0.000000      0.000000      0.000000     -0.000000     -0.089548      0.342791      0.342791     -0.359499      0.000000
grape           6          0.445141      0.000000     -0.000000     -0.227704      0.000000     -0.000000     -0.119324     -0.032597     -0.121909     -0.599315     -0.000000     -0.000000     -0.000000      0.000000      0.576398      0.252517      0.252517     -0.006870      0.401559
honda           7          0.000000     -0.000000     -0.491123     -0.000000     -0.000000      0.294269      0.000000     -0.120722     -0.168501     -0.291682     -0.000000     -0.000000     -0.000000      0.000000      0.189002      0.372020      0.372020     -0.284945      0.000000
lamp            8          0.000000     -0.491123     -0.000118      0.000000      0.303531      0.074408     -0.017705      0.346717     -0.052462      0.141514      0.000000      0.000000      0.000000     -0.000000     -0.310881      0.262513      0.262513     -0.254607     -0.048131
lime            9          0.445141      0.000000      0.000000     -0.227704      0.000000     -0.000000     -0.119324      0.043073      0.118787      0.598396      0.000000      0.000000      0.000000     -0.000000     -0.579953      0.260616      0.260616     -0.034866      0.401559
mango           10         0.262930     -0.000000      0.000000      0.514005     -0.000000     -0.000000     -0.000000      0.000000      0.000000      0.000000     -0.701876     -0.016623     -0.304439      0.397522      0.000000      0.000000      0.000000     -0.000000     -0.000000
nissan          11         0.000000     -0.000000     -0.491123     -0.000000     -0.000000      0.294269      0.000000      0.060361      0.084251      0.145841     -0.230382     -0.165219     -0.040169     -0.156693     -0.094501      0.186010      0.186010      0.142472      0.000000
orange          12         0.262930     -0.000000      0.000000      0.514005     -0.000000     -0.000000     -0.000000     -0.000000     -0.000000     -0.000000      0.230151      0.409679     -0.004208     -0.207464     -0.000000      0.000000      0.000000      0.000000     -0.000000
pillow          13         0.000000     -0.303531     -0.000073     -0.000000     -0.491123     -0.120395      0.000000     -0.203502     -0.051115     -0.052385      0.250723     -0.558032     -0.635928      0.597372      0.038815      0.086968      0.086968     -0.167301     -0.000000
sofa            14         0.000000     -0.491123     -0.000118      0.000000      0.303531      0.074408     -0.017705      0.347604     -0.052919      0.152188      0.000000      0.000000      0.000000     -0.000000     -0.077941      0.259113      0.259113     -0.252638     -0.048131
strawberry      15         0.262930     -0.000000      0.000000      0.514005     -0.000000     -0.000000     -0.000000     -0.000000     -0.000000     -0.000000      0.471725     -0.393056      0.308647     -0.190058     -0.000000      0.000000      0.000000      0.000000     -0.000000
table           16         0.000000     -0.303531     -0.000073     -0.000000     -0.491123     -0.120395      0.000000     -0.203502     -0.051115     -0.052385     -0.250723      0.558032      0.635928     -0.597372      0.038815      0.086968      0.086968     -0.167301     -0.000000
tesla           17         0.000000     -0.000000     -0.303531     -0.000000     -0.000000     -0.476138      0.000000      0.028078      0.706080      0.077095     -0.000000     -0.000000     -0.000000      0.000000      0.007334      0.189993      0.189993     -0.073124      0.000000
toyota          18         0.000000     -0.000000     -0.491123     -0.000000     -0.000000      0.294269      0.000000      0.060361      0.084251      0.145841      0.230382      0.165219      0.040169      0.156693     -0.094501      0.186010      0.186010      0.142472      0.000000

* Column with λ1 (Principal Eigenvector) will be used for sorting tokens

================================================================================
PCA-BASED TOKEN REORDERING
================================================================================

Sorting tokens by principal eigenvector values...
This clusters tokens that co-occur together.

Token           Old ID     Eigenvector Value    New ID
-----------------------------------------------------------------
honda           7                    0.000000   0
nissan          11                   0.000000   1
tesla           17                   0.000000   2
toyota          18                   0.000000   3
chevrolet       3                    0.000000   4
ford            5                    0.000000   5
pillow          13                   0.000000   6
table           16                   0.000000   7
curtain         4                    0.000000   8
lamp            8                    0.000000   9
sofa            14                   0.000000   10
chair           2                    0.000000   11
mango           10                   0.262930   12
orange          12                   0.262930   13
strawberry      15                   0.262930   14
grape           6                    0.445141   15
lime            9                    0.445141   16
apple           0                    0.445141   17
banana          1                    0.445141   18

================================================================================
OPTIMIZED TOKEN ID ASSIGNMENTS (PCA-based)
================================================================================
Token           New ID (dec)    New ID (bin)
------------------------------------------------------------
honda           0               0b000000
nissan          1               0b000001
tesla           2               0b000010
toyota          3               0b000011
chevrolet       4               0b000100
ford            5               0b000101
pillow          6               0b000110
table           7               0b000111
curtain         8               0b001000
lamp            9               0b001001
sofa            10              0b001010
chair           11              0b001011
mango           12              0b001100
orange          13              0b001101
strawberry      14              0b001110
grape           15              0b001111
lime            16              0b010000
apple           17              0b010001
banana          18              0b010010

Total tokens: 19

================================================================================
OPTIMIZED TOKEN-DOCUMENT MATRIX (PCA-based ordering)
================================================================================

Notice how tokens from the same topic are now grouped together!

================================================================================
TOKEN-DOCUMENT MATRIX
================================================================================

Rows = Tokens (ordered by ID), Columns = Documents
1 = token present in document, 0 = token absent

Token           ID    D1 D2 D3 D4 D5 D6
--------------------------------------
honda           0      0  0  1  0  0  1
nissan          1      0  0  1  0  0  1
tesla           2      0  0  1  0  0  0
toyota          3      0  0  1  0  0  1
chevrolet       4      0  0  1  0  0  0
ford            5      0  0  1  0  0  0
pillow          6      0  1  0  0  0  0
table           7      0  1  0  0  0  0
curtain         8      0  1  0  0  0  0
lamp            9      0  1  0  0  1  0
sofa            10     0  1  0  0  1  0
chair           11     0  1  0  0  1  0
mango           12     1  0  0  0  0  0
orange          13     1  0  0  0  0  0
strawberry      14     1  0  0  0  0  0
grape           15     1  0  0  1  0  0
lime            16     1  0  0  1  0  0
apple           17     1  0  0  1  0  0
banana          18     1  0  0  1  0  0

================================================================================
OPTIMIZED DOCUMENT BIT VECTORS (PCA-based ordering)
================================================================================

Notice how the 1-bits are now clustered together!

================================================================================
DOCUMENT BIT VECTORS
================================================================================

Document 1 (fruits):
  Content: apple orange banana lime grape mango strawberry
  Tokens: ['apple', 'banana', 'grape', 'lime', 'mango', 'orange', 'strawberry']
  Bit vector (decimal): 520192
  Bit vector (binary):  1111111000000000000
  Bits set: 7

Document 2 (home_objects):
  Content: chair table lamp sofa pillow curtain
  Tokens: ['chair', 'curtain', 'lamp', 'pillow', 'sofa', 'table']
  Bit vector (decimal): 4032
  Bit vector (binary):  0000000111111000000
  Bits set: 6

Document 3 (car_models):
  Content: toyota honda tesla ford chevrolet nissan
  Tokens: ['chevrolet', 'ford', 'honda', 'nissan', 'tesla', 'toyota']
  Bit vector (decimal): 63
  Bit vector (binary):  0000000000000111111
  Bits set: 6

Document 4 (fruits):
  Content: lime apple banana grape
  Tokens: ['apple', 'banana', 'grape', 'lime']
  Bit vector (decimal): 491520
  Bit vector (binary):  1111000000000000000
  Bits set: 4

Document 5 (home_objects):
  Content: chair lamp sofa
  Tokens: ['chair', 'lamp', 'sofa']
  Bit vector (decimal): 3584
  Bit vector (binary):  0000000111000000000
  Bits set: 3

Document 6 (car_models):
  Content: honda toyota nissan
  Tokens: ['honda', 'nissan', 'toyota']
  Bit vector (decimal): 11
  Bit vector (binary):  0000000000000001011
  Bits set: 3
================================================================================
COMPRESSION EFFICIENCY COMPARISON
================================================================================

Metric: Number of runs (alternations between 0s and 1s)
Fewer runs = better compression with RLE

Random ordering:    50 runs
PCA-based ordering: 16 runs
Improvement:        34 fewer runs (68.0% reduction)

Example bit patterns:
--------------------------------------------------------------------------------

Document 1:
  Random:  0001001011001000011
  PCA:     1111111000000000000

Document 2:
  Random:  0010110000100010100
  PCA:     0000000111111000000

Document 3:
  Random:  1100000100010101000
  PCA:     0000000000000111111


================================================================================
CONCLUSION
================================================================================

With PCA-based ordering, tokens that co-occur are assigned
consecutive IDs, resulting in clustered bit patterns.
This creates longer runs of zeros and ones, improving compression
efficiency with run-length encoding or sparse bit vector representations!
================================================================================
"""
Sample dataset generator for demonstrating token-based document indexing.
This module creates a simple corpus of documents with tokens from different topics,
assigns unique IDs to each token, and generates bit vectors representing which
tokens appear in each document.
"""
import numpy as np
# Sample documents organized by topic
DOCUMENTS = [
# Topic 1: Fruits
"apple orange banana lime grape mango strawberry",
# Topic 2: Objects at home
"chair table lamp sofa pillow curtain",
# Topic 3: Car models
"toyota honda tesla ford chevrolet nissan",
# Mixed documents
"lime apple banana grape", # More fruits
"chair lamp sofa", # More home objects
"honda toyota nissan", # More car models
]
TOPIC_LABELS = [
"fruits",
"home_objects",
"car_models",
"fruits",
"home_objects",
"car_models",
]
def extract_tokens(documents):
"""Extract unique tokens from all documents."""
tokens = set()
for doc in documents:
tokens.update(doc.split())
return sorted(tokens) # Sort alphabetically for initial assignment
def assign_token_ids(tokens):
"""Assign sequential IDs to tokens (0-indexed)."""
return {token: idx for idx, token in enumerate(tokens)}
def create_document_bitvector(document, token_to_id):
"""
Create a bit vector for a document.
Args:
document: String containing space-separated tokens
token_to_id: Dictionary mapping tokens to their IDs
Returns:
Integer representing the bit vector (bits set for present tokens)
"""
bitvector = 0
for token in document.split():
token_id = token_to_id[token]
bitvector |= (1 << token_id) # Set bit at position token_id
return bitvector
def print_tokens_and_ids(token_to_id):
"""Print all tokens with their IDs in decimal and binary."""
print("=" * 60)
print("TOKEN ID ASSIGNMENTS")
print("=" * 60)
print(f"{'Token':<15} {'ID (decimal)':<15} {'ID (binary)':<20}")
print("-" * 60)
for token in sorted(token_to_id.keys(), key=lambda t: token_to_id[t]):
token_id = token_to_id[token]
binary_repr = f"0b{token_id:06b}"
print(f"{token:<15} {token_id:<15} {binary_repr:<20}")
print(f"\nTotal tokens: {len(token_to_id)}")
print()
def print_token_document_matrix(documents, token_to_id):
"""Print a visual matrix showing which tokens appear in which documents."""
print("=" * 80)
print("TOKEN-DOCUMENT MATRIX")
print("=" * 80)
print("\nRows = Tokens (ordered by ID), Columns = Documents")
print("1 = token present in document, 0 = token absent")
print()
# Header
print(f"{'Token':<15} {'ID':<5}", end="")
for i in range(len(documents)):
print(f" D{i+1}", end="")
print()
print("-" * (20 + len(documents) * 3))
# Sort tokens by ID
sorted_tokens = sorted(token_to_id.keys(), key=lambda t: token_to_id[t])
for token in sorted_tokens:
token_id = token_to_id[token]
print(f"{token:<15} {token_id:<5}", end="")
for doc in documents:
present = "1" if token in doc.split() else "0"
print(f" {present}", end="")
print()
print()
def print_document_bitvectors(documents, token_to_id, topic_labels):
"""Print bit vectors for all documents."""
print("=" * 80)
print("DOCUMENT BIT VECTORS")
print("=" * 80)
for idx, doc in enumerate(documents):
bitvector = create_document_bitvector(doc, token_to_id)
num_bits = len(token_to_id)
binary_repr = f"{bitvector:0{num_bits}b}"
print(f"\nDocument {idx + 1} ({topic_labels[idx]}):")
print(f" Content: {doc}")
print(f" Tokens: {sorted(doc.split())}")
print(f" Bit vector (decimal): {bitvector}")
print(f" Bit vector (binary): {binary_repr}")
print(f" Bits set: {bin(bitvector).count('1')}")
def build_cooccurrence_matrix(documents, tokens):
"""
Build a co-occurrence matrix for tokens.
Args:
documents: List of document strings
tokens: List of unique tokens (ordered)
Returns:
numpy array where element [i,j] is the number of documents
containing both tokens[i] and tokens[j]
"""
n = len(tokens)
matrix = np.zeros((n, n), dtype=int)
token_to_idx = {token: idx for idx, token in enumerate(tokens)}
for doc in documents:
doc_tokens = doc.split()
doc_token_indices = [token_to_idx[token] for token in doc_tokens]
for i in doc_token_indices:
for j in doc_token_indices:
matrix[i, j] += 1
return matrix
def print_cooccurrence_matrix(matrix, tokens):
"""Print the co-occurrence matrix in a readable format."""
print("=" * 80)
print("CO-OCCURRENCE MATRIX")
print("=" * 80)
print("\nRows and columns are tokens (ordered by ID)")
print("Values show how many documents contain both tokens")
print()
n = len(tokens)
max_token_len = max(len(token) for token in tokens)
col_width = max(3, max_token_len)
print(" " * (col_width + 1), end="")
for i in range(n):
print(f"{i:>3}", end=" ")
print()
print(" " * (col_width + 1), end="")
for token in tokens:
print(f"{token[:3]:>3}", end=" ")
print()
print("-" * (col_width + 1 + n * 4))
for i, token in enumerate(tokens):
print(f"{i:>2} {token:<{col_width}}", end=" ")
for j in range(n):
print(f"{matrix[i, j]:>3}", end=" ")
print()
print()
def compute_eigenvectors(matrix):
"""
Compute eigenvalues and eigenvectors of a matrix.
Args:
matrix: Square numpy array
Returns:
Tuple of (eigenvalues, eigenvectors) where:
- eigenvalues: 1D array sorted in descending order
- eigenvectors: 2D array where column i is the eigenvector for eigenvalues[i]
"""
eigenvalues, eigenvectors = np.linalg.eig(matrix)
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
return eigenvalues, eigenvectors
def find_principal_eigenvector(matrix):
"""
Find the eigenvector corresponding to the largest eigenvalue.
Args:
matrix: Square numpy array
Returns:
Tuple of (largest_eigenvalue, principal_eigenvector)
"""
eigenvalues, eigenvectors = compute_eigenvectors(matrix)
return eigenvalues[0], eigenvectors[:, 0]
def print_all_eigenvectors(matrix, tokens):
"""Print complete eigenvector matrix before sorting."""
print("=" * 80)
print("COMPLETE EIGENVECTOR MATRIX (before sorting)")
print("=" * 80)
print("\nRows = Tokens, Columns = Eigenvectors")
print("Each column corresponds to an eigenvalue")
print()
eigenvalues, eigenvectors = compute_eigenvectors(matrix)
n = len(tokens)
print("All Eigenvalues (sorted descending):")
for i in range(n):
ev = eigenvalues[i]
ev_real = ev.real if np.isreal(ev) else abs(ev)
marker = " <- PRINCIPAL (will be used for reordering)" if i == 0 else ""
print(f" λ{i+1} = {ev_real:>10.4f}{marker}")
print()
print("Eigenvector Matrix:")
print(f"{'Token':<15} {'ID':<5}", end="")
for i in range(n):
ev = eigenvalues[i]
ev_real = ev.real if np.isreal(ev) else abs(ev)
marker = "*" if i == 0 else ""
print(f" λ{i+1}={ev_real:6.2f}{marker:1}", end="")
print()
print("-" * (20 + n * 14))
for i, token in enumerate(tokens):
print(f"{token:<15} {i:<5}", end="")
for j in range(n):
val = eigenvectors[i, j]
val_real = val.real if np.isreal(val) else abs(val)
print(f" {val_real:>13.6f}", end="")
print()
print()
print("* Column with λ1 (Principal Eigenvector) will be used for sorting tokens")
print()
def print_eigenvector_analysis(matrix, tokens):
"""Print eigenvalue and eigenvector analysis."""
print("=" * 80)
print("EIGENVALUE ANALYSIS")
print("=" * 80)
print()
eigenvalues, eigenvectors = compute_eigenvectors(matrix)
print("All Eigenvalues (sorted descending):")
for i, val in enumerate(eigenvalues):
if np.isreal(val):
print(f" λ{i+1} = {val.real:>10.4f}")
else:
print(f" λ{i+1} = {val:>10.4f}")
print()
print("=" * 80)
print("PRINCIPAL EIGENVECTOR (largest eigenvalue)")
print("=" * 80)
print()
largest_eigenvalue = eigenvalues[0]
principal_eigenvector = eigenvectors[:, 0]
print(f"Largest eigenvalue: {largest_eigenvalue.real if np.isreal(largest_eigenvalue) else largest_eigenvalue:.4f}")
print()
print(f"{'Token':<15} {'ID':<5} {'Eigenvector Value':<20}")
print("-" * 50)
for i, token in enumerate(tokens):
val = principal_eigenvector[i]
if np.isreal(val):
print(f"{token:<15} {i:<5} {val.real:>18.6f}")
else:
print(f"{token:<15} {i:<5} {str(val):>18}")
print()
def reorder_tokens_by_pca(tokens, cooccurrence_matrix):
"""
Reorder tokens based on principal eigenvector values.
Args:
tokens: List of tokens in original order
cooccurrence_matrix: Co-occurrence matrix for the tokens
Returns:
Tuple of (reordered_tokens, token_value_pairs) where token_value_pairs
contains (token, old_id, eigenvector_value) tuples sorted by eigenvector value
"""
_, principal_eigenvector = find_principal_eigenvector(cooccurrence_matrix)
token_value_pairs = [(token, i, principal_eigenvector[i].real if np.isreal(principal_eigenvector[i]) else abs(principal_eigenvector[i]))
for i, token in enumerate(tokens)]
token_value_pairs.sort(key=lambda x: x[2])
reordered_tokens = [token for token, _, _ in token_value_pairs]
return reordered_tokens, token_value_pairs
def print_pca_reordering(tokens, cooccurrence_matrix):
"""Print the PCA-based reordering results."""
print("=" * 80)
print("PCA-BASED TOKEN REORDERING")
print("=" * 80)
print("\nSorting tokens by principal eigenvector values...")
print("This clusters tokens that co-occur together.")
print()
reordered_tokens, token_value_pairs = reorder_tokens_by_pca(tokens, cooccurrence_matrix)
print(f"{'Token':<15} {'Old ID':<10} {'Eigenvector Value':<20} {'New ID':<10}")
print("-" * 65)
for new_id, (token, old_id, value) in enumerate(token_value_pairs):
print(f"{token:<15} {old_id:<10} {value:>18.6f} {new_id:<10}")
print()
return reordered_tokens
def print_optimized_token_ids(pca_token_to_id):
"""Print optimized token IDs after PCA reordering."""
print("=" * 80)
print("OPTIMIZED TOKEN ID ASSIGNMENTS (PCA-based)")
print("=" * 80)
print(f"{'Token':<15} {'New ID (dec)':<15} {'New ID (bin)':<20}")
print("-" * 60)
for token in sorted(pca_token_to_id.keys(), key=lambda t: pca_token_to_id[t]):
token_id = pca_token_to_id[token]
binary_repr = f"0b{token_id:06b}"
print(f"{token:<15} {token_id:<15} {binary_repr:<20}")
print(f"\nTotal tokens: {len(pca_token_to_id)}")
print()
def calculate_compression_metrics(documents, token_to_id):
"""Calculate compression metrics for bit vectors."""
total_runs = 0
total_bits = 0
for doc in documents:
bitvector = create_document_bitvector(doc, token_to_id)
num_bits = len(token_to_id)
binary_str = f"{bitvector:0{num_bits}b}"
runs = 1
for i in range(1, len(binary_str)):
if binary_str[i] != binary_str[i-1]:
runs += 1
total_runs += runs
total_bits += num_bits
return total_runs, total_bits
def print_compression_comparison(documents, random_token_to_id, pca_token_to_id):
"""Print comparison of compression efficiency between random and PCA ordering."""
print("=" * 80)
print("COMPRESSION EFFICIENCY COMPARISON")
print("=" * 80)
print()
random_runs, total_bits = calculate_compression_metrics(documents, random_token_to_id)
pca_runs, _ = calculate_compression_metrics(documents, pca_token_to_id)
print("Metric: Number of runs (alternations between 0s and 1s)")
print("Fewer runs = better compression with RLE")
print()
print(f"Random ordering: {random_runs} runs")
print(f"PCA-based ordering: {pca_runs} runs")
print(f"Improvement: {random_runs - pca_runs} fewer runs ({(1 - pca_runs/random_runs)*100:.1f}% reduction)")
print()
print("Example bit patterns:")
print("-" * 80)
for i in range(min(3, len(documents))):
doc = documents[i]
random_bv = create_document_bitvector(doc, random_token_to_id)
pca_bv = create_document_bitvector(doc, pca_token_to_id)
num_bits = len(random_token_to_id)
random_bin = f"{random_bv:0{num_bits}b}"
pca_bin = f"{pca_bv:0{num_bits}b}"
print(f"\nDocument {i+1}:")
print(f" Random: {random_bin}")
print(f" PCA: {pca_bin}")
print()
def main():
"""Main function to demonstrate the token indexing system."""
print("\n" + "=" * 80)
print("SAMPLE DATASET FOR TOKEN-BASED DOCUMENT INDEXING")
print("=" * 80)
print()
tokens = extract_tokens(DOCUMENTS)
token_to_id = assign_token_ids(tokens)
print_tokens_and_ids(token_to_id)
print_token_document_matrix(DOCUMENTS, token_to_id)
print_document_bitvectors(DOCUMENTS, token_to_id, TOPIC_LABELS)
cooccurrence_matrix = build_cooccurrence_matrix(DOCUMENTS, tokens)
print_cooccurrence_matrix(cooccurrence_matrix, tokens)
print_eigenvector_analysis(cooccurrence_matrix, tokens)
print_all_eigenvectors(cooccurrence_matrix, tokens)
reordered_tokens = print_pca_reordering(tokens, cooccurrence_matrix)
pca_token_to_id = assign_token_ids(reordered_tokens)
print_optimized_token_ids(pca_token_to_id)
print("=" * 80)
print("OPTIMIZED TOKEN-DOCUMENT MATRIX (PCA-based ordering)")
print("=" * 80)
print("\nNotice how tokens from the same topic are now grouped together!")
print()
print_token_document_matrix(DOCUMENTS, pca_token_to_id)
print("=" * 80)
print("OPTIMIZED DOCUMENT BIT VECTORS (PCA-based ordering)")
print("=" * 80)
print("\nNotice how the 1-bits are now clustered together!")
print()
print_document_bitvectors(DOCUMENTS, pca_token_to_id, TOPIC_LABELS)
print_compression_comparison(DOCUMENTS, token_to_id, pca_token_to_id)
print("\n" + "=" * 80)
print("CONCLUSION")
print("=" * 80)
print("\nWith PCA-based ordering, tokens that co-occur are assigned")
print("consecutive IDs, resulting in clustered bit patterns.")
print("This creates longer runs of zeros and ones, improving compression")
print("efficiency with run-length encoding or sparse bit vector representations!")
print("=" * 80)
print()
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment