kellabyte/bifrost_milestone3_update.md

## bifrost_milestone3_update.md

      
    Raw
  

              bifrost_milestone3_update.md
            
          
    Bifrost First Run

Bifrost is a DuckDB extension that bridges OLTP (TursoDB) and OLAP (DuckDB) worlds through a single ACID transaction layer by storing OLAP native Parquet segments in TursoDB.
Bifrost is written by Codex with approxiamtely 24h of compute time and we completed milestone 3 from the project plan. Bifrost contains zero human written code to experiment with what is possible. The decisions along the way have been heavily human directed however.
The project plan was created from my (@kellabyte) requirements via Codex. This document is written only by me with no help from AI.
Query Routing

Bifrost supports 2 ways of routing queries.

Auto: Bifrost detects if the query would benefit from an OLAP orientation. A query to table events would route to the events_oltp table.
Manual: You can explicitly query directly against the events_oltp table or the events_olap table if you wish to skip the auto detection mechanism.

Simple Performance Test Comparing OLTP vs OLAP

Load the Bifrost DuckDB extension.
./build/duckdb/duckdb -unsigned
DuckDB v1.4.1 (Andium) b390a7c
Enter ".help" for usage hints.
D .timer on
D
D LOAD './build/bifrost.duckdb_extension';

ATTACH 'file:/tmp/bifrost_demo.db' AS t (TYPE turso);
Run Time (s): real 0.005 user 0.000853 sys 0.003439

Load 101 million rows of sample data into a new table.
CREATE TABLE t.events (
    id         TEXT PRIMARY KEY DEFAULT uuid(),
    user_id    TEXT NOT NULL,
    product_id TEXT NOT NULL,
    amount     DECIMAL(10,2),
    created_at TIMESTAMP,
    is_test    BOOLEAN
);
Run Time (s): real 0.001 user 0.000254 sys 0.000367

INSERT INTO t.events (id, user_id, product_id, amount, created_at, is_test)
SELECT
    uuid()                                             AS id,
    'user_' || (i % 5000)                              AS user_id,
    'product_' || (i % 250)                            AS product_id,
    ROUND(10 + random() * 90, 2)::DECIMAL(10,2)        AS amount,
    TIMESTAMP '2024-01-01' + INTERVAL (i % 365) DAY    AS created_at,
    (i % 10 = 0)                                       AS is_test
FROM range(101000000) AS t(i);

Bifrost isn't at the milestone where it commits OLAP data within the same transaction as the INSERT's yet so we explicitly call Bifrost to process the OLTP data into the OLAP table. Data in both OLTP and OLAP tables (Parquet segments) are using the same backing store.
Here we are syncing the OLAP store which is stored within the same Turso storage engine. OLTP and OLAP's Parquet segments are stored in Turso. In a later milestone we will fuse OLTP and OLAP transactions together.
CALL t.sync_columnar_projection('events');
100% ▕██████████████████████████████████████▏ (00:02:37.50 elapsed)
┌─────────────────┬──────────────────┬──────────────┐
│     status      │ segments_emitted │ rows_emitted │
│     varchar     │      int64       │    int64     │
├─────────────────┼──────────────────┼──────────────┤
│ segment_created │       1010       │  101000000   │
└─────────────────┴──────────────────┴──────────────┘
Run Time (s): real 167.276 user 280.957738 sys 20.694401

Now we query against the OLTP Turso row oriented table.
SELECT user_id,
         COUNT(*) AS orders,
         SUM(amount) AS total_amount
  FROM t.events_oltp
  WHERE created_at BETWEEN TIMESTAMP '2024-03-01' AND TIMESTAMP '2024-06-30'
    AND NOT is_test
  GROUP BY 1
  ORDER BY total_amount DESC
  LIMIT 10;
100% ▕██████████████████████████████████████▏ (00:00:58.19 elapsed)
┌───────────┬────────┬───────────────┐
│  user_id  │ orders │ total_amount  │
│  varchar  │ int64  │ decimal(38,2) │
├───────────┼────────┼───────────────┤
│ user_4506 │   6916 │     387229.31 │
│ user_1766 │   6916 │     386937.49 │
│ user_3571 │   6921 │     386765.84 │
│ user_4125 │   6924 │     386511.71 │
│ user_1201 │   6921 │     386468.41 │
│ user_3971 │   6917 │     386088.11 │
│ user_2925 │   6918 │     386026.75 │
│ user_4061 │   6915 │     386017.55 │
│ user_911  │   6924 │     385996.44 │
│ user_3546 │   6920 │     385976.38 │
├───────────┴────────┴───────────────┤
│ 10 rows                  3 columns │
└────────────────────────────────────┘
Run Time (s): real 58.223 user 34.179090 sys 5.639712

Now we run the exact same query against the Parquet segments also stored in TursoDB.
SELECT user_id,
         COUNT(*) AS orders,
         SUM(amount) AS total_amount
  FROM t.events_olap
  WHERE created_at BETWEEN TIMESTAMP '2024-03-01' AND TIMESTAMP '2024-06-30'
    AND NOT is_test
  GROUP BY 1
  ORDER BY total_amount DESC
  LIMIT 10;
100% ▕██████████████████████████████████████▏ (00:00:06.30 elapsed)
┌───────────┬────────┬───────────────┐
│  user_id  │ orders │ total_amount  │
│  varchar  │ int64  │ decimal(38,2) │
├───────────┼────────┼───────────────┤
│ user_4506 │   6916 │     387229.31 │
│ user_1766 │   6916 │     386937.49 │
│ user_3571 │   6921 │     386765.84 │
│ user_4125 │   6924 │     386511.71 │
│ user_1201 │   6921 │     386468.41 │
│ user_3971 │   6917 │     386088.11 │
│ user_2925 │   6918 │     386026.75 │
│ user_4061 │   6915 │     386017.55 │
│ user_911  │   6924 │     385996.44 │
│ user_3546 │   6920 │     385976.38 │
├───────────┴────────┴───────────────┤
│ 10 rows                  3 columns │
└────────────────────────────────────┘
Run Time (s): real 8.885 user 12.966260 sys 1.520343

Notes

It seems Codex has created the building blocks for us to implement atomic OLTP+OLAP commits and we have achieved 6.65x better query performance.
Is this worth it? I'm not sure if storing a duplicate of the data is worth only a 6x performance improvement. But damn is life so much simpler when you can serve immediate analytics. 2 competing workloads in one system maybe doesn't make sense. But it's fun to try.
But maybe there's a lot more room for improvements. Maybe this is a promising start? I'm pretty sure we aren't parallelizing compute across the CPU cores yet. Although we did a lot of work on in-memory batching and I steered it towards some bad choices it was going to make, I'm not sure I have everything understood yet.
Also one big question is which store is the primary entry point? Bifrost currently is a DuckDB front end that leverages TursoDB internally for all it's storage.
My gut tells me that you would want TursoDB to be the frontend and DuckDB used behind the scenes but Codex seemed more confident in this approach so I wanted to see if we would end up with a working system. We did!
Here's the thing, we've been iterating a shit ton in 72 hours. I can't keep up. Should I go slower? Probably! If this was a day job task I wouldn't be approaching it like this. But I'm SO CURIOUS where this goes.
I'm always thinking up ideas and if AI let's me prototype all these ideas I have in my head with working systems that is very exciting!
I haven't pushed the code to GitHub yet because I didn't feel it was ready but I think after a few things it might be interesting to share what Codex has written.
No results found