auxten/chdb-ds-arch.md

## chdb-ds-arch.md

      
    Raw
  

              chdb-ds-arch.md
            
          
    ┌─────────────────────────────────────────────────────────────┐
│                      User Code (pandas-style)               │
│                                                             │
│  ds = DataStore.from_file("events.parquet")                 │
│  ds = ds[ds['age'] > 25]          # filter                 │
│  ds = ds.groupby('city')['salary'].mean()                  │
│  print(ds)                         # triggers execution     │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                  ColumnExpr  (Expression Tree)              │
│                                                             │
│  ds['age'] > 25  compiles to:                               │
│                                                             │
│       BinaryCondition                                       │
│        ├── left:  Field('age')                              │
│        ├── op:    '>'                                       │
│        └── right: Literal(25)                               │
│                                                             │
│  Expression types:                                          │
│  ┌──────────┐ ┌─────────┐ ┌──────────┐ ┌────────────────┐  │
│  │  Field   │ │ Literal │ │ Function │ │ ArithmeticExpr │  │
│  │ "col"    │ │ 42,NULL │ │ sum(x)   │ │ (a + b) * c    │  │
│  └──────────┘ └─────────┘ └──────────┘ └────────────────┘  │
│  ┌──────────────────┐ ┌───────────────┐ ┌──────────────┐   │
│  │ BinaryCondition  │ │ InCondition   │ │ NotCondition │   │
│  │ a > b, a == b    │ │ x IN (1,2,3)  │ │ NOT (cond)   │   │
│  └──────────────────┘ └───────────────┘ └──────────────┘   │
│                                                             │
│  Every node has  .to_sql()  →  "age" > 25                  │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                  LazyOp Chain  (Logical Plan)               │
│                                                             │
│  Each DataFrame method appends a LazyOp to the chain:      │
│                                                             │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────┐    │
│  │ DataSource   │──▶│ RelationalOp │──▶│ GroupByAgg   │    │
│  │ (parquet)    │   │ WHERE age>25 │   │ city → mean  │    │
│  └──────────────┘   └──────────────┘   └──────────────┘    │
│                                                             │
│  SQL-pushable ops:        Pandas-only ops:                  │
│  ✓ RelationalOp (WHERE,   ✗ Apply (callable)               │
│    SELECT, ORDER BY,      ✗ Transform (callable)           │
│    LIMIT, OFFSET)         ✗ FillNA, DropNA                 │
│  ✓ ColumnSelection        ✗ AddPrefix, AddSuffix           │
│  ✓ GroupByAgg             ✗ Rename, AsType                 │
│  ✓ ColumnAssignment*                                       │
│  ✓ Join, Union, Distinct                                   │
│                                                             │
│  * = conditional, depends on expression compatibility       │
└──────────────────────┬──────────────────────────────────────┘
                       │  trigger: len() / print() / .columns
                       ▼
┌─────────────────────────────────────────────────────────────┐
│              QueryPlanner  (Segment Splitting)              │
│                                                             │
│  Walks the LazyOp chain, checks can_push_to_sql() per op,  │
│  and splits into alternating execution segments:            │
│                                                             │
│  ┌─────────────────┐  ┌───────────────┐  ┌──────────────┐  │
│  │  Segment 1 SQL  │─▶│ Segment 2 PD  │─▶│ Segment 3 SQL│  │
│  │ WHERE, SELECT,  │  │ apply(), map()│  │ WHERE, LIMIT │  │
│  │ GROUP BY        │  │               │  │ (on Python())│  │
│  └─────────────────┘  └───────────────┘  └──────────────┘  │
│                                                             │
│  Also handles: nested subqueries (WHERE after LIMIT),       │
│  alias conflicts, column dependency tracking                │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│            SQLExecutionEngine  (SQL Generation)             │
│                                                             │
│  For each SQL segment:                                      │
│                                                             │
│  1. extract_clauses_from_ops()                              │
│     → where_conditions, select_fields, orderby, limit      │
│                                                             │
│  2. Each clause calls  expr.to_sql()  on its ColumnExprs    │
│                                                             │
│  3. Assemble final query:                                   │
│  ┌────────────────────────────────────────────────────┐     │
│  │ SELECT "city", avg("salary") AS "salary"           │     │
│  │ FROM Python(__df__)          ← or file/table       │     │
│  │ WHERE "age" > 25                                   │     │
│  │ GROUP BY "city"                                    │     │
│  │ ORDER BY "city" ASC                                │     │
│  │ LIMIT 100                                          │     │
│  └────────────────────────────────────────────────────┘     │
│                                                             │
│  Complex cases: nested subqueries, CASE WHEN for           │
│  where/mask, multi-layer wrapping                           │
└──────────────────────┬──────────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────────┐
│                   chDB Engine (Execution)                    │
│                                                             │
│  Embedded ClickHouse runs the SQL in-process.               │
│  Python() table function enables zero-copy exchange         │
│  between pandas DataFrames and SQL segments.                │
└─────────────────────────────────────────────────────────────┘
No results found