krisajenkins/pg_lake_architecture.md

## pg_lake_architecture.md

      
    Raw
  

              pg_lake_architecture.md
            
          
    pg_lake Architecture

System Overview

pg_lake turns PostgreSQL into a lakehouse by integrating Apache Iceberg and data lake file formats (Parquet, CSV, JSON) directly into PostgreSQL. Users connect only to PostgreSQL — the extensions transparently delegate data scanning and computation to pgduck_server (a standalone process backed by DuckDB) while PostgreSQL handles query planning, transaction boundaries, and Iceberg metadata management.

  
      graph LR
    Client["Client (psql, app)"]
    PG["PostgreSQL<br/>+ pg_lake extensions"]
    PGDuck["pgduck_server<br/>(DuckDB engine)"]
    Storage["Object Storage<br/>(S3 / GCS)"]

    Client -->|"SQL (port 5432)"| PG
    PG -->|"libpq (unix socket :5332)"| PGDuck
    PG -->|"Iceberg metadata<br/>(Avro read/write)"| Storage
    PGDuck -->|"Parquet/CSV/JSON<br/>read & write"| Storage

    
      Loading

  
Extension Hierarchy

All extensions install into pg_catalog. The requires field in each .control file defines the dependency chain. Installing pg_lake with CASCADE pulls in the entire tree.

  
      graph TD
    pg_lake["<b>pg_lake</b><br/><i>Meta-extension (installs everything)</i>"]
    pg_lake_copy["<b>pg_lake_copy</b><br/><i>COPY to/from data lake files</i>"]
    pg_lake_table["<b>pg_lake_table</b><br/><i>FDW for lake & Iceberg tables</i>"]
    pg_lake_iceberg["<b>pg_lake_iceberg</b><br/><i>Iceberg v2 protocol impl</i>"]
    pg_lake_engine["<b>pg_lake_engine</b><br/><i>Shared query engine module</i>"]
    pg_extension_base["<b>pg_extension_base</b><br/><i>Extension dev kit / foundation</i>"]
    pg_map["<b>pg_map</b><br/><i>Generic map type for PG</i>"]
    pg_extension_updater["<b>pg_extension_updater</b><br/><i>Auto extension updates</i>"]
    btree_gist["<b>btree_gist</b><br/><i>PostgreSQL contrib</i>"]

    pg_lake_spatial["<b>pg_lake_spatial</b><br/><i>Geospatial file format support</i>"]
    postgis["<b>PostGIS</b><br/><i>(external)</i>"]
    pg_lake_benchmark["<b>pg_lake_benchmark</b><br/><i>Benchmarking</i>"]

    pg_lake --> pg_lake_table
    pg_lake --> pg_lake_copy
    pg_lake_copy --> pg_lake_engine
    pg_lake_copy --> pg_lake_iceberg
    pg_lake_copy --> pg_lake_table
    pg_lake_table --> pg_lake_engine
    pg_lake_table --> pg_lake_iceberg
    pg_lake_table --> btree_gist
    pg_lake_iceberg --> pg_lake_engine
    pg_lake_engine --> pg_extension_base
    pg_lake_engine --> pg_map
    pg_extension_updater --> pg_extension_base

    pg_lake_spatial --> pg_lake
    pg_lake_spatial --> postgis
    pg_lake_benchmark --> pg_lake

    style pg_lake fill:#4a90d9,color:#fff
    style pg_lake_table fill:#5ba85b,color:#fff
    style pg_lake_iceberg fill:#d9a54a,color:#fff
    style pg_lake_engine fill:#c75050,color:#fff
    style pg_lake_copy fill:#7b68ae,color:#fff
    style pg_extension_base fill:#888,color:#fff
    style pg_map fill:#888,color:#fff
    style pg_extension_updater fill:#888,color:#fff
    style btree_gist fill:#aaa,color:#fff
    style pg_lake_spatial fill:#4a90d9,color:#fff,stroke-dasharray: 5
    style pg_lake_benchmark fill:#4a90d9,color:#fff,stroke-dasharray: 5
    style postgis fill:#aaa,color:#fff,stroke-dasharray: 5

    
      Loading

  
Extension responsibilities


Extension
Role


pg_extension_base
Foundation for all extensions. Provides background workers, library preloading, extension control file parsing, PG version compatibility shims.


pg_map
Generic key-value map type generator for semi-structured data in PostgreSQL.


pg_extension_updater
Automatically updates installed extensions on startup.


pg_lake_engine
Shared module across all pg_lake extensions. Houses the pgduck client (pgduck/client.c), query rewriting, type mapping between PG and DuckDB, read/write data generation, CSV/Parquet/JSON handling, S3/GCS utilities, and the utility hook infrastructure.


pg_lake_iceberg
Full Iceberg v2 specification implementation. Manages table metadata, snapshots, manifests, manifest entries, data files, schemas, and partitioning — all serialized via Avro. Supports object store catalog and REST catalog.


pg_lake_table
Foreign Data Wrapper (FDW) that exposes data lake files and Iceberg tables as PostgreSQL foreign tables. Implements the full FDW API (scan, insert, update, delete), DDL handling (CREATE TABLE ... USING iceberg), full query pushdown via a custom planner hook, snapshot management, file pruning, and transaction hooks for Iceberg metadata commits. Also provides a placeholder table access method (pg_lake_iceberg) for CREATE TABLE ... USING syntax.


pg_lake_copy
Extends PostgreSQL's COPY command to import/export data to data lake file formats.


pg_lake
Meta-extension. Installing it with CASCADE installs all required extensions.


pg_lake_spatial
Optional. Adds geospatial file format support (GeoParquet). Requires PostGIS.


Codebase Organisation


      graph TD
    subgraph repo["pg_lake repository"]
        direction TB

        subgraph core_ext["Core Extensions (C, PGXS)"]
            PEB["pg_extension_base/"]
            PEU["pg_extension_updater/"]
            PM["pg_map/"]
            PLE["pg_lake_engine/"]
            PLI["pg_lake_iceberg/"]
            PLT["pg_lake_table/"]
            PLC["pg_lake_copy/"]
            PL["pg_lake/"]
        end

        subgraph optional_ext["Optional Extensions"]
            PLS["pg_lake_spatial/"]
            PLB["pg_lake_benchmark/"]
        end

        subgraph external["External Components"]
            PDS["pgduck_server/<br/>(standalone C server)"]
            DPL["duckdb_pglake/<br/>(DuckDB C++ extension)"]
            AVRO["avro/<br/>(patched Apache Avro)"]
        end

        subgraph support["Supporting"]
            TC["test_common/<br/>(shared pytest fixtures)"]
            TOOLS["tools/<br/>(version bumping, etc.)"]
            DOCS["docs/"]
            DOCKER["docker/"]
            DATA["data/<br/>(test data)"]
        end
    end

    style core_ext fill:#e8f0fe,stroke:#4a90d9
    style optional_ext fill:#fef7e0,stroke:#d9a54a
    style external fill:#fde8e8,stroke:#c75050
    style support fill:#e8fee8,stroke:#5ba85b

    
      Loading

  
Key source file areas within pg_lake_engine/src/


Directory
Purpose


pgduck/
Client connection to pgduck_server (client.c), query rewriting (rewrite_query.c), read/write query generation (read_data.c, write_data.c), DuckDB type mapping (type.c), data serialization, and shippable function/operator lists.


parquet/
Parquet schema field mapping and GeoParquet support.


csv/, json/
Format-specific options and readers.


copy/
Remote copy query generation.


cleanup/
Deletion queue and in-progress file tracking.


storage/
Local storage utilities.


ddl/
Utility hook for DDL interception.


extensions/
Per-extension version and dependency metadata.


Key source file areas within pg_lake_table/src/


Directory
Purpose


fdw/
FDW callback implementations (pg_lake_table.c), snapshot management (snapshot.c), query deparsing, data file pruning, partitioning, writable table support, row ID management, and position delete handling.


planner/
Full query pushdown via planner hook (query_pushdown.c), insert-select optimization, and EXPLAIN support.


ddl/
CREATE TABLE (both CREATE FOREIGN TABLE and CREATE TABLE ... USING iceberg), ALTER TABLE, DROP TABLE, VACUUM.


transaction/
Iceberg metadata change tracking within PG transactions (track_iceberg_metadata_changes.c) and transaction commit/abort hooks (transaction_hooks.c).


access_method/
Placeholder table access method for USING iceberg syntax.


duckdb/
Query transformation from PG parse tree to DuckDB-compatible SQL.


Key source file areas within pg_lake_iceberg/src/


Directory
Purpose


iceberg/api/
Core Iceberg data structures: table_metadata.c, snapshot.c, manifest.c, manifest_list.c, manifest_entry.c, datafile.c, table_schema.c.


iceberg/
Catalog management (catalog.c), metadata read/write, partitioning, manifest merge, and Iceberg type serialization (binary and JSON).


avro/
Avro reader/writer for serializing Iceberg metadata files.


rest_catalog/
REST catalog API client.


object_store_catalog/
Object store (filesystem-style) catalog.


http/
HTTP client for REST catalog communication.


Data Flow: CREATE TABLE, INSERT, SELECT

The following diagram traces the complete lifecycle of creating an Iceberg table, inserting rows, and querying them.
1. CREATE TABLE


      sequenceDiagram
    participant Client
    participant PG as PostgreSQL
    participant PLT as pg_lake_table<br/>(DDL hooks)
    participant PLI as pg_lake_iceberg<br/>(metadata)
    participant S3 as Object Storage

    Client->>PG: CREATE TABLE t (id int, name text)<br/>USING iceberg
    PG->>PLT: Utility hook intercepts CREATE stmt

    Note over PLT: ProcessCreateIcebergTableFromCreateStmt()<br/>Converts to CREATE FOREIGN TABLE<br/>with SERVER pg_lake_iceberg

    PLT->>PLT: Validate column types, set options<br/>(location from default_location_prefix)
    PLT->>PG: Execute CREATE FOREIGN TABLE internally

    PG-->>PLT: Foreign table created in pg_catalog

    PLT->>PLI: InsertInternalIcebergCatalogTable()<br/>Register in lake_iceberg.tables
    PLI->>PLI: Initialize Iceberg table metadata<br/>(schema, snapshot, partition spec)
    PLI->>S3: Write initial metadata.json<br/>(via Avro serialization)

    PLT->>PLT: Register field ID mappings<br/>Initialize data file catalogs

    PG-->>Client: CREATE TABLE OK

    
      Loading

  
2. INSERT


      sequenceDiagram
    participant Client
    participant PG as PostgreSQL
    participant FDW as pg_lake_table<br/>(FDW callbacks)
    participant Engine as pg_lake_engine<br/>(pgduck client)
    participant PGDuck as pgduck_server<br/>(DuckDB)
    participant S3 as Object Storage
    participant TxTrack as Transaction<br/>Tracker

    Client->>PG: INSERT INTO t VALUES (1,'alice'),(2,'bob')

    PG->>FDW: BeginForeignModify()
    FDW->>FDW: Create DestReceiver for inserts<br/>(partitioned if needed)

    PG->>FDW: ExecForeignInsert() [per row]
    FDW->>FDW: Buffer rows in CSV via COPY TO

    PG->>FDW: EndForeignModify()

    Note over FDW: Flush buffered CSV to local temp file

    FDW->>Engine: ConvertCSVFileTo()<br/>Generate DuckDB COPY command
    Engine->>PGDuck: COPY (SELECT ... FROM read_csv(...))<br/>TO 's3://bucket/.../data-xxx.parquet'<br/>(via libpq over unix socket)
    PGDuck->>S3: Write Parquet file
    PGDuck-->>Engine: Write complete + stats

    FDW->>FDW: Record new data file in<br/>pg_lake catalogs (data_files, stats)
    FDW->>TxTrack: TrackIcebergMetadataChangesInTx()<br/>Record pending ADD_FILE operation

    Note over TxTrack: Changes held until COMMIT

    Client->>PG: COMMIT
    PG->>TxTrack: IcebergXactCallback(PRE_COMMIT)
    TxTrack->>TxTrack: ConsumeTrackedIcebergMetadataChanges()<br/>Diff current vs. last-pushed metadata

    TxTrack->>TxTrack: Build new Iceberg snapshot:<br/>• New manifest entry for data file<br/>• New manifest (Avro)<br/>• New manifest list (Avro)<br/>• New snapshot in table metadata

    TxTrack->>S3: Write manifest Avro files
    TxTrack->>S3: Write updated metadata.json

    PG->>TxTrack: IcebergXactCallback(COMMIT)
    TxTrack->>TxTrack: PostAllRestCatalogRequests()<br/>(if using REST catalog)

    PG-->>Client: COMMIT OK

    
      Loading

  
3. SELECT (Query)


      sequenceDiagram
    participant Client
    participant PG as PostgreSQL
    participant Planner as pg_lake_table<br/>(planner hook)
    participant FDW as pg_lake_table<br/>(FDW scan)
    participant Engine as pg_lake_engine
    participant PGDuck as pgduck_server<br/>(DuckDB)
    participant S3 as Object Storage

    Client->>PG: SELECT * FROM t WHERE id > 1

    alt Full Query Pushdown (entire query is shippable)
        PG->>Planner: LakeTablePlanner()
        Planner->>Planner: Check all expressions are shippable<br/>(functions, operators, types)
        Planner->>Planner: Generate CustomScan with<br/>rewritten DuckDB SQL
        Planner->>Engine: Deparse query to DuckDB dialect
        Note over Planner: Replaces table refs with<br/>read_parquet([file1, file2, ...])<br/>using current snapshot's file list
        PG->>Engine: QueryPushdownExecScan()
        Engine->>PGDuck: Send full SQL query<br/>(via libpq, extended protocol)
        PGDuck->>S3: Read Parquet files
        PGDuck-->>Engine: Stream result rows
        Engine-->>PG: Convert DuckDB results to PG tuples
    else FDW Path (partial pushdown)
        PG->>FDW: GetForeignRelSize / GetForeignPaths
        FDW->>FDW: CreatePgLakeScanSnapshot()<br/>Get file list from Iceberg metadata
        FDW->>FDW: Data file pruning via min/max stats
        PG->>FDW: GetForeignPlan — deparse SELECT
        Note over FDW: Generates: SELECT cols FROM<br/>read_parquet([files]) WHERE ...
        PG->>FDW: BeginForeignScan()
        FDW->>Engine: SendQueryToPGDuck()
        Engine->>PGDuck: Execute deparsed query
        PGDuck->>S3: Read Parquet files
        PGDuck-->>Engine: Stream rows via PG wire protocol
        Engine-->>FDW: WaitForResult() / fetch tuples
        FDW-->>PG: Return HeapTuples
    end

    PG-->>Client: Query result rows

    
      Loading

  
pgduck_server Architecture

pgduck_server is a standalone multi-threaded process that implements the PostgreSQL wire protocol and delegates all computation to DuckDB. It is not a PostgreSQL extension — it is a separate binary.

  
      graph TD
    subgraph pgduck_server["pgduck_server process"]
        Main["main.c<br/>Parse CLI options"]
        DuckInit["duckdb_global_init()<br/>Initialize DuckDB instance"]
        PGServer["pgserver.c<br/>Unix socket listener (:5332)"]
        ThreadPool["client_threadpool.c<br/>Thread pool"]

        subgraph per_thread["Per-connection thread"]
            PGSession["pgsession.c<br/>PG wire protocol handler"]
            PGSessionIO["pgsession_io.c<br/>Low-level socket I/O"]
            DuckDB["duckdb.c<br/>Execute queries on DuckDB"]
            TypeConv["type_conversion.c<br/>DuckDB → PG type mapping"]
        end

        DuckDBExt["duckdb_pglake extension<br/>(loaded into DuckDB)"]
    end

    subgraph duckdb_pglake["duckdb_pglake DuckDB extension"]
        S3FS["pg_lake_s3fs<br/>Region-aware S3 filesystem"]
        CacheFS["caching_file_system<br/>Local file caching"]
        UtilFns["utility_functions<br/>PG-compatible functions"]
    end

    Main --> DuckInit
    Main --> PGServer
    PGServer --> ThreadPool
    ThreadPool --> per_thread
    PGSession --> DuckDB
    DuckDB --> DuckDBExt
    DuckDBExt --> duckdb_pglake

    style pgduck_server fill:#fde8e8,stroke:#c75050
    style per_thread fill:#fff0f0,stroke:#c75050
    style duckdb_pglake fill:#e8f0fe,stroke:#4a90d9

    
      Loading

  
Communication protocol

PostgreSQL extensions communicate with pgduck_server using libpq (the standard PostgreSQL client library) over a Unix domain socket (default port 5332). This means pgduck_server looks like another PostgreSQL server to the extensions.

Queries are sent via PQsendQueryParams() (extended query protocol) for streaming results, or PQsendQuery() for simple commands.
For bulk data transfer, a TRANSMIT prefix triggers COPY-OUT mode — pgduck_server streams results in CSV batches over the PG wire protocol, avoiding per-row overhead.
Connections are pooled per-backend in a hash table and released at transaction boundaries via XactCallback.


Transaction & Metadata Commit Flow

Iceberg metadata changes are tracked in-memory during a transaction and only committed atomically at PRE_COMMIT time:

  
      stateDiagram-v2
    [*] --> Active: BEGIN (implicit or explicit)

    Active --> Active: INSERT / UPDATE / DELETE<br/>Data files written to S3<br/>Changes tracked in hash table

    Active --> PreCommit: COMMIT issued

    state PreCommit {
        [*] --> DiffMetadata: ConsumeTrackedIcebergMetadataChanges()
        DiffMetadata --> BuildSnapshot: Diff added/removed files<br/>vs. last-pushed metadata
        BuildSnapshot --> WriteManifests: Create new manifests (Avro)
        WriteManifests --> WriteMetadata: Write new metadata.json to S3
        WriteMetadata --> [*]
    }

    PreCommit --> Committed: XACT_EVENT_COMMIT
    Committed --> PostCommit: PostAllRestCatalogRequests()<br/>(REST catalog only)
    PostCommit --> [*]

    Active --> Aborted: ABORT / ERROR
    Aborted --> Cleanup: ResetTrackedIcebergMetadataOperation()
    Cleanup --> [*]

    
      Loading

  
Query Pushdown Decision

pg_lake_table installs a PostgreSQL planner hook that attempts to push down entire queries to DuckDB. If any part of the query is not shippable, it falls back to the standard FDW path where only per-table scans are pushed down.

  
      flowchart TD
    Q["Incoming Query"]
    Q --> Check{"All tables are<br/>pg_lake tables?"}
    Check -->|No| Standard["Standard PostgreSQL<br/>planner (no pushdown)"]
    Check -->|Yes| Shippable{"All expressions<br/>shippable to DuckDB?"}
    Shippable -->|Yes| FullPush["Full Query Pushdown<br/>(CustomScan node)"]
    Shippable -->|No| FDW["FDW Path<br/>(per-table scan pushdown)"]

    FullPush --> Rewrite["Rewrite entire query<br/>to DuckDB dialect"]
    Rewrite --> Replace["Replace table refs with<br/>read_parquet(file_list)"]
    Replace --> Send["Send to pgduck_server"]

    FDW --> Deparse["Deparse per-table SELECT<br/>with pushed-down WHERE"]
    Deparse --> Replace2["Replace table ref with<br/>read_parquet(file_list)"]
    Replace2 --> Send2["Send to pgduck_server"]

    style FullPush fill:#5ba85b,color:#fff
    style FDW fill:#d9a54a,color:#fff
    style Standard fill:#888,color:#fff

    
      Loading

  
Write Path Detail

When data is written to an Iceberg table, it flows through several stages:

  
      flowchart LR
    subgraph PostgreSQL
        PG_COPY["COPY TO<br/>(PG internal)"]
        CSV["Local CSV<br/>temp file"]
    end

    subgraph pg_lake_engine
        GenQ["Generate DuckDB<br/>COPY command"]
    end

    subgraph pgduck_server
        ReadCSV["read_csv()"]
        WriteParquet["COPY TO<br/>Parquet"]
    end

    subgraph Object Storage
        Parquet["data-xxx.parquet"]
    end

    PG_COPY -->|"Rows buffered<br/>as CSV"| CSV
    CSV -->|"File path"| GenQ
    GenQ -->|"COPY (SELECT ... FROM<br/>read_csv('/tmp/xx.csv'))<br/>TO 's3://...parquet'"| ReadCSV
    ReadCSV --> WriteParquet
    WriteParquet --> Parquet

    style PostgreSQL fill:#e8f0fe,stroke:#4a90d9
    style pg_lake_engine fill:#fde8e8,stroke:#c75050
    style pgduck_server fill:#fef7e0,stroke:#d9a54a

    
      Loading

  
Rows are first serialized to CSV using PostgreSQL's built-in COPY mechanism, then pgduck_server reads that CSV and writes it out as Parquet to object storage. This avoids implementing a Parquet writer inside PostgreSQL and leverages DuckDB's efficient columnar engine for the conversion.
Extension	Role
pg_extension_base	Foundation for all extensions. Provides background workers, library preloading, extension control file parsing, PG version compatibility shims.
pg_map	Generic key-value map type generator for semi-structured data in PostgreSQL.
pg_extension_updater	Automatically updates installed extensions on startup.
pg_lake_engine	Shared module across all pg_lake extensions. Houses the pgduck client (`pgduck/client.c`), query rewriting, type mapping between PG and DuckDB, read/write data generation, CSV/Parquet/JSON handling, S3/GCS utilities, and the utility hook infrastructure.
pg_lake_iceberg	Full Iceberg v2 specification implementation. Manages table metadata, snapshots, manifests, manifest entries, data files, schemas, and partitioning — all serialized via Avro. Supports object store catalog and REST catalog.
pg_lake_table	Foreign Data Wrapper (FDW) that exposes data lake files and Iceberg tables as PostgreSQL foreign tables. Implements the full FDW API (scan, insert, update, delete), DDL handling (`CREATE TABLE ... USING iceberg`), full query pushdown via a custom planner hook, snapshot management, file pruning, and transaction hooks for Iceberg metadata commits. Also provides a placeholder table access method (`pg_lake_iceberg`) for `CREATE TABLE ... USING` syntax.
pg_lake_copy	Extends PostgreSQL's `COPY` command to import/export data to data lake file formats.
pg_lake	Meta-extension. Installing it with `CASCADE` installs all required extensions.
pg_lake_spatial	Optional. Adds geospatial file format support (GeoParquet). Requires PostGIS.
Directory	Purpose
`pgduck/`	Client connection to pgduck_server (`client.c`), query rewriting (`rewrite_query.c`), read/write query generation (`read_data.c`, `write_data.c`), DuckDB type mapping (`type.c`), data serialization, and shippable function/operator lists.
`parquet/`	Parquet schema field mapping and GeoParquet support.
`csv/`, `json/`	Format-specific options and readers.
`copy/`	Remote copy query generation.
`cleanup/`	Deletion queue and in-progress file tracking.
`storage/`	Local storage utilities.
`ddl/`	Utility hook for DDL interception.
`extensions/`	Per-extension version and dependency metadata.
Directory	Purpose
`fdw/`	FDW callback implementations (`pg_lake_table.c`), snapshot management (`snapshot.c`), query deparsing, data file pruning, partitioning, writable table support, row ID management, and position delete handling.
`planner/`	Full query pushdown via planner hook (`query_pushdown.c`), insert-select optimization, and EXPLAIN support.
`ddl/`	`CREATE TABLE` (both `CREATE FOREIGN TABLE` and `CREATE TABLE ... USING iceberg`), `ALTER TABLE`, `DROP TABLE`, `VACUUM`.
`transaction/`	Iceberg metadata change tracking within PG transactions (`track_iceberg_metadata_changes.c`) and transaction commit/abort hooks (`transaction_hooks.c`).
`access_method/`	Placeholder table access method for `USING iceberg` syntax.
`duckdb/`	Query transformation from PG parse tree to DuckDB-compatible SQL.
Directory	Purpose
`iceberg/api/`	Core Iceberg data structures: `table_metadata.c`, `snapshot.c`, `manifest.c`, `manifest_list.c`, `manifest_entry.c`, `datafile.c`, `table_schema.c`.
`iceberg/`	Catalog management (`catalog.c`), metadata read/write, partitioning, manifest merge, and Iceberg type serialization (binary and JSON).
`avro/`	Avro reader/writer for serializing Iceberg metadata files.
`rest_catalog/`	REST catalog API client.
`object_store_catalog/`	Object store (filesystem-style) catalog.
`http/`	HTTP client for REST catalog communication.