Skip to content

Instantly share code, notes, and snippets.

@thehypergraph
Created February 21, 2026 04:37
Show Gist options
  • Select an option

  • Save thehypergraph/0e5dc358a44f5cb102e47690835d406f to your computer and use it in GitHub Desktop.

Select an option

Save thehypergraph/0e5dc358a44f5cb102e47690835d406f to your computer and use it in GitHub Desktop.
An AI generated map and study guide to postgres

PostgreSQL Source Code Study Guide

A map of the codebase for learning how PostgreSQL works from the inside out.


How a Query Flows Through the System

Client
  ↓  (TCP/Unix socket)
postmaster        — forks a new backend per connection
  ↓
backend startup   — auth, GUC init, shared memory attach
  ↓
tcop/postgres.c   — ReadCommand() loop: read wire protocol message
  ↓
parser/           — raw lex/parse → raw parse tree
  ↓
parser/analyze.c  — semantic analysis → Query node
  ↓
rewrite/          — apply rules → list of Query nodes
  ↓
optimizer/        — Query → PlannedStmt (cheapest Path → Plan tree)
  ↓
executor/         — execute Plan tree, produce tuples
  ↓
tcop/dest.c       — send result rows back to client (libpq)

Directory Map

src/backend/ — The Server


postmaster/ — Process Manager

The master daemon. Never touches shared memory itself. It listens for connections, forks a backend for each one, and manages auxiliary processes (autovacuum launcher, WAL writer, bgwriter, checkpointer, etc.). On crash of any backend it can reset shared memory and restart the cluster.

Key files:

  • postmaster.c — main loop, fork-on-connect, signal handling
  • autovacuum.c — autovacuum launcher and worker logic
  • bgwriter.c, checkpointer.c, walwriter.c — background I/O workers

Key structs:

  • PGPROC (storage/proc.h) — one per backend in shared memory; holds XID, lock wait state, latch

tcop/ — Traffic Cop (Query Dispatch)

The per-backend main loop. Reads messages from the client over the wire protocol, dispatches to parser → optimizer → executor, and writes results back. "tcop" stands for traffic cop.

Key files:

  • postgres.cPostgresMain() message loop; exec_simple_query() ties together parse/plan/execute
  • pquery.cPortalRun(), executing a portal (cursor abstraction)
  • utility.c — handles non-optimizable statements (DDL, COPY, VACUUM, etc.)
  • dest.c — output destination abstraction; swappable sinks (client, /dev/null, file)

Key structs:

  • Portal (utils/portal.h) — represents an open cursor or active query; holds PlannedStmt + ExecutorState
  • DestReceiver — vtable for tuple output (print to client, store to tuplestore, etc.)

parser/ — Lexer, Parser, Semantic Analysis

Converts SQL text into an internal Query node. Two phases:

  1. Raw parse (gram.y + scan.l) → raw parse tree with RawStmt nodes
  2. Analyze (analyze.c) → resolves names, type-checks, builds Query

Key files:

  • scan.l — flex lexer (tokenizes SQL text)
  • gram.y — bison grammar (800+ rules); outputs a raw parse tree
  • analyze.c — top-level semantic analysis, calls out to parse_*.c helpers
  • parse_expr.c, parse_clause.c, parse_coerce.c, parse_agg.c — expression/clause analysis

Key structs (all in src/include/nodes/parsenodes.h):

  • RawStmt — wrapper around a raw parse tree node with source location
  • Query — the main output; holds commandType, rtable (range table), jointree, targetList, havingQual, etc.
  • RangeTblEntry (RTE) — one entry per table/subquery/function in the FROM clause
  • TargetEntry — one column in the SELECT list (wraps an Expr, has resno, resname)

rewrite/ — Rule System

Applies stored rewrite rules (CREATE RULE) to a Query, potentially producing multiple Queries. Also handles view expansion (views are rules).

Key files:

  • rewriteHandler.cQueryRewrite() entry point
  • rewriteManip.c — tree manipulation helpers

Key structs:

  • RewriteRule / RuleLock (rewrite/prs2lock.h) — cached rule definitions on a relation

optimizer/ — Query Planner

Takes a Query and returns a PlannedStmt. Organized into subdirectories:

  • prep/ — preprocessing: sublink flattening, subquery pull-up, partition pruning
  • path/ — builds Path trees (all possible access/join strategies); cost estimation lives here
  • plan/ — converts the winning Path tree into a Plan tree
  • geqo/ — genetic algorithm for join ordering when there are many tables
  • util/RelOptInfo, PathKey, restriction/join clause utilities

The planner builds a RelOptInfo for each base relation and each candidate join. For each RelOptInfo it generates all plausible Paths, estimates their cost, selects the cheapest, and recurses. See src/backend/optimizer/README for the full algorithm.

Key files:

  • optimizer/plan/planner.cplanner() entry point
  • optimizer/path/allpaths.cmake_one_rel(), generates all access paths
  • optimizer/path/costsize.c — all cost formulas (cost_seqscan, cost_index, cost_hashjoin, …)
  • optimizer/path/joinpath.c — generates NestLoop, MergeJoin, HashJoin paths
  • optimizer/plan/createplan.ccreate_plan() turns winning Path into Plan nodes

Key structs (src/include/nodes/pathnodes.h, plannodes.h):

  • PlannerInfo — planner's working state for a single query level (holds simple_rel_array, join_rel_list, etc.)
  • RelOptInfo — represents a base relation or join rel during planning; has pathlist, cheapest_total_path, rows estimate
  • Path — base type; subtypes: IndexPath, BitmapHeapPath, NestPath, MergePath, HashPath, AppendPath
  • PlannedStmt — final output: has planTree (root Plan node), rtable, resultRelations
  • Plan node types (plannodes.h): SeqScan, IndexScan, BitmapHeapScan, NestLoop, MergeJoin, HashJoin, Hash, Agg, Sort, Limit, ModifyTable, etc.

executor/ — Query Executor

Walks the Plan tree using a demand-pull (volcano/iterator) model. Each plan node has ExecInitFoo() / ExecFoo() / ExecEndFoo() functions. Calling ExecProcNode() on the root returns one tuple at a time.

Key files:

  • execMain.cExecutorStart(), ExecutorRun(), ExecutorFinish(), ExecutorEnd()
  • execProcnode.c — dispatch table: ExecInitNode(), ExecProcNode(), ExecEndNode()
  • execExpr.c / execExprInterp.c — expression evaluation engine (JIT-compilable)
  • nodeSeqscan.c, nodeIndexscan.c, nodeNestloop.c, nodeHashjoin.c, nodeAgg.c, nodeSort.c — one file per plan node type

Key structs (src/include/nodes/execnodes.h):

  • EState — executor state for the whole query: snapshot, result relation, estate tuple table
  • PlanState — mirrors each Plan node; subtypes: SeqScanState, IndexScanState, NestLoopState, AggState, etc.
  • ExprContext — holds the current tuple(s) being evaluated (ecxt_scantuple, ecxt_innertuple, ecxt_outertuple)
  • ExprState — compiled expression: array of ExprEvalStep instructions interpreted by ExecInterpExpr()
  • TupleTableSlot — in-memory tuple representation; may be heap, minimal, or virtual

access/ — Table & Index Access Methods

access/heap/ — Heap (Table) Storage

The physical format of table rows on disk. Each row is a HeapTuple on a page. MVCC is implemented here via xmin/xmax fields in every tuple header.

  • heapam.cheap_insert(), heap_update(), heap_delete(), heap_getnext() (sequential scan)
  • hio.c — finding free space and writing tuples to pages
  • heapam_visibility.cHeapTupleSatisfiesSnapshot() and friends (MVCC visibility checks)

Key structs:

  • HeapTupleHeaderData (access/htup_details.h) — on-disk tuple header: t_xmin, t_xmax, t_infomask, t_hoff (header length), null bitmap
  • HeapTupleData — in-memory wrapper: pointer to HeapTupleHeaderData + t_self (ItemPointerData = block + offset)
access/nbtree/ — B-Tree Index

Lehman-Yao concurrent B-tree. Supports equality and range queries. Every index page has a "high key" and a right-link pointer enabling safe concurrent splits.

  • nbtree.c — AM interface (btinsert, btgettuple, btbuild, …)
  • nbtsearch.c — descend tree to find leaf, scan forward/backward
  • nbtpage.c — page-level operations, splits
  • nbtinsert.c — insert and page-split logic

Key structs:

  • BTPageOpaqueData — extra data at end of each B-tree page: btpo_prev, btpo_next, btpo_level, btpo_flags
  • BTScanOpaqueData — scan state: current page/offset, scan keys
access/gin/ — GIN (Generalized Inverted Index)

For full-text search and array/jsonb containment. Stores (key → posting list) pairs.

access/gist/ — GiST (Generalized Search Tree)

Framework for geometric and other custom index types (PostGIS uses this).

access/brin/ — BRIN (Block Range Index)

Stores min/max (or other summaries) per range of heap pages. Very compact; good for naturally ordered large tables.

access/hash/ — Hash Index

Equality-only index using extensible hashing.

access/spgist/ — SP-GiST (Space-Partitioned GiST)

For space-partitioning structures (quad-trees, k-d trees, radix tries).

access/transam/ — Transaction Management

The heart of MVCC. Manages transaction IDs, commit log, snapshots, WAL logging.

  • xact.cStartTransaction(), CommitTransaction(), AbortTransaction()
  • clog.c — commit log (pg_xact): tracks commit/abort status of every XID as 2 bits
  • subtrans.c — subtransaction parent tracking (pg_subtrans)
  • multixact.c — MultiXactId for row-level locking by multiple transactions
  • xlog.c — WAL write path; XLogInsert(), XLogFlush()
  • xlogrecovery.c — WAL replay on startup / standby

Key structs:

  • TransactionId (uint32) — XID; wraps around after ~4 billion
  • SnapshotData (utils/snapshot.h) — xmin, xmax, xip[] (in-progress XIDs): defines what a transaction can see
access/table/ — Table AM (Access Method) API

Abstraction layer introduced in PG12 so pluggable storage engines can implement the table interface. The heap uses this.

access/sequence/ — Sequences

Sequence object implementation.


storage/ — Storage Layer

storage/buffer/ — Shared Buffer Manager

The buffer pool: shared memory cache of disk pages. All access to table/index data goes through this.

  • bufmgr.cReadBuffer(), ReleaseBuffer(), MarkBufferDirty(), LockBuffer()
  • buf_table.c — hash table mapping (RelFileLocator, BlockNumber) → buffer slot
  • freelist.c — clock-sweep buffer replacement algorithm

Key structs:

  • BufferDesc (storage/buf_internals.h) — one per buffer slot: tag (which page), state (pin count, flags), content lock
  • Buffer (int32) — index into buffer array; positive = shared, negative = local, InvalidBuffer = 0
storage/smgr/ — Storage Manager

Thin abstraction over the OS filesystem. Currently only one implementation (md.c = "magnetic disk").

  • smgr.c — dispatch layer
  • md.c — actual file open/read/write/sync; manages relation "forks" (main, FSM, VM)
storage/lmgr/ — Lock Manager

Four lock levels: spinlocks, LWLocks, regular locks, predicate locks.

  • lock.c — regular (heavyweight) lock table in shared memory; deadlock detection
  • lwlock.c — lightweight locks; reader/writer semantics; used for short critical sections
  • lmgr.c — relation-level and tuple-level lock helpers
  • predicate.c — SSI (serializable snapshot isolation) predicate locks
  • proc.cPGPROC management; lock wait queues

Key structs:

  • LOCK — one per distinct locked object in shared lock table
  • PROCLOCK — connects a PGPROC to a LOCK; tracks held/requested modes
  • LWLock — lightweight lock: state word encodes exclusive/shared holders and waiters
storage/page/ — Page Layout

Every heap and index page is 8 KB (default BLCKSZ). The page has a header, an item pointer array growing forward, free space in the middle, and tuple data growing backward.

Key structs:

  • PageHeaderData (storage/bufpage.h) — pd_lsn (WAL position), pd_lower/pd_upper (free space boundaries), pd_special (index-specific area offset), pd_pagesize_version, pd_checksum
  • ItemIdData — 4-byte item pointer on the page: offset, length, flags (normal/redirect/dead/unused)
storage/freespace/ — Free Space Map (FSM)

A separate file fork per relation storing free space estimates per page (1 byte = 1/256 of a page). Used by the heap insert path to find pages with room.

storage/ipc/ — Inter-Process Communication

Shared memory setup, semaphores, signal handling, PGPROC latch (condition variable for waiting processes).

storage/aio/ — Asynchronous I/O (PG17+)

Infrastructure for async read/write, enabling prefetching and parallel I/O.


catalog/ — System Catalogs

The system catalogs are regular tables (pg_class, pg_attribute, pg_type, pg_proc, etc.) stored in the database. This directory has the C code that creates and manipulates them.

  • pg_class.c, pg_attribute.c, pg_type.c, pg_proc.c — catalog manipulation for each catalog
  • heap.cheap_create_with_catalog(): creating a new table updates many catalogs atomically
  • index.c — index creation
  • dependency.c — dependency tracking (DROP CASCADE logic)
  • namespace.c — schema/namespace search path resolution

Key structs:

  • Form_pg_class — C struct mapping a pg_class row (table/index/view metadata)
  • Form_pg_attribute — column descriptor from pg_attribute
  • Form_pg_type — type descriptor from pg_type

commands/ — SQL Command Implementations

DDL and utility commands: CREATE TABLE, ALTER TABLE, COPY, VACUUM, ANALYZE, CLUSTER, etc.

Key files:

  • tablecmds.c — CREATE/ALTER/DROP TABLE (very large file)
  • vacuum.c / vacuumlazy.c — VACUUM; dead tuple removal, FSM update, visibility map update
  • analyze.c — ANALYZE; sampling and statistics collection into pg_statistic
  • copy.c — COPY TO/FROM; bulk data load/export
  • indexcmds.c — CREATE INDEX, REINDEX
  • explain.c — EXPLAIN output formatting
  • sequence.c — CREATE SEQUENCE, nextval/currval/setval
  • trigger.c — trigger management and firing
  • event_trigger.c — DDL event triggers

nodes/ — Node Infrastructure

All parse/plan/executor tree nodes inherit from Node (a struct whose first field is NodeTag). This directory provides generic tree operations.

  • nodes.cnewNode(), nodeTag()
  • copyfuncs.ccopyObject(): deep copy any node tree
  • equalfuncs.cequal(): structural equality of two node trees
  • outfuncs.cnodeToString(): serialize a node to text (used in EXPLAIN and debugging)
  • readfuncs.cstringToNode(): deserialize (used in stored plan cache)
  • list.cList type: a generic singly-linked list used everywhere

Key structs:

  • Node — base: { NodeTag type; }
  • List (nodes/pg_list.h) — linked list; elements can be pointers, ints, or OIDs
  • Bitmapset (nodes/bitmapset.h) — compact set of non-negative integers (used for column/rel sets)

utils/ — Utilities

utils/cache/ — Catalog Cache & Relation Cache

Caches frequently-accessed catalog rows to avoid repeated syscat scans.

  • catcache.cSearchSysCache(): hash cache of individual catalog tuples, invalidated by SI messages
  • relcache.cRelationIdGetRelation(): relation descriptor cache; holds RelationData structs
  • inval.c — cache invalidation via shared-memory message queue (SI = shared invalidation)

Key structs:

  • RelationData (utils/rel.h) — relation descriptor: file locator, TupleDesc, index list, trigger list, locking info, rule info — the central struct for any open table or index
  • TupleDesc (access/tupdesc.h) — array of Form_pg_attribute; describes a row's columns
utils/mmgr/ — Memory Context Manager

PostgreSQL almost never calls malloc directly. Everything goes through memory contexts, which are arena allocators arranged in a tree. Freeing a context frees all its children. Key context lifetimes: TopMemoryContext (forever), MessageContext (per query), PortalContext (per portal), CurrentMemoryContext (current).

  • aset.cAllocSet: the default allocator (slab-based with size classes)
  • mcxt.c — context tree management: AllocSetContextCreate(), MemoryContextSwitchTo(), pfree(), palloc()

Key struct:

  • MemoryContextData (nodes/memnodes.h) — base struct for all context types; forms a parent/child tree
utils/error/ — Error Handling

elog() / ereport() macros build an error report with ErrorData and longjmp to the nearest PG_CATCH handler.

  • elog.cerrstart(), errfinish(), elog_finish()
  • pg_lzcompress.c — TOAST compression
utils/fmgr/ — Function Manager

How PostgreSQL calls SQL functions from C. The V1 call convention wraps arguments in a FunctionCallInfoBaseData struct.

  • fmgr.cfmgr_info() (look up function), FunctionCallInvoke() (call it)

Key structs:

  • FmgrInfo — cached function lookup: function pointer, type OID, nargs
  • FunctionCallInfoBaseData — call frame: flinfo, nargs, args[] (each a NullableDatum)
utils/adt/ — Built-in Data Types

Implementations of all built-in types: int, float, text, bytea, date, timestamp, numeric, json, jsonb, array, range, etc. Each type provides input/output functions and operators.

utils/sort/ — Sorting & Hashing
  • tuplesort.c — external merge sort used by ORDER BY, CREATE INDEX, DISTINCT
  • tuplestore.c — in-memory (spill-to-disk) tuple buffer used by CTEs, window functions
  • logtape.c — virtual tape file for merge sort spill
utils/time/ — MVCC Snapshot & Visibility
  • snapmgr.cGetTransactionSnapshot(), GetLatestSnapshot(), snapshot export/import
  • combocid.c — combo command IDs for tuples inserted and deleted in the same transaction
utils/misc/ — GUC (Configuration)
  • guc.c, guc_tables.c — Grand Unified Configuration: all postgresql.conf parameters, SET, SHOW
utils/activity/ — Statistics & Wait Events
  • pgstat.c — statistics collector infrastructure
  • wait_event.c — wait event definitions used in pg_stat_activity.wait_event
utils/resowner/ — Resource Owner

Tracks resources (buffer pins, lock holds, file handles, tuplestore refs) that must be released on error. Arranged in a tree parallel to transaction/portal structure.


replication/ — Replication

  • walsender.c — streams WAL to standby; handles replication protocol
  • walreceiver.c — receives WAL on standby; writes to local WAL files
  • logical/ — logical decoding (change data capture): decodes WAL back into row changes
  • slot.c — replication slots: prevent WAL removal until a consumer has caught up

libpq/ — Wire Protocol (Server Side)

Implements the PostgreSQL frontend/backend protocol (version 3). Handles authentication, SSL, message framing.

  • auth.c — authentication methods (md5, scram, gss, cert, …)
  • pqcomm.c — socket I/O, message send/receive
  • pqformat.cpq_sendint(), pq_sendstring(), etc.

jit/ — JIT Compilation

LLVM-based JIT for expression evaluation and tuple deforming. Activated for long-running queries where the compilation cost is worth it.

  • jit.c — provider interface
  • llvm/ — LLVM backend: llvmjit.c, llvmjit_expr.c (compile ExprState to native code)

statistics/ — Extended Statistics

Tracks multi-column statistics (ndistinct, functional dependencies, MCV lists) stored in pg_statistic_ext. Used by the planner to get better row count estimates when columns are correlated.


src/include/ — Header Files

Headers are organized to mirror src/backend/:

Header What it defines
nodes/nodes.h NodeTag enum; makeNode() macro
nodes/parsenodes.h Query, RangeTblEntry, TargetEntry, all parse tree node types
nodes/primnodes.h Expression nodes: Var, Const, FuncExpr, OpExpr, SubLink, etc.
nodes/plannodes.h PlannedStmt, Plan and all plan node subtypes
nodes/pathnodes.h PlannerInfo, RelOptInfo, Path and subtypes
nodes/execnodes.h EState, PlanState subtypes, ExprContext, TupleTableSlot
access/htup_details.h HeapTupleHeaderData (on-disk tuple header layout)
access/xact.h Transaction state, isolation levels, StartTransaction etc.
storage/bufmgr.h ReadBuffer(), Buffer type, buffer access API
storage/bufpage.h PageHeaderData, ItemIdData, page layout macros
storage/lock.h LOCK, PROCLOCK, lock mode constants
storage/proc.h PGPROC — per-backend shared memory struct
utils/rel.h RelationData — relation descriptor
utils/snapshot.h SnapshotData — MVCC snapshot
utils/memutils.h Memory context macros (palloc, pfree, MemoryContextSwitchTo)

Key Data Structures: Quick Reference

Struct File Description
Query nodes/parsenodes.h Output of semantic analysis; input to planner
PlannedStmt nodes/plannodes.h Output of planner; input to executor
PlannerInfo nodes/pathnodes.h Planner's per-query-level working state
RelOptInfo nodes/pathnodes.h A base or join relation during planning
Path nodes/pathnodes.h An access strategy under consideration
EState nodes/execnodes.h Executor's per-query state
PlanState nodes/execnodes.h Executor's per-plan-node state
TupleTableSlot nodes/execnodes.h An in-memory tuple being processed
HeapTupleHeaderData access/htup_details.h On-disk tuple header (MVCC fields)
RelationData utils/rel.h Relation descriptor (relcache entry)
TupleDesc access/tupdesc.h Column definitions for a relation or expression
BufferDesc storage/buf_internals.h Buffer pool slot descriptor
PGPROC storage/proc.h Per-backend shared memory block
SnapshotData utils/snapshot.h MVCC visibility snapshot
MemoryContextData nodes/memnodes.h Memory context (arena allocator)
List nodes/pg_list.h Generic singly-linked list

Good Starting Points for Deep Dives

Topic Start here
How a SELECT executes end-to-end tcop/postgres.c:exec_simple_query()
How the planner picks a join order optimizer/README, then optimizer/path/allpaths.c
How cost estimation works optimizer/path/costsize.c
How MVCC visibility is checked access/heap/heapam_visibility.c
How a page is read from disk storage/buffer/bufmgr.c:ReadBuffer()
How WAL write works access/transam/xlog.c:XLogInsert()
How B-tree index scan works access/nbtree/nbtsearch.c
How locks prevent conflicts storage/lmgr/lock.c:LockAcquire()
How memory is managed utils/mmgr/aset.c, utils/mmgr/mcxt.c
How expressions are evaluated executor/execExprInterp.c
How VACUUM removes dead tuples commands/vacuumlazy.c
How replication works replication/walsender.c, replication/README

Concepts Worth Understanding First

  1. MVCC — Every row stores xmin (inserting XID) and xmax (deleting XID). A snapshot's visibility check reads these against its own xmin/xmax/xip[]. No locks needed for reads.

  2. Node trees — Parse trees, plan trees, and expression trees are all trees of Node-derived structs. The NodeTag in the first field tells you what struct it really is. Use IsA(node, Type) to check.

  3. Memory contexts — Never call free() directly. Every allocation is in a context. When a query ends, MemoryContextDelete(MessageContext) frees everything at once.

  4. WAL-first — Every change to a data page must write a WAL record before modifying shared buffers. This ensures crash recovery. XLogInsert() + MarkBufferDirty() is the pattern.

  5. The buffer pin — Before touching any page you must ReadBuffer() (pin it). The buffer manager won't evict a pinned buffer. ReleaseBuffer() releases the pin.

  6. Resource owners — Pin counts, lock holds, etc. are registered with a ResourceOwner. On ERROR, the resource owner cleanup releases everything even if the code didn't explicitly do so.

  7. Demand-pull executionExecProcNode(node) returns one tuple. The top of the tree is called by the portal loop; it calls its children; leaves call the storage layer. No tuple is computed until needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment