A map of the codebase for learning how PostgreSQL works from the inside out.
Client
↓ (TCP/Unix socket)
postmaster — forks a new backend per connection
↓
backend startup — auth, GUC init, shared memory attach
↓
tcop/postgres.c — ReadCommand() loop: read wire protocol message
↓
parser/ — raw lex/parse → raw parse tree
↓
parser/analyze.c — semantic analysis → Query node
↓
rewrite/ — apply rules → list of Query nodes
↓
optimizer/ — Query → PlannedStmt (cheapest Path → Plan tree)
↓
executor/ — execute Plan tree, produce tuples
↓
tcop/dest.c — send result rows back to client (libpq)
The master daemon. Never touches shared memory itself. It listens for connections, forks a backend for each one, and manages auxiliary processes (autovacuum launcher, WAL writer, bgwriter, checkpointer, etc.). On crash of any backend it can reset shared memory and restart the cluster.
Key files:
postmaster.c— main loop, fork-on-connect, signal handlingautovacuum.c— autovacuum launcher and worker logicbgwriter.c,checkpointer.c,walwriter.c— background I/O workers
Key structs:
PGPROC(storage/proc.h) — one per backend in shared memory; holds XID, lock wait state, latch
The per-backend main loop. Reads messages from the client over the wire protocol, dispatches to parser → optimizer → executor, and writes results back. "tcop" stands for traffic cop.
Key files:
postgres.c—PostgresMain()message loop;exec_simple_query()ties together parse/plan/executepquery.c—PortalRun(), executing a portal (cursor abstraction)utility.c— handles non-optimizable statements (DDL, COPY, VACUUM, etc.)dest.c— output destination abstraction; swappable sinks (client,/dev/null, file)
Key structs:
Portal(utils/portal.h) — represents an open cursor or active query; holds PlannedStmt + ExecutorStateDestReceiver— vtable for tuple output (print to client, store to tuplestore, etc.)
Converts SQL text into an internal Query node. Two phases:
- Raw parse (
gram.y+scan.l) → raw parse tree withRawStmtnodes - Analyze (
analyze.c) → resolves names, type-checks, buildsQuery
Key files:
scan.l— flex lexer (tokenizes SQL text)gram.y— bison grammar (800+ rules); outputs a raw parse treeanalyze.c— top-level semantic analysis, calls out to parse_*.c helpersparse_expr.c,parse_clause.c,parse_coerce.c,parse_agg.c— expression/clause analysis
Key structs (all in src/include/nodes/parsenodes.h):
RawStmt— wrapper around a raw parse tree node with source locationQuery— the main output; holdscommandType,rtable(range table),jointree,targetList,havingQual, etc.RangeTblEntry(RTE) — one entry per table/subquery/function in the FROM clauseTargetEntry— one column in the SELECT list (wraps anExpr, hasresno,resname)
Applies stored rewrite rules (CREATE RULE) to a Query, potentially producing multiple Queries. Also handles view expansion (views are rules).
Key files:
rewriteHandler.c—QueryRewrite()entry pointrewriteManip.c— tree manipulation helpers
Key structs:
RewriteRule/RuleLock(rewrite/prs2lock.h) — cached rule definitions on a relation
Takes a Query and returns a PlannedStmt. Organized into subdirectories:
prep/— preprocessing: sublink flattening, subquery pull-up, partition pruningpath/— buildsPathtrees (all possible access/join strategies); cost estimation lives hereplan/— converts the winningPathtree into aPlantreegeqo/— genetic algorithm for join ordering when there are many tablesutil/—RelOptInfo,PathKey, restriction/join clause utilities
The planner builds a RelOptInfo for each base relation and each candidate join. For each RelOptInfo it generates all plausible Paths, estimates their cost, selects the cheapest, and recurses. See src/backend/optimizer/README for the full algorithm.
Key files:
optimizer/plan/planner.c—planner()entry pointoptimizer/path/allpaths.c—make_one_rel(), generates all access pathsoptimizer/path/costsize.c— all cost formulas (cost_seqscan,cost_index,cost_hashjoin, …)optimizer/path/joinpath.c— generates NestLoop, MergeJoin, HashJoin pathsoptimizer/plan/createplan.c—create_plan()turns winning Path into Plan nodes
Key structs (src/include/nodes/pathnodes.h, plannodes.h):
PlannerInfo— planner's working state for a single query level (holdssimple_rel_array,join_rel_list, etc.)RelOptInfo— represents a base relation or join rel during planning; haspathlist,cheapest_total_path,rowsestimatePath— base type; subtypes:IndexPath,BitmapHeapPath,NestPath,MergePath,HashPath,AppendPathPlannedStmt— final output: hasplanTree(root Plan node),rtable,resultRelations- Plan node types (
plannodes.h):SeqScan,IndexScan,BitmapHeapScan,NestLoop,MergeJoin,HashJoin,Hash,Agg,Sort,Limit,ModifyTable, etc.
Walks the Plan tree using a demand-pull (volcano/iterator) model. Each plan node has ExecInitFoo() / ExecFoo() / ExecEndFoo() functions. Calling ExecProcNode() on the root returns one tuple at a time.
Key files:
execMain.c—ExecutorStart(),ExecutorRun(),ExecutorFinish(),ExecutorEnd()execProcnode.c— dispatch table:ExecInitNode(),ExecProcNode(),ExecEndNode()execExpr.c/execExprInterp.c— expression evaluation engine (JIT-compilable)nodeSeqscan.c,nodeIndexscan.c,nodeNestloop.c,nodeHashjoin.c,nodeAgg.c,nodeSort.c— one file per plan node type
Key structs (src/include/nodes/execnodes.h):
EState— executor state for the whole query: snapshot, result relation, estate tuple tablePlanState— mirrors each Plan node; subtypes:SeqScanState,IndexScanState,NestLoopState,AggState, etc.ExprContext— holds the current tuple(s) being evaluated (ecxt_scantuple,ecxt_innertuple,ecxt_outertuple)ExprState— compiled expression: array ofExprEvalStepinstructions interpreted byExecInterpExpr()TupleTableSlot— in-memory tuple representation; may be heap, minimal, or virtual
The physical format of table rows on disk. Each row is a HeapTuple on a page. MVCC is implemented here via xmin/xmax fields in every tuple header.
heapam.c—heap_insert(),heap_update(),heap_delete(),heap_getnext()(sequential scan)hio.c— finding free space and writing tuples to pagesheapam_visibility.c—HeapTupleSatisfiesSnapshot()and friends (MVCC visibility checks)
Key structs:
HeapTupleHeaderData(access/htup_details.h) — on-disk tuple header:t_xmin,t_xmax,t_infomask,t_hoff(header length), null bitmapHeapTupleData— in-memory wrapper: pointer toHeapTupleHeaderData+t_self(ItemPointerData = block + offset)
Lehman-Yao concurrent B-tree. Supports equality and range queries. Every index page has a "high key" and a right-link pointer enabling safe concurrent splits.
nbtree.c— AM interface (btinsert,btgettuple,btbuild, …)nbtsearch.c— descend tree to find leaf, scan forward/backwardnbtpage.c— page-level operations, splitsnbtinsert.c— insert and page-split logic
Key structs:
BTPageOpaqueData— extra data at end of each B-tree page:btpo_prev,btpo_next,btpo_level,btpo_flagsBTScanOpaqueData— scan state: current page/offset, scan keys
For full-text search and array/jsonb containment. Stores (key → posting list) pairs.
Framework for geometric and other custom index types (PostGIS uses this).
Stores min/max (or other summaries) per range of heap pages. Very compact; good for naturally ordered large tables.
Equality-only index using extensible hashing.
For space-partitioning structures (quad-trees, k-d trees, radix tries).
The heart of MVCC. Manages transaction IDs, commit log, snapshots, WAL logging.
xact.c—StartTransaction(),CommitTransaction(),AbortTransaction()clog.c— commit log (pg_xact): tracks commit/abort status of every XID as 2 bitssubtrans.c— subtransaction parent tracking (pg_subtrans)multixact.c— MultiXactId for row-level locking by multiple transactionsxlog.c— WAL write path;XLogInsert(),XLogFlush()xlogrecovery.c— WAL replay on startup / standby
Key structs:
TransactionId(uint32) — XID; wraps around after ~4 billionSnapshotData(utils/snapshot.h) —xmin,xmax,xip[](in-progress XIDs): defines what a transaction can see
Abstraction layer introduced in PG12 so pluggable storage engines can implement the table interface. The heap uses this.
Sequence object implementation.
The buffer pool: shared memory cache of disk pages. All access to table/index data goes through this.
bufmgr.c—ReadBuffer(),ReleaseBuffer(),MarkBufferDirty(),LockBuffer()buf_table.c— hash table mapping (RelFileLocator, BlockNumber) → buffer slotfreelist.c— clock-sweep buffer replacement algorithm
Key structs:
BufferDesc(storage/buf_internals.h) — one per buffer slot: tag (which page), state (pin count, flags), content lockBuffer(int32) — index into buffer array; positive = shared, negative = local,InvalidBuffer= 0
Thin abstraction over the OS filesystem. Currently only one implementation (md.c = "magnetic disk").
smgr.c— dispatch layermd.c— actual file open/read/write/sync; manages relation "forks" (main, FSM, VM)
Four lock levels: spinlocks, LWLocks, regular locks, predicate locks.
lock.c— regular (heavyweight) lock table in shared memory; deadlock detectionlwlock.c— lightweight locks; reader/writer semantics; used for short critical sectionslmgr.c— relation-level and tuple-level lock helperspredicate.c— SSI (serializable snapshot isolation) predicate locksproc.c—PGPROCmanagement; lock wait queues
Key structs:
LOCK— one per distinct locked object in shared lock tablePROCLOCK— connects aPGPROCto aLOCK; tracks held/requested modesLWLock— lightweight lock:stateword encodes exclusive/shared holders and waiters
Every heap and index page is 8 KB (default BLCKSZ). The page has a header, an item pointer array growing forward, free space in the middle, and tuple data growing backward.
Key structs:
PageHeaderData(storage/bufpage.h) —pd_lsn(WAL position),pd_lower/pd_upper(free space boundaries),pd_special(index-specific area offset),pd_pagesize_version,pd_checksumItemIdData— 4-byte item pointer on the page: offset, length, flags (normal/redirect/dead/unused)
A separate file fork per relation storing free space estimates per page (1 byte = 1/256 of a page). Used by the heap insert path to find pages with room.
Shared memory setup, semaphores, signal handling, PGPROC latch (condition variable for waiting processes).
Infrastructure for async read/write, enabling prefetching and parallel I/O.
The system catalogs are regular tables (pg_class, pg_attribute, pg_type, pg_proc, etc.) stored in the database. This directory has the C code that creates and manipulates them.
pg_class.c,pg_attribute.c,pg_type.c,pg_proc.c— catalog manipulation for each catalogheap.c—heap_create_with_catalog(): creating a new table updates many catalogs atomicallyindex.c— index creationdependency.c— dependency tracking (DROP CASCADE logic)namespace.c— schema/namespace search path resolution
Key structs:
Form_pg_class— C struct mapping apg_classrow (table/index/view metadata)Form_pg_attribute— column descriptor frompg_attributeForm_pg_type— type descriptor frompg_type
DDL and utility commands: CREATE TABLE, ALTER TABLE, COPY, VACUUM, ANALYZE, CLUSTER, etc.
Key files:
tablecmds.c— CREATE/ALTER/DROP TABLE (very large file)vacuum.c/vacuumlazy.c— VACUUM; dead tuple removal, FSM update, visibility map updateanalyze.c— ANALYZE; sampling and statistics collection intopg_statisticcopy.c— COPY TO/FROM; bulk data load/exportindexcmds.c— CREATE INDEX, REINDEXexplain.c— EXPLAIN output formattingsequence.c— CREATE SEQUENCE, nextval/currval/setvaltrigger.c— trigger management and firingevent_trigger.c— DDL event triggers
All parse/plan/executor tree nodes inherit from Node (a struct whose first field is NodeTag). This directory provides generic tree operations.
nodes.c—newNode(),nodeTag()copyfuncs.c—copyObject(): deep copy any node treeequalfuncs.c—equal(): structural equality of two node treesoutfuncs.c—nodeToString(): serialize a node to text (used inEXPLAINand debugging)readfuncs.c—stringToNode(): deserialize (used in stored plan cache)list.c—Listtype: a generic singly-linked list used everywhere
Key structs:
Node— base:{ NodeTag type; }List(nodes/pg_list.h) — linked list; elements can be pointers, ints, or OIDsBitmapset(nodes/bitmapset.h) — compact set of non-negative integers (used for column/rel sets)
Caches frequently-accessed catalog rows to avoid repeated syscat scans.
catcache.c—SearchSysCache(): hash cache of individual catalog tuples, invalidated by SI messagesrelcache.c—RelationIdGetRelation(): relation descriptor cache; holdsRelationDatastructsinval.c— cache invalidation via shared-memory message queue (SI = shared invalidation)
Key structs:
RelationData(utils/rel.h) — relation descriptor: file locator, TupleDesc, index list, trigger list, locking info, rule info — the central struct for any open table or indexTupleDesc(access/tupdesc.h) — array ofForm_pg_attribute; describes a row's columns
PostgreSQL almost never calls malloc directly. Everything goes through memory contexts, which are arena allocators arranged in a tree. Freeing a context frees all its children. Key context lifetimes: TopMemoryContext (forever), MessageContext (per query), PortalContext (per portal), CurrentMemoryContext (current).
aset.c—AllocSet: the default allocator (slab-based with size classes)mcxt.c— context tree management:AllocSetContextCreate(),MemoryContextSwitchTo(),pfree(),palloc()
Key struct:
MemoryContextData(nodes/memnodes.h) — base struct for all context types; forms a parent/child tree
elog() / ereport() macros build an error report with ErrorData and longjmp to the nearest PG_CATCH handler.
elog.c—errstart(),errfinish(),elog_finish()pg_lzcompress.c— TOAST compression
How PostgreSQL calls SQL functions from C. The V1 call convention wraps arguments in a FunctionCallInfoBaseData struct.
fmgr.c—fmgr_info()(look up function),FunctionCallInvoke()(call it)
Key structs:
FmgrInfo— cached function lookup: function pointer, type OID, nargsFunctionCallInfoBaseData— call frame:flinfo,nargs,args[](each aNullableDatum)
Implementations of all built-in types: int, float, text, bytea, date, timestamp, numeric, json, jsonb, array, range, etc. Each type provides input/output functions and operators.
tuplesort.c— external merge sort used byORDER BY,CREATE INDEX,DISTINCTtuplestore.c— in-memory (spill-to-disk) tuple buffer used by CTEs, window functionslogtape.c— virtual tape file for merge sort spill
snapmgr.c—GetTransactionSnapshot(),GetLatestSnapshot(), snapshot export/importcombocid.c— combo command IDs for tuples inserted and deleted in the same transaction
guc.c,guc_tables.c— Grand Unified Configuration: allpostgresql.confparameters,SET,SHOW
pgstat.c— statistics collector infrastructurewait_event.c— wait event definitions used inpg_stat_activity.wait_event
Tracks resources (buffer pins, lock holds, file handles, tuplestore refs) that must be released on error. Arranged in a tree parallel to transaction/portal structure.
walsender.c— streams WAL to standby; handles replication protocolwalreceiver.c— receives WAL on standby; writes to local WAL fileslogical/— logical decoding (change data capture): decodes WAL back into row changesslot.c— replication slots: prevent WAL removal until a consumer has caught up
Implements the PostgreSQL frontend/backend protocol (version 3). Handles authentication, SSL, message framing.
auth.c— authentication methods (md5, scram, gss, cert, …)pqcomm.c— socket I/O, message send/receivepqformat.c—pq_sendint(),pq_sendstring(), etc.
LLVM-based JIT for expression evaluation and tuple deforming. Activated for long-running queries where the compilation cost is worth it.
jit.c— provider interfacellvm/— LLVM backend:llvmjit.c,llvmjit_expr.c(compile ExprState to native code)
Tracks multi-column statistics (ndistinct, functional dependencies, MCV lists) stored in pg_statistic_ext. Used by the planner to get better row count estimates when columns are correlated.
Headers are organized to mirror src/backend/:
| Header | What it defines |
|---|---|
nodes/nodes.h |
NodeTag enum; makeNode() macro |
nodes/parsenodes.h |
Query, RangeTblEntry, TargetEntry, all parse tree node types |
nodes/primnodes.h |
Expression nodes: Var, Const, FuncExpr, OpExpr, SubLink, etc. |
nodes/plannodes.h |
PlannedStmt, Plan and all plan node subtypes |
nodes/pathnodes.h |
PlannerInfo, RelOptInfo, Path and subtypes |
nodes/execnodes.h |
EState, PlanState subtypes, ExprContext, TupleTableSlot |
access/htup_details.h |
HeapTupleHeaderData (on-disk tuple header layout) |
access/xact.h |
Transaction state, isolation levels, StartTransaction etc. |
storage/bufmgr.h |
ReadBuffer(), Buffer type, buffer access API |
storage/bufpage.h |
PageHeaderData, ItemIdData, page layout macros |
storage/lock.h |
LOCK, PROCLOCK, lock mode constants |
storage/proc.h |
PGPROC — per-backend shared memory struct |
utils/rel.h |
RelationData — relation descriptor |
utils/snapshot.h |
SnapshotData — MVCC snapshot |
utils/memutils.h |
Memory context macros (palloc, pfree, MemoryContextSwitchTo) |
| Struct | File | Description |
|---|---|---|
Query |
nodes/parsenodes.h |
Output of semantic analysis; input to planner |
PlannedStmt |
nodes/plannodes.h |
Output of planner; input to executor |
PlannerInfo |
nodes/pathnodes.h |
Planner's per-query-level working state |
RelOptInfo |
nodes/pathnodes.h |
A base or join relation during planning |
Path |
nodes/pathnodes.h |
An access strategy under consideration |
EState |
nodes/execnodes.h |
Executor's per-query state |
PlanState |
nodes/execnodes.h |
Executor's per-plan-node state |
TupleTableSlot |
nodes/execnodes.h |
An in-memory tuple being processed |
HeapTupleHeaderData |
access/htup_details.h |
On-disk tuple header (MVCC fields) |
RelationData |
utils/rel.h |
Relation descriptor (relcache entry) |
TupleDesc |
access/tupdesc.h |
Column definitions for a relation or expression |
BufferDesc |
storage/buf_internals.h |
Buffer pool slot descriptor |
PGPROC |
storage/proc.h |
Per-backend shared memory block |
SnapshotData |
utils/snapshot.h |
MVCC visibility snapshot |
MemoryContextData |
nodes/memnodes.h |
Memory context (arena allocator) |
List |
nodes/pg_list.h |
Generic singly-linked list |
| Topic | Start here |
|---|---|
| How a SELECT executes end-to-end | tcop/postgres.c:exec_simple_query() |
| How the planner picks a join order | optimizer/README, then optimizer/path/allpaths.c |
| How cost estimation works | optimizer/path/costsize.c |
| How MVCC visibility is checked | access/heap/heapam_visibility.c |
| How a page is read from disk | storage/buffer/bufmgr.c:ReadBuffer() |
| How WAL write works | access/transam/xlog.c:XLogInsert() |
| How B-tree index scan works | access/nbtree/nbtsearch.c |
| How locks prevent conflicts | storage/lmgr/lock.c:LockAcquire() |
| How memory is managed | utils/mmgr/aset.c, utils/mmgr/mcxt.c |
| How expressions are evaluated | executor/execExprInterp.c |
| How VACUUM removes dead tuples | commands/vacuumlazy.c |
| How replication works | replication/walsender.c, replication/README |
-
MVCC — Every row stores
xmin(inserting XID) andxmax(deleting XID). A snapshot's visibility check reads these against its ownxmin/xmax/xip[]. No locks needed for reads. -
Node trees — Parse trees, plan trees, and expression trees are all trees of
Node-derived structs. TheNodeTagin the first field tells you what struct it really is. UseIsA(node, Type)to check. -
Memory contexts — Never call
free()directly. Every allocation is in a context. When a query ends,MemoryContextDelete(MessageContext)frees everything at once. -
WAL-first — Every change to a data page must write a WAL record before modifying shared buffers. This ensures crash recovery.
XLogInsert()+MarkBufferDirty()is the pattern. -
The buffer pin — Before touching any page you must
ReadBuffer()(pin it). The buffer manager won't evict a pinned buffer.ReleaseBuffer()releases the pin. -
Resource owners — Pin counts, lock holds, etc. are registered with a
ResourceOwner. OnERROR, the resource owner cleanup releases everything even if the code didn't explicitly do so. -
Demand-pull execution —
ExecProcNode(node)returns one tuple. The top of the tree is called by the portal loop; it calls its children; leaves call the storage layer. No tuple is computed until needed.