Skip to content

Instantly share code, notes, and snippets.

@alezzigo
Created January 5, 2026 03:37
Show Gist options
  • Select an option

  • Save alezzigo/8b5a3b12a2d650346a06fbb4f4e18327 to your computer and use it in GitHub Desktop.

Select an option

Save alezzigo/8b5a3b12a2d650346a06fbb4f4e18327 to your computer and use it in GitHub Desktop.
544000 TPS - Alipay Crashes ?

544,000 payments/sec and world TPC-C record - 707 million tpmC

Singles' Day night 11.11 - 23:59:50. In HQ of Ant Group and Alibaba in Hangzhou, hundreds engineers staring at dashboard screens. This is real money, billions USD of Alibaba getting decided in next few minutes.

00:00:00: Festival starts. Like 1 billion ppl press pay button at same time. Traffic jump from 50k to 544k TPS in like 3 seconds. If something break, Alibaba can lose billions. But dashboard still green. Peak is 544,000 transactions per second.

00:01:08:

  • first $1B in 68 seconds
  • $10B in less than 30 mins

End of that day, record numbers:

  • 544k TPS peak (world record at that time)
  • 1.3B orders
  • $34B revenue

To help you imagine: Visa at their most stress is like 65k TPS, PayPal is ~10k TPS.

System scale around that time:

  • 100 database clusters
  • 20,000 servers
  • 1 million CPU cores
  • hundreds petabytes storage

Im a engineer at Alipay. After some internal sharing and reading internal engineer blogs, i feel Alipay architecture is super interesting but not so common outside. Most docs are in Chinese so i try summarize in simple way: why Alipay not using Oracle, MySQL, Postgres?

Real problem in Peak Day is not lack of servers, but database. Payment system needs strong consistency, RPO ~ 0, super low RTO, and survive insane write spike.

Oracle is very strong in single instance and RAC. But when TPS goes to hundreds thousands, shared storage and global locks become bottleneck.

MySQL/PostgreSQL can shard, but cross-shard transactions you either do in app layer or use 2PC. Its super hard to keep stable when traffic jump 10x in few seconds. Manual sharding on tens thousands nodes is ops nightmare.

NoSQL (MongoDB/Cassandra) scale good but more eventual consistency, for payment its too risky.

So Alipay build OceanBase from scratch for distributed transactions. In 2020 it also got TPC-C record 707M tpmC, beating big traditional database.

No free lunch for 544k TPS. To carry this number, Alipay accept some tradeoff:

Tradeoff What you pay Why it matters in peak
Infra burden 3-5 replicas for Paxos, storage+network cost 3-5x RPO 0 and survive DC failure
Ops complexity debug distributed system (network/leader/follower) is hard need strong monitoring and automation

Compare with other NewSQL:

NewSQL Core idea Good at Tradeoff
Google Spanner global timestamp + sync time infra global consistency across regions depends on Google infra
TiDB Raft + MySQL compatible open source ecosystem, easy deploy performance tuning and hotspots
OceanBase Paxos + LSM + deep compression huge spikes, high write throughput complexity and infra cost

Overall system architecture

Diagram: Overall system architecture

flowchart LR
  U[Client/App] --> P[OBProxy\nSQL routing]
  P --> S1[OBServer\nSQL/Txn/Storage]
  P --> S2[OBServer\nSQL/Txn/Storage]
  P --> S3[OBServer\nSQL/Txn/Storage]
  RS[RootService\nMetadata/Health/Rebalance] -. control plane .-> P
  RS -. control plane .-> S1
  RS -. control plane .-> S2
  RS -. control plane .-> S3
Loading

Big picture - OceanBase run like a shared-nothing cluster. Each OBServer node is independent with its own SQL engine, transaction engine and storage. OBProxy sits in front and route queries to correct node based on partition keys. RootService is the brain - manage metadata, watch health and do load balancing.

Detailed explain each component:

  1. OBProxy (SQL routing layer): This is the first layer all requests come in. Its job:
  • parse SQL query to find partition key (usually user_id)
  • hash partition key to know which OBServer node has the data
  • route request to that node
  • handle connection pooling to avoid too many connections

Real example:

INSERT INTO payments (user_id, amount, merchant_id) 
VALUES (293847, 999, 12345);

OBProxy see user_id = 293847, hash it: 293847 mod 1000 = 847. Partition 847 is on OBServer node 47. Route request to node 47. Simple as that.

  1. OBServer node (compute + storage): This is the real workhorses. Each node has:
  • SQL engine: parse and optimize queries. For complex query join 5 tables, SQL engine decide join order, which index to use, how many rows to scan.

  • Transaction engine: manage MVCC (Multi-Version Concurrency Control) and 2PC.

    MVCC example:

    • timestamp 1000: user A balance = $100
    • timestamp 1001: transaction is running, write user A balance = $50 (not commit yet)
    • timestamp 1002: user B read balance of user A
    • system return version at timestamp 1000 ($100) because txn 1001 not commit
    • timestamp 1003: txn 1001 commit
    • timestamp 1004: user C read balance -> get $50
  • Storage engine: LSM-Tree store the real data. Will explain in its own section.

  1. RootService (the brain): RootService dont handle user traffic. It do admin work:
  • track metadata: partition 847 is on which node? leader is which node?
  • monitor health: node 47 CPU 90%? need rebalance
  • orchestrate migration: move partition 123 from node A to node B
  • trigger elections: node 47 dead, elect new leader for its partitions

Rebalance example:

  • RootService see node 47 running 95% CPU (hot)
  • node 23 only 30% CPU (cold)
  • RootService decide migrate partition 847 from node 47 to node 23
  • steps:
    1. create new replica of partition 847 on node 23
    2. sync data from node 47 to node 23
    3. wait node 23 catch up
    4. switch leader from node 47 to node 23
    5. update metadata
    6. delete old replica on node 47
  • total time: ~30 seconds, zero downtime for users
  1. Multi-zone deployment:
flowchart TB
  %% example: 5 replicas for partition 847 across 3 zones
  subgraph Z1[Zone 1 Beijing]
    BJ1[Beijing DC1\nLeader p847]
    BJ2[Beijing DC2\nFollower p847]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[Hangzhou DC1\nFollower p847]
    HZ2[Hangzhou DC2\nFollower p847]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[Shenzhen DC1\nFollower p847]
  end

  BJ1 -->|redo log ts=1234567890| BJ2
  BJ1 -->|redo log ts=1234567890| HZ1
  BJ1 -->|redo log ts=1234567890| HZ2
  BJ1 -->|redo log ts=1234567890| SZ1

  Q[Quorum rule 3-of-5\ncommit when any 2 followers ACK] -.-> BJ1
Loading

5 datacenters in 3 zones:

Zone Datacenters Purpose
Zone 1 Beijing DC1, DC2 local quorum + survive 1 DC loss
Zone 2 Hangzhou DC1, DC2 cross-zone quorum + disaster recovery
Zone 3 Shenzhen DC1 third zone for HA

Each partition has 5 replicas, spread across zones:

Replica role Location Example for partition 847
Leader Zone 1 Beijing DC1
Follower Zone 1 Beijing DC2
Follower Zone 2 Hangzhou DC1
Follower Zone 2 Hangzhou DC2
Follower Zone 3 Shenzhen DC1

If all Beijing power off (2 datacenters down), still 3 replicas in Hangzhou and Shenzhen. System keep running.

With 100 OceanBase clusters on 20,000 servers and 1M CPU cores, this is not about raw power - its about how to orchestrate all these resources to work like one single coherent database.


LSM-Tree storage engine architecture

flowchart TD
  W[Write: INSERT payment_id=12345] --> MT[Active MemTable\nrows: 12345 -> v1000]
  MT -->|00:00:00.050 freeze| FMT[Frozen MemTable\nimmutable snapshot]
  FMT -->|00:00:00.200 dump| MINI[Mini SSTable\nfile: mini_0007.sst]
  MINI -->|00:00:05.000 minor merge| MINOR[Minor SSTable\nfile: minor_0012.sst]
  MINOR -->|03:10:00 major compaction\nincremental| MAJOR[Major SSTable\nfile: major_2020-11-12.sst]

  %% example: read payment_id=12345 at 00:00:02.000
  R[Read: SELECT payment_id=12345] --> MT
  R --> FMT
  R --> RC[Row cache\nhit maybe]
  R --> BF[Bloom filter\ncheck key=12345]
  BF --> MINI
  BF --> MINOR
  BF --> MAJOR
Loading
Example in 1 page (super simplified):

00:00:00.001 write comes in:
  MemTable: {12345 -> amount=100, version=1000}

00:00:00.050 MemTable freeze (too full):
  Frozen MemTable: snapshot version=1000

00:00:00.200 dump to disk:
  mini_0007.sst contains key range ... includes 12345

00:00:05.000 minor compaction:
  merge mini_0001..mini_0050 -> minor_0012.sst

03:10:00 major compaction (incremental):
  reuse old macroblocks, rewrite only changed blocks

00:00:02.000 read comes in:
  check MemTable -> miss
  check Frozen MemTable -> miss
  check bloom -> points to minor/major file
  read 1 macroblock -> return row

LSM-Tree is the heart of storage engine. Lets break down each stage:

LSM stages in 1 table

Stage Trigger What happens Typical size / note
Write to MemTable every write keep hot rows + MVCC versions in memory fast, no disk
Freeze MemTable full active -> frozen, new active created 64MB-256MB
Flush to Mini SSTable background dump frozen to disk many small files
Minor compaction too many mini files merge many mini -> 1 minor, keep newest version minor about 2GB
Major compaction daily low traffic merge minor + old major -> new major incremental macroblock reuse
flowchart LR
  %% example: MemTable internal indexes
  subgraph MT[MemTable]
    BT[B-Tree index\nrange query by ts]
    H[Hash index\npoint query by payment_id]
    MV[MVCC versions\nts=1000,1001,...]
  end
  W1[INSERT payment_id=1\nts=1000] --> MT
  W2[UPDATE payment_id=1\nts=1001] --> MT
  R1[SELECT payment_id=1\nread ts=1000] --> MV
Loading

Concrete example as table

Time Operation Result (simplified)
1000 INSERT payment_id=1 amount=999 version 1000 stored
1001 UPDATE payment_id=1 amount=888 version 1001 stored, version 1000 kept
1002 read at ts=1000 return amount 999 (old version)

Read path as a decision flow

flowchart TD
  %% example: point query payment_id=12345
  Q[Query\nSELECT payment_id=12345] --> MT2[Active MemTable\n0.1ms]
  MT2 -->|miss| FMT2[Frozen MemTables\n0.2ms]
  FMT2 -->|miss| RC2[Row cache\n0.5ms]
  RC2 -->|miss| BF2[Bloom filter\nwhich SSTable may have key]
  BF2 --> MINI2[Mini SSTables\nmaybe 10-50 files]
  BF2 --> MINOR2[Minor SSTables]
  BF2 --> MAJOR2[Major SSTable]
  MINI2 --> OUT2[Return row]
  MINOR2 --> OUT2
  MAJOR2 --> OUT2
Loading

Why incremental major compaction matters (example)

Item Number
Old major size 100GB
Changes from minors 10GB
Macroblock size 2MB
Macroblocks rewritten 1000 blocks, about 2GB
Macroblocks reused 49000 blocks

With hundreds of petabytes storage and tens of billions SQL queries, optimizing every read and write op is critical.


Transaction flow with Paxos replication

sequenceDiagram
  autonumber
  %% example: user_id=293847 transfer $100 to user_id=100523 at 00:00:00
  participant U as User (uid=293847)
  participant P as OBProxy
  participant L as Leader node 47 (p847)
  participant F1 as Follower 1 (Beijing DC2)
  participant F2 as Follower 2 (Hangzhou DC1)
  participant F3 as Follower 3 (Hangzhou DC2)
  participant F4 as Follower 4 (Shenzhen DC1)

  U->>P: SQL: UPDATE accounts SET balance=balance-100 WHERE user_id=293847
  P->>L: route by hash(293847)%1000=847
  L->>L: exec + write redo log (ts=1234567890)
  par replicate redo log
    L->>F1: log chunk (ts=1234567890)
    L->>F2: log chunk (ts=1234567890)
    L->>F3: log chunk (ts=1234567890)
    L->>F4: log chunk (ts=1234567890)
  end
  F1-->>L: ACK (2ms)
  F2-->>L: ACK (5ms) (quorum ok)
  L-->>P: commit ok (lat~8ms)
  P-->>U: success
Loading
Example time line:
0ms  user send SQL
1ms  leader write redo log locally
2ms  follower1 ack
5ms  follower2 ack -> quorum 3/5 reached
8ms  proxy return to user

This is the magic part that make your data never lost, even if servers die.

Detailed example: user A transfer $100 to user B

Step 1: request goes to leader

User A -> OBProxy -> leader replica (partition 847, node 47)
SQL: UPDATE accounts SET balance = balance - 100 WHERE user_id = A

Step 2: leader processing

Step What leader does Example detail
1 begin txn, assign timestamp ts=1234567890
2 check constraints enough balance? conflict?
3 write to MemTable user_id=A, balance=900, version=1234567890
4 append redo log UPDATE accounts SET balance=900 WHERE user_id=A
flowchart TB
  %% example: leader internal pipeline for 1 SQL
  IN[SQL request\nUPDATE accounts...] --> TS[assign ts\n1234567890]
  TS --> CC[constraint check\nbalance>=100]
  CC --> LK[lock row\nuser_id=A]
  LK --> MT[write MemTable\nbalance=900]
  MT --> RL[append redo log\ndurable]
  RL --> OUT[ready to replicate]
Loading

Step 3: Paxos replication Leader sends redo log to 4 followers in parallel:

Leader (Beijing DC1)
  -> Follower 1 (Beijing DC2): latency 2ms
  -> Follower 2 (Hangzhou DC1): latency 5ms
  -> Follower 3 (Hangzhou DC2): latency 5ms
  -> Follower 4 (Shenzhen DC1): latency 8ms

Each follower:

  1. receive redo log
  2. write to local disk (durable)
  3. apply to local MemTable
  4. send ACK back to leader

Step 4: majority ACK and commit Leader waits for majority ACK. With 5 replicas, need 3 ACKs (quorum):

Time 0ms: Leader send logs
Time 2ms: Follower 1 ACK
Time 5ms: Follower 2 ACK -> majority reached (3/5)
Time 5ms: Follower 3 ACK
Time 8ms: Follower 4 ACK (late but ok)

As soon as it has 3 ACKs (time 5ms), leader:

  1. mark transaction as COMMITTED
  2. release locks
  3. return success to user
  4. user see "payment success"

Step 5: background async flush MemTable data will flush into SSTable later (not blocking user):

Time 10 seconds later: Frozen MemTable dump into Mini SSTable
Time 1 hour later: Mini SSTables merge into Minor SSTable
Time 3am next day: Minor SSTables merge into Major SSTable

Why Paxos matters?

Failure scenarios in 2 diagrams

sequenceDiagram
  autonumber
  %% scenario 1: leader dies after sending logs, before returning to user
  participant U as User
  participant L as Leader (Beijing DC1)
  participant F2 as Follower (Hangzhou DC1)
  participant F3 as Follower (Hangzhou DC2)
  participant RS as RootService

  U->>L: write request ts=1234567890
  L->>L: append redo log local
  par replicate
    L->>F2: redo log ts=1234567890
    L->>F3: redo log ts=1234567890
  end
  Note over L: crash at t=3ms
  F2-->>RS: detect leader timeout
  RS->>RS: start election
  RS-->>F2: promote to new leader
  F2-->>U: user retry ok, data preserved
Loading
flowchart TB
  %% scenario 2: whole zone 1 Beijing down, quorum still ok
  subgraph Z1[Zone 1 Beijing]
    BJ1[DC1 down]
    BJ2[DC2 down]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[DC1 follower]
    HZ2[DC2 follower]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[DC1 follower]
  end

  Q[Quorum 3-of-5\nneed any 3 replicas alive] --> HZ1
  Q --> HZ2
  Q --> SZ1
Loading
Metric Meaning Value in example
RPO how much committed data can be lost 0
RTO time to recover service less than 30s

Partitioning and Partition Groups

flowchart LR
  %% example: 2 users and 1 transfer
  U1[User A\nuser_id=293847\norder_id=O-7731] -->|INSERT order + pay| P
  P -->|hash 293847 % 1000 = 847| N47[OBServer node 47\nPartition 847 leader]

  subgraph N47
    O[orders_p847\nINSERT O-7731]
    Pay[payments_p847\nINSERT P-9911]
    Bal[account_balance_p847\nUPDATE -100]
  end

  N47 -->|local txn: begin/commit| O
  N47 -->|local txn: same node| Pay
  N47 -->|local txn: same node| Bal

  P -->|hash 100523 % 1000 = 523| N23[OBServer node 23\nPartition 523 leader]

  subgraph N23
    Bal2[account_balance_p523\nUPDATE +100]
  end
Loading
ASCII quick map:

user_id=293847 -> p847 -> node47
  orders_p847 + payments_p847 + balance_p847  (local)

user_id=100523 -> p523 -> node23
  balance_p523 (other node)

Local txn: update 3 tables on node47, no network.
Cross-partition transfer: node47 + node23 need 2PC.

This is the secret sauce to avoid distributed transactions.

Problem with traditional distributed databases:

Traditional 2PC cost as a table

Step What happens Example cost
Prepare phase coordinator asks 3 servers, waits ACK 5ms x 3 = 15ms
Commit phase coordinator tells COMMIT, waits ACK 5ms x 3 = 15ms
Total network-only overhead 30ms plus
sequenceDiagram
  autonumber
  %% example: 3 tables on 3 servers
  participant C as Coordinator
  participant S1 as Server1 orders
  participant S2 as Server2 payments
  participant S3 as Server3 balance

  C->>S1: PREPARE
  C->>S2: PREPARE
  C->>S3: PREPARE
  S1-->>C: YES 5ms
  S2-->>C: YES 5ms
  S3-->>C: YES 5ms
  C->>C: decide COMMIT
  C->>S1: COMMIT
  C->>S2: COMMIT
  C->>S3: COMMIT
  S1-->>C: ACK 5ms
  S2-->>C: ACK 5ms
  S3-->>C: ACK 5ms
Loading

OceanBase solution: partition groups

Key insight: if user A buys item, all data related to user A should be in same place.

Hash partitioning by user_id:

3 tables use same partition function:

orders table: partitioned by user_id
  Partition 847 has all orders of users hashed into 847

payments table: partitioned by user_id
  Partition 847 has all payments of users hashed into 847

account_balance table: partitioned by user_id
  Partition 847 has all balances of users hashed into 847

Colocation magic: OceanBase colocate same partition number onto same node:

OBServer node 47:
  - orders_p847
  - payments_p847
  - account_balance_p847

OBServer node 23:
  - orders_p523
  - payments_p523
  - account_balance_p523

Then transaction becomes LOCAL: User 293847 buys item:

  1. request routed to node 47 (because partition 847 is there)
  2. begin transaction
  3. INSERT into orders_p847 - local, no network
  4. INSERT into payments_p847 - local, no network
  5. UPDATE account_balance_p847 - local, no network
  6. commit
  7. total latency: ~2ms (mostly memory operations)

No network overhead. No 2PC. Just local transaction inside 1 node.

Ratio of local vs distributed transactions:

  • 95%+ transactions are local (same user, same partition)
  • only 5% are distributed (example: user A transfer money to user B in different partition)

Cross-partition transaction (distributed 2PC):

Example: user A (partition 847) transfer $100 to user B (partition 523):

Coordinator (node 47):
1. get global timestamp from GTS: 1234567890
2. phase 1 - prepare:
   - tell node 47: prepare decrease A's balance
   - tell node 23: prepare increase B's balance

Node 47:
   - lock user A row
   - write prepare log
   - check A has enough $100?
   - vote: YES

Node 23:
   - lock user B row
   - write prepare log
   - vote: YES

Coordinator:
   - received 2 YES votes
   - decision: COMMIT
   - write commit log

3. phase 2 - commit:
   - tell node 47: COMMIT
   - tell node 23: COMMIT

Node 47:
   - apply changes
   - release locks
   - ACK

Node 23:
   - apply changes
   - release locks
   - ACK

Coordinator:
   - return success to user

Total latency: ~10-15ms (network overhead)

Still ok for 5% transactions. But the other 95% running local in 2ms is why we can reach 544k TPS.

Dynamic repartitioning:

If a user becomes super hot (example: celebrity account with millions followers buying merch):

user_id = 999888 -> partition 888 becomes HOT
Node hosting partition 888: 95% CPU, overloaded

RootService detect:
1. split partition 888 into 2 sub-partitions:
   - partition 888a: users hash into [888.0 - 888.5]
   - partition 888b: users hash into [888.5 - 888.9]

2. migrate partition 888b to other node
3. now load spread across 2 nodes
4. update routing table in OBProxy

Automatic, zero downtime, zero data loss.


Dynamic Load Balancing in Peak Traffic

sequenceDiagram
  autonumber
  %% example: hot partition 123 at 00:00 peak
  participant N47 as Node47 (p123 leader)
  participant RS as RootService
  participant PX as OBProxy
  participant N23 as Node23 (cold)

  loop every 1s
    N47-->>RS: hb cpu=92 mem=85 io=2.1GB/s\np123 qps=150k cpu=95
    N23-->>RS: hb cpu=25 mem=40 io=0.2GB/s\np523 qps=1k cpu=2
  end

  RS->>RS: detect p123 hot (qps spike)
  RS-->>N47: action#1 increase quota (cpu 20%->40%, mem 4G->8G)

  alt still hot after 10s
    RS-->>N23: action#2 leader move plan\nprepare follower promote
    RS-->>N47: demote leader p123 -> follower
    RS-->>N23: promote follower p123 -> leader
  else too hot for 1 leader
    RS-->>N47: action#3 split p123 by merchant_id=88888
    RS-->>N23: create p123b (merchant=88888) + migrate
    RS-->>PX: update route: merchant=88888 -> p123b@node23
  end

  RS-->>PX: refresh routing meta
Loading
ASCII example of split rule:

Before:
  p123: merchant_id=all merchants

After split:
  p123a: merchant_id != 88888  (20% traffic)
  p123b: merchant_id == 88888  (80% traffic, isolated)

Route change in proxy:
  if merchant_id==88888 -> node23/p123b
  else -> node47/p123a

Singles' Day is not uniform. Some items are crazy hot (iPhone, limited sneakers), some items are dead cold (toilet paper).

Hot partition problem:

00:00:00 - iPhone 11 flash sale starts
100,000 users click buy at same time
All traffic routes to partition that has merchant_id of Apple Store
-> that partition: 90% CPU, 2GB/s disk I/O, cache thrashing
-> other partitions: 10% CPU, idle

RootService monitoring:

Diagram: RootService heartbeat monitoring

sequenceDiagram
  autonumber
  %% example: heartbeat JSON every 1s
  participant N47 as Node47
  participant RS as RootService

  loop every 1s
    N47-->>RS: hb node=47 cpu=92 mem=85 io=2.1GB/s\npart 847 qps=50k cpu=45\npart 123 qps=150k cpu=95\npart 456 qps=1k cpu=2
    RS->>RS: score hotspot + decide action
  end
Loading

Heartbeat fields (example):

Field Meaning Example
node_id node identity 47
timestamp seconds 1573430401
cpu_usage node cpu 92%
memory_usage node mem 85%
disk_io disk bandwidth 2.1 GB/s
partitions[] per partition stats id=123 qps=150k cpu=95

RootService analyzes and decide actions:

Action 1: increase resource quota

Signal Before After
CPU quota 20% 40%
Memory quota 4GB 8GB
Disk I/O priority normal boosted

Diagram: Hot partition quota scale-up

flowchart TB
  %% example: quota scale up for hot partition
  HS[Hot partition p123\nqps 150k cpu 95] --> RS2[RootService decision]
  RS2 --> QUP[Increase quota\ncpu 20 to 40\nmem 4G to 8G\nio priority up]
  QUP --> STB[Stabilize\nreduce tail latency]
Loading

Action 2: rebalance leaders

Node 47 has 3 leader partitions, all hot
Node 23 has 3 leader partitions, all cold

Action:
-> move 1 leader partition from node 47 to node 23
-> follower on node 23 promote to leader
-> follower on node 47 demote to follower
-> load spread out

Action 3: split hot partition

Partition 123 too hot, even with increased quota
RootService: split it

Step 1: analyze partition data
  - 80% traffic goes to merchant_id = 88888 (Apple Store)
  - 20% traffic goes to other merchants

Step 2: create new partition 123b
  - partition 123a: merchant_id != 88888
  - partition 123b: merchant_id == 88888 (Apple Store only)

Step 3: migrate partition 123b to new node
  - copy data
  - sync redo logs
  - switch traffic
  - update routing table

Step 4: result
  - partition 123a: 20% traffic, runs ok
  - partition 123b: 80% traffic, but dedicated resources, runs ok

Action 4: temporary read replicas

Partition 123b still hot (read-heavy queries: users viewing product details)

Solution:
-> create 2 temporary read replicas
-> load balance read queries across 3 replicas
-> writes still go to 1 leader (consistency)
-> reads distributed: ~33% per replica

Real example Singles' Day 2019:

23:50 - pre-peak preparation
  RootService: rebalance leaders to distribute evenly

00:00 - peak starts
  - 50 partitions spike to 95%+ CPU in 30 seconds
  - RootService auto increase quotas
  - 10 partitions need splitting
  - 15 partitions need leader migrate

00:05 - peak sustained
  - all actions completed
  - system stable at 544k TPS
  - no engineers on-call
  - no manual intervention

01:00 - peak declining
  - traffic drops to 200k TPS
  - RootService scale down resources
  - return quotas to normal

03:00 - major compaction
  - traffic down to 20k TPS
  - safe to run major compaction
  - reclaim disk space

Why no engineers on-call in 2020?

Before 2019: engineers had to monitor dashboards manually and decide actions.

In 2020: RootService learned enough patterns:

  • knows which partitions will be hot (based on history)
  • pre-allocate resources before peak
  • auto-scale during peak
  • auto-rebalance after peak
  • alert engineers only when real anomalies happen (99.9%+ CPU, disk full, network down)

Engineers just sit watching dashboard and drink coffee.


HTAP - Handle OLTP and OLAP together

LSM-Tree - example write and read path

Diagram: LSM-Tree write/read lifecycle example

flowchart TD
  %% example: payment_id=12345 write at 00:00:00.001, then read at 00:00:02.000
  W[Write SQL\nINSERT payment_id=12345] --> MT[MemTable\nrows 12345 -> v1000]
  MT -->|00:00:00.050 freeze| FMT[Frozen MemTable\nimmutable]
  FMT -->|00:00:00.200 dump| MINI[Mini SSTable\nmini_0007.sst]
  MINI -->|00:00:05.000 minor merge| MINOR[Minor SSTable\nminor_0012.sst]
  MINOR -->|03:10:00 major compaction| MAJOR[Major SSTable\nmajor_2020-11-12.sst]

  R[Read SQL\nSELECT payment_id=12345] --> MT
  R --> FMT
  R --> RC[Row cache\nhit maybe]
  R --> BF[Bloom filter\nkey 12345]
  BF --> MINI
  BF --> MINOR
  BF --> MAJOR

  RC --> OUT[Return row\nabout 0.1ms if hit]
Loading
Example time line:
00:00:00.001 write arrives -> put into MemTable
00:00:00.050 MemTable freeze -> become Frozen MemTable
00:00:00.200 dump Frozen -> Mini SSTable
00:00:05.000 minor merge many mini -> Minor SSTable
03:10:00 major compaction -> Major SSTable
00:00:02.000 read can hit MemTable/Frozen first, if miss then SSTable + bloom

HTAP - Handle OLTP and OLAP together

Diagram: HTAP dual execution paths

flowchart TB
  %% example: 2 queries hit different path in same DB
  Q1[Q1: SELECT balance\nuid=293847] --> OLTP[OLTP fast path]
  Q2[Q2: SUM amount\nlast 1 hour] --> OLAP[OLAP scan path]

  OLTP --> MT[MemTable\nhot rows last 10 min]
  MT --> TXN[Txn engine\nMVCC + locks]

  OLAP --> SS[SSTable\ncolumn blocks]
  SS --> VE[Vector exec\nSIMD batch 1024 rows]

  MT <-->|flush + compaction| SS

  OLTP --> OUT1[result about 0.1ms]
  OLAP --> OUT2[result about 50ms]
Loading
Example routing:
- Q1 is point lookup -> go MemTable/hash index first
- Q2 is scan+agg -> go SSTable column blocks + vector exec
- same data, no ETL, just 2 execution path

This is one of the most brilliant part of OceanBase. Traditional databases usually split into 2 types:

OLTP (Online Transaction Processing):

  • Fast point queries: "get balance of user A"
  • Fast writes: "insert payment record"
  • Row-oriented storage
  • Examples: MySQL, PostgreSQL, Oracle

OLAP (Online Analytical Processing):

  • Slower aggregate queries: "total sales by region"
  • Scan millions of rows
  • Column-oriented storage
  • Examples: ClickHouse, Snowflake, BigQuery

Problem with traditional approach:

OLTP DB (MySQL)
  ↓
ETL job (runs every hour)
  ↓
OLAP warehouse (Snowflake)
  ↓
Dashboard (1 hour delay)

Singles' Day: merchants want real-time dashboards. "How many i sold in last 10 mins?" You cant wait 1 hour.

OceanBase HTAP solution: 1 database, 2 execution paths

Storage engine supports both:

  1. MemTable (row-oriented)
  • Active writes go here
  • Hash + B-tree indexes
  • Perfect for OLTP point queries
  1. SSTable (column-oriented)
  • Flushed data stored here
  • Column encoding + compression
  • Perfect for OLAP scans

Concrete example:

OLTP query (point query):

SELECT * FROM payments WHERE payment_id = 12345;

Execution plan:

1. Check MemTable hash index -> O(1) lookup
2. Find row in 0.1ms
3. Return result

Data layout in MemTable (row-oriented):
Row: [payment_id=12345, user_id=999, amount=100, timestamp=1234567890, status=success]

OLAP query (aggregate scan):

SELECT
  region,
  SUM(amount) as total_sales,
  COUNT(*) as order_count
FROM payments
WHERE timestamp > '2019-11-11 00:00:00'
GROUP BY region;

Execution plan:

1. Scan SSTable (column-oriented)
2. Only read needed columns: timestamp, region, amount
3. Vector execution: process 1000 rows at once
4. Use SIMD instructions for SUM/COUNT
5. Return aggregated result

Data layout in SSTable (column-oriented):
timestamp column: [1234567890, 1234567891, 1234567892, ...]
region column:    ['Beijing', 'Shanghai', 'Beijing', ...]
amount column:    [100, 200, 150, ...]

Only read these 3 columns, skip other 10 columns
Compressed: 1000 rows = ~4KB (vs 50KB row-oriented)

Vector execution engine:

Traditional execution: 1 row at a time

total = 0
for row in rows:
    if row.region == 'Beijing':
        total += row.amount  # 1M iterations

Vector execution: 1000 rows at once

# Load 1000 rows into CPU cache
chunk = load_1000_rows()

# Use SIMD to filter (parallel)
mask = SIMD_compare(chunk.region, 'Beijing')  # 1 CPU instruction

# Use SIMD to sum (parallel)
total = SIMD_sum(chunk.amount, mask)  # 1 CPU instruction

# Process 1000 rows in ~same time as 1 row
# Total: 1M/1000 = 1000 iterations (1000x faster)

Real-time analytics during Singles' Day:

Use case 1: fraud detection (real-time OLAP)

-- runs every 10 seconds
SELECT user_id, COUNT(*) as payment_count
FROM payments
WHERE timestamp > NOW() - INTERVAL '60 seconds'
  AND amount > 10000
GROUP BY user_id
HAVING COUNT(*) > 20;
-- Find users making 20+ payments > $1000 in 1 min

Execution:

  • Scan SSTable recent data (~10M rows)
  • Vector execution: 10 seconds
  • Result: 15 suspicious accounts
  • Trigger alert to risk team

Use case 2: merchant dashboard (real-time OLTP + OLAP)

-- dashboard updates every 5 seconds
SELECT
  COUNT(*) as total_orders,
  SUM(amount) as total_sales,
  AVG(amount) as avg_order_value,
  MAX(amount) as largest_order
FROM orders
WHERE merchant_id = 88888
  AND timestamp > NOW() - INTERVAL '1 hour';

Execution:

  • Recent data (last 10 min) from MemTable: OLTP fast path
  • Older data (10-60 min ago) from SSTable: OLAP scan
  • Combined result in 50ms
  • Merchant sees real-time dashboard

Use case 3: recommendation engine (complex OLAP)

-- Update recommendations every 30 seconds
SELECT
  p1.user_id,
  p2.product_id,
  COUNT(*) as co_purchase_count
FROM purchases p1
JOIN purchases p2 ON p1.order_id = p2.order_id
WHERE p1.product_id = 12345  -- iPhone
  AND p2.product_id != 12345
  AND p1.timestamp > NOW() - INTERVAL '10 minutes'
GROUP BY p1.user_id, p2.product_id
ORDER BY co_purchase_count DESC
LIMIT 10;
-- Users buying iPhone, what else did they buy?

Execution:

  • Join 2 large tables
  • Scan millions of rows
  • Vector execution + hash join
  • Result in 2 seconds
  • Update recommendation model

Simultaneously running:

00:00:00 - Singles' Day starts
Load:
  - 544k OLTP TPS (payments, orders)
  - 50 OLAP queries/second (fraud, dashboards, recommendations)
  - Total: 544k + 50 queries/second
  - Same database, same data, zero lag

CPU allocation:
  - 90% for OLTP (priority)
  - 10% for OLAP (background)

Memory allocation:
  - MemTable: 60% (hot OLTP data)
  - Block Cache: 30% (recent OLAP scans)
  - Row Cache: 10% (frequently accessed rows)

No ETL tax: Traditional:

  • Storage cost: OLTP DB + OLAP warehouse = 2x
  • Network cost: ETL transfer petabytes/day
  • Lag: 1 hour minimum
  • Complexity: maintain 2 systems

OceanBase:

  • Storage cost: 1x (same data)
  • Network cost: 0 (no transfer)
  • Lag: 0 (real-time)
  • Complexity: 1 system

Distributed Transaction Coordination (2PC)

sequenceDiagram
  autonumber
  %% example: transfer $100 from uid=293847(p847@node47) -> uid=100523(p523@node23)
  participant U as User
  participant C as Coordinator (node47)
  participant P1 as Participant node47 (p847)
  participant P2 as Participant node23 (p523)

  U->>C: begin txn txn_19283746 ts=1234567890
  C->>P1: PREPARE txn_19283746\nUPDATE A balance -100
  C->>P2: PREPARE txn_19283746\nUPDATE B balance +100
  P1-->>C: VOTE YES (lock A, write prepare log)
  P2-->>C: VOTE YES (lock B, write prepare log)

  alt all YES
    C->>C: write COMMIT log (durable)
    C->>P1: COMMIT txn_19283746
    C->>P2: COMMIT txn_19283746
    P1-->>C: ACK (apply + release lock)
    P2-->>C: ACK (apply + release lock)
    C-->>U: SUCCESS (lat ~10ms)
  else any NO
    C->>C: write ABORT log
    C->>P1: ABORT txn_19283746
    C->>P2: ABORT txn_19283746
    P1-->>C: ACK (release lock)
    P2-->>C: ACK (release lock)
    C-->>U: FAIL
  end
Loading
Example timeline (same datacenter):
0ms   coordinator send prepare
2ms   node47 prepared + vote YES
6ms   node23 prepared + vote YES
7ms   coordinator write commit log
8ms   send commit to both
10ms  both ack, user sees success
  • Cross-table transactions not using the same partition key (rare)

Two-Phase Commit (2PC)

sequenceDiagram
  autonumber
  %% Normal 2PC: prepare + commit
  participant U as Client
  participant C as Coordinator (Node47)
  participant P1 as Participant (Node47 p847)
  participant P2 as Participant (Node23 p523)

  U->>C: begin txn txn_19283746\nGTS ts=1234567890
  par PREPARE
    C->>P1: PREPARE txn_19283746\nUPDATE A -100
    C->>P2: PREPARE txn_19283746\nUPDATE B +100
  end
  P1->>P1: check + lock + write PREPARE log
  P2->>P2: check + lock + write PREPARE log
  P1-->>C: VOTE YES
  P2-->>C: VOTE YES
  C->>C: decide COMMIT\nwrite COMMIT log
  par COMMIT
    C->>P1: COMMIT txn_19283746
    C->>P2: COMMIT txn_19283746
  end
  P1->>P1: apply + write COMMIT log + unlock
  P2->>P2: apply + write COMMIT log + unlock
  P1-->>C: ACK
  P2-->>C: ACK
  C-->>U: success
Loading
sequenceDiagram
  autonumber
  %% Normal 2PC: abort path
  participant U as Client
  participant C as Coordinator
  participant P1 as Participant Node47
  participant P2 as Participant Node23

  U->>C: begin txn txn_19283746
  C->>P1: PREPARE
  C->>P2: PREPARE
  P1-->>C: YES
  P2-->>C: NO (constraint failed)
  C->>C: decide ABORT\nwrite ABORT log
  C->>P1: ABORT
  C->>P2: ABORT
  P1->>P1: rollback + unlock
  P2->>P2: unlock if needed
  P1-->>C: ACK
  P2-->>C: ACK
  C-->>U: failure
Loading
sequenceDiagram
  autonumber
  %% Scenario A: participant crash during PREPARE
  participant C as Coordinator
  participant P1 as Node47 participant
  participant P2 as Node23 participant

  C->>P1: PREPARE txn_19283746
  C->>P2: PREPARE txn_19283746
  P1-->>C: YES (prepared)
  Note over P2: crash before vote
  C->>C: timeout => decide ABORT
  C->>P1: ABORT
  P1->>P1: unlock
  Note over P2: restart\nreplay log -> found PREPARE-only\nask coordinator -> ABORT\ncleanup
Loading
sequenceDiagram
  autonumber
  %% Scenario B: coordinator crash after all YES, before COMMIT
  participant U as Client
  participant C as Coordinator Node47
  participant P1 as Node47 participant
  participant P2 as Node23 participant
  participant NC as New coordinator Node23

  U->>C: begin txn txn_19283746
  C->>P1: PREPARE
  C->>P2: PREPARE
  P1-->>C: YES
  P2-->>C: YES
  Note over C: crash before sending COMMIT
  Note over P1,P2: locks held, waiting decision
  NC->>NC: timeout detected, take over
  NC->>P1: query state
  NC->>P2: query state
  P1-->>NC: prepared YES
  P2-->>NC: prepared YES
  NC->>NC: decide COMMIT
  NC->>P1: COMMIT
  NC->>P2: COMMIT
  P1-->>NC: ACK
  P2-->>NC: ACK
Loading

Scenario C: Network partition during COMMIT

sequenceDiagram
  autonumber
  participant C as Coordinator
  participant N47 as Node47
  participant N23 as Node23

  C->>N47: COMMIT txn_19283746
  C-x N23: COMMIT txn_19283746 (dropped by partition)
  N47-->>C: ACK (apply + release locks)
  Note over N23: still locked, waiting for decision
  loop retry until network heals
    C->>N23: COMMIT txn_19283746
  end
  N23-->>C: ACK (apply + release locks)
Loading
stateDiagram-v2
  %% participant local state machine
  [*] --> Init
  Init --> Prepared: PREPARE log durable + lock acquired
  Prepared --> Committed: COMMIT log durable + apply + unlock
  Prepared --> Aborted: ABORT or timeout cleanup + unlock
  Committed --> [*]
  Aborted --> [*]
Loading

High Availability and Failover

Diagram: Partition replica layout and quorum

flowchart TB
  %% example: partition 847 replicas across 3 zones
  subgraph Z1[Zone 1 - Beijing]
    L[Server47\nLeader p847]
    Fz1[Server102\nFollower p847]
  end
  subgraph Z2[Zone 2 - Hangzhou]
    Fz2a[Server203\nFollower p847]
    Fz2b[Server247\nFollower p847]
  end
  subgraph Z3[Zone 3 - Shenzhen]
    Fz3[Server389\nFollower p847]
  end

  L -->|redo log ts=1234567890| Fz1
  L -->|redo log ts=1234567890| Fz2a
  L -->|redo log ts=1234567890| Fz2b
  L -->|redo log ts=1234567890| Fz3

  subgraph Q[Quorum rule 3-of-5]
    Q1[need any 3 ACK to commit]
  end

  L -. commit ok when 3 ACK .-> Q1
Loading
Example commit math:
- leader writes log locally
- wait ACK from any 2 followers
- total 3/5 -> commit ok

Example failure:
- lose Server47 (leader) + Server102 (follower)
- still 3/5 left -> system can still work

5 replicas, 3 zones, RPO=0, RTO<30s.

Diagram: Replica count vs quorum trade-off

flowchart TB
  %% replica placement + quorum comparison
  subgraph R3[3 replicas\nquorum=2\ntolerate 1 failure]
    r3a((A))
    r3b((B))
    r3c((C))
  end
  subgraph R5[5 replicas\nquorum=3\ntolerate 2 failures]
    r5a((A))
    r5b((B))
    r5c((C))
    r5d((D))
    r5e((E))
  end
  subgraph R7[7 replicas\nquorum=4\ntolerate 3 failures\nextra overhead]
    r7a((A))
    r7b((B))
    r7c((C))
    r7d((D))
    r7e((E))
    r7f((F))
    r7g((G))
  end

  R3 --> Why3[Risk: lose 1 DC may remove 2 replicas\n=> system down]
  R5 --> Why5[Sweet spot: survive 1 DC loss\nlose 2 replicas\nwith acceptable cost]
  R7 --> Why7[Higher safety but more replication cost]
Loading
sequenceDiagram
  autonumber
  %% Failure scenario 1: single leader server failure -> election -> route update
  participant L as Server47 (leader p847)
  participant F1 as Server102 (follower)
  participant F2 as Server203 (follower)
  participant F3 as Server247 (follower)
  participant F4 as Server389 (follower)
  participant RS as RootService
  participant PX as OBProxy

  Note over L: normal traffic, replicate logs
  L-->>F1: redo log
  L-->>F2: redo log
  L-->>F3: redo log
  L-->>F4: redo log

  Note over L: crash
  F2-->>RS: leader timeout detected
  RS->>RS: start Paxos election
  RS-->>F2: promote to new leader (new term)
  RS-->>PX: update route p847 -> Server203
  PX-->>F2: new requests
Loading

Diagram: Replica placement across zones

flowchart TB
  %% replica placement across zones (example p847)
  subgraph Z1[Zone 1 Beijing]
    BJ1[(DC1 leader)]
    BJ2[(DC2 follower)]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[(DC1 follower)]
    HZ2[(DC2 follower)]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[(DC1 follower)]
  end

  BJ1 --> BJ2
  BJ1 --> HZ1
  BJ1 --> HZ2
  BJ1 --> SZ1

  Q[Commit when 3-of-5 replicas have log] -.-> BJ1
Loading

Example chi tiết về data preservation:

Before failure (timestamp 1234567890):
User A submitted payment, transaction committed
  Server 47 (Leader): balance = $900 ✓ committed
  Server 102: balance = $900 ✓ replicated
  Server 203: balance = $900 ✓ replicated
  Server 247: balance = $900 ✓ replicated
  Server 389: balance = $900 ✓ replicated

During failure (timestamp 1234567891):
User B submitted payment, Leader processing
  Server 47 (Leader): balance = $800 ← processing, NOT YET COMMITTED
  Server 47 DIES before replication

After failover:
New Leader (Server 203) state:
  - Has User A transaction: balance = $900 ✓ PRESERVED
  - Doesn't have User B transaction: balance = $900 (not $800)
  
User B sees: "Transaction failed, please retry"
User B retries → success
Final balance: $800

Key point: 
- Committed transactions NEVER lost (RPO = 0)
- Uncommitted transactions may be lost (acceptable)
- Users just retry failed requests
sequenceDiagram
  autonumber
  %% Failure scenario 2: entire datacenter down -> mass elections + route update
  participant BJ1 as Beijing DC1 (down)
  participant RS as RootService
  participant HZ as Hangzhou zone
  participant SZ as Shenzhen zone
  participant PX as OBProxy fleet

  Note over BJ1: power outage, thousands of servers drop
  HZ-->>RS: heartbeat timeouts for BJ1 nodes
  SZ-->>RS: heartbeat timeouts for BJ1 nodes
  RS->>RS: start elections for many partitions in parallel
  RS-->>HZ: promote new leaders for affected partitions
  RS-->>SZ: promote new leaders for affected partitions
  RS-->>PX: push updated routing tables
  Note over PX: clients retry -> traffic shifts to HZ/SZ
Loading

Diagram: Datacenter outage - quorum still available

flowchart TB
  %% Entire zone 1 down but quorum still satisfied
  subgraph Z1[Zone 1 Beijing]
    DC1x[DC1 down]
    DC2x[DC2 down]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[(replica)]
    HZ2[(replica)]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[(replica)]
  end

  Q[3-of-5 quorum still available\n=> keep accepting writes] --> HZ1
  Q --> HZ2
  Q --> SZ1
Loading
sequenceDiagram
  autonumber
  %% Auto-heal: rebuild back to 5 replicas
  participant RS as RootService
  participant HZ1 as Hangzhou DC1
  participant SZ1 as Shenzhen DC1

  RS->>RS: detect only 3 replicas remain
  RS->>HZ1: create new replica r4 + backfill
  RS->>SZ1: create new replica r5 + backfill
  HZ1-->>RS: r4 caught up
  SZ1-->>RS: r5 caught up
  RS->>RS: restore 5 replicas
Loading
sequenceDiagram
  autonumber
  %% Failure scenario 3: network partition (split brain) prevented by quorum
  participant BJ as Beijing group (2 replicas)
  participant HZ as Hangzhou+Shenzhen group (3 replicas)

  Note over BJ,HZ: network partition between groups

  BJ->>BJ: try to commit write
  BJ-->>BJ: only 2-of-5 ACK possible => cannot reach quorum
  BJ-->>BJ: reject writes (read-only safe mode)

  HZ->>HZ: elect leader with 3-of-5
  HZ-->>HZ: accept writes (quorum satisfied)

  Note over BJ,HZ: network heals
  BJ->>HZ: rejoin, step down to follower, catch up logs
Loading

Diagram: Quorum rule - who can accept writes

flowchart TB
  %% quorum rule decides who can accept writes
  A[Group A\n2 replicas] -->|no quorum| RO[READ-ONLY\nreject writes]
  B[Group B\n3 replicas] -->|has quorum| RW[READ/WRITE\naccept writes]
  RO --> HEAL[heal network -> catch up]
  RW --> HEAL
Loading

Diagram: RTO component breakdown

flowchart TB
  %% RTO breakdown
  RTO[RTO components] --> DET[Detection\n1-2s]
  RTO --> ELE[Election\n3-10s]
  RTO --> ROUTE[Route update\n2-8s]
  RTO --> WARM[Warmup\n2-10s]

  DET --> S1[Single server\n8-15s]
  ELE --> S1
  ROUTE --> S1
  WARM --> S1

  DET --> DC[Datacenter\n20-30s]
  ELE --> DC
  ROUTE --> DC
  WARM --> DC
Loading
sequenceDiagram
  autonumber
  %% RPO=0: commit only after majority replication
  participant U as Client
  participant L as Leader
  participant F as Followers (quorum)

  U->>L: write request
  L->>L: append redo log locally
  par replicate
    L->>F: redo log
  end
  F-->>L: majority ACK
  L-->>U: success (committed)

  Note over L: leader crash before majority ACK => user sees failure, retry
Loading
sequenceDiagram
  autonumber
  %% Chaos test summary
  participant ENG as Engineers
  participant OPS as Ops tool
  participant CL as OceanBase cluster

  ENG->>OPS: kill 50 random nodes
  OPS->>CL: terminate nodes
  CL-->>CL: elections + leader moves
  CL-->>ENG: recovered in <=30s, RPO=0

  ENG->>OPS: disconnect Beijing DC1
  OPS->>CL: network isolation
  CL-->>CL: mass elections + reroute
  CL-->>ENG: recovered ~25s

  ENG->>OPS: add 500ms latency BJ<->HZ
  OPS->>CL: inject latency
  CL-->>ENG: commit slower, still available
Loading

Compression and Column Encoding

Diagram: Microblock compression and read path

flowchart TD
  %% example: microblock for payments_p847, columns: status/timestamp/amount
  Raw[Raw rows payments_p847] --> Col[Column layout\nstatus timestamp amount]
  Col --> Enc[Encoding step\nstatus dict, ts delta, amount delta]
  Enc --> Comp[Compression\nLZ4 for cold, Snappy for hot]
  Comp --> Store[Store microblock\nsize 4096B -> 287B]

  Store --> Read[Read query\nSELECT amount WHERE uid=12345]
  Read --> Decomp[Decompress 287B -> 4KB]
  Decomp --> Prune[Column prune\nkeep only amount]
  Prune --> Decode[Decode amount\ndelta decode]
  Decode --> Exec[Exec + return\nabout 0.16ms]
Loading
Example numbers in 1 block:
- before: 4KB microblock
- after encoding: 820B
- after LZ4: 287B
- read I/O: 4KB -> 287B (14x less)

At hundreds of petabytes of storage, compression is not a "nice to have"—it is a "must have". Without compression, we'd need 2–3× more servers.

Diagram: Compression pipeline (write and read)

flowchart TD
  %% Compression pipeline overview
  subgraph W[Write path]
    WR[Rows] --> MB[Build microblock]
    MB --> COL[Columnize]
    COL --> ENC[Encode\ndict / delta / rle / prefix]
    ENC --> CMP[Compress\nLZ4 or Snappy]
    CMP --> SST[Persist SSTable]
  end

  subgraph R[Read path]
    Q[Query] --> PRUNE[Prune columns]
    PRUNE --> IO[Read compressed microblock]
    IO --> DCMP[Decompress]
    DCMP --> DECODE[Decode selected columns]
    DECODE --> RES[Return rows]
  end

  SST -. serves .-> IO
Loading

Compression math (order-of-magnitude)

Item Formula Example
Raw write orders × avg row size 1.29B × 5KB = 6.45TB
Replicas raw × replica factor 6.45TB × 5 = 32.25TB
With 80% compression raw × 20% 6.45TB × 20% = 1.29TB

Diagram: Row store vs column store I/O

flowchart LR
  %% Row vs column layout and scan behavior
  subgraph ROW[Row store]
    R1[Row1: uid, amount, status, ts, merchant]
    R2[Row2: uid, amount, status, ts, merchant]
    R3[Row3: uid, amount, status, ts, merchant]
  end

  subgraph COLS[Column store]
    C1[uid: 1001,1002,1003,...]
    C2[amount: 100,200,150,...]
    C3[status: success,success,pending,...]
    C4[ts: 1234567890, ...]
    C5[merchant: 88888, ...]
  end

  Q1[Query: SELECT amount WHERE ts > X] -->|row store reads many columns| ROW
  Q1 -->|column store reads only ts+amount| COLS

  ROW --> W1[More I/O\nmixed types]
  COLS --> W2[Less I/O\nbetter compression]
Loading

Encoding Techniques — Overview (4 patterns)

Diagram: Column encoding technique selection

flowchart TB
  %% encoding decision cheatsheet
  A[Column data] --> DCT[Dictionary\nlow cardinality\nstatus, merchant]
  A --> DLT[Delta / Delta-of-delta\nmonotonic numbers\ntimestamp]
  A --> RLE[RLE\nlong runs\nflash sale product_id]
  A --> PFX[Prefix\ncommon prefixes\nproduct_name]

  DCT --> OUT[Smaller encoded stream]
  DLT --> OUT
  RLE --> OUT
  PFX --> OUT
Loading
Technique When it applies Key idea Typical gain
Dictionary few distinct values map to small ids 80–98%
Delta / DoD ordered numbers store differences 85–98%
RLE repeated values (value,count) 90–99%+
Prefix shared string prefix store prefix once 30–60%
sequenceDiagram
  autonumber
  %% dictionary example
  participant W as Writer
  participant MB as Microblock

  W->>MB: build dictionary {success:0,pending:1,failed:2,cancelled:3}
  W->>MB: encode status column as ids [0,0,1,0,2,...]
  MB-->>W: output bit-packed ids (2 bits/value)
Loading
sequenceDiagram
  autonumber
  %% delta-of-delta example
  participant W as Writer
  participant MB as Microblock

  W->>MB: base ts=1234567890
  W->>MB: deltas [+1,+1,+1,...]
  W->>MB: delta-of-delta [0,0,0,...]
  MB-->>W: store base + compact zeros
Loading
sequenceDiagram
  autonumber
  %% RLE example
  participant W as Writer
  participant MB as Microblock

  W->>MB: product_id stream [12345 x 500000]
  W->>MB: encode as (12345,500000)
  MB-->>W: 12 bytes payload
Loading
sequenceDiagram
  autonumber
  %% prefix example
  participant W as Writer
  participant MB as Microblock

  W->>MB: find common prefix "iPhone 11 Pro"
  W->>MB: store suffixes [" Max 256GB ...", " 128GB ...", ...]
  MB-->>W: prefix + suffix array
Loading

Compression Layer (after encoding)

flowchart TB
  title: Compression algorithm choice
  %% encoding then compression
  E[Encoded column stream] -->|many repeated patterns| LZ[LZ4\nhigher ratio]
  E -->|fast decode| SNP[Snappy\nfaster decompress]

  LZ --> OUT[Compressed bytes]
  SNP --> OUT

  OUT -->|read| DCP[Decompress]
  DCP -->|then| DEC[Decode columns]
Loading
Algorithm Bias Use case
LZ4 higher compression ratio cold data, bandwidth/storage sensitive
Snappy faster decompression hot data, latency sensitive

Real compression results:

Example microblock (4KB uncompressed):
1. Raw data: 4096 bytes
2. After column encoding: 820 bytes (80% reduction)
3. After LZ4 compression: 287 bytes (93% total reduction)

Typical compression ratios by column type:
- status (enum): 98% reduction
- timestamp (sequential): 95% reduction
- user_id (random): 60% reduction
- amount (numbers): 70% reduction
- product_name (strings): 50% reduction

Overall: 70-90% reduction across all data

Performance impact (summary)

flowchart LR
  title: Compression performance trade-off
  %% compression trade-off summary
  IO[Disk I/O] -->|reduced bytes read| FAST[Lower latency / higher throughput]
  CPU[CPU] -->|decompress + decode| COST[Small extra cost]

  FAST --> NET[Net win\nI/O saved >> CPU cost]
  COST --> NET
Loading
Metric Without compression With compression
Read bytes per microblock 4096B 287B
Extra CPU ~0ms ~0.1ms decompress + decode
Cache efficiency baseline higher (more blocks fit)

Decompression on read (as a sequence)

sequenceDiagram
  autonumber
  participant Q as Query
  participant IDX as Index
  participant DISK as Disk
  participant LZ as Decompress
  participant CE as Column engine

  Q->>IDX: locate microblock for user_id=12345
  IDX-->>Q: microblock_id
  Q->>DISK: read compressed microblock (287B)
  DISK-->>Q: bytes
  Q->>LZ: decompress to 4KB
  LZ-->>Q: uncompressed block
  Q->>CE: prune columns (keep amount)
  CE-->>Q: amount column
  Q->>CE: decode (delta)
  CE-->>Q: result rows
Loading

Column pruning benefit:

flowchart LR
  title: Column pruning I/O reduction
  %% column pruning I/O comparison
  subgraph NO[Without column pruning]
    N1[Read 20 columns\n200B/row] --> N2[Scan 1M rows\n200MB]
  end

  subgraph YES[With column pruning]
    Y1[Read timestamp + amount\n16B/row] --> Y2[Scan 1M rows\n16MB]
  end

  NO --> GAP[12.5x more I/O]
  YES --> GAP
Loading

Real Singles' Day numbers (summary)

Item Value
Total written (raw) 6.45TB
After compression 1.3TB
Saved per replica 5.15TB
Total saved (5 replicas) 25.75TB

Cache Architecture

flowchart TB
  title: Two-level cache overview
  %% example: 2 queries hit 2 caches
  Q1[Q1 point: payment_id=12345] -->|point| RC[Row Cache\nkey=payments,847,12345]
  Q2[Q2 scan: ts>00:00] -->|scan| BC[Block Cache\nkey=macroblock_id]

  RC -->|hit 0.05ms| OUT1[row]
  RC -->|miss| MT[MemTable]
  MT -->|miss| SST[SSTable]

  BC -->|hit 0.5ms| OUT2[blocks]
  BC -->|miss| SST

  SST -->|read 287B microblock| IO[Disk I/O]
  IO --> SST

  subgraph Warm[Warming example]
    WS[Scan hot p847\nload last 24h rows] --> RC
    WS --> BC
  end
Loading
Example cache hit flow:
- point read payment_id=12345: RowCache hit -> 0.05ms
- scan last 1 hour: BlockCache hit -> avoid disk read
- after restart: warming scan loads p847 hot keys

Two-level cache architecture

flowchart TB
  title: Why two cache levels
  %% why 2 cache levels
  AP[Access pattern] --> OLTP[OLTP\npoint reads\nprimary key]
  AP --> OLAP[OLAP\nrange scans\nfilters]

  OLTP --> RC2[Row Cache\nrow-level objects]
  OLAP --> BC2[Block Cache\nmicroblocks]

  RC2 --> GOAL[Lower tail latency]
  BC2 --> GOAL
Loading

Row Cache:

sequenceDiagram
  autonumber
  %% point query path: row cache hit/miss
  participant U as Client
  participant RC as Row Cache
  participant MT as MemTable
  participant SST as SSTable

  U->>RC: GET (table,partition,pk)
  alt cache hit
    RC-->>U: row (0.05ms)
  else cache miss
    RC-->>U: miss
    U->>MT: lookup
    alt in memtable
      MT-->>U: row
    else not in memtable
      U->>SST: read microblock
      SST-->>U: row
      U->>RC: put row
    end
  end
Loading
flowchart TB
  title: Cache population strategies
  %% cache population strategies
  subgraph Demand[Read-through]
    D1[Cache miss] --> D2[Load from SSTable] --> D3[Insert into cache]
  end

  subgraph Warm[Warming]
    W1[Server restart/compaction] --> W2[Scan hot partitions] --> W3[Preload recent/hot keys]
  end
Loading
flowchart TB
  title: Cache eviction and thrashing
  %% eviction + thrashing
  subgraph LRU[Eviction]
    E1[Cache full] --> E2[Find least recently used] --> E3[Evict] --> E4[Insert new row]
  end

  subgraph THR[Thrashing]
    T1[Too many unique keys] --> T2[Constant evict/insert] --> T3[Low hit rate]
  end

  THR --> Fix[Mitigations: bigger cache / LRU-K / partitioned cache]
Loading

Block Cache — Overview

sequenceDiagram
  autonumber
  %% range scan path: block cache hit/miss
  participant Q as Query
  participant BC as Block Cache
  participant DISK as Disk
  participant DEC as Decompress+Decode

  Q->>Q: decide scan needs many microblocks
  loop for each microblock
    Q->>BC: GET microblock
    alt hit
      BC-->>Q: decompressed block
    else miss
      BC-->>Q: miss
      Q->>DISK: read compressed block
      DISK-->>Q: bytes
      Q->>DEC: decompress + decode
      DEC-->>Q: decompressed block
      Q->>BC: PUT decompressed block
    end
  end
Loading
Why store decompressed blocks? Trade-off
avoid decompression on every hit uses more memory (fewer blocks cached)
flowchart TB
  title: Dynamic cache budget split
  %% cache split + dynamic rebalance
  TOTAL[Total cache budget] --> RC[Row Cache]
  TOTAL --> BC2[Block Cache]

  RC -->|monitor hit rate| MON[Metrics]
  BC2 -->|monitor hit rate| MON

  MON -->|rebalance every 5m| ADJ[Adjust split]
  ADJ --> RC
  ADJ --> BC2
Loading

Cache warming after major compaction (timeline)

sequenceDiagram
  autonumber
  participant Z as Zone 2
  participant W as Warming job
  participant BC as Block Cache
  participant Q as Queries

  Note over Z: major compaction finishes
  Z-->>BC: SSTable ids changed -> cache keys invalid
  BC-->>Q: 100% miss if serving immediately

  Z->>W: start warming before serving
  W->>W: identify hot partitions (top 20% QPS)
  loop for each hot partition
    W->>Z: sequential scan new SSTable
    W->>BC: preload hot microblocks
  end
  W-->>Z: warming done
  Z-->>Q: start serving with 75% hit rate
Loading

Cache warming specifics

flowchart TB
  title: Cache warming plan (what/how/time)
  %% what/how/time/result
  subgraph What[What to warm]
    HOT[Hot partitions\nTop 20% by QPS\nexample top 200]
    COLD[Cold partitions\n80%]
    HOT --> H24[Load 100% last 24h]
    HOT --> H7[Load 50% last 7d]
    COLD --> NAT[Warm naturally]
  end

  subgraph HowMuch[How much to warm]
    RCV[Row Cache: 10M rows 2GB]
    BCV[Block Cache: 500k microblocks 2GB]
    TOT[Total: 4GB per node\nCluster: 20k nodes -> 80TB]
  end

  subgraph Time[Time to warm]
    SCAN[Sequential scan 1GB/s] --> CALC[4s scan + 5s insert\n=> 10s per zone]
  end

  subgraph Result[Result]
    READY[Zone ready in 10s\nNo user-visible degradation]
  end

  What --> Result
  HowMuch --> Result
  Time --> Result
Loading

Cache hit rate metrics — Singles' Day 2019 (summary)

Cache Window Hit rate Avg latency QPS
Row Cache 00:00–01:00 92% 0.15ms 500k
Row Cache 01:00–06:00 95% 0.10ms 200k
Row Cache 06:00–24:00 97% 0.08ms 50k
Block Cache 00:00–01:00 65% 8ms / 1k rows 50 scans/s
Block Cache 01:00–06:00 75% 5ms / 1k rows 100 scans/s
Block Cache 06:00–24:00 85% 3ms / 1k rows 200 scans/s
xychart-beta
  title "Hit rate trend"
  x-axis ["00-01","01-06","06-24"]
  y-axis "hit rate %" 0 --> 100
  line "RowCache" [92,95,97]
  line "BlockCache" [65,75,85]
Loading

Cache benefits — By the numbers (summary)

flowchart LR
  title: Cache ROI back-of-envelope
  %% cache benefit back-of-envelope
  A[Without cache\n544k QPS\napprox 2ms avg\nhigh disk I/O] --> B[Needs many servers]
  C[With 95% hit rate\nmost reads from RAM\nlow disk I/O] --> D[Fewer servers]
  B --> E[Cost]
  D --> E
  E --> F[Net savings]
Loading
Scenario Disk I/O (per node) CPU cores (rough) Servers
Without cache 2.176GB/s 1088 3000
With 95% hit rate 108MB/s 80 1000
Cost item Value
Servers saved 2000
Server capex saved $10M
RAM cost $1.28M
Net savings $8.72M
ROI 6.8x

Cache is a necessity to reach sustained 544k TPS.


Compaction Strategy During Singles' Day

Major compaction is a necessary evil: it reclaims disk space and cleans up old versions, but it is expensive (rewrites data and thrashes caches).

sequenceDiagram
  autonumber
  %% Zone rotation timeline (Singles' Day)
  participant RS as RootService
  participant PX as OBProxy
  participant Z1 as Zone 1
  participant Z2 as Zone 2
  participant Z3 as Zone 3

  Note over Z1,Z3: 00:00 All zones ACTIVE, peak traffic

  Note over Z2: 06:00 drain Zone 2 -> COMPACTING
  RS->>PX: stop routing to Zone 2
  PX-->>Z1: shift traffic up
  PX-->>Z3: shift traffic up
  RS->>Z2: start major compaction
  Note over Z2: 3h compaction
  RS->>Z2: start cache warming
  Note over Z2: 10m warming
  RS->>PX: re-enable routing to Zone 2

  Note over Z3: 10:00 drain Zone 3 -> COMPACTING (repeat)
Loading
flowchart TB
  title: Single node major compaction steps
  %% Single node compaction (high-level)
  PREP[Prepare\nstop serving writes\nflush memtables] --> MERGE[Merge\nbuild new major SSTable]
  MERGE --> CLEAN[Cleanup\ndelete old SSTables\nreclaim space]
  CLEAN --> WARM[Warm caches\npreload hot blocks]
  WARM --> READY[Ready\nback to ACTIVE]

  MERGE --> INC[Incremental rewrite\nonly modified macroblocks]
Loading
flowchart LR
  title: Traditional vs incremental compaction I/O
  %% Traditional vs incremental write cost
  T[Traditional\nrewrite full dataset] --> TW[High write I/O]
  I[Incremental\nrewrite modified blocks] --> IW[Low write I/O]

  TW --> SLOW[Longer compaction]
  IW --> FAST[Shorter compaction]
Loading
Topic Key takeaway
Traditional approach compact everywhere -> caches cold -> user-visible latency
Zone rotation compact 1 zone at a time -> other zones serve -> near-zero user impact
Incremental compaction reduces write amplification by rewriting only modified macroblocks
flowchart TB
  title: Cache preservation by proactive warming
  %% Cache preservation by proactive warming
  OLD[Old cache keys\nSSTable v4] --> MAP[Map hot regions\nold to new]
  MAP --> LOAD[Load hot blocks\nfrom new SSTable v5]
  LOAD --> HIT[Hit rate recovers faster]
Loading
sequenceDiagram
  autonumber
  %% Stop-the-world vs rotation (impact)
  participant U as Users
  participant STW as Stop-the-world
  participant ROT as Zone rotation

  Note over STW: all zones compact
  STW-->>U: caches cold -> high latency for hours

  Note over ROT: only 1 zone compacting
  ROT-->>U: route to hot zones -> minimal impact
Loading
flowchart TB
  title: Routing adjustment and emergency handling
  %% Load balancing and emergency handling
  MON[Monitor zone CPU/latency] --> ROUTE[Adjust routing between active zones]
  ROUTE --> OK[Stay within headroom]

  FAIL[Active zone failure] --> EMER[Emergency: pause compaction\nbring compacting zone ACTIVE]
  EMER --> ROUTE
Loading
xychart-beta
  title "Disk usage trend"
  x-axis ["D1","D2","D3","D10","D11"]
  y-axis "TB" 0 --> 600
  line "No compaction" [100,160,220,550,600]
  line "With compaction" [100,130,160,280,280]
Loading

Singles' Day 2019 compaction (summary)

Item Value
Zone 2 compaction window 06:00–09:00
Per-node space reclaimed 15TB
Compaction duration 3h
Warming duration 10m
Hit rate right after warming 75%
Hit rate after 1h serving 85–90%
Zone 2 nodes 2000
Zone 2 reclaimed (cluster-wide) 30PB
Zone 2 rewritten (cluster-wide) 20PB

Major compaction is a necessary evil: it reclaims disk space and cleans up old versions, but it is expensive (rewrites data and thrashes caches).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment