Singles' Day night 11.11 - 23:59:50. In HQ of Ant Group and Alibaba in Hangzhou, hundreds engineers staring at dashboard screens. This is real money, billions USD of Alibaba getting decided in next few minutes.
00:00:00: Festival starts. Like 1 billion ppl press pay button at same time. Traffic jump from 50k to 544k TPS in like 3 seconds. If something break, Alibaba can lose billions. But dashboard still green. Peak is 544,000 transactions per second.
00:01:08:
- first $1B in 68 seconds
- $10B in less than 30 mins
End of that day, record numbers:
- 544k TPS peak (world record at that time)
- 1.3B orders
- $34B revenue
To help you imagine: Visa at their most stress is like 65k TPS, PayPal is ~10k TPS.
System scale around that time:
- 100 database clusters
- 20,000 servers
- 1 million CPU cores
- hundreds petabytes storage
Im a engineer at Alipay. After some internal sharing and reading internal engineer blogs, i feel Alipay architecture is super interesting but not so common outside. Most docs are in Chinese so i try summarize in simple way: why Alipay not using Oracle, MySQL, Postgres?
Real problem in Peak Day is not lack of servers, but database. Payment system needs strong consistency, RPO ~ 0, super low RTO, and survive insane write spike.
Oracle is very strong in single instance and RAC. But when TPS goes to hundreds thousands, shared storage and global locks become bottleneck.
MySQL/PostgreSQL can shard, but cross-shard transactions you either do in app layer or use 2PC. Its super hard to keep stable when traffic jump 10x in few seconds. Manual sharding on tens thousands nodes is ops nightmare.
NoSQL (MongoDB/Cassandra) scale good but more eventual consistency, for payment its too risky.
So Alipay build OceanBase from scratch for distributed transactions. In 2020 it also got TPC-C record 707M tpmC, beating big traditional database.
No free lunch for 544k TPS. To carry this number, Alipay accept some tradeoff:
| Tradeoff | What you pay | Why it matters in peak |
|---|---|---|
| Infra burden | 3-5 replicas for Paxos, storage+network cost 3-5x | RPO 0 and survive DC failure |
| Ops complexity | debug distributed system (network/leader/follower) is hard | need strong monitoring and automation |
Compare with other NewSQL:
| NewSQL | Core idea | Good at | Tradeoff |
|---|---|---|---|
| Google Spanner | global timestamp + sync time infra | global consistency across regions | depends on Google infra |
| TiDB | Raft + MySQL compatible | open source ecosystem, easy deploy | performance tuning and hotspots |
| OceanBase | Paxos + LSM + deep compression | huge spikes, high write throughput | complexity and infra cost |
Diagram: Overall system architecture
flowchart LR
U[Client/App] --> P[OBProxy\nSQL routing]
P --> S1[OBServer\nSQL/Txn/Storage]
P --> S2[OBServer\nSQL/Txn/Storage]
P --> S3[OBServer\nSQL/Txn/Storage]
RS[RootService\nMetadata/Health/Rebalance] -. control plane .-> P
RS -. control plane .-> S1
RS -. control plane .-> S2
RS -. control plane .-> S3
Big picture - OceanBase run like a shared-nothing cluster. Each OBServer node is independent with its own SQL engine, transaction engine and storage. OBProxy sits in front and route queries to correct node based on partition keys. RootService is the brain - manage metadata, watch health and do load balancing.
Detailed explain each component:
- OBProxy (SQL routing layer): This is the first layer all requests come in. Its job:
- parse SQL query to find partition key (usually user_id)
- hash partition key to know which OBServer node has the data
- route request to that node
- handle connection pooling to avoid too many connections
Real example:
INSERT INTO payments (user_id, amount, merchant_id)
VALUES (293847, 999, 12345);OBProxy see user_id = 293847, hash it: 293847 mod 1000 = 847. Partition 847 is on OBServer node 47. Route request to node 47. Simple as that.
- OBServer node (compute + storage): This is the real workhorses. Each node has:
-
SQL engine: parse and optimize queries. For complex query join 5 tables, SQL engine decide join order, which index to use, how many rows to scan.
-
Transaction engine: manage MVCC (Multi-Version Concurrency Control) and 2PC.
MVCC example:
- timestamp 1000: user A balance = $100
- timestamp 1001: transaction is running, write user A balance = $50 (not commit yet)
- timestamp 1002: user B read balance of user A
- system return version at timestamp 1000 ($100) because txn 1001 not commit
- timestamp 1003: txn 1001 commit
- timestamp 1004: user C read balance -> get $50
-
Storage engine: LSM-Tree store the real data. Will explain in its own section.
- RootService (the brain): RootService dont handle user traffic. It do admin work:
- track metadata: partition 847 is on which node? leader is which node?
- monitor health: node 47 CPU 90%? need rebalance
- orchestrate migration: move partition 123 from node A to node B
- trigger elections: node 47 dead, elect new leader for its partitions
Rebalance example:
- RootService see node 47 running 95% CPU (hot)
- node 23 only 30% CPU (cold)
- RootService decide migrate partition 847 from node 47 to node 23
- steps:
- create new replica of partition 847 on node 23
- sync data from node 47 to node 23
- wait node 23 catch up
- switch leader from node 47 to node 23
- update metadata
- delete old replica on node 47
- total time: ~30 seconds, zero downtime for users
- Multi-zone deployment:
flowchart TB
%% example: 5 replicas for partition 847 across 3 zones
subgraph Z1[Zone 1 Beijing]
BJ1[Beijing DC1\nLeader p847]
BJ2[Beijing DC2\nFollower p847]
end
subgraph Z2[Zone 2 Hangzhou]
HZ1[Hangzhou DC1\nFollower p847]
HZ2[Hangzhou DC2\nFollower p847]
end
subgraph Z3[Zone 3 Shenzhen]
SZ1[Shenzhen DC1\nFollower p847]
end
BJ1 -->|redo log ts=1234567890| BJ2
BJ1 -->|redo log ts=1234567890| HZ1
BJ1 -->|redo log ts=1234567890| HZ2
BJ1 -->|redo log ts=1234567890| SZ1
Q[Quorum rule 3-of-5\ncommit when any 2 followers ACK] -.-> BJ1
5 datacenters in 3 zones:
| Zone | Datacenters | Purpose |
|---|---|---|
| Zone 1 Beijing | DC1, DC2 | local quorum + survive 1 DC loss |
| Zone 2 Hangzhou | DC1, DC2 | cross-zone quorum + disaster recovery |
| Zone 3 Shenzhen | DC1 | third zone for HA |
Each partition has 5 replicas, spread across zones:
| Replica role | Location | Example for partition 847 |
|---|---|---|
| Leader | Zone 1 | Beijing DC1 |
| Follower | Zone 1 | Beijing DC2 |
| Follower | Zone 2 | Hangzhou DC1 |
| Follower | Zone 2 | Hangzhou DC2 |
| Follower | Zone 3 | Shenzhen DC1 |
If all Beijing power off (2 datacenters down), still 3 replicas in Hangzhou and Shenzhen. System keep running.
With 100 OceanBase clusters on 20,000 servers and 1M CPU cores, this is not about raw power - its about how to orchestrate all these resources to work like one single coherent database.
flowchart TD
W[Write: INSERT payment_id=12345] --> MT[Active MemTable\nrows: 12345 -> v1000]
MT -->|00:00:00.050 freeze| FMT[Frozen MemTable\nimmutable snapshot]
FMT -->|00:00:00.200 dump| MINI[Mini SSTable\nfile: mini_0007.sst]
MINI -->|00:00:05.000 minor merge| MINOR[Minor SSTable\nfile: minor_0012.sst]
MINOR -->|03:10:00 major compaction\nincremental| MAJOR[Major SSTable\nfile: major_2020-11-12.sst]
%% example: read payment_id=12345 at 00:00:02.000
R[Read: SELECT payment_id=12345] --> MT
R --> FMT
R --> RC[Row cache\nhit maybe]
R --> BF[Bloom filter\ncheck key=12345]
BF --> MINI
BF --> MINOR
BF --> MAJOR
Example in 1 page (super simplified):
00:00:00.001 write comes in:
MemTable: {12345 -> amount=100, version=1000}
00:00:00.050 MemTable freeze (too full):
Frozen MemTable: snapshot version=1000
00:00:00.200 dump to disk:
mini_0007.sst contains key range ... includes 12345
00:00:05.000 minor compaction:
merge mini_0001..mini_0050 -> minor_0012.sst
03:10:00 major compaction (incremental):
reuse old macroblocks, rewrite only changed blocks
00:00:02.000 read comes in:
check MemTable -> miss
check Frozen MemTable -> miss
check bloom -> points to minor/major file
read 1 macroblock -> return row
LSM-Tree is the heart of storage engine. Lets break down each stage:
| Stage | Trigger | What happens | Typical size / note |
|---|---|---|---|
| Write to MemTable | every write | keep hot rows + MVCC versions in memory | fast, no disk |
| Freeze | MemTable full | active -> frozen, new active created | 64MB-256MB |
| Flush to Mini SSTable | background | dump frozen to disk | many small files |
| Minor compaction | too many mini files | merge many mini -> 1 minor, keep newest version | minor about 2GB |
| Major compaction | daily low traffic | merge minor + old major -> new major | incremental macroblock reuse |
flowchart LR
%% example: MemTable internal indexes
subgraph MT[MemTable]
BT[B-Tree index\nrange query by ts]
H[Hash index\npoint query by payment_id]
MV[MVCC versions\nts=1000,1001,...]
end
W1[INSERT payment_id=1\nts=1000] --> MT
W2[UPDATE payment_id=1\nts=1001] --> MT
R1[SELECT payment_id=1\nread ts=1000] --> MV
| Time | Operation | Result (simplified) |
|---|---|---|
| 1000 | INSERT payment_id=1 amount=999 | version 1000 stored |
| 1001 | UPDATE payment_id=1 amount=888 | version 1001 stored, version 1000 kept |
| 1002 | read at ts=1000 | return amount 999 (old version) |
flowchart TD
%% example: point query payment_id=12345
Q[Query\nSELECT payment_id=12345] --> MT2[Active MemTable\n0.1ms]
MT2 -->|miss| FMT2[Frozen MemTables\n0.2ms]
FMT2 -->|miss| RC2[Row cache\n0.5ms]
RC2 -->|miss| BF2[Bloom filter\nwhich SSTable may have key]
BF2 --> MINI2[Mini SSTables\nmaybe 10-50 files]
BF2 --> MINOR2[Minor SSTables]
BF2 --> MAJOR2[Major SSTable]
MINI2 --> OUT2[Return row]
MINOR2 --> OUT2
MAJOR2 --> OUT2
| Item | Number |
|---|---|
| Old major size | 100GB |
| Changes from minors | 10GB |
| Macroblock size | 2MB |
| Macroblocks rewritten | 1000 blocks, about 2GB |
| Macroblocks reused | 49000 blocks |
With hundreds of petabytes storage and tens of billions SQL queries, optimizing every read and write op is critical.
sequenceDiagram
autonumber
%% example: user_id=293847 transfer $100 to user_id=100523 at 00:00:00
participant U as User (uid=293847)
participant P as OBProxy
participant L as Leader node 47 (p847)
participant F1 as Follower 1 (Beijing DC2)
participant F2 as Follower 2 (Hangzhou DC1)
participant F3 as Follower 3 (Hangzhou DC2)
participant F4 as Follower 4 (Shenzhen DC1)
U->>P: SQL: UPDATE accounts SET balance=balance-100 WHERE user_id=293847
P->>L: route by hash(293847)%1000=847
L->>L: exec + write redo log (ts=1234567890)
par replicate redo log
L->>F1: log chunk (ts=1234567890)
L->>F2: log chunk (ts=1234567890)
L->>F3: log chunk (ts=1234567890)
L->>F4: log chunk (ts=1234567890)
end
F1-->>L: ACK (2ms)
F2-->>L: ACK (5ms) (quorum ok)
L-->>P: commit ok (lat~8ms)
P-->>U: success
Example time line:
0ms user send SQL
1ms leader write redo log locally
2ms follower1 ack
5ms follower2 ack -> quorum 3/5 reached
8ms proxy return to user
This is the magic part that make your data never lost, even if servers die.
Detailed example: user A transfer $100 to user B
Step 1: request goes to leader
User A -> OBProxy -> leader replica (partition 847, node 47)
SQL: UPDATE accounts SET balance = balance - 100 WHERE user_id = A
Step 2: leader processing
| Step | What leader does | Example detail |
|---|---|---|
| 1 | begin txn, assign timestamp | ts=1234567890 |
| 2 | check constraints | enough balance? conflict? |
| 3 | write to MemTable | user_id=A, balance=900, version=1234567890 |
| 4 | append redo log | UPDATE accounts SET balance=900 WHERE user_id=A |
flowchart TB
%% example: leader internal pipeline for 1 SQL
IN[SQL request\nUPDATE accounts...] --> TS[assign ts\n1234567890]
TS --> CC[constraint check\nbalance>=100]
CC --> LK[lock row\nuser_id=A]
LK --> MT[write MemTable\nbalance=900]
MT --> RL[append redo log\ndurable]
RL --> OUT[ready to replicate]
Step 3: Paxos replication Leader sends redo log to 4 followers in parallel:
Leader (Beijing DC1)
-> Follower 1 (Beijing DC2): latency 2ms
-> Follower 2 (Hangzhou DC1): latency 5ms
-> Follower 3 (Hangzhou DC2): latency 5ms
-> Follower 4 (Shenzhen DC1): latency 8ms
Each follower:
- receive redo log
- write to local disk (durable)
- apply to local MemTable
- send ACK back to leader
Step 4: majority ACK and commit Leader waits for majority ACK. With 5 replicas, need 3 ACKs (quorum):
Time 0ms: Leader send logs
Time 2ms: Follower 1 ACK
Time 5ms: Follower 2 ACK -> majority reached (3/5)
Time 5ms: Follower 3 ACK
Time 8ms: Follower 4 ACK (late but ok)
As soon as it has 3 ACKs (time 5ms), leader:
- mark transaction as COMMITTED
- release locks
- return success to user
- user see "payment success"
Step 5: background async flush MemTable data will flush into SSTable later (not blocking user):
Time 10 seconds later: Frozen MemTable dump into Mini SSTable
Time 1 hour later: Mini SSTables merge into Minor SSTable
Time 3am next day: Minor SSTables merge into Major SSTable
Why Paxos matters?
sequenceDiagram
autonumber
%% scenario 1: leader dies after sending logs, before returning to user
participant U as User
participant L as Leader (Beijing DC1)
participant F2 as Follower (Hangzhou DC1)
participant F3 as Follower (Hangzhou DC2)
participant RS as RootService
U->>L: write request ts=1234567890
L->>L: append redo log local
par replicate
L->>F2: redo log ts=1234567890
L->>F3: redo log ts=1234567890
end
Note over L: crash at t=3ms
F2-->>RS: detect leader timeout
RS->>RS: start election
RS-->>F2: promote to new leader
F2-->>U: user retry ok, data preserved
flowchart TB
%% scenario 2: whole zone 1 Beijing down, quorum still ok
subgraph Z1[Zone 1 Beijing]
BJ1[DC1 down]
BJ2[DC2 down]
end
subgraph Z2[Zone 2 Hangzhou]
HZ1[DC1 follower]
HZ2[DC2 follower]
end
subgraph Z3[Zone 3 Shenzhen]
SZ1[DC1 follower]
end
Q[Quorum 3-of-5\nneed any 3 replicas alive] --> HZ1
Q --> HZ2
Q --> SZ1
| Metric | Meaning | Value in example |
|---|---|---|
| RPO | how much committed data can be lost | 0 |
| RTO | time to recover service | less than 30s |
flowchart LR
%% example: 2 users and 1 transfer
U1[User A\nuser_id=293847\norder_id=O-7731] -->|INSERT order + pay| P
P -->|hash 293847 % 1000 = 847| N47[OBServer node 47\nPartition 847 leader]
subgraph N47
O[orders_p847\nINSERT O-7731]
Pay[payments_p847\nINSERT P-9911]
Bal[account_balance_p847\nUPDATE -100]
end
N47 -->|local txn: begin/commit| O
N47 -->|local txn: same node| Pay
N47 -->|local txn: same node| Bal
P -->|hash 100523 % 1000 = 523| N23[OBServer node 23\nPartition 523 leader]
subgraph N23
Bal2[account_balance_p523\nUPDATE +100]
end
ASCII quick map:
user_id=293847 -> p847 -> node47
orders_p847 + payments_p847 + balance_p847 (local)
user_id=100523 -> p523 -> node23
balance_p523 (other node)
Local txn: update 3 tables on node47, no network.
Cross-partition transfer: node47 + node23 need 2PC.
This is the secret sauce to avoid distributed transactions.
Problem with traditional distributed databases:
| Step | What happens | Example cost |
|---|---|---|
| Prepare phase | coordinator asks 3 servers, waits ACK | 5ms x 3 = 15ms |
| Commit phase | coordinator tells COMMIT, waits ACK | 5ms x 3 = 15ms |
| Total | network-only overhead | 30ms plus |
sequenceDiagram
autonumber
%% example: 3 tables on 3 servers
participant C as Coordinator
participant S1 as Server1 orders
participant S2 as Server2 payments
participant S3 as Server3 balance
C->>S1: PREPARE
C->>S2: PREPARE
C->>S3: PREPARE
S1-->>C: YES 5ms
S2-->>C: YES 5ms
S3-->>C: YES 5ms
C->>C: decide COMMIT
C->>S1: COMMIT
C->>S2: COMMIT
C->>S3: COMMIT
S1-->>C: ACK 5ms
S2-->>C: ACK 5ms
S3-->>C: ACK 5ms
OceanBase solution: partition groups
Key insight: if user A buys item, all data related to user A should be in same place.
Hash partitioning by user_id:
3 tables use same partition function:
orders table: partitioned by user_id
Partition 847 has all orders of users hashed into 847
payments table: partitioned by user_id
Partition 847 has all payments of users hashed into 847
account_balance table: partitioned by user_id
Partition 847 has all balances of users hashed into 847
Colocation magic: OceanBase colocate same partition number onto same node:
OBServer node 47:
- orders_p847
- payments_p847
- account_balance_p847
OBServer node 23:
- orders_p523
- payments_p523
- account_balance_p523
Then transaction becomes LOCAL: User 293847 buys item:
- request routed to node 47 (because partition 847 is there)
- begin transaction
- INSERT into orders_p847 - local, no network
- INSERT into payments_p847 - local, no network
- UPDATE account_balance_p847 - local, no network
- commit
- total latency: ~2ms (mostly memory operations)
No network overhead. No 2PC. Just local transaction inside 1 node.
Ratio of local vs distributed transactions:
- 95%+ transactions are local (same user, same partition)
- only 5% are distributed (example: user A transfer money to user B in different partition)
Cross-partition transaction (distributed 2PC):
Example: user A (partition 847) transfer $100 to user B (partition 523):
Coordinator (node 47):
1. get global timestamp from GTS: 1234567890
2. phase 1 - prepare:
- tell node 47: prepare decrease A's balance
- tell node 23: prepare increase B's balance
Node 47:
- lock user A row
- write prepare log
- check A has enough $100?
- vote: YES
Node 23:
- lock user B row
- write prepare log
- vote: YES
Coordinator:
- received 2 YES votes
- decision: COMMIT
- write commit log
3. phase 2 - commit:
- tell node 47: COMMIT
- tell node 23: COMMIT
Node 47:
- apply changes
- release locks
- ACK
Node 23:
- apply changes
- release locks
- ACK
Coordinator:
- return success to user
Total latency: ~10-15ms (network overhead)
Still ok for 5% transactions. But the other 95% running local in 2ms is why we can reach 544k TPS.
Dynamic repartitioning:
If a user becomes super hot (example: celebrity account with millions followers buying merch):
user_id = 999888 -> partition 888 becomes HOT
Node hosting partition 888: 95% CPU, overloaded
RootService detect:
1. split partition 888 into 2 sub-partitions:
- partition 888a: users hash into [888.0 - 888.5]
- partition 888b: users hash into [888.5 - 888.9]
2. migrate partition 888b to other node
3. now load spread across 2 nodes
4. update routing table in OBProxy
Automatic, zero downtime, zero data loss.
sequenceDiagram
autonumber
%% example: hot partition 123 at 00:00 peak
participant N47 as Node47 (p123 leader)
participant RS as RootService
participant PX as OBProxy
participant N23 as Node23 (cold)
loop every 1s
N47-->>RS: hb cpu=92 mem=85 io=2.1GB/s\np123 qps=150k cpu=95
N23-->>RS: hb cpu=25 mem=40 io=0.2GB/s\np523 qps=1k cpu=2
end
RS->>RS: detect p123 hot (qps spike)
RS-->>N47: action#1 increase quota (cpu 20%->40%, mem 4G->8G)
alt still hot after 10s
RS-->>N23: action#2 leader move plan\nprepare follower promote
RS-->>N47: demote leader p123 -> follower
RS-->>N23: promote follower p123 -> leader
else too hot for 1 leader
RS-->>N47: action#3 split p123 by merchant_id=88888
RS-->>N23: create p123b (merchant=88888) + migrate
RS-->>PX: update route: merchant=88888 -> p123b@node23
end
RS-->>PX: refresh routing meta
ASCII example of split rule:
Before:
p123: merchant_id=all merchants
After split:
p123a: merchant_id != 88888 (20% traffic)
p123b: merchant_id == 88888 (80% traffic, isolated)
Route change in proxy:
if merchant_id==88888 -> node23/p123b
else -> node47/p123a
Singles' Day is not uniform. Some items are crazy hot (iPhone, limited sneakers), some items are dead cold (toilet paper).
Hot partition problem:
00:00:00 - iPhone 11 flash sale starts
100,000 users click buy at same time
All traffic routes to partition that has merchant_id of Apple Store
-> that partition: 90% CPU, 2GB/s disk I/O, cache thrashing
-> other partitions: 10% CPU, idle
RootService monitoring:
Diagram: RootService heartbeat monitoring
sequenceDiagram
autonumber
%% example: heartbeat JSON every 1s
participant N47 as Node47
participant RS as RootService
loop every 1s
N47-->>RS: hb node=47 cpu=92 mem=85 io=2.1GB/s\npart 847 qps=50k cpu=45\npart 123 qps=150k cpu=95\npart 456 qps=1k cpu=2
RS->>RS: score hotspot + decide action
end
Heartbeat fields (example):
| Field | Meaning | Example |
|---|---|---|
| node_id | node identity | 47 |
| timestamp | seconds | 1573430401 |
| cpu_usage | node cpu | 92% |
| memory_usage | node mem | 85% |
| disk_io | disk bandwidth | 2.1 GB/s |
| partitions[] | per partition stats | id=123 qps=150k cpu=95 |
RootService analyzes and decide actions:
Action 1: increase resource quota
| Signal | Before | After |
|---|---|---|
| CPU quota | 20% | 40% |
| Memory quota | 4GB | 8GB |
| Disk I/O priority | normal | boosted |
Diagram: Hot partition quota scale-up
flowchart TB
%% example: quota scale up for hot partition
HS[Hot partition p123\nqps 150k cpu 95] --> RS2[RootService decision]
RS2 --> QUP[Increase quota\ncpu 20 to 40\nmem 4G to 8G\nio priority up]
QUP --> STB[Stabilize\nreduce tail latency]
Action 2: rebalance leaders
Node 47 has 3 leader partitions, all hot
Node 23 has 3 leader partitions, all cold
Action:
-> move 1 leader partition from node 47 to node 23
-> follower on node 23 promote to leader
-> follower on node 47 demote to follower
-> load spread out
Action 3: split hot partition
Partition 123 too hot, even with increased quota
RootService: split it
Step 1: analyze partition data
- 80% traffic goes to merchant_id = 88888 (Apple Store)
- 20% traffic goes to other merchants
Step 2: create new partition 123b
- partition 123a: merchant_id != 88888
- partition 123b: merchant_id == 88888 (Apple Store only)
Step 3: migrate partition 123b to new node
- copy data
- sync redo logs
- switch traffic
- update routing table
Step 4: result
- partition 123a: 20% traffic, runs ok
- partition 123b: 80% traffic, but dedicated resources, runs ok
Action 4: temporary read replicas
Partition 123b still hot (read-heavy queries: users viewing product details)
Solution:
-> create 2 temporary read replicas
-> load balance read queries across 3 replicas
-> writes still go to 1 leader (consistency)
-> reads distributed: ~33% per replica
Real example Singles' Day 2019:
23:50 - pre-peak preparation
RootService: rebalance leaders to distribute evenly
00:00 - peak starts
- 50 partitions spike to 95%+ CPU in 30 seconds
- RootService auto increase quotas
- 10 partitions need splitting
- 15 partitions need leader migrate
00:05 - peak sustained
- all actions completed
- system stable at 544k TPS
- no engineers on-call
- no manual intervention
01:00 - peak declining
- traffic drops to 200k TPS
- RootService scale down resources
- return quotas to normal
03:00 - major compaction
- traffic down to 20k TPS
- safe to run major compaction
- reclaim disk space
Why no engineers on-call in 2020?
Before 2019: engineers had to monitor dashboards manually and decide actions.
In 2020: RootService learned enough patterns:
- knows which partitions will be hot (based on history)
- pre-allocate resources before peak
- auto-scale during peak
- auto-rebalance after peak
- alert engineers only when real anomalies happen (99.9%+ CPU, disk full, network down)
Engineers just sit watching dashboard and drink coffee.
Diagram: LSM-Tree write/read lifecycle example
flowchart TD
%% example: payment_id=12345 write at 00:00:00.001, then read at 00:00:02.000
W[Write SQL\nINSERT payment_id=12345] --> MT[MemTable\nrows 12345 -> v1000]
MT -->|00:00:00.050 freeze| FMT[Frozen MemTable\nimmutable]
FMT -->|00:00:00.200 dump| MINI[Mini SSTable\nmini_0007.sst]
MINI -->|00:00:05.000 minor merge| MINOR[Minor SSTable\nminor_0012.sst]
MINOR -->|03:10:00 major compaction| MAJOR[Major SSTable\nmajor_2020-11-12.sst]
R[Read SQL\nSELECT payment_id=12345] --> MT
R --> FMT
R --> RC[Row cache\nhit maybe]
R --> BF[Bloom filter\nkey 12345]
BF --> MINI
BF --> MINOR
BF --> MAJOR
RC --> OUT[Return row\nabout 0.1ms if hit]
Example time line:
00:00:00.001 write arrives -> put into MemTable
00:00:00.050 MemTable freeze -> become Frozen MemTable
00:00:00.200 dump Frozen -> Mini SSTable
00:00:05.000 minor merge many mini -> Minor SSTable
03:10:00 major compaction -> Major SSTable
00:00:02.000 read can hit MemTable/Frozen first, if miss then SSTable + bloom
Diagram: HTAP dual execution paths
flowchart TB
%% example: 2 queries hit different path in same DB
Q1[Q1: SELECT balance\nuid=293847] --> OLTP[OLTP fast path]
Q2[Q2: SUM amount\nlast 1 hour] --> OLAP[OLAP scan path]
OLTP --> MT[MemTable\nhot rows last 10 min]
MT --> TXN[Txn engine\nMVCC + locks]
OLAP --> SS[SSTable\ncolumn blocks]
SS --> VE[Vector exec\nSIMD batch 1024 rows]
MT <-->|flush + compaction| SS
OLTP --> OUT1[result about 0.1ms]
OLAP --> OUT2[result about 50ms]
Example routing:
- Q1 is point lookup -> go MemTable/hash index first
- Q2 is scan+agg -> go SSTable column blocks + vector exec
- same data, no ETL, just 2 execution path
This is one of the most brilliant part of OceanBase. Traditional databases usually split into 2 types:
OLTP (Online Transaction Processing):
- Fast point queries: "get balance of user A"
- Fast writes: "insert payment record"
- Row-oriented storage
- Examples: MySQL, PostgreSQL, Oracle
OLAP (Online Analytical Processing):
- Slower aggregate queries: "total sales by region"
- Scan millions of rows
- Column-oriented storage
- Examples: ClickHouse, Snowflake, BigQuery
Problem with traditional approach:
OLTP DB (MySQL)
↓
ETL job (runs every hour)
↓
OLAP warehouse (Snowflake)
↓
Dashboard (1 hour delay)
Singles' Day: merchants want real-time dashboards. "How many i sold in last 10 mins?" You cant wait 1 hour.
OceanBase HTAP solution: 1 database, 2 execution paths
Storage engine supports both:
- MemTable (row-oriented)
- Active writes go here
- Hash + B-tree indexes
- Perfect for OLTP point queries
- SSTable (column-oriented)
- Flushed data stored here
- Column encoding + compression
- Perfect for OLAP scans
Concrete example:
OLTP query (point query):
SELECT * FROM payments WHERE payment_id = 12345;Execution plan:
1. Check MemTable hash index -> O(1) lookup
2. Find row in 0.1ms
3. Return result
Data layout in MemTable (row-oriented):
Row: [payment_id=12345, user_id=999, amount=100, timestamp=1234567890, status=success]
OLAP query (aggregate scan):
SELECT
region,
SUM(amount) as total_sales,
COUNT(*) as order_count
FROM payments
WHERE timestamp > '2019-11-11 00:00:00'
GROUP BY region;Execution plan:
1. Scan SSTable (column-oriented)
2. Only read needed columns: timestamp, region, amount
3. Vector execution: process 1000 rows at once
4. Use SIMD instructions for SUM/COUNT
5. Return aggregated result
Data layout in SSTable (column-oriented):
timestamp column: [1234567890, 1234567891, 1234567892, ...]
region column: ['Beijing', 'Shanghai', 'Beijing', ...]
amount column: [100, 200, 150, ...]
Only read these 3 columns, skip other 10 columns
Compressed: 1000 rows = ~4KB (vs 50KB row-oriented)
Vector execution engine:
Traditional execution: 1 row at a time
total = 0
for row in rows:
if row.region == 'Beijing':
total += row.amount # 1M iterations
Vector execution: 1000 rows at once
# Load 1000 rows into CPU cache
chunk = load_1000_rows()
# Use SIMD to filter (parallel)
mask = SIMD_compare(chunk.region, 'Beijing') # 1 CPU instruction
# Use SIMD to sum (parallel)
total = SIMD_sum(chunk.amount, mask) # 1 CPU instruction
# Process 1000 rows in ~same time as 1 row
# Total: 1M/1000 = 1000 iterations (1000x faster)
Real-time analytics during Singles' Day:
Use case 1: fraud detection (real-time OLAP)
-- runs every 10 seconds
SELECT user_id, COUNT(*) as payment_count
FROM payments
WHERE timestamp > NOW() - INTERVAL '60 seconds'
AND amount > 10000
GROUP BY user_id
HAVING COUNT(*) > 20;
-- Find users making 20+ payments > $1000 in 1 minExecution:
- Scan SSTable recent data (~10M rows)
- Vector execution: 10 seconds
- Result: 15 suspicious accounts
- Trigger alert to risk team
Use case 2: merchant dashboard (real-time OLTP + OLAP)
-- dashboard updates every 5 seconds
SELECT
COUNT(*) as total_orders,
SUM(amount) as total_sales,
AVG(amount) as avg_order_value,
MAX(amount) as largest_order
FROM orders
WHERE merchant_id = 88888
AND timestamp > NOW() - INTERVAL '1 hour';Execution:
- Recent data (last 10 min) from MemTable: OLTP fast path
- Older data (10-60 min ago) from SSTable: OLAP scan
- Combined result in 50ms
- Merchant sees real-time dashboard
Use case 3: recommendation engine (complex OLAP)
-- Update recommendations every 30 seconds
SELECT
p1.user_id,
p2.product_id,
COUNT(*) as co_purchase_count
FROM purchases p1
JOIN purchases p2 ON p1.order_id = p2.order_id
WHERE p1.product_id = 12345 -- iPhone
AND p2.product_id != 12345
AND p1.timestamp > NOW() - INTERVAL '10 minutes'
GROUP BY p1.user_id, p2.product_id
ORDER BY co_purchase_count DESC
LIMIT 10;
-- Users buying iPhone, what else did they buy?Execution:
- Join 2 large tables
- Scan millions of rows
- Vector execution + hash join
- Result in 2 seconds
- Update recommendation model
Simultaneously running:
00:00:00 - Singles' Day starts
Load:
- 544k OLTP TPS (payments, orders)
- 50 OLAP queries/second (fraud, dashboards, recommendations)
- Total: 544k + 50 queries/second
- Same database, same data, zero lag
CPU allocation:
- 90% for OLTP (priority)
- 10% for OLAP (background)
Memory allocation:
- MemTable: 60% (hot OLTP data)
- Block Cache: 30% (recent OLAP scans)
- Row Cache: 10% (frequently accessed rows)
No ETL tax: Traditional:
- Storage cost: OLTP DB + OLAP warehouse = 2x
- Network cost: ETL transfer petabytes/day
- Lag: 1 hour minimum
- Complexity: maintain 2 systems
OceanBase:
- Storage cost: 1x (same data)
- Network cost: 0 (no transfer)
- Lag: 0 (real-time)
- Complexity: 1 system
sequenceDiagram
autonumber
%% example: transfer $100 from uid=293847(p847@node47) -> uid=100523(p523@node23)
participant U as User
participant C as Coordinator (node47)
participant P1 as Participant node47 (p847)
participant P2 as Participant node23 (p523)
U->>C: begin txn txn_19283746 ts=1234567890
C->>P1: PREPARE txn_19283746\nUPDATE A balance -100
C->>P2: PREPARE txn_19283746\nUPDATE B balance +100
P1-->>C: VOTE YES (lock A, write prepare log)
P2-->>C: VOTE YES (lock B, write prepare log)
alt all YES
C->>C: write COMMIT log (durable)
C->>P1: COMMIT txn_19283746
C->>P2: COMMIT txn_19283746
P1-->>C: ACK (apply + release lock)
P2-->>C: ACK (apply + release lock)
C-->>U: SUCCESS (lat ~10ms)
else any NO
C->>C: write ABORT log
C->>P1: ABORT txn_19283746
C->>P2: ABORT txn_19283746
P1-->>C: ACK (release lock)
P2-->>C: ACK (release lock)
C-->>U: FAIL
end
Example timeline (same datacenter):
0ms coordinator send prepare
2ms node47 prepared + vote YES
6ms node23 prepared + vote YES
7ms coordinator write commit log
8ms send commit to both
10ms both ack, user sees success
- Cross-table transactions not using the same partition key (rare)
sequenceDiagram
autonumber
%% Normal 2PC: prepare + commit
participant U as Client
participant C as Coordinator (Node47)
participant P1 as Participant (Node47 p847)
participant P2 as Participant (Node23 p523)
U->>C: begin txn txn_19283746\nGTS ts=1234567890
par PREPARE
C->>P1: PREPARE txn_19283746\nUPDATE A -100
C->>P2: PREPARE txn_19283746\nUPDATE B +100
end
P1->>P1: check + lock + write PREPARE log
P2->>P2: check + lock + write PREPARE log
P1-->>C: VOTE YES
P2-->>C: VOTE YES
C->>C: decide COMMIT\nwrite COMMIT log
par COMMIT
C->>P1: COMMIT txn_19283746
C->>P2: COMMIT txn_19283746
end
P1->>P1: apply + write COMMIT log + unlock
P2->>P2: apply + write COMMIT log + unlock
P1-->>C: ACK
P2-->>C: ACK
C-->>U: success
sequenceDiagram
autonumber
%% Normal 2PC: abort path
participant U as Client
participant C as Coordinator
participant P1 as Participant Node47
participant P2 as Participant Node23
U->>C: begin txn txn_19283746
C->>P1: PREPARE
C->>P2: PREPARE
P1-->>C: YES
P2-->>C: NO (constraint failed)
C->>C: decide ABORT\nwrite ABORT log
C->>P1: ABORT
C->>P2: ABORT
P1->>P1: rollback + unlock
P2->>P2: unlock if needed
P1-->>C: ACK
P2-->>C: ACK
C-->>U: failure
sequenceDiagram
autonumber
%% Scenario A: participant crash during PREPARE
participant C as Coordinator
participant P1 as Node47 participant
participant P2 as Node23 participant
C->>P1: PREPARE txn_19283746
C->>P2: PREPARE txn_19283746
P1-->>C: YES (prepared)
Note over P2: crash before vote
C->>C: timeout => decide ABORT
C->>P1: ABORT
P1->>P1: unlock
Note over P2: restart\nreplay log -> found PREPARE-only\nask coordinator -> ABORT\ncleanup
sequenceDiagram
autonumber
%% Scenario B: coordinator crash after all YES, before COMMIT
participant U as Client
participant C as Coordinator Node47
participant P1 as Node47 participant
participant P2 as Node23 participant
participant NC as New coordinator Node23
U->>C: begin txn txn_19283746
C->>P1: PREPARE
C->>P2: PREPARE
P1-->>C: YES
P2-->>C: YES
Note over C: crash before sending COMMIT
Note over P1,P2: locks held, waiting decision
NC->>NC: timeout detected, take over
NC->>P1: query state
NC->>P2: query state
P1-->>NC: prepared YES
P2-->>NC: prepared YES
NC->>NC: decide COMMIT
NC->>P1: COMMIT
NC->>P2: COMMIT
P1-->>NC: ACK
P2-->>NC: ACK
Scenario C: Network partition during COMMIT
sequenceDiagram
autonumber
participant C as Coordinator
participant N47 as Node47
participant N23 as Node23
C->>N47: COMMIT txn_19283746
C-x N23: COMMIT txn_19283746 (dropped by partition)
N47-->>C: ACK (apply + release locks)
Note over N23: still locked, waiting for decision
loop retry until network heals
C->>N23: COMMIT txn_19283746
end
N23-->>C: ACK (apply + release locks)
stateDiagram-v2
%% participant local state machine
[*] --> Init
Init --> Prepared: PREPARE log durable + lock acquired
Prepared --> Committed: COMMIT log durable + apply + unlock
Prepared --> Aborted: ABORT or timeout cleanup + unlock
Committed --> [*]
Aborted --> [*]
Diagram: Partition replica layout and quorum
flowchart TB
%% example: partition 847 replicas across 3 zones
subgraph Z1[Zone 1 - Beijing]
L[Server47\nLeader p847]
Fz1[Server102\nFollower p847]
end
subgraph Z2[Zone 2 - Hangzhou]
Fz2a[Server203\nFollower p847]
Fz2b[Server247\nFollower p847]
end
subgraph Z3[Zone 3 - Shenzhen]
Fz3[Server389\nFollower p847]
end
L -->|redo log ts=1234567890| Fz1
L -->|redo log ts=1234567890| Fz2a
L -->|redo log ts=1234567890| Fz2b
L -->|redo log ts=1234567890| Fz3
subgraph Q[Quorum rule 3-of-5]
Q1[need any 3 ACK to commit]
end
L -. commit ok when 3 ACK .-> Q1
Example commit math:
- leader writes log locally
- wait ACK from any 2 followers
- total 3/5 -> commit ok
Example failure:
- lose Server47 (leader) + Server102 (follower)
- still 3/5 left -> system can still work
5 replicas, 3 zones, RPO=0, RTO<30s.
Diagram: Replica count vs quorum trade-off
flowchart TB
%% replica placement + quorum comparison
subgraph R3[3 replicas\nquorum=2\ntolerate 1 failure]
r3a((A))
r3b((B))
r3c((C))
end
subgraph R5[5 replicas\nquorum=3\ntolerate 2 failures]
r5a((A))
r5b((B))
r5c((C))
r5d((D))
r5e((E))
end
subgraph R7[7 replicas\nquorum=4\ntolerate 3 failures\nextra overhead]
r7a((A))
r7b((B))
r7c((C))
r7d((D))
r7e((E))
r7f((F))
r7g((G))
end
R3 --> Why3[Risk: lose 1 DC may remove 2 replicas\n=> system down]
R5 --> Why5[Sweet spot: survive 1 DC loss\nlose 2 replicas\nwith acceptable cost]
R7 --> Why7[Higher safety but more replication cost]
sequenceDiagram
autonumber
%% Failure scenario 1: single leader server failure -> election -> route update
participant L as Server47 (leader p847)
participant F1 as Server102 (follower)
participant F2 as Server203 (follower)
participant F3 as Server247 (follower)
participant F4 as Server389 (follower)
participant RS as RootService
participant PX as OBProxy
Note over L: normal traffic, replicate logs
L-->>F1: redo log
L-->>F2: redo log
L-->>F3: redo log
L-->>F4: redo log
Note over L: crash
F2-->>RS: leader timeout detected
RS->>RS: start Paxos election
RS-->>F2: promote to new leader (new term)
RS-->>PX: update route p847 -> Server203
PX-->>F2: new requests
Diagram: Replica placement across zones
flowchart TB
%% replica placement across zones (example p847)
subgraph Z1[Zone 1 Beijing]
BJ1[(DC1 leader)]
BJ2[(DC2 follower)]
end
subgraph Z2[Zone 2 Hangzhou]
HZ1[(DC1 follower)]
HZ2[(DC2 follower)]
end
subgraph Z3[Zone 3 Shenzhen]
SZ1[(DC1 follower)]
end
BJ1 --> BJ2
BJ1 --> HZ1
BJ1 --> HZ2
BJ1 --> SZ1
Q[Commit when 3-of-5 replicas have log] -.-> BJ1
Example chi tiết về data preservation:
Before failure (timestamp 1234567890):
User A submitted payment, transaction committed
Server 47 (Leader): balance = $900 ✓ committed
Server 102: balance = $900 ✓ replicated
Server 203: balance = $900 ✓ replicated
Server 247: balance = $900 ✓ replicated
Server 389: balance = $900 ✓ replicated
During failure (timestamp 1234567891):
User B submitted payment, Leader processing
Server 47 (Leader): balance = $800 ← processing, NOT YET COMMITTED
Server 47 DIES before replication
After failover:
New Leader (Server 203) state:
- Has User A transaction: balance = $900 ✓ PRESERVED
- Doesn't have User B transaction: balance = $900 (not $800)
User B sees: "Transaction failed, please retry"
User B retries → success
Final balance: $800
Key point:
- Committed transactions NEVER lost (RPO = 0)
- Uncommitted transactions may be lost (acceptable)
- Users just retry failed requests
sequenceDiagram
autonumber
%% Failure scenario 2: entire datacenter down -> mass elections + route update
participant BJ1 as Beijing DC1 (down)
participant RS as RootService
participant HZ as Hangzhou zone
participant SZ as Shenzhen zone
participant PX as OBProxy fleet
Note over BJ1: power outage, thousands of servers drop
HZ-->>RS: heartbeat timeouts for BJ1 nodes
SZ-->>RS: heartbeat timeouts for BJ1 nodes
RS->>RS: start elections for many partitions in parallel
RS-->>HZ: promote new leaders for affected partitions
RS-->>SZ: promote new leaders for affected partitions
RS-->>PX: push updated routing tables
Note over PX: clients retry -> traffic shifts to HZ/SZ
Diagram: Datacenter outage - quorum still available
flowchart TB
%% Entire zone 1 down but quorum still satisfied
subgraph Z1[Zone 1 Beijing]
DC1x[DC1 down]
DC2x[DC2 down]
end
subgraph Z2[Zone 2 Hangzhou]
HZ1[(replica)]
HZ2[(replica)]
end
subgraph Z3[Zone 3 Shenzhen]
SZ1[(replica)]
end
Q[3-of-5 quorum still available\n=> keep accepting writes] --> HZ1
Q --> HZ2
Q --> SZ1
sequenceDiagram
autonumber
%% Auto-heal: rebuild back to 5 replicas
participant RS as RootService
participant HZ1 as Hangzhou DC1
participant SZ1 as Shenzhen DC1
RS->>RS: detect only 3 replicas remain
RS->>HZ1: create new replica r4 + backfill
RS->>SZ1: create new replica r5 + backfill
HZ1-->>RS: r4 caught up
SZ1-->>RS: r5 caught up
RS->>RS: restore 5 replicas
sequenceDiagram
autonumber
%% Failure scenario 3: network partition (split brain) prevented by quorum
participant BJ as Beijing group (2 replicas)
participant HZ as Hangzhou+Shenzhen group (3 replicas)
Note over BJ,HZ: network partition between groups
BJ->>BJ: try to commit write
BJ-->>BJ: only 2-of-5 ACK possible => cannot reach quorum
BJ-->>BJ: reject writes (read-only safe mode)
HZ->>HZ: elect leader with 3-of-5
HZ-->>HZ: accept writes (quorum satisfied)
Note over BJ,HZ: network heals
BJ->>HZ: rejoin, step down to follower, catch up logs
Diagram: Quorum rule - who can accept writes
flowchart TB
%% quorum rule decides who can accept writes
A[Group A\n2 replicas] -->|no quorum| RO[READ-ONLY\nreject writes]
B[Group B\n3 replicas] -->|has quorum| RW[READ/WRITE\naccept writes]
RO --> HEAL[heal network -> catch up]
RW --> HEAL
Diagram: RTO component breakdown
flowchart TB
%% RTO breakdown
RTO[RTO components] --> DET[Detection\n1-2s]
RTO --> ELE[Election\n3-10s]
RTO --> ROUTE[Route update\n2-8s]
RTO --> WARM[Warmup\n2-10s]
DET --> S1[Single server\n8-15s]
ELE --> S1
ROUTE --> S1
WARM --> S1
DET --> DC[Datacenter\n20-30s]
ELE --> DC
ROUTE --> DC
WARM --> DC
sequenceDiagram
autonumber
%% RPO=0: commit only after majority replication
participant U as Client
participant L as Leader
participant F as Followers (quorum)
U->>L: write request
L->>L: append redo log locally
par replicate
L->>F: redo log
end
F-->>L: majority ACK
L-->>U: success (committed)
Note over L: leader crash before majority ACK => user sees failure, retry
sequenceDiagram
autonumber
%% Chaos test summary
participant ENG as Engineers
participant OPS as Ops tool
participant CL as OceanBase cluster
ENG->>OPS: kill 50 random nodes
OPS->>CL: terminate nodes
CL-->>CL: elections + leader moves
CL-->>ENG: recovered in <=30s, RPO=0
ENG->>OPS: disconnect Beijing DC1
OPS->>CL: network isolation
CL-->>CL: mass elections + reroute
CL-->>ENG: recovered ~25s
ENG->>OPS: add 500ms latency BJ<->HZ
OPS->>CL: inject latency
CL-->>ENG: commit slower, still available
Diagram: Microblock compression and read path
flowchart TD
%% example: microblock for payments_p847, columns: status/timestamp/amount
Raw[Raw rows payments_p847] --> Col[Column layout\nstatus timestamp amount]
Col --> Enc[Encoding step\nstatus dict, ts delta, amount delta]
Enc --> Comp[Compression\nLZ4 for cold, Snappy for hot]
Comp --> Store[Store microblock\nsize 4096B -> 287B]
Store --> Read[Read query\nSELECT amount WHERE uid=12345]
Read --> Decomp[Decompress 287B -> 4KB]
Decomp --> Prune[Column prune\nkeep only amount]
Prune --> Decode[Decode amount\ndelta decode]
Decode --> Exec[Exec + return\nabout 0.16ms]
Example numbers in 1 block:
- before: 4KB microblock
- after encoding: 820B
- after LZ4: 287B
- read I/O: 4KB -> 287B (14x less)
At hundreds of petabytes of storage, compression is not a "nice to have"—it is a "must have". Without compression, we'd need 2–3× more servers.
Diagram: Compression pipeline (write and read)
flowchart TD
%% Compression pipeline overview
subgraph W[Write path]
WR[Rows] --> MB[Build microblock]
MB --> COL[Columnize]
COL --> ENC[Encode\ndict / delta / rle / prefix]
ENC --> CMP[Compress\nLZ4 or Snappy]
CMP --> SST[Persist SSTable]
end
subgraph R[Read path]
Q[Query] --> PRUNE[Prune columns]
PRUNE --> IO[Read compressed microblock]
IO --> DCMP[Decompress]
DCMP --> DECODE[Decode selected columns]
DECODE --> RES[Return rows]
end
SST -. serves .-> IO
| Item | Formula | Example |
|---|---|---|
| Raw write | orders × avg row size | 1.29B × 5KB = 6.45TB |
| Replicas | raw × replica factor | 6.45TB × 5 = 32.25TB |
| With 80% compression | raw × 20% | 6.45TB × 20% = 1.29TB |
Diagram: Row store vs column store I/O
flowchart LR
%% Row vs column layout and scan behavior
subgraph ROW[Row store]
R1[Row1: uid, amount, status, ts, merchant]
R2[Row2: uid, amount, status, ts, merchant]
R3[Row3: uid, amount, status, ts, merchant]
end
subgraph COLS[Column store]
C1[uid: 1001,1002,1003,...]
C2[amount: 100,200,150,...]
C3[status: success,success,pending,...]
C4[ts: 1234567890, ...]
C5[merchant: 88888, ...]
end
Q1[Query: SELECT amount WHERE ts > X] -->|row store reads many columns| ROW
Q1 -->|column store reads only ts+amount| COLS
ROW --> W1[More I/O\nmixed types]
COLS --> W2[Less I/O\nbetter compression]
Encoding Techniques — Overview (4 patterns)
Diagram: Column encoding technique selection
flowchart TB
%% encoding decision cheatsheet
A[Column data] --> DCT[Dictionary\nlow cardinality\nstatus, merchant]
A --> DLT[Delta / Delta-of-delta\nmonotonic numbers\ntimestamp]
A --> RLE[RLE\nlong runs\nflash sale product_id]
A --> PFX[Prefix\ncommon prefixes\nproduct_name]
DCT --> OUT[Smaller encoded stream]
DLT --> OUT
RLE --> OUT
PFX --> OUT
| Technique | When it applies | Key idea | Typical gain |
|---|---|---|---|
| Dictionary | few distinct values | map to small ids | 80–98% |
| Delta / DoD | ordered numbers | store differences | 85–98% |
| RLE | repeated values | (value,count) | 90–99%+ |
| Prefix | shared string prefix | store prefix once | 30–60% |
sequenceDiagram
autonumber
%% dictionary example
participant W as Writer
participant MB as Microblock
W->>MB: build dictionary {success:0,pending:1,failed:2,cancelled:3}
W->>MB: encode status column as ids [0,0,1,0,2,...]
MB-->>W: output bit-packed ids (2 bits/value)
sequenceDiagram
autonumber
%% delta-of-delta example
participant W as Writer
participant MB as Microblock
W->>MB: base ts=1234567890
W->>MB: deltas [+1,+1,+1,...]
W->>MB: delta-of-delta [0,0,0,...]
MB-->>W: store base + compact zeros
sequenceDiagram
autonumber
%% RLE example
participant W as Writer
participant MB as Microblock
W->>MB: product_id stream [12345 x 500000]
W->>MB: encode as (12345,500000)
MB-->>W: 12 bytes payload
sequenceDiagram
autonumber
%% prefix example
participant W as Writer
participant MB as Microblock
W->>MB: find common prefix "iPhone 11 Pro"
W->>MB: store suffixes [" Max 256GB ...", " 128GB ...", ...]
MB-->>W: prefix + suffix array
Compression Layer (after encoding)
flowchart TB
title: Compression algorithm choice
%% encoding then compression
E[Encoded column stream] -->|many repeated patterns| LZ[LZ4\nhigher ratio]
E -->|fast decode| SNP[Snappy\nfaster decompress]
LZ --> OUT[Compressed bytes]
SNP --> OUT
OUT -->|read| DCP[Decompress]
DCP -->|then| DEC[Decode columns]
| Algorithm | Bias | Use case |
|---|---|---|
| LZ4 | higher compression ratio | cold data, bandwidth/storage sensitive |
| Snappy | faster decompression | hot data, latency sensitive |
Real compression results:
Example microblock (4KB uncompressed):
1. Raw data: 4096 bytes
2. After column encoding: 820 bytes (80% reduction)
3. After LZ4 compression: 287 bytes (93% total reduction)
Typical compression ratios by column type:
- status (enum): 98% reduction
- timestamp (sequential): 95% reduction
- user_id (random): 60% reduction
- amount (numbers): 70% reduction
- product_name (strings): 50% reduction
Overall: 70-90% reduction across all data
Performance impact (summary)
flowchart LR
title: Compression performance trade-off
%% compression trade-off summary
IO[Disk I/O] -->|reduced bytes read| FAST[Lower latency / higher throughput]
CPU[CPU] -->|decompress + decode| COST[Small extra cost]
FAST --> NET[Net win\nI/O saved >> CPU cost]
COST --> NET
| Metric | Without compression | With compression |
|---|---|---|
| Read bytes per microblock | 4096B | 287B |
| Extra CPU | ~0ms | ~0.1ms decompress + decode |
| Cache efficiency | baseline | higher (more blocks fit) |
Decompression on read (as a sequence)
sequenceDiagram
autonumber
participant Q as Query
participant IDX as Index
participant DISK as Disk
participant LZ as Decompress
participant CE as Column engine
Q->>IDX: locate microblock for user_id=12345
IDX-->>Q: microblock_id
Q->>DISK: read compressed microblock (287B)
DISK-->>Q: bytes
Q->>LZ: decompress to 4KB
LZ-->>Q: uncompressed block
Q->>CE: prune columns (keep amount)
CE-->>Q: amount column
Q->>CE: decode (delta)
CE-->>Q: result rows
Column pruning benefit:
flowchart LR
title: Column pruning I/O reduction
%% column pruning I/O comparison
subgraph NO[Without column pruning]
N1[Read 20 columns\n200B/row] --> N2[Scan 1M rows\n200MB]
end
subgraph YES[With column pruning]
Y1[Read timestamp + amount\n16B/row] --> Y2[Scan 1M rows\n16MB]
end
NO --> GAP[12.5x more I/O]
YES --> GAP
Real Singles' Day numbers (summary)
| Item | Value |
|---|---|
| Total written (raw) | 6.45TB |
| After compression | 1.3TB |
| Saved per replica | 5.15TB |
| Total saved (5 replicas) | 25.75TB |
flowchart TB
title: Two-level cache overview
%% example: 2 queries hit 2 caches
Q1[Q1 point: payment_id=12345] -->|point| RC[Row Cache\nkey=payments,847,12345]
Q2[Q2 scan: ts>00:00] -->|scan| BC[Block Cache\nkey=macroblock_id]
RC -->|hit 0.05ms| OUT1[row]
RC -->|miss| MT[MemTable]
MT -->|miss| SST[SSTable]
BC -->|hit 0.5ms| OUT2[blocks]
BC -->|miss| SST
SST -->|read 287B microblock| IO[Disk I/O]
IO --> SST
subgraph Warm[Warming example]
WS[Scan hot p847\nload last 24h rows] --> RC
WS --> BC
end
Example cache hit flow:
- point read payment_id=12345: RowCache hit -> 0.05ms
- scan last 1 hour: BlockCache hit -> avoid disk read
- after restart: warming scan loads p847 hot keys
Two-level cache architecture
flowchart TB
title: Why two cache levels
%% why 2 cache levels
AP[Access pattern] --> OLTP[OLTP\npoint reads\nprimary key]
AP --> OLAP[OLAP\nrange scans\nfilters]
OLTP --> RC2[Row Cache\nrow-level objects]
OLAP --> BC2[Block Cache\nmicroblocks]
RC2 --> GOAL[Lower tail latency]
BC2 --> GOAL
Row Cache:
sequenceDiagram
autonumber
%% point query path: row cache hit/miss
participant U as Client
participant RC as Row Cache
participant MT as MemTable
participant SST as SSTable
U->>RC: GET (table,partition,pk)
alt cache hit
RC-->>U: row (0.05ms)
else cache miss
RC-->>U: miss
U->>MT: lookup
alt in memtable
MT-->>U: row
else not in memtable
U->>SST: read microblock
SST-->>U: row
U->>RC: put row
end
end
flowchart TB
title: Cache population strategies
%% cache population strategies
subgraph Demand[Read-through]
D1[Cache miss] --> D2[Load from SSTable] --> D3[Insert into cache]
end
subgraph Warm[Warming]
W1[Server restart/compaction] --> W2[Scan hot partitions] --> W3[Preload recent/hot keys]
end
flowchart TB
title: Cache eviction and thrashing
%% eviction + thrashing
subgraph LRU[Eviction]
E1[Cache full] --> E2[Find least recently used] --> E3[Evict] --> E4[Insert new row]
end
subgraph THR[Thrashing]
T1[Too many unique keys] --> T2[Constant evict/insert] --> T3[Low hit rate]
end
THR --> Fix[Mitigations: bigger cache / LRU-K / partitioned cache]
Block Cache — Overview
sequenceDiagram
autonumber
%% range scan path: block cache hit/miss
participant Q as Query
participant BC as Block Cache
participant DISK as Disk
participant DEC as Decompress+Decode
Q->>Q: decide scan needs many microblocks
loop for each microblock
Q->>BC: GET microblock
alt hit
BC-->>Q: decompressed block
else miss
BC-->>Q: miss
Q->>DISK: read compressed block
DISK-->>Q: bytes
Q->>DEC: decompress + decode
DEC-->>Q: decompressed block
Q->>BC: PUT decompressed block
end
end
| Why store decompressed blocks? | Trade-off |
|---|---|
| avoid decompression on every hit | uses more memory (fewer blocks cached) |
flowchart TB
title: Dynamic cache budget split
%% cache split + dynamic rebalance
TOTAL[Total cache budget] --> RC[Row Cache]
TOTAL --> BC2[Block Cache]
RC -->|monitor hit rate| MON[Metrics]
BC2 -->|monitor hit rate| MON
MON -->|rebalance every 5m| ADJ[Adjust split]
ADJ --> RC
ADJ --> BC2
Cache warming after major compaction (timeline)
sequenceDiagram
autonumber
participant Z as Zone 2
participant W as Warming job
participant BC as Block Cache
participant Q as Queries
Note over Z: major compaction finishes
Z-->>BC: SSTable ids changed -> cache keys invalid
BC-->>Q: 100% miss if serving immediately
Z->>W: start warming before serving
W->>W: identify hot partitions (top 20% QPS)
loop for each hot partition
W->>Z: sequential scan new SSTable
W->>BC: preload hot microblocks
end
W-->>Z: warming done
Z-->>Q: start serving with 75% hit rate
Cache warming specifics
flowchart TB
title: Cache warming plan (what/how/time)
%% what/how/time/result
subgraph What[What to warm]
HOT[Hot partitions\nTop 20% by QPS\nexample top 200]
COLD[Cold partitions\n80%]
HOT --> H24[Load 100% last 24h]
HOT --> H7[Load 50% last 7d]
COLD --> NAT[Warm naturally]
end
subgraph HowMuch[How much to warm]
RCV[Row Cache: 10M rows 2GB]
BCV[Block Cache: 500k microblocks 2GB]
TOT[Total: 4GB per node\nCluster: 20k nodes -> 80TB]
end
subgraph Time[Time to warm]
SCAN[Sequential scan 1GB/s] --> CALC[4s scan + 5s insert\n=> 10s per zone]
end
subgraph Result[Result]
READY[Zone ready in 10s\nNo user-visible degradation]
end
What --> Result
HowMuch --> Result
Time --> Result
Cache hit rate metrics — Singles' Day 2019 (summary)
| Cache | Window | Hit rate | Avg latency | QPS |
|---|---|---|---|---|
| Row Cache | 00:00–01:00 | 92% | 0.15ms | 500k |
| Row Cache | 01:00–06:00 | 95% | 0.10ms | 200k |
| Row Cache | 06:00–24:00 | 97% | 0.08ms | 50k |
| Block Cache | 00:00–01:00 | 65% | 8ms / 1k rows | 50 scans/s |
| Block Cache | 01:00–06:00 | 75% | 5ms / 1k rows | 100 scans/s |
| Block Cache | 06:00–24:00 | 85% | 3ms / 1k rows | 200 scans/s |
xychart-beta
title "Hit rate trend"
x-axis ["00-01","01-06","06-24"]
y-axis "hit rate %" 0 --> 100
line "RowCache" [92,95,97]
line "BlockCache" [65,75,85]
Cache benefits — By the numbers (summary)
flowchart LR
title: Cache ROI back-of-envelope
%% cache benefit back-of-envelope
A[Without cache\n544k QPS\napprox 2ms avg\nhigh disk I/O] --> B[Needs many servers]
C[With 95% hit rate\nmost reads from RAM\nlow disk I/O] --> D[Fewer servers]
B --> E[Cost]
D --> E
E --> F[Net savings]
| Scenario | Disk I/O (per node) | CPU cores (rough) | Servers |
|---|---|---|---|
| Without cache | 2.176GB/s | 1088 | 3000 |
| With 95% hit rate | 108MB/s | 80 | 1000 |
| Cost item | Value |
|---|---|
| Servers saved | 2000 |
| Server capex saved | $10M |
| RAM cost | $1.28M |
| Net savings | $8.72M |
| ROI | 6.8x |
Cache is a necessity to reach sustained 544k TPS.
Major compaction is a necessary evil: it reclaims disk space and cleans up old versions, but it is expensive (rewrites data and thrashes caches).
sequenceDiagram
autonumber
%% Zone rotation timeline (Singles' Day)
participant RS as RootService
participant PX as OBProxy
participant Z1 as Zone 1
participant Z2 as Zone 2
participant Z3 as Zone 3
Note over Z1,Z3: 00:00 All zones ACTIVE, peak traffic
Note over Z2: 06:00 drain Zone 2 -> COMPACTING
RS->>PX: stop routing to Zone 2
PX-->>Z1: shift traffic up
PX-->>Z3: shift traffic up
RS->>Z2: start major compaction
Note over Z2: 3h compaction
RS->>Z2: start cache warming
Note over Z2: 10m warming
RS->>PX: re-enable routing to Zone 2
Note over Z3: 10:00 drain Zone 3 -> COMPACTING (repeat)
flowchart TB
title: Single node major compaction steps
%% Single node compaction (high-level)
PREP[Prepare\nstop serving writes\nflush memtables] --> MERGE[Merge\nbuild new major SSTable]
MERGE --> CLEAN[Cleanup\ndelete old SSTables\nreclaim space]
CLEAN --> WARM[Warm caches\npreload hot blocks]
WARM --> READY[Ready\nback to ACTIVE]
MERGE --> INC[Incremental rewrite\nonly modified macroblocks]
flowchart LR
title: Traditional vs incremental compaction I/O
%% Traditional vs incremental write cost
T[Traditional\nrewrite full dataset] --> TW[High write I/O]
I[Incremental\nrewrite modified blocks] --> IW[Low write I/O]
TW --> SLOW[Longer compaction]
IW --> FAST[Shorter compaction]
| Topic | Key takeaway |
|---|---|
| Traditional approach | compact everywhere -> caches cold -> user-visible latency |
| Zone rotation | compact 1 zone at a time -> other zones serve -> near-zero user impact |
| Incremental compaction | reduces write amplification by rewriting only modified macroblocks |
flowchart TB
title: Cache preservation by proactive warming
%% Cache preservation by proactive warming
OLD[Old cache keys\nSSTable v4] --> MAP[Map hot regions\nold to new]
MAP --> LOAD[Load hot blocks\nfrom new SSTable v5]
LOAD --> HIT[Hit rate recovers faster]
sequenceDiagram
autonumber
%% Stop-the-world vs rotation (impact)
participant U as Users
participant STW as Stop-the-world
participant ROT as Zone rotation
Note over STW: all zones compact
STW-->>U: caches cold -> high latency for hours
Note over ROT: only 1 zone compacting
ROT-->>U: route to hot zones -> minimal impact
flowchart TB
title: Routing adjustment and emergency handling
%% Load balancing and emergency handling
MON[Monitor zone CPU/latency] --> ROUTE[Adjust routing between active zones]
ROUTE --> OK[Stay within headroom]
FAIL[Active zone failure] --> EMER[Emergency: pause compaction\nbring compacting zone ACTIVE]
EMER --> ROUTE
xychart-beta
title "Disk usage trend"
x-axis ["D1","D2","D3","D10","D11"]
y-axis "TB" 0 --> 600
line "No compaction" [100,160,220,550,600]
line "With compaction" [100,130,160,280,280]
| Item | Value |
|---|---|
| Zone 2 compaction window | 06:00–09:00 |
| Per-node space reclaimed | 15TB |
| Compaction duration | 3h |
| Warming duration | 10m |
| Hit rate right after warming | 75% |
| Hit rate after 1h serving | 85–90% |
| Zone 2 nodes | 2000 |
| Zone 2 reclaimed (cluster-wide) | 30PB |
| Zone 2 rewritten (cluster-wide) | 20PB |
Major compaction is a necessary evil: it reclaims disk space and cleans up old versions, but it is expensive (rewrites data and thrashes caches).