alezzigo/544000 TPS - Alipay Crashes ?.md

## 544000 TPS - Alipay Crashes ?.md

      
    Raw
  

              544000 TPS - Alipay Crashes ?.md
            
          
    544,000 payments/sec and world TPC-C record - 707 million tpmC

Singles' Day night 11.11 - 23:59:50.
In HQ of Ant Group and Alibaba in Hangzhou, hundreds engineers staring at dashboard screens. This is real money, billions USD of Alibaba getting decided in next few minutes.
00:00:00:
Festival starts. Like 1 billion ppl press pay button at same time. Traffic jump from 50k to 544k TPS in like 3 seconds. If something break, Alibaba can lose billions. But dashboard still green. Peak is 544,000 transactions per second.
00:01:08:

first $1B in 68 seconds
$10B in less than 30 mins

End of that day, record numbers:

544k TPS peak (world record at that time)
1.3B orders
$34B revenue

To help you imagine: Visa at their most stress is like 65k TPS, PayPal is ~10k TPS.
System scale around that time:

100 database clusters
20,000 servers
1 million CPU cores
hundreds petabytes storage

Im a engineer at Alipay. After some internal sharing and reading internal engineer blogs, i feel Alipay architecture is super interesting but not so common outside. Most docs are in Chinese so i try summarize in simple way: why Alipay not using Oracle, MySQL, Postgres?
Real problem in Peak Day is not lack of servers, but database. Payment system needs strong consistency, RPO ~ 0, super low RTO, and survive insane write spike.
Oracle is very strong in single instance and RAC. But when TPS goes to hundreds thousands, shared storage and global locks become bottleneck.
MySQL/PostgreSQL can shard, but cross-shard transactions you either do in app layer or use 2PC. Its super hard to keep stable when traffic jump 10x in few seconds. Manual sharding on tens thousands nodes is ops nightmare.
NoSQL (MongoDB/Cassandra) scale good but more eventual consistency, for payment its too risky.
So Alipay build OceanBase from scratch for distributed transactions. In 2020 it also got TPC-C record 707M tpmC, beating big traditional database.
No free lunch for 544k TPS. To carry this number, Alipay accept some tradeoff:


Tradeoff
What you pay
Why it matters in peak


Infra burden
3-5 replicas for Paxos, storage+network cost 3-5x
RPO 0 and survive DC failure


Ops complexity
debug distributed system (network/leader/follower) is hard
need strong monitoring and automation


Compare with other NewSQL:


NewSQL
Core idea
Good at
Tradeoff


Google Spanner
global timestamp + sync time infra
global consistency across regions
depends on Google infra


TiDB
Raft + MySQL compatible
open source ecosystem, easy deploy
performance tuning and hotspots


OceanBase
Paxos + LSM + deep compression
huge spikes, high write throughput
complexity and infra cost


Overall system architecture

Diagram: Overall system architecture

  
      flowchart LR
  U[Client/App] --> P[OBProxy\nSQL routing]
  P --> S1[OBServer\nSQL/Txn/Storage]
  P --> S2[OBServer\nSQL/Txn/Storage]
  P --> S3[OBServer\nSQL/Txn/Storage]
  RS[RootService\nMetadata/Health/Rebalance] -. control plane .-> P
  RS -. control plane .-> S1
  RS -. control plane .-> S2
  RS -. control plane .-> S3

    
      Loading

  
Big picture - OceanBase run like a shared-nothing cluster. Each OBServer node is independent with its own SQL engine, transaction engine and storage. OBProxy sits in front and route queries to correct node based on partition keys. RootService is the brain - manage metadata, watch health and do load balancing.
Detailed explain each component:

OBProxy (SQL routing layer):
This is the first layer all requests come in. Its job:


parse SQL query to find partition key (usually user_id)
hash partition key to know which OBServer node has the data
route request to that node
handle connection pooling to avoid too many connections

Real example:
INSERT INTO payments (user_id, amount, merchant_id) 
VALUES (293847, 999, 12345);
OBProxy see user_id = 293847, hash it: 293847 mod 1000 = 847. Partition 847 is on OBServer node 47. Route request to node 47. Simple as that.

OBServer node (compute + storage):
This is the real workhorses. Each node has:


SQL engine: parse and optimize queries. For complex query join 5 tables, SQL engine decide join order, which index to use, how many rows to scan.


Transaction engine: manage MVCC (Multi-Version Concurrency Control) and 2PC.
MVCC example:

timestamp 1000: user A balance = $100
timestamp 1001: transaction is running, write user A balance = $50 (not commit yet)
timestamp 1002: user B read balance of user A
system return version at timestamp 1000 ($100) because txn 1001 not commit
timestamp 1003: txn 1001 commit
timestamp 1004: user C read balance -> get $50


Storage engine: LSM-Tree store the real data. Will explain in its own section.


RootService (the brain):
RootService dont handle user traffic. It do admin work:


track metadata: partition 847 is on which node? leader is which node?
monitor health: node 47 CPU 90%? need rebalance
orchestrate migration: move partition 123 from node A to node B
trigger elections: node 47 dead, elect new leader for its partitions

Rebalance example:

RootService see node 47 running 95% CPU (hot)
node 23 only 30% CPU (cold)
RootService decide migrate partition 847 from node 47 to node 23
steps:

create new replica of partition 847 on node 23
sync data from node 47 to node 23
wait node 23 catch up
switch leader from node 47 to node 23
update metadata
delete old replica on node 47


total time: ~30 seconds, zero downtime for users


Multi-zone deployment:


      flowchart TB
  %% example: 5 replicas for partition 847 across 3 zones
  subgraph Z1[Zone 1 Beijing]
    BJ1[Beijing DC1\nLeader p847]
    BJ2[Beijing DC2\nFollower p847]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[Hangzhou DC1\nFollower p847]
    HZ2[Hangzhou DC2\nFollower p847]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[Shenzhen DC1\nFollower p847]
  end

  BJ1 -->|redo log ts=1234567890| BJ2
  BJ1 -->|redo log ts=1234567890| HZ1
  BJ1 -->|redo log ts=1234567890| HZ2
  BJ1 -->|redo log ts=1234567890| SZ1

  Q[Quorum rule 3-of-5\ncommit when any 2 followers ACK] -.-> BJ1

    
      Loading

  
5 datacenters in 3 zones:


Zone
Datacenters
Purpose


Zone 1 Beijing
DC1, DC2
local quorum + survive 1 DC loss


Zone 2 Hangzhou
DC1, DC2
cross-zone quorum + disaster recovery


Zone 3 Shenzhen
DC1
third zone for HA


Each partition has 5 replicas, spread across zones:


Replica role
Location
Example for partition 847


Leader
Zone 1
Beijing DC1


Follower
Zone 1
Beijing DC2


Follower
Zone 2
Hangzhou DC1


Follower
Zone 2
Hangzhou DC2


Follower
Zone 3
Shenzhen DC1


If all Beijing power off (2 datacenters down), still 3 replicas in Hangzhou and Shenzhen. System keep running.
With 100 OceanBase clusters on 20,000 servers and 1M CPU cores, this is not about raw power - its about how to orchestrate all these resources to work like one single coherent database.

LSM-Tree storage engine architecture


      flowchart TD
  W[Write: INSERT payment_id=12345] --> MT[Active MemTable\nrows: 12345 -> v1000]
  MT -->|00:00:00.050 freeze| FMT[Frozen MemTable\nimmutable snapshot]
  FMT -->|00:00:00.200 dump| MINI[Mini SSTable\nfile: mini_0007.sst]
  MINI -->|00:00:05.000 minor merge| MINOR[Minor SSTable\nfile: minor_0012.sst]
  MINOR -->|03:10:00 major compaction\nincremental| MAJOR[Major SSTable\nfile: major_2020-11-12.sst]

  %% example: read payment_id=12345 at 00:00:02.000
  R[Read: SELECT payment_id=12345] --> MT
  R --> FMT
  R --> RC[Row cache\nhit maybe]
  R --> BF[Bloom filter\ncheck key=12345]
  BF --> MINI
  BF --> MINOR
  BF --> MAJOR

    
      Loading

  
Example in 1 page (super simplified):

00:00:00.001 write comes in:
  MemTable: {12345 -> amount=100, version=1000}

00:00:00.050 MemTable freeze (too full):
  Frozen MemTable: snapshot version=1000

00:00:00.200 dump to disk:
  mini_0007.sst contains key range ... includes 12345

00:00:05.000 minor compaction:
  merge mini_0001..mini_0050 -> minor_0012.sst

03:10:00 major compaction (incremental):
  reuse old macroblocks, rewrite only changed blocks

00:00:02.000 read comes in:
  check MemTable -> miss
  check Frozen MemTable -> miss
  check bloom -> points to minor/major file
  read 1 macroblock -> return row

LSM-Tree is the heart of storage engine. Lets break down each stage:
LSM stages in 1 table


Stage
Trigger
What happens
Typical size / note


Write to MemTable
every write
keep hot rows + MVCC versions in memory
fast, no disk


Freeze
MemTable full
active -> frozen, new active created
64MB-256MB


Flush to Mini SSTable
background
dump frozen to disk
many small files


Minor compaction
too many mini files
merge many mini -> 1 minor, keep newest version
minor about 2GB


Major compaction
daily low traffic
merge minor + old major -> new major
incremental macroblock reuse


      flowchart LR
  %% example: MemTable internal indexes
  subgraph MT[MemTable]
    BT[B-Tree index\nrange query by ts]
    H[Hash index\npoint query by payment_id]
    MV[MVCC versions\nts=1000,1001,...]
  end
  W1[INSERT payment_id=1\nts=1000] --> MT
  W2[UPDATE payment_id=1\nts=1001] --> MT
  R1[SELECT payment_id=1\nread ts=1000] --> MV

    
      Loading

  
Concrete example as table


Time
Operation
Result (simplified)


1000
INSERT payment_id=1 amount=999
version 1000 stored


1001
UPDATE payment_id=1 amount=888
version 1001 stored, version 1000 kept


1002
read at ts=1000
return amount 999 (old version)


Read path as a decision flow


      flowchart TD
  %% example: point query payment_id=12345
  Q[Query\nSELECT payment_id=12345] --> MT2[Active MemTable\n0.1ms]
  MT2 -->|miss| FMT2[Frozen MemTables\n0.2ms]
  FMT2 -->|miss| RC2[Row cache\n0.5ms]
  RC2 -->|miss| BF2[Bloom filter\nwhich SSTable may have key]
  BF2 --> MINI2[Mini SSTables\nmaybe 10-50 files]
  BF2 --> MINOR2[Minor SSTables]
  BF2 --> MAJOR2[Major SSTable]
  MINI2 --> OUT2[Return row]
  MINOR2 --> OUT2
  MAJOR2 --> OUT2

    
      Loading

  
Why incremental major compaction matters (example)


Item
Number


Old major size
100GB


Changes from minors
10GB


Macroblock size
2MB


Macroblocks rewritten
1000 blocks, about 2GB


Macroblocks reused
49000 blocks


With hundreds of petabytes storage and tens of billions SQL queries, optimizing every read and write op is critical.

Transaction flow with Paxos replication


      sequenceDiagram
  autonumber
  %% example: user_id=293847 transfer $100 to user_id=100523 at 00:00:00
  participant U as User (uid=293847)
  participant P as OBProxy
  participant L as Leader node 47 (p847)
  participant F1 as Follower 1 (Beijing DC2)
  participant F2 as Follower 2 (Hangzhou DC1)
  participant F3 as Follower 3 (Hangzhou DC2)
  participant F4 as Follower 4 (Shenzhen DC1)

  U->>P: SQL: UPDATE accounts SET balance=balance-100 WHERE user_id=293847
  P->>L: route by hash(293847)%1000=847
  L->>L: exec + write redo log (ts=1234567890)
  par replicate redo log
    L->>F1: log chunk (ts=1234567890)
    L->>F2: log chunk (ts=1234567890)
    L->>F3: log chunk (ts=1234567890)
    L->>F4: log chunk (ts=1234567890)
  end
  F1-->>L: ACK (2ms)
  F2-->>L: ACK (5ms) (quorum ok)
  L-->>P: commit ok (lat~8ms)
  P-->>U: success

    
      Loading

  
Example time line:
0ms  user send SQL
1ms  leader write redo log locally
2ms  follower1 ack
5ms  follower2 ack -> quorum 3/5 reached
8ms  proxy return to user

This is the magic part that make your data never lost, even if servers die.
Detailed example: user A transfer $100 to user B
Step 1: request goes to leader
User A -> OBProxy -> leader replica (partition 847, node 47)
SQL: UPDATE accounts SET balance = balance - 100 WHERE user_id = A

Step 2: leader processing


Step
What leader does
Example detail


1
begin txn, assign timestamp
ts=1234567890


2
check constraints
enough balance? conflict?


3
write to MemTable
user_id=A, balance=900, version=1234567890


4
append redo log
UPDATE accounts SET balance=900 WHERE user_id=A


      flowchart TB
  %% example: leader internal pipeline for 1 SQL
  IN[SQL request\nUPDATE accounts...] --> TS[assign ts\n1234567890]
  TS --> CC[constraint check\nbalance>=100]
  CC --> LK[lock row\nuser_id=A]
  LK --> MT[write MemTable\nbalance=900]
  MT --> RL[append redo log\ndurable]
  RL --> OUT[ready to replicate]

    
      Loading

  
Step 3: Paxos replication
Leader sends redo log to 4 followers in parallel:
Leader (Beijing DC1)
  -> Follower 1 (Beijing DC2): latency 2ms
  -> Follower 2 (Hangzhou DC1): latency 5ms
  -> Follower 3 (Hangzhou DC2): latency 5ms
  -> Follower 4 (Shenzhen DC1): latency 8ms

Each follower:

receive redo log
write to local disk (durable)
apply to local MemTable
send ACK back to leader

Step 4: majority ACK and commit
Leader waits for majority ACK. With 5 replicas, need 3 ACKs (quorum):
Time 0ms: Leader send logs
Time 2ms: Follower 1 ACK
Time 5ms: Follower 2 ACK -> majority reached (3/5)
Time 5ms: Follower 3 ACK
Time 8ms: Follower 4 ACK (late but ok)

As soon as it has 3 ACKs (time 5ms), leader:

mark transaction as COMMITTED
release locks
return success to user
user see "payment success"

Step 5: background async flush
MemTable data will flush into SSTable later (not blocking user):
Time 10 seconds later: Frozen MemTable dump into Mini SSTable
Time 1 hour later: Mini SSTables merge into Minor SSTable
Time 3am next day: Minor SSTables merge into Major SSTable

Why Paxos matters?
Failure scenarios in 2 diagrams


      sequenceDiagram
  autonumber
  %% scenario 1: leader dies after sending logs, before returning to user
  participant U as User
  participant L as Leader (Beijing DC1)
  participant F2 as Follower (Hangzhou DC1)
  participant F3 as Follower (Hangzhou DC2)
  participant RS as RootService

  U->>L: write request ts=1234567890
  L->>L: append redo log local
  par replicate
    L->>F2: redo log ts=1234567890
    L->>F3: redo log ts=1234567890
  end
  Note over L: crash at t=3ms
  F2-->>RS: detect leader timeout
  RS->>RS: start election
  RS-->>F2: promote to new leader
  F2-->>U: user retry ok, data preserved

    
      Loading

  
      flowchart TB
  %% scenario 2: whole zone 1 Beijing down, quorum still ok
  subgraph Z1[Zone 1 Beijing]
    BJ1[DC1 down]
    BJ2[DC2 down]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[DC1 follower]
    HZ2[DC2 follower]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[DC1 follower]
  end

  Q[Quorum 3-of-5\nneed any 3 replicas alive] --> HZ1
  Q --> HZ2
  Q --> SZ1

    
      Loading

  
Metric
Meaning
Value in example


RPO
how much committed data can be lost
0


RTO
time to recover service
less than 30s


Partitioning and Partition Groups


      flowchart LR
  %% example: 2 users and 1 transfer
  U1[User A\nuser_id=293847\norder_id=O-7731] -->|INSERT order + pay| P
  P -->|hash 293847 % 1000 = 847| N47[OBServer node 47\nPartition 847 leader]

  subgraph N47
    O[orders_p847\nINSERT O-7731]
    Pay[payments_p847\nINSERT P-9911]
    Bal[account_balance_p847\nUPDATE -100]
  end

  N47 -->|local txn: begin/commit| O
  N47 -->|local txn: same node| Pay
  N47 -->|local txn: same node| Bal

  P -->|hash 100523 % 1000 = 523| N23[OBServer node 23\nPartition 523 leader]

  subgraph N23
    Bal2[account_balance_p523\nUPDATE +100]
  end

    
      Loading

  
ASCII quick map:

user_id=293847 -> p847 -> node47
  orders_p847 + payments_p847 + balance_p847  (local)

user_id=100523 -> p523 -> node23
  balance_p523 (other node)

Local txn: update 3 tables on node47, no network.
Cross-partition transfer: node47 + node23 need 2PC.

This is the secret sauce to avoid distributed transactions.
Problem with traditional distributed databases:
Traditional 2PC cost as a table


Step
What happens
Example cost


Prepare phase
coordinator asks 3 servers, waits ACK
5ms x 3 = 15ms


Commit phase
coordinator tells COMMIT, waits ACK
5ms x 3 = 15ms


Total
network-only overhead
30ms plus


      sequenceDiagram
  autonumber
  %% example: 3 tables on 3 servers
  participant C as Coordinator
  participant S1 as Server1 orders
  participant S2 as Server2 payments
  participant S3 as Server3 balance

  C->>S1: PREPARE
  C->>S2: PREPARE
  C->>S3: PREPARE
  S1-->>C: YES 5ms
  S2-->>C: YES 5ms
  S3-->>C: YES 5ms
  C->>C: decide COMMIT
  C->>S1: COMMIT
  C->>S2: COMMIT
  C->>S3: COMMIT
  S1-->>C: ACK 5ms
  S2-->>C: ACK 5ms
  S3-->>C: ACK 5ms

    
      Loading

  
OceanBase solution: partition groups
Key insight: if user A buys item, all data related to user A should be in same place.
Hash partitioning by user_id:
3 tables use same partition function:
orders table: partitioned by user_id
  Partition 847 has all orders of users hashed into 847

payments table: partitioned by user_id
  Partition 847 has all payments of users hashed into 847

account_balance table: partitioned by user_id
  Partition 847 has all balances of users hashed into 847

Colocation magic:
OceanBase colocate same partition number onto same node:
OBServer node 47:
  - orders_p847
  - payments_p847
  - account_balance_p847

OBServer node 23:
  - orders_p523
  - payments_p523
  - account_balance_p523

Then transaction becomes LOCAL:
User 293847 buys item:

request routed to node 47 (because partition 847 is there)
begin transaction
INSERT into orders_p847 - local, no network
INSERT into payments_p847 - local, no network
UPDATE account_balance_p847 - local, no network
commit
total latency: ~2ms (mostly memory operations)

No network overhead. No 2PC. Just local transaction inside 1 node.
Ratio of local vs distributed transactions:

95%+ transactions are local (same user, same partition)
only 5% are distributed (example: user A transfer money to user B in different partition)

Cross-partition transaction (distributed 2PC):
Example: user A (partition 847) transfer $100 to user B (partition 523):
Coordinator (node 47):
1. get global timestamp from GTS: 1234567890
2. phase 1 - prepare:
   - tell node 47: prepare decrease A's balance
   - tell node 23: prepare increase B's balance

Node 47:
   - lock user A row
   - write prepare log
   - check A has enough $100?
   - vote: YES

Node 23:
   - lock user B row
   - write prepare log
   - vote: YES

Coordinator:
   - received 2 YES votes
   - decision: COMMIT
   - write commit log

3. phase 2 - commit:
   - tell node 47: COMMIT
   - tell node 23: COMMIT

Node 47:
   - apply changes
   - release locks
   - ACK

Node 23:
   - apply changes
   - release locks
   - ACK

Coordinator:
   - return success to user

Total latency: ~10-15ms (network overhead)

Still ok for 5% transactions. But the other 95% running local in 2ms is why we can reach 544k TPS.
Dynamic repartitioning:
If a user becomes super hot (example: celebrity account with millions followers buying merch):
user_id = 999888 -> partition 888 becomes HOT
Node hosting partition 888: 95% CPU, overloaded

RootService detect:
1. split partition 888 into 2 sub-partitions:
   - partition 888a: users hash into [888.0 - 888.5]
   - partition 888b: users hash into [888.5 - 888.9]

2. migrate partition 888b to other node
3. now load spread across 2 nodes
4. update routing table in OBProxy

Automatic, zero downtime, zero data loss.

Dynamic Load Balancing in Peak Traffic


      sequenceDiagram
  autonumber
  %% example: hot partition 123 at 00:00 peak
  participant N47 as Node47 (p123 leader)
  participant RS as RootService
  participant PX as OBProxy
  participant N23 as Node23 (cold)

  loop every 1s
    N47-->>RS: hb cpu=92 mem=85 io=2.1GB/s\np123 qps=150k cpu=95
    N23-->>RS: hb cpu=25 mem=40 io=0.2GB/s\np523 qps=1k cpu=2
  end

  RS->>RS: detect p123 hot (qps spike)
  RS-->>N47: action#1 increase quota (cpu 20%->40%, mem 4G->8G)

  alt still hot after 10s
    RS-->>N23: action#2 leader move plan\nprepare follower promote
    RS-->>N47: demote leader p123 -> follower
    RS-->>N23: promote follower p123 -> leader
  else too hot for 1 leader
    RS-->>N47: action#3 split p123 by merchant_id=88888
    RS-->>N23: create p123b (merchant=88888) + migrate
    RS-->>PX: update route: merchant=88888 -> p123b@node23
  end

  RS-->>PX: refresh routing meta

    
      Loading

  
ASCII example of split rule:

Before:
  p123: merchant_id=all merchants

After split:
  p123a: merchant_id != 88888  (20% traffic)
  p123b: merchant_id == 88888  (80% traffic, isolated)

Route change in proxy:
  if merchant_id==88888 -> node23/p123b
  else -> node47/p123a

Singles' Day is not uniform. Some items are crazy hot (iPhone, limited sneakers), some items are dead cold (toilet paper).
Hot partition problem:
00:00:00 - iPhone 11 flash sale starts
100,000 users click buy at same time
All traffic routes to partition that has merchant_id of Apple Store
-> that partition: 90% CPU, 2GB/s disk I/O, cache thrashing
-> other partitions: 10% CPU, idle

RootService monitoring:
Diagram: RootService heartbeat monitoring

  
      sequenceDiagram
  autonumber
  %% example: heartbeat JSON every 1s
  participant N47 as Node47
  participant RS as RootService

  loop every 1s
    N47-->>RS: hb node=47 cpu=92 mem=85 io=2.1GB/s\npart 847 qps=50k cpu=45\npart 123 qps=150k cpu=95\npart 456 qps=1k cpu=2
    RS->>RS: score hotspot + decide action
  end

    
      Loading

  
Heartbeat fields (example):


Field
Meaning
Example


node_id
node identity
47


timestamp
seconds
1573430401


cpu_usage
node cpu
92%


memory_usage
node mem
85%


disk_io
disk bandwidth
2.1 GB/s


partitions[]
per partition stats
id=123 qps=150k cpu=95


RootService analyzes and decide actions:
Action 1: increase resource quota


Signal
Before
After


CPU quota
20%
40%


Memory quota
4GB
8GB


Disk I/O priority
normal
boosted


Diagram: Hot partition quota scale-up

  
      flowchart TB
  %% example: quota scale up for hot partition
  HS[Hot partition p123\nqps 150k cpu 95] --> RS2[RootService decision]
  RS2 --> QUP[Increase quota\ncpu 20 to 40\nmem 4G to 8G\nio priority up]
  QUP --> STB[Stabilize\nreduce tail latency]

    
      Loading

  
Action 2: rebalance leaders
Node 47 has 3 leader partitions, all hot
Node 23 has 3 leader partitions, all cold

Action:
-> move 1 leader partition from node 47 to node 23
-> follower on node 23 promote to leader
-> follower on node 47 demote to follower
-> load spread out

Action 3: split hot partition
Partition 123 too hot, even with increased quota
RootService: split it

Step 1: analyze partition data
  - 80% traffic goes to merchant_id = 88888 (Apple Store)
  - 20% traffic goes to other merchants

Step 2: create new partition 123b
  - partition 123a: merchant_id != 88888
  - partition 123b: merchant_id == 88888 (Apple Store only)

Step 3: migrate partition 123b to new node
  - copy data
  - sync redo logs
  - switch traffic
  - update routing table

Step 4: result
  - partition 123a: 20% traffic, runs ok
  - partition 123b: 80% traffic, but dedicated resources, runs ok

Action 4: temporary read replicas
Partition 123b still hot (read-heavy queries: users viewing product details)

Solution:
-> create 2 temporary read replicas
-> load balance read queries across 3 replicas
-> writes still go to 1 leader (consistency)
-> reads distributed: ~33% per replica

Real example Singles' Day 2019:
23:50 - pre-peak preparation
  RootService: rebalance leaders to distribute evenly

00:00 - peak starts
  - 50 partitions spike to 95%+ CPU in 30 seconds
  - RootService auto increase quotas
  - 10 partitions need splitting
  - 15 partitions need leader migrate

00:05 - peak sustained
  - all actions completed
  - system stable at 544k TPS
  - no engineers on-call
  - no manual intervention

01:00 - peak declining
  - traffic drops to 200k TPS
  - RootService scale down resources
  - return quotas to normal

03:00 - major compaction
  - traffic down to 20k TPS
  - safe to run major compaction
  - reclaim disk space

Why no engineers on-call in 2020?
Before 2019: engineers had to monitor dashboards manually and decide actions.
In 2020: RootService learned enough patterns:

knows which partitions will be hot (based on history)
pre-allocate resources before peak
auto-scale during peak
auto-rebalance after peak
alert engineers only when real anomalies happen (99.9%+ CPU, disk full, network down)

Engineers just sit watching dashboard and drink coffee.

HTAP - Handle OLTP and OLAP together

LSM-Tree - example write and read path

Diagram: LSM-Tree write/read lifecycle example

  
      flowchart TD
  %% example: payment_id=12345 write at 00:00:00.001, then read at 00:00:02.000
  W[Write SQL\nINSERT payment_id=12345] --> MT[MemTable\nrows 12345 -> v1000]
  MT -->|00:00:00.050 freeze| FMT[Frozen MemTable\nimmutable]
  FMT -->|00:00:00.200 dump| MINI[Mini SSTable\nmini_0007.sst]
  MINI -->|00:00:05.000 minor merge| MINOR[Minor SSTable\nminor_0012.sst]
  MINOR -->|03:10:00 major compaction| MAJOR[Major SSTable\nmajor_2020-11-12.sst]

  R[Read SQL\nSELECT payment_id=12345] --> MT
  R --> FMT
  R --> RC[Row cache\nhit maybe]
  R --> BF[Bloom filter\nkey 12345]
  BF --> MINI
  BF --> MINOR
  BF --> MAJOR

  RC --> OUT[Return row\nabout 0.1ms if hit]

    
      Loading

  
Example time line:
00:00:00.001 write arrives -> put into MemTable
00:00:00.050 MemTable freeze -> become Frozen MemTable
00:00:00.200 dump Frozen -> Mini SSTable
00:00:05.000 minor merge many mini -> Minor SSTable
03:10:00 major compaction -> Major SSTable
00:00:02.000 read can hit MemTable/Frozen first, if miss then SSTable + bloom

HTAP - Handle OLTP and OLAP together

Diagram: HTAP dual execution paths

  
      flowchart TB
  %% example: 2 queries hit different path in same DB
  Q1[Q1: SELECT balance\nuid=293847] --> OLTP[OLTP fast path]
  Q2[Q2: SUM amount\nlast 1 hour] --> OLAP[OLAP scan path]

  OLTP --> MT[MemTable\nhot rows last 10 min]
  MT --> TXN[Txn engine\nMVCC + locks]

  OLAP --> SS[SSTable\ncolumn blocks]
  SS --> VE[Vector exec\nSIMD batch 1024 rows]

  MT <-->|flush + compaction| SS

  OLTP --> OUT1[result about 0.1ms]
  OLAP --> OUT2[result about 50ms]

    
      Loading

  
Example routing:
- Q1 is point lookup -> go MemTable/hash index first
- Q2 is scan+agg -> go SSTable column blocks + vector exec
- same data, no ETL, just 2 execution path

This is one of the most brilliant part of OceanBase. Traditional databases usually split into 2 types:
OLTP (Online Transaction Processing):

Fast point queries: "get balance of user A"
Fast writes: "insert payment record"
Row-oriented storage
Examples: MySQL, PostgreSQL, Oracle

OLAP (Online Analytical Processing):

Slower aggregate queries: "total sales by region"
Scan millions of rows
Column-oriented storage
Examples: ClickHouse, Snowflake, BigQuery

Problem with traditional approach:
OLTP DB (MySQL)
  ↓
ETL job (runs every hour)
  ↓
OLAP warehouse (Snowflake)
  ↓
Dashboard (1 hour delay)

Singles' Day: merchants want real-time dashboards. "How many i sold in last 10 mins?" You cant wait 1 hour.
OceanBase HTAP solution: 1 database, 2 execution paths
Storage engine supports both:

MemTable (row-oriented)


Active writes go here
Hash + B-tree indexes
Perfect for OLTP point queries


SSTable (column-oriented)


Flushed data stored here
Column encoding + compression
Perfect for OLAP scans

Concrete example:
OLTP query (point query):
SELECT * FROM payments WHERE payment_id = 12345;
Execution plan:
1. Check MemTable hash index -> O(1) lookup
2. Find row in 0.1ms
3. Return result

Data layout in MemTable (row-oriented):
Row: [payment_id=12345, user_id=999, amount=100, timestamp=1234567890, status=success]

OLAP query (aggregate scan):
SELECT
  region,
  SUM(amount) as total_sales,
  COUNT(*) as order_count
FROM payments
WHERE timestamp > '2019-11-11 00:00:00'
GROUP BY region;
Execution plan:
1. Scan SSTable (column-oriented)
2. Only read needed columns: timestamp, region, amount
3. Vector execution: process 1000 rows at once
4. Use SIMD instructions for SUM/COUNT
5. Return aggregated result

Data layout in SSTable (column-oriented):
timestamp column: [1234567890, 1234567891, 1234567892, ...]
region column:    ['Beijing', 'Shanghai', 'Beijing', ...]
amount column:    [100, 200, 150, ...]

Only read these 3 columns, skip other 10 columns
Compressed: 1000 rows = ~4KB (vs 50KB row-oriented)

Vector execution engine:
Traditional execution: 1 row at a time
total = 0
for row in rows:
    if row.region == 'Beijing':
        total += row.amount  # 1M iterations

Vector execution: 1000 rows at once
# Load 1000 rows into CPU cache
chunk = load_1000_rows()

# Use SIMD to filter (parallel)
mask = SIMD_compare(chunk.region, 'Beijing')  # 1 CPU instruction

# Use SIMD to sum (parallel)
total = SIMD_sum(chunk.amount, mask)  # 1 CPU instruction

# Process 1000 rows in ~same time as 1 row
# Total: 1M/1000 = 1000 iterations (1000x faster)

Real-time analytics during Singles' Day:
Use case 1: fraud detection (real-time OLAP)
-- runs every 10 seconds
SELECT user_id, COUNT(*) as payment_count
FROM payments
WHERE timestamp > NOW() - INTERVAL '60 seconds'
  AND amount > 10000
GROUP BY user_id
HAVING COUNT(*) > 20;
-- Find users making 20+ payments > $1000 in 1 min
Execution:

Scan SSTable recent data (~10M rows)
Vector execution: 10 seconds
Result: 15 suspicious accounts
Trigger alert to risk team

Use case 2: merchant dashboard (real-time OLTP + OLAP)
-- dashboard updates every 5 seconds
SELECT
  COUNT(*) as total_orders,
  SUM(amount) as total_sales,
  AVG(amount) as avg_order_value,
  MAX(amount) as largest_order
FROM orders
WHERE merchant_id = 88888
  AND timestamp > NOW() - INTERVAL '1 hour';
Execution:

Recent data (last 10 min) from MemTable: OLTP fast path
Older data (10-60 min ago) from SSTable: OLAP scan
Combined result in 50ms
Merchant sees real-time dashboard

Use case 3: recommendation engine (complex OLAP)
-- Update recommendations every 30 seconds
SELECT
  p1.user_id,
  p2.product_id,
  COUNT(*) as co_purchase_count
FROM purchases p1
JOIN purchases p2 ON p1.order_id = p2.order_id
WHERE p1.product_id = 12345  -- iPhone
  AND p2.product_id != 12345
  AND p1.timestamp > NOW() - INTERVAL '10 minutes'
GROUP BY p1.user_id, p2.product_id
ORDER BY co_purchase_count DESC
LIMIT 10;
-- Users buying iPhone, what else did they buy?
Execution:

Join 2 large tables
Scan millions of rows
Vector execution + hash join
Result in 2 seconds
Update recommendation model

Simultaneously running:
00:00:00 - Singles' Day starts
Load:
  - 544k OLTP TPS (payments, orders)
  - 50 OLAP queries/second (fraud, dashboards, recommendations)
  - Total: 544k + 50 queries/second
  - Same database, same data, zero lag

CPU allocation:
  - 90% for OLTP (priority)
  - 10% for OLAP (background)

Memory allocation:
  - MemTable: 60% (hot OLTP data)
  - Block Cache: 30% (recent OLAP scans)
  - Row Cache: 10% (frequently accessed rows)

No ETL tax:
Traditional:

Storage cost: OLTP DB + OLAP warehouse = 2x
Network cost: ETL transfer petabytes/day
Lag: 1 hour minimum
Complexity: maintain 2 systems

OceanBase:

Storage cost: 1x (same data)
Network cost: 0 (no transfer)
Lag: 0 (real-time)
Complexity: 1 system


Distributed Transaction Coordination (2PC)


      sequenceDiagram
  autonumber
  %% example: transfer $100 from uid=293847(p847@node47) -> uid=100523(p523@node23)
  participant U as User
  participant C as Coordinator (node47)
  participant P1 as Participant node47 (p847)
  participant P2 as Participant node23 (p523)

  U->>C: begin txn txn_19283746 ts=1234567890
  C->>P1: PREPARE txn_19283746\nUPDATE A balance -100
  C->>P2: PREPARE txn_19283746\nUPDATE B balance +100
  P1-->>C: VOTE YES (lock A, write prepare log)
  P2-->>C: VOTE YES (lock B, write prepare log)

  alt all YES
    C->>C: write COMMIT log (durable)
    C->>P1: COMMIT txn_19283746
    C->>P2: COMMIT txn_19283746
    P1-->>C: ACK (apply + release lock)
    P2-->>C: ACK (apply + release lock)
    C-->>U: SUCCESS (lat ~10ms)
  else any NO
    C->>C: write ABORT log
    C->>P1: ABORT txn_19283746
    C->>P2: ABORT txn_19283746
    P1-->>C: ACK (release lock)
    P2-->>C: ACK (release lock)
    C-->>U: FAIL
  end

    
      Loading

  
Example timeline (same datacenter):
0ms   coordinator send prepare
2ms   node47 prepared + vote YES
6ms   node23 prepared + vote YES
7ms   coordinator write commit log
8ms   send commit to both
10ms  both ack, user sees success


Cross-table transactions not using the same partition key (rare)

Two-Phase Commit (2PC)


      sequenceDiagram
  autonumber
  %% Normal 2PC: prepare + commit
  participant U as Client
  participant C as Coordinator (Node47)
  participant P1 as Participant (Node47 p847)
  participant P2 as Participant (Node23 p523)

  U->>C: begin txn txn_19283746\nGTS ts=1234567890
  par PREPARE
    C->>P1: PREPARE txn_19283746\nUPDATE A -100
    C->>P2: PREPARE txn_19283746\nUPDATE B +100
  end
  P1->>P1: check + lock + write PREPARE log
  P2->>P2: check + lock + write PREPARE log
  P1-->>C: VOTE YES
  P2-->>C: VOTE YES
  C->>C: decide COMMIT\nwrite COMMIT log
  par COMMIT
    C->>P1: COMMIT txn_19283746
    C->>P2: COMMIT txn_19283746
  end
  P1->>P1: apply + write COMMIT log + unlock
  P2->>P2: apply + write COMMIT log + unlock
  P1-->>C: ACK
  P2-->>C: ACK
  C-->>U: success

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Normal 2PC: abort path
  participant U as Client
  participant C as Coordinator
  participant P1 as Participant Node47
  participant P2 as Participant Node23

  U->>C: begin txn txn_19283746
  C->>P1: PREPARE
  C->>P2: PREPARE
  P1-->>C: YES
  P2-->>C: NO (constraint failed)
  C->>C: decide ABORT\nwrite ABORT log
  C->>P1: ABORT
  C->>P2: ABORT
  P1->>P1: rollback + unlock
  P2->>P2: unlock if needed
  P1-->>C: ACK
  P2-->>C: ACK
  C-->>U: failure

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Scenario A: participant crash during PREPARE
  participant C as Coordinator
  participant P1 as Node47 participant
  participant P2 as Node23 participant

  C->>P1: PREPARE txn_19283746
  C->>P2: PREPARE txn_19283746
  P1-->>C: YES (prepared)
  Note over P2: crash before vote
  C->>C: timeout => decide ABORT
  C->>P1: ABORT
  P1->>P1: unlock
  Note over P2: restart\nreplay log -> found PREPARE-only\nask coordinator -> ABORT\ncleanup

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Scenario B: coordinator crash after all YES, before COMMIT
  participant U as Client
  participant C as Coordinator Node47
  participant P1 as Node47 participant
  participant P2 as Node23 participant
  participant NC as New coordinator Node23

  U->>C: begin txn txn_19283746
  C->>P1: PREPARE
  C->>P2: PREPARE
  P1-->>C: YES
  P2-->>C: YES
  Note over C: crash before sending COMMIT
  Note over P1,P2: locks held, waiting decision
  NC->>NC: timeout detected, take over
  NC->>P1: query state
  NC->>P2: query state
  P1-->>NC: prepared YES
  P2-->>NC: prepared YES
  NC->>NC: decide COMMIT
  NC->>P1: COMMIT
  NC->>P2: COMMIT
  P1-->>NC: ACK
  P2-->>NC: ACK

    
      Loading

  
Scenario C: Network partition during COMMIT

  
      sequenceDiagram
  autonumber
  participant C as Coordinator
  participant N47 as Node47
  participant N23 as Node23

  C->>N47: COMMIT txn_19283746
  C-x N23: COMMIT txn_19283746 (dropped by partition)
  N47-->>C: ACK (apply + release locks)
  Note over N23: still locked, waiting for decision
  loop retry until network heals
    C->>N23: COMMIT txn_19283746
  end
  N23-->>C: ACK (apply + release locks)

    
      Loading

  
      stateDiagram-v2
  %% participant local state machine
  [*] --> Init
  Init --> Prepared: PREPARE log durable + lock acquired
  Prepared --> Committed: COMMIT log durable + apply + unlock
  Prepared --> Aborted: ABORT or timeout cleanup + unlock
  Committed --> [*]
  Aborted --> [*]

    
      Loading

  
High Availability and Failover

Diagram: Partition replica layout and quorum

  
      flowchart TB
  %% example: partition 847 replicas across 3 zones
  subgraph Z1[Zone 1 - Beijing]
    L[Server47\nLeader p847]
    Fz1[Server102\nFollower p847]
  end
  subgraph Z2[Zone 2 - Hangzhou]
    Fz2a[Server203\nFollower p847]
    Fz2b[Server247\nFollower p847]
  end
  subgraph Z3[Zone 3 - Shenzhen]
    Fz3[Server389\nFollower p847]
  end

  L -->|redo log ts=1234567890| Fz1
  L -->|redo log ts=1234567890| Fz2a
  L -->|redo log ts=1234567890| Fz2b
  L -->|redo log ts=1234567890| Fz3

  subgraph Q[Quorum rule 3-of-5]
    Q1[need any 3 ACK to commit]
  end

  L -. commit ok when 3 ACK .-> Q1

    
      Loading

  
Example commit math:
- leader writes log locally
- wait ACK from any 2 followers
- total 3/5 -> commit ok

Example failure:
- lose Server47 (leader) + Server102 (follower)
- still 3/5 left -> system can still work

5 replicas, 3 zones, RPO=0, RTO<30s.
Diagram: Replica count vs quorum trade-off

  
      flowchart TB
  %% replica placement + quorum comparison
  subgraph R3[3 replicas\nquorum=2\ntolerate 1 failure]
    r3a((A))
    r3b((B))
    r3c((C))
  end
  subgraph R5[5 replicas\nquorum=3\ntolerate 2 failures]
    r5a((A))
    r5b((B))
    r5c((C))
    r5d((D))
    r5e((E))
  end
  subgraph R7[7 replicas\nquorum=4\ntolerate 3 failures\nextra overhead]
    r7a((A))
    r7b((B))
    r7c((C))
    r7d((D))
    r7e((E))
    r7f((F))
    r7g((G))
  end

  R3 --> Why3[Risk: lose 1 DC may remove 2 replicas\n=> system down]
  R5 --> Why5[Sweet spot: survive 1 DC loss\nlose 2 replicas\nwith acceptable cost]
  R7 --> Why7[Higher safety but more replication cost]

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Failure scenario 1: single leader server failure -> election -> route update
  participant L as Server47 (leader p847)
  participant F1 as Server102 (follower)
  participant F2 as Server203 (follower)
  participant F3 as Server247 (follower)
  participant F4 as Server389 (follower)
  participant RS as RootService
  participant PX as OBProxy

  Note over L: normal traffic, replicate logs
  L-->>F1: redo log
  L-->>F2: redo log
  L-->>F3: redo log
  L-->>F4: redo log

  Note over L: crash
  F2-->>RS: leader timeout detected
  RS->>RS: start Paxos election
  RS-->>F2: promote to new leader (new term)
  RS-->>PX: update route p847 -> Server203
  PX-->>F2: new requests

    
      Loading

  
Diagram: Replica placement across zones

  
      flowchart TB
  %% replica placement across zones (example p847)
  subgraph Z1[Zone 1 Beijing]
    BJ1[(DC1 leader)]
    BJ2[(DC2 follower)]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[(DC1 follower)]
    HZ2[(DC2 follower)]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[(DC1 follower)]
  end

  BJ1 --> BJ2
  BJ1 --> HZ1
  BJ1 --> HZ2
  BJ1 --> SZ1

  Q[Commit when 3-of-5 replicas have log] -.-> BJ1

    
      Loading

  
Example chi tiết về data preservation:
Before failure (timestamp 1234567890):
User A submitted payment, transaction committed
  Server 47 (Leader): balance = $900 ✓ committed
  Server 102: balance = $900 ✓ replicated
  Server 203: balance = $900 ✓ replicated
  Server 247: balance = $900 ✓ replicated
  Server 389: balance = $900 ✓ replicated

During failure (timestamp 1234567891):
User B submitted payment, Leader processing
  Server 47 (Leader): balance = $800 ← processing, NOT YET COMMITTED
  Server 47 DIES before replication

After failover:
New Leader (Server 203) state:
  - Has User A transaction: balance = $900 ✓ PRESERVED
  - Doesn't have User B transaction: balance = $900 (not $800)
  
User B sees: "Transaction failed, please retry"
User B retries → success
Final balance: $800

Key point: 
- Committed transactions NEVER lost (RPO = 0)
- Uncommitted transactions may be lost (acceptable)
- Users just retry failed requests


      sequenceDiagram
  autonumber
  %% Failure scenario 2: entire datacenter down -> mass elections + route update
  participant BJ1 as Beijing DC1 (down)
  participant RS as RootService
  participant HZ as Hangzhou zone
  participant SZ as Shenzhen zone
  participant PX as OBProxy fleet

  Note over BJ1: power outage, thousands of servers drop
  HZ-->>RS: heartbeat timeouts for BJ1 nodes
  SZ-->>RS: heartbeat timeouts for BJ1 nodes
  RS->>RS: start elections for many partitions in parallel
  RS-->>HZ: promote new leaders for affected partitions
  RS-->>SZ: promote new leaders for affected partitions
  RS-->>PX: push updated routing tables
  Note over PX: clients retry -> traffic shifts to HZ/SZ

    
      Loading

  
Diagram: Datacenter outage - quorum still available

  
      flowchart TB
  %% Entire zone 1 down but quorum still satisfied
  subgraph Z1[Zone 1 Beijing]
    DC1x[DC1 down]
    DC2x[DC2 down]
  end
  subgraph Z2[Zone 2 Hangzhou]
    HZ1[(replica)]
    HZ2[(replica)]
  end
  subgraph Z3[Zone 3 Shenzhen]
    SZ1[(replica)]
  end

  Q[3-of-5 quorum still available\n=> keep accepting writes] --> HZ1
  Q --> HZ2
  Q --> SZ1

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Auto-heal: rebuild back to 5 replicas
  participant RS as RootService
  participant HZ1 as Hangzhou DC1
  participant SZ1 as Shenzhen DC1

  RS->>RS: detect only 3 replicas remain
  RS->>HZ1: create new replica r4 + backfill
  RS->>SZ1: create new replica r5 + backfill
  HZ1-->>RS: r4 caught up
  SZ1-->>RS: r5 caught up
  RS->>RS: restore 5 replicas

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Failure scenario 3: network partition (split brain) prevented by quorum
  participant BJ as Beijing group (2 replicas)
  participant HZ as Hangzhou+Shenzhen group (3 replicas)

  Note over BJ,HZ: network partition between groups

  BJ->>BJ: try to commit write
  BJ-->>BJ: only 2-of-5 ACK possible => cannot reach quorum
  BJ-->>BJ: reject writes (read-only safe mode)

  HZ->>HZ: elect leader with 3-of-5
  HZ-->>HZ: accept writes (quorum satisfied)

  Note over BJ,HZ: network heals
  BJ->>HZ: rejoin, step down to follower, catch up logs

    
      Loading

  
Diagram: Quorum rule - who can accept writes

  
      flowchart TB
  %% quorum rule decides who can accept writes
  A[Group A\n2 replicas] -->|no quorum| RO[READ-ONLY\nreject writes]
  B[Group B\n3 replicas] -->|has quorum| RW[READ/WRITE\naccept writes]
  RO --> HEAL[heal network -> catch up]
  RW --> HEAL

    
      Loading

  
Diagram: RTO component breakdown

  
      flowchart TB
  %% RTO breakdown
  RTO[RTO components] --> DET[Detection\n1-2s]
  RTO --> ELE[Election\n3-10s]
  RTO --> ROUTE[Route update\n2-8s]
  RTO --> WARM[Warmup\n2-10s]

  DET --> S1[Single server\n8-15s]
  ELE --> S1
  ROUTE --> S1
  WARM --> S1

  DET --> DC[Datacenter\n20-30s]
  ELE --> DC
  ROUTE --> DC
  WARM --> DC

    
      Loading

  
      sequenceDiagram
  autonumber
  %% RPO=0: commit only after majority replication
  participant U as Client
  participant L as Leader
  participant F as Followers (quorum)

  U->>L: write request
  L->>L: append redo log locally
  par replicate
    L->>F: redo log
  end
  F-->>L: majority ACK
  L-->>U: success (committed)

  Note over L: leader crash before majority ACK => user sees failure, retry

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Chaos test summary
  participant ENG as Engineers
  participant OPS as Ops tool
  participant CL as OceanBase cluster

  ENG->>OPS: kill 50 random nodes
  OPS->>CL: terminate nodes
  CL-->>CL: elections + leader moves
  CL-->>ENG: recovered in <=30s, RPO=0

  ENG->>OPS: disconnect Beijing DC1
  OPS->>CL: network isolation
  CL-->>CL: mass elections + reroute
  CL-->>ENG: recovered ~25s

  ENG->>OPS: add 500ms latency BJ<->HZ
  OPS->>CL: inject latency
  CL-->>ENG: commit slower, still available

    
      Loading

  
Compression and Column Encoding

Diagram: Microblock compression and read path

  
      flowchart TD
  %% example: microblock for payments_p847, columns: status/timestamp/amount
  Raw[Raw rows payments_p847] --> Col[Column layout\nstatus timestamp amount]
  Col --> Enc[Encoding step\nstatus dict, ts delta, amount delta]
  Enc --> Comp[Compression\nLZ4 for cold, Snappy for hot]
  Comp --> Store[Store microblock\nsize 4096B -> 287B]

  Store --> Read[Read query\nSELECT amount WHERE uid=12345]
  Read --> Decomp[Decompress 287B -> 4KB]
  Decomp --> Prune[Column prune\nkeep only amount]
  Prune --> Decode[Decode amount\ndelta decode]
  Decode --> Exec[Exec + return\nabout 0.16ms]

    
      Loading

  
Example numbers in 1 block:
- before: 4KB microblock
- after encoding: 820B
- after LZ4: 287B
- read I/O: 4KB -> 287B (14x less)

At hundreds of petabytes of storage, compression is not a "nice to have"—it is a "must have". Without compression, we'd need 2–3× more servers.
Diagram: Compression pipeline (write and read)

  
      flowchart TD
  %% Compression pipeline overview
  subgraph W[Write path]
    WR[Rows] --> MB[Build microblock]
    MB --> COL[Columnize]
    COL --> ENC[Encode\ndict / delta / rle / prefix]
    ENC --> CMP[Compress\nLZ4 or Snappy]
    CMP --> SST[Persist SSTable]
  end

  subgraph R[Read path]
    Q[Query] --> PRUNE[Prune columns]
    PRUNE --> IO[Read compressed microblock]
    IO --> DCMP[Decompress]
    DCMP --> DECODE[Decode selected columns]
    DECODE --> RES[Return rows]
  end

  SST -. serves .-> IO

    
      Loading

  
Compression math (order-of-magnitude)


Item
Formula
Example


Raw write
orders × avg row size
1.29B × 5KB = 6.45TB


Replicas
raw × replica factor
6.45TB × 5 = 32.25TB


With 80% compression
raw × 20%
6.45TB × 20% = 1.29TB


Diagram: Row store vs column store I/O

  
      flowchart LR
  %% Row vs column layout and scan behavior
  subgraph ROW[Row store]
    R1[Row1: uid, amount, status, ts, merchant]
    R2[Row2: uid, amount, status, ts, merchant]
    R3[Row3: uid, amount, status, ts, merchant]
  end

  subgraph COLS[Column store]
    C1[uid: 1001,1002,1003,...]
    C2[amount: 100,200,150,...]
    C3[status: success,success,pending,...]
    C4[ts: 1234567890, ...]
    C5[merchant: 88888, ...]
  end

  Q1[Query: SELECT amount WHERE ts > X] -->|row store reads many columns| ROW
  Q1 -->|column store reads only ts+amount| COLS

  ROW --> W1[More I/O\nmixed types]
  COLS --> W2[Less I/O\nbetter compression]

    
      Loading

  
Encoding Techniques — Overview (4 patterns)
Diagram: Column encoding technique selection

  
      flowchart TB
  %% encoding decision cheatsheet
  A[Column data] --> DCT[Dictionary\nlow cardinality\nstatus, merchant]
  A --> DLT[Delta / Delta-of-delta\nmonotonic numbers\ntimestamp]
  A --> RLE[RLE\nlong runs\nflash sale product_id]
  A --> PFX[Prefix\ncommon prefixes\nproduct_name]

  DCT --> OUT[Smaller encoded stream]
  DLT --> OUT
  RLE --> OUT
  PFX --> OUT

    
      Loading

  
Technique
When it applies
Key idea
Typical gain


Dictionary
few distinct values
map to small ids
80–98%


Delta / DoD
ordered numbers
store differences
85–98%


RLE
repeated values
(value,count)
90–99%+


Prefix
shared string prefix
store prefix once
30–60%


      sequenceDiagram
  autonumber
  %% dictionary example
  participant W as Writer
  participant MB as Microblock

  W->>MB: build dictionary {success:0,pending:1,failed:2,cancelled:3}
  W->>MB: encode status column as ids [0,0,1,0,2,...]
  MB-->>W: output bit-packed ids (2 bits/value)

    
      Loading

  
      sequenceDiagram
  autonumber
  %% delta-of-delta example
  participant W as Writer
  participant MB as Microblock

  W->>MB: base ts=1234567890
  W->>MB: deltas [+1,+1,+1,...]
  W->>MB: delta-of-delta [0,0,0,...]
  MB-->>W: store base + compact zeros

    
      Loading

  
      sequenceDiagram
  autonumber
  %% RLE example
  participant W as Writer
  participant MB as Microblock

  W->>MB: product_id stream [12345 x 500000]
  W->>MB: encode as (12345,500000)
  MB-->>W: 12 bytes payload

    
      Loading

  
      sequenceDiagram
  autonumber
  %% prefix example
  participant W as Writer
  participant MB as Microblock

  W->>MB: find common prefix "iPhone 11 Pro"
  W->>MB: store suffixes [" Max 256GB ...", " 128GB ...", ...]
  MB-->>W: prefix + suffix array

    
      Loading

  
Compression Layer (after encoding)

  
      flowchart TB
  title: Compression algorithm choice
  %% encoding then compression
  E[Encoded column stream] -->|many repeated patterns| LZ[LZ4\nhigher ratio]
  E -->|fast decode| SNP[Snappy\nfaster decompress]

  LZ --> OUT[Compressed bytes]
  SNP --> OUT

  OUT -->|read| DCP[Decompress]
  DCP -->|then| DEC[Decode columns]

    
      Loading

  
Algorithm
Bias
Use case


LZ4
higher compression ratio
cold data, bandwidth/storage sensitive


Snappy
faster decompression
hot data, latency sensitive


Real compression results:
Example microblock (4KB uncompressed):
1. Raw data: 4096 bytes
2. After column encoding: 820 bytes (80% reduction)
3. After LZ4 compression: 287 bytes (93% total reduction)

Typical compression ratios by column type:
- status (enum): 98% reduction
- timestamp (sequential): 95% reduction
- user_id (random): 60% reduction
- amount (numbers): 70% reduction
- product_name (strings): 50% reduction

Overall: 70-90% reduction across all data

Performance impact (summary)

  
      flowchart LR
  title: Compression performance trade-off
  %% compression trade-off summary
  IO[Disk I/O] -->|reduced bytes read| FAST[Lower latency / higher throughput]
  CPU[CPU] -->|decompress + decode| COST[Small extra cost]

  FAST --> NET[Net win\nI/O saved >> CPU cost]
  COST --> NET

    
      Loading

  
Metric
Without compression
With compression


Read bytes per microblock
4096B
287B


Extra CPU
~0ms
~0.1ms decompress + decode


Cache efficiency
baseline
higher (more blocks fit)


Decompression on read (as a sequence)

  
      sequenceDiagram
  autonumber
  participant Q as Query
  participant IDX as Index
  participant DISK as Disk
  participant LZ as Decompress
  participant CE as Column engine

  Q->>IDX: locate microblock for user_id=12345
  IDX-->>Q: microblock_id
  Q->>DISK: read compressed microblock (287B)
  DISK-->>Q: bytes
  Q->>LZ: decompress to 4KB
  LZ-->>Q: uncompressed block
  Q->>CE: prune columns (keep amount)
  CE-->>Q: amount column
  Q->>CE: decode (delta)
  CE-->>Q: result rows

    
      Loading

  
Column pruning benefit:

  
      flowchart LR
  title: Column pruning I/O reduction
  %% column pruning I/O comparison
  subgraph NO[Without column pruning]
    N1[Read 20 columns\n200B/row] --> N2[Scan 1M rows\n200MB]
  end

  subgraph YES[With column pruning]
    Y1[Read timestamp + amount\n16B/row] --> Y2[Scan 1M rows\n16MB]
  end

  NO --> GAP[12.5x more I/O]
  YES --> GAP

    
      Loading

  
Real Singles' Day numbers (summary)


Item
Value


Total written (raw)
6.45TB


After compression
1.3TB


Saved per replica
5.15TB


Total saved (5 replicas)
25.75TB


Cache Architecture


      flowchart TB
  title: Two-level cache overview
  %% example: 2 queries hit 2 caches
  Q1[Q1 point: payment_id=12345] -->|point| RC[Row Cache\nkey=payments,847,12345]
  Q2[Q2 scan: ts>00:00] -->|scan| BC[Block Cache\nkey=macroblock_id]

  RC -->|hit 0.05ms| OUT1[row]
  RC -->|miss| MT[MemTable]
  MT -->|miss| SST[SSTable]

  BC -->|hit 0.5ms| OUT2[blocks]
  BC -->|miss| SST

  SST -->|read 287B microblock| IO[Disk I/O]
  IO --> SST

  subgraph Warm[Warming example]
    WS[Scan hot p847\nload last 24h rows] --> RC
    WS --> BC
  end

    
      Loading

  
Example cache hit flow:
- point read payment_id=12345: RowCache hit -> 0.05ms
- scan last 1 hour: BlockCache hit -> avoid disk read
- after restart: warming scan loads p847 hot keys

Two-level cache architecture

  
      flowchart TB
  title: Why two cache levels
  %% why 2 cache levels
  AP[Access pattern] --> OLTP[OLTP\npoint reads\nprimary key]
  AP --> OLAP[OLAP\nrange scans\nfilters]

  OLTP --> RC2[Row Cache\nrow-level objects]
  OLAP --> BC2[Block Cache\nmicroblocks]

  RC2 --> GOAL[Lower tail latency]
  BC2 --> GOAL

    
      Loading

  
Row Cache:

  
      sequenceDiagram
  autonumber
  %% point query path: row cache hit/miss
  participant U as Client
  participant RC as Row Cache
  participant MT as MemTable
  participant SST as SSTable

  U->>RC: GET (table,partition,pk)
  alt cache hit
    RC-->>U: row (0.05ms)
  else cache miss
    RC-->>U: miss
    U->>MT: lookup
    alt in memtable
      MT-->>U: row
    else not in memtable
      U->>SST: read microblock
      SST-->>U: row
      U->>RC: put row
    end
  end

    
      Loading

  
      flowchart TB
  title: Cache population strategies
  %% cache population strategies
  subgraph Demand[Read-through]
    D1[Cache miss] --> D2[Load from SSTable] --> D3[Insert into cache]
  end

  subgraph Warm[Warming]
    W1[Server restart/compaction] --> W2[Scan hot partitions] --> W3[Preload recent/hot keys]
  end

    
      Loading

  
      flowchart TB
  title: Cache eviction and thrashing
  %% eviction + thrashing
  subgraph LRU[Eviction]
    E1[Cache full] --> E2[Find least recently used] --> E3[Evict] --> E4[Insert new row]
  end

  subgraph THR[Thrashing]
    T1[Too many unique keys] --> T2[Constant evict/insert] --> T3[Low hit rate]
  end

  THR --> Fix[Mitigations: bigger cache / LRU-K / partitioned cache]

    
      Loading

  
Block Cache — Overview

  
      sequenceDiagram
  autonumber
  %% range scan path: block cache hit/miss
  participant Q as Query
  participant BC as Block Cache
  participant DISK as Disk
  participant DEC as Decompress+Decode

  Q->>Q: decide scan needs many microblocks
  loop for each microblock
    Q->>BC: GET microblock
    alt hit
      BC-->>Q: decompressed block
    else miss
      BC-->>Q: miss
      Q->>DISK: read compressed block
      DISK-->>Q: bytes
      Q->>DEC: decompress + decode
      DEC-->>Q: decompressed block
      Q->>BC: PUT decompressed block
    end
  end

    
      Loading

  
Why store decompressed blocks?
Trade-off


avoid decompression on every hit
uses more memory (fewer blocks cached)


      flowchart TB
  title: Dynamic cache budget split
  %% cache split + dynamic rebalance
  TOTAL[Total cache budget] --> RC[Row Cache]
  TOTAL --> BC2[Block Cache]

  RC -->|monitor hit rate| MON[Metrics]
  BC2 -->|monitor hit rate| MON

  MON -->|rebalance every 5m| ADJ[Adjust split]
  ADJ --> RC
  ADJ --> BC2

    
      Loading

  
Cache warming after major compaction (timeline)

  
      sequenceDiagram
  autonumber
  participant Z as Zone 2
  participant W as Warming job
  participant BC as Block Cache
  participant Q as Queries

  Note over Z: major compaction finishes
  Z-->>BC: SSTable ids changed -> cache keys invalid
  BC-->>Q: 100% miss if serving immediately

  Z->>W: start warming before serving
  W->>W: identify hot partitions (top 20% QPS)
  loop for each hot partition
    W->>Z: sequential scan new SSTable
    W->>BC: preload hot microblocks
  end
  W-->>Z: warming done
  Z-->>Q: start serving with 75% hit rate

    
      Loading

  
Cache warming specifics

  
      flowchart TB
  title: Cache warming plan (what/how/time)
  %% what/how/time/result
  subgraph What[What to warm]
    HOT[Hot partitions\nTop 20% by QPS\nexample top 200]
    COLD[Cold partitions\n80%]
    HOT --> H24[Load 100% last 24h]
    HOT --> H7[Load 50% last 7d]
    COLD --> NAT[Warm naturally]
  end

  subgraph HowMuch[How much to warm]
    RCV[Row Cache: 10M rows 2GB]
    BCV[Block Cache: 500k microblocks 2GB]
    TOT[Total: 4GB per node\nCluster: 20k nodes -> 80TB]
  end

  subgraph Time[Time to warm]
    SCAN[Sequential scan 1GB/s] --> CALC[4s scan + 5s insert\n=> 10s per zone]
  end

  subgraph Result[Result]
    READY[Zone ready in 10s\nNo user-visible degradation]
  end

  What --> Result
  HowMuch --> Result
  Time --> Result

    
      Loading

  
Cache hit rate metrics — Singles' Day 2019 (summary)


Cache
Window
Hit rate
Avg latency
QPS


Row Cache
00:00–01:00
92%
0.15ms
500k


Row Cache
01:00–06:00
95%
0.10ms
200k


Row Cache
06:00–24:00
97%
0.08ms
50k


Block Cache
00:00–01:00
65%
8ms / 1k rows
50 scans/s


Block Cache
01:00–06:00
75%
5ms / 1k rows
100 scans/s


Block Cache
06:00–24:00
85%
3ms / 1k rows
200 scans/s


      xychart-beta
  title "Hit rate trend"
  x-axis ["00-01","01-06","06-24"]
  y-axis "hit rate %" 0 --> 100
  line "RowCache" [92,95,97]
  line "BlockCache" [65,75,85]

    
      Loading

  
Cache benefits — By the numbers (summary)

  
      flowchart LR
  title: Cache ROI back-of-envelope
  %% cache benefit back-of-envelope
  A[Without cache\n544k QPS\napprox 2ms avg\nhigh disk I/O] --> B[Needs many servers]
  C[With 95% hit rate\nmost reads from RAM\nlow disk I/O] --> D[Fewer servers]
  B --> E[Cost]
  D --> E
  E --> F[Net savings]

    
      Loading

  
Scenario
Disk I/O (per node)
CPU cores (rough)
Servers


Without cache
2.176GB/s
1088
3000


With 95% hit rate
108MB/s
80
1000


Cost item
Value


Servers saved
2000


Server capex saved
$10M


RAM cost
$1.28M


Net savings
$8.72M


ROI
6.8x


Cache is a necessity to reach sustained 544k TPS.

Compaction Strategy During Singles' Day

Major compaction is a necessary evil: it reclaims disk space and cleans up old versions, but it is expensive (rewrites data and thrashes caches).

  
      sequenceDiagram
  autonumber
  %% Zone rotation timeline (Singles' Day)
  participant RS as RootService
  participant PX as OBProxy
  participant Z1 as Zone 1
  participant Z2 as Zone 2
  participant Z3 as Zone 3

  Note over Z1,Z3: 00:00 All zones ACTIVE, peak traffic

  Note over Z2: 06:00 drain Zone 2 -> COMPACTING
  RS->>PX: stop routing to Zone 2
  PX-->>Z1: shift traffic up
  PX-->>Z3: shift traffic up
  RS->>Z2: start major compaction
  Note over Z2: 3h compaction
  RS->>Z2: start cache warming
  Note over Z2: 10m warming
  RS->>PX: re-enable routing to Zone 2

  Note over Z3: 10:00 drain Zone 3 -> COMPACTING (repeat)

    
      Loading

  
      flowchart TB
  title: Single node major compaction steps
  %% Single node compaction (high-level)
  PREP[Prepare\nstop serving writes\nflush memtables] --> MERGE[Merge\nbuild new major SSTable]
  MERGE --> CLEAN[Cleanup\ndelete old SSTables\nreclaim space]
  CLEAN --> WARM[Warm caches\npreload hot blocks]
  WARM --> READY[Ready\nback to ACTIVE]

  MERGE --> INC[Incremental rewrite\nonly modified macroblocks]

    
      Loading

  
      flowchart LR
  title: Traditional vs incremental compaction I/O
  %% Traditional vs incremental write cost
  T[Traditional\nrewrite full dataset] --> TW[High write I/O]
  I[Incremental\nrewrite modified blocks] --> IW[Low write I/O]

  TW --> SLOW[Longer compaction]
  IW --> FAST[Shorter compaction]

    
      Loading

  
Topic
Key takeaway


Traditional approach
compact everywhere -> caches cold -> user-visible latency


Zone rotation
compact 1 zone at a time -> other zones serve -> near-zero user impact


Incremental compaction
reduces write amplification by rewriting only modified macroblocks


      flowchart TB
  title: Cache preservation by proactive warming
  %% Cache preservation by proactive warming
  OLD[Old cache keys\nSSTable v4] --> MAP[Map hot regions\nold to new]
  MAP --> LOAD[Load hot blocks\nfrom new SSTable v5]
  LOAD --> HIT[Hit rate recovers faster]

    
      Loading

  
      sequenceDiagram
  autonumber
  %% Stop-the-world vs rotation (impact)
  participant U as Users
  participant STW as Stop-the-world
  participant ROT as Zone rotation

  Note over STW: all zones compact
  STW-->>U: caches cold -> high latency for hours

  Note over ROT: only 1 zone compacting
  ROT-->>U: route to hot zones -> minimal impact

    
      Loading

  
      flowchart TB
  title: Routing adjustment and emergency handling
  %% Load balancing and emergency handling
  MON[Monitor zone CPU/latency] --> ROUTE[Adjust routing between active zones]
  ROUTE --> OK[Stay within headroom]

  FAIL[Active zone failure] --> EMER[Emergency: pause compaction\nbring compacting zone ACTIVE]
  EMER --> ROUTE

    
      Loading

  
      xychart-beta
  title "Disk usage trend"
  x-axis ["D1","D2","D3","D10","D11"]
  y-axis "TB" 0 --> 600
  line "No compaction" [100,160,220,550,600]
  line "With compaction" [100,130,160,280,280]

    
      Loading

  
Singles' Day 2019 compaction (summary)


Item
Value


Zone 2 compaction window
06:00–09:00


Per-node space reclaimed
15TB


Compaction duration
3h


Warming duration
10m


Hit rate right after warming
75%


Hit rate after 1h serving
85–90%


Zone 2 nodes
2000


Zone 2 reclaimed (cluster-wide)
30PB


Zone 2 rewritten (cluster-wide)
20PB


Major compaction is a necessary evil: it reclaims disk space and cleans up old versions, but it is expensive (rewrites data and thrashes caches).
Tradeoff	What you pay	Why it matters in peak
Infra burden	3-5 replicas for Paxos, storage+network cost 3-5x	RPO 0 and survive DC failure
Ops complexity	debug distributed system (network/leader/follower) is hard	need strong monitoring and automation
NewSQL	Core idea	Good at	Tradeoff
Google Spanner	global timestamp + sync time infra	global consistency across regions	depends on Google infra
TiDB	Raft + MySQL compatible	open source ecosystem, easy deploy	performance tuning and hotspots
OceanBase	Paxos + LSM + deep compression	huge spikes, high write throughput	complexity and infra cost
Zone	Datacenters	Purpose
Zone 1 Beijing	DC1, DC2	local quorum + survive 1 DC loss
Zone 2 Hangzhou	DC1, DC2	cross-zone quorum + disaster recovery
Zone 3 Shenzhen	DC1	third zone for HA
Replica role	Location	Example for partition 847
Leader	Zone 1	Beijing DC1
Follower	Zone 1	Beijing DC2
Follower	Zone 2	Hangzhou DC1
Follower	Zone 2	Hangzhou DC2
Follower	Zone 3	Shenzhen DC1
Stage	Trigger	What happens	Typical size / note
Write to MemTable	every write	keep hot rows + MVCC versions in memory	fast, no disk
Freeze	MemTable full	active -> frozen, new active created	64MB-256MB
Flush to Mini SSTable	background	dump frozen to disk	many small files
Minor compaction	too many mini files	merge many mini -> 1 minor, keep newest version	minor about 2GB
Major compaction	daily low traffic	merge minor + old major -> new major	incremental macroblock reuse
Time	Operation	Result (simplified)
1000	INSERT payment_id=1 amount=999	version 1000 stored
1001	UPDATE payment_id=1 amount=888	version 1001 stored, version 1000 kept
1002	read at ts=1000	return amount 999 (old version)
Item	Number
Old major size	100GB
Changes from minors	10GB
Macroblock size	2MB
Macroblocks rewritten	1000 blocks, about 2GB
Macroblocks reused	49000 blocks
Step	What leader does	Example detail
1	begin txn, assign timestamp	ts=1234567890
2	check constraints	enough balance? conflict?
3	write to MemTable	user_id=A, balance=900, version=1234567890
4	append redo log	UPDATE accounts SET balance=900 WHERE user_id=A
Metric	Meaning	Value in example
RPO	how much committed data can be lost	0
RTO	time to recover service	less than 30s
Step	What happens	Example cost
Prepare phase	coordinator asks 3 servers, waits ACK	5ms x 3 = 15ms
Commit phase	coordinator tells COMMIT, waits ACK	5ms x 3 = 15ms
Total	network-only overhead	30ms plus