Skip to content

Instantly share code, notes, and snippets.

@zboralski
Last active March 7, 2026 12:35
Show Gist options
  • Select an option

  • Save zboralski/8dcd1ee111407e2993e15801097f2152 to your computer and use it in GitHub Desktop.

Select an option

Save zboralski/8dcd1ee111407e2993e15801097f2152 to your computer and use it in GitHub Desktop.
MidALU Trigram Clustering — recovering GPU instruction families from compiler scheduling patterns

MidALU Trigram Clustering Results (2026-03-06)

Method

For every occurrence of a target tag in the corpus (1.49M HiALU + all MidALU), extract a trigram window:

[prev_form] TARGET [next_form]

Where forms are: LowALU, MidALU, HiALU, MemHi, MemOp, Reg, ExtALU, Branch, MidCatch, CF.

Represent each tag as a normalized frequency vector over form-level trigrams, then compute cosine similarity.

Dataset

  • 260 disasm files (imac_flat, macos26, system, xcode_beta, icon_composer, gpu, docs)
  • 11 unprobed MidALU tags analyzed against 16 known baseline tags

Cluster Results

Cluster 1: Standard ALU Pipeline

Tags that sit in generic MidALU↔MidALU, MidALU↔LowALU, MidALU↔HiALU neighborhoods. High cosine (>0.93) with known arithmetic tags.

Unprobed Base Modifier Closest Known Cosine Family
mid_alu_b5 0x15 mod 5 mid_alu_03 0.959 ALU general
mid_alu_b9 0x19 mod 5 mid_alu_1d 0.958 ALU general
mid_alu_f9 0x19 mod 7 mid_alu_04 0.932 ALU general
mid_alu_ff 0x1f mod 7 mid_alu_03 0.908 ALU general

b5 and b9 cluster together at 0.977 cosine — they are the same functional family with different base opcodes. Their mod-0 bases (0x15, 0x19) also cluster tightly (0.970).

Cluster 2: Data Movement / Memory-Adjacent

Unprobed Base Modifier Closest Known Cosine Family
mid_alu_2a 0x0a mod 1 mid_alu_0a 0.984 ALU feed (nearly identical to base)
mid_alu_d3 0x13 mod 6 mid_alu_03 0.896 ALU with heavy MemHi context
mid_alu_36 0x16 mod 1 mid_alu_03 0.907 ALU config (more HiALU context)

mid_alu_d3 is distinct: 8.8% of its trigrams involve MemHi↔MemHi patterns (memory load/store chains). Top prevs include mem_hi_6=274, mem_hi_e=208.

Cluster 3: Branch-Adjacent / Control Flow

Unprobed Base Modifier Closest Known Cosine Family
mid_alu_9b 0x1b mod 4 mid_alu_11 0.859 Branch-heavy context
mid_alu_bd 0x1d mod 5 mid_alu_01 0.810 Compare+branch pattern

mid_alu_9b signature: 8.9% branch→TARGET→branch trigrams. Top prev: branch_9a=3451. Top next: branch_9a=4010. Chains with itself: mid_alu_9b→mid_alu_9b→branch_9a.

mid_alu_bd signature: 5.8% TARGET→Branch, 14% TARGET→HiALU. Top pattern: low_alu_002b→mid_alu_bd→branch_10 (5.8%).

Cluster 4: Register Pipeline (Outlier)

Unprobed Base Modifier Closest Known Cosine Family
mid_alu_d1 0x11 mod 6 mid_alu_19 0.508 Unique: register sync

Genuine outlier. Trigram profile:

  • 22.9% MidALU → TARGET → Reg
  • 15.0% HiALU → TARGET → Reg
  • 8.7% Branch → TARGET → Reg
  • 53% of next instructions are reg words

This is consistent with TG_FENCE_M6 — a threadgroup fence/barrier that requires register file synchronization. The following reg word likely encodes barrier metadata.

Cluster 5: ALU Pipeline Step

Unprobed Base Modifier Closest Known Cosine Family
mid_alu_ad 0x0d mod 5 mid_alu_11 0.756 Mid-pipeline step

Unique signature: low_alu_0005→TARGET→hi_alu_81 (10.3%), low_alu_0004→TARGET→hi_alu_81 (6.0%). A MidALU step that sits between LowALU setup and HiALU consumer. Top next: hi_alu_81=1987.

Specific Behavioral Signatures

mid_alu_ff (FDIV_COMPARE_AUX)

64.6% of next instructions are mid_alu_0c. This is the tightest pairing in the corpus.

TOP NEXT: mid_alu_0c=3892 (64.6%), low_alu_000e=266, low_alu_001e=212
TOP PREV: low_alu_0000=734, reg=624, low_alu_catch=315

Confirmed: a divider pipeline micro-op that always feeds into mid_alu_0c.

mid_alu_9b (ALU_CHAIN_SETUP_M4)

Heavily self-chaining in branch contexts:

8.9%  branch_9a → mid_alu_9b → branch_9a
6.0%  mid_alu_9b → mid_alu_9b → branch_9a
4.5%  branch_9a → mid_alu_9b → mid_alu_9b
2.6%  mid_alu_9b → mid_alu_9b → mid_alu_9b

A conditional chain setup that runs in loops with branches. 23,914 corpus occurrences.

mid_alu_d3 (ALU_REG_CONFIG_M6)

Heavy memory context:

4.0%  mem_hi_6 → TARGET → mem_hi_4
3.5%  mem_hi_e → TARGET → mem_hi_e
TOP PREV: mem_hi_6=274, mem_hi_e=208

A register bank reconfiguration between memory operations. 2,753 corpus occurrences.

mid_alu_2a (ALU_FEED_M1)

Nearly identical to its base mid_alu_0a (cosine=0.984):

TOP PREV: mid_alu_1a=5635, mid_alu_3a=4233, mid_alu_5a=3733
TOP NEXT: mid_alu_34=5653, low_alu_catch=5243, mem_hi_4=4636

Same pipeline position as 0x0a, just modifier variant. 73,868 corpus occurrences (not rare at all).

Cosine Similarity Matrix (top pairs, >0.95)

mid_alu_13  ↔ mid_alu_15   0.992
mid_alu_15  ↔ mid_alu_1b   0.991
mid_alu_13  ↔ mid_alu_1b   0.984
mid_alu_0a  ↔ mid_alu_2a   0.984   ← unprobed matches base
mid_alu_15  ↔ mid_alu_1d   0.984
mid_alu_13  ↔ mid_alu_1d   0.984
mid_alu_1b  ↔ mid_alu_1d   0.981
mid_alu_b5  ↔ mid_alu_b9   0.977   ← unprobed pair clusters
mid_alu_03  ↔ mid_alu_04   0.974
mid_alu_03  ↔ mid_alu_b5   0.959   ← unprobed matches known
mid_alu_1d  ↔ mid_alu_b9   0.958   ← unprobed matches known

Conclusions

  1. 7 of 11 unprobed tags are standard ALU pipeline variants (cosine >0.90 with known tags). They don't need oracle probing — they're modifier variants of well-understood base opcodes.

  2. mid_alu_d1 is the most interesting — a genuine outlier that always precedes reg words. TG_FENCE_M6 label is well-supported.

  3. mid_alu_ff always feeds mid_alu_0c — divider pipeline confirmed.

  4. mid_alu_9b lives in branch loops — conditional chain setup confirmed.

  5. Trigram clustering recovers instruction families without knowing instruction semantics, purely from compiler scheduling patterns.

K-Means Clustering (k=5, 106 MidALU tags, ≥100 occurrences)

Form-level trigram vectors, L2-normalized, k-means++ initialization, 20 trials.

Cluster 0: LowALU Feed Chain (11 tags)

Signature: [LowALU] TARGET [MidALU] dominant (83%)

Tags: mid_alu_{3a, 5a, 72, 7a, 7f, 92, b2, d2, da, f2, fa}

These are MidALU instructions that primarily receive input from LowALU — pipeline feed operations. All modifier variants (mod 1-7) of a small set of base opcodes.

Cluster 1: General ALU (78 tags)

Signature: [MidALU] TARGET [MidALU] dominant (75%)

The largest cluster. Contains most arithmetic, logic, and configuration tags including all 11 unprobed targets except mid_alu_ad and mid_alu_d1. These sit in MidALU↔MidALU chains — standard ALU pipeline instructions.

Cluster 2: Load-Execute Bridge (7 tags)

Signature: [LowALU] TARGET [HiALU] dominant (62%)

Tags: mid_alu_{78, 95, ad, b0, b8, f0, f8}

MidALU instructions that bridge LowALU setup to HiALU execution. mid_alu_ad (ALU_MID_STEP_M5) falls here — confirmed as pipeline step between source setup and ALU consumer.

Cluster 3: Register Sync (4 tags)

Signature: [MidALU] TARGET [Reg] dominant (51%)

Tags: mid_alu_{34, b1, d1, d5}

MidALU instructions that almost always precede reg words. mid_alu_d1 (TG_FENCE_M6) confirmed here. mid_alu_34 is the high-volume anchor (101K occurrences). This cluster represents operations requiring register file metadata words.

Cluster 4: LowALU Interleave (6 tags)

Signature: [LowALU] TARGET [LowALU] dominant (78%)

Tags: mid_alu_{09, 0b, 50, 70, a0, a4}

MidALU instructions sandwiched between LowALU instructions. These are inline data or barrier-like tags that sit within LowALU instruction streams without disrupting them. mid_alu_a0 is the high-volume member (66K occurrences).

Inter-Cluster Distances

Pair Cosine
C1 ↔ C2 0.705
C0 ↔ C1 0.691
C1 ↔ C4 0.653
C1 ↔ C3 0.604
C0 ↔ C2 0.528

Clusters 3 (Reg Sync) and 4 (LowALU Interleave) are the most distinctive — lowest similarity to other clusters.

Modifier Interaction Matrix (2026-03-06)

Method

For every MidALU instruction in the corpus, extract base = tag[4:0] and modifier = tag[7:5]. Build:

  1. Base × Modifier frequency table
  2. Per base+mod: predecessor/successor form distribution
  3. Modifier transition matrix (consecutive MidALU pairs)
  4. Clause patterns (modifier sequences of length 2-3)

Tool: tools/modifier_matrix.go

Base × Modifier Structure

Two tiers of base opcodes:

  • Simple bases (0x00-0x0f): only mod0 + mod1 (plus occasional mod5). Core scalar ALU.
  • Rich bases (0x10-0x1f): up to 8 modifiers. Complex pipeline ops (FMA, conversion, compare, fence).

Widest modifier spread: base 0x1a (459K instances, 7 modifiers), 0x12 (283K, all 8 mods), 0x14 (154K, 7 mods).

Modifier Transition Matrix

prev\next  mod0   mod1   mod2   mod3   mod4   mod5   mod6   mod7
mod0      64.1%  24.6%   1.9%   1.6%   2.4%   3.7%   0.9%   0.9%  (n=957K)
mod1      52.4%  35.6%   2.6%   1.6%   2.4%   3.5%   0.9%   1.0%  (n=405K)
mod2      59.9%  33.8%   3.0%   0.8%   0.8%   1.2%   0.3%   0.3%  (n=67K)
mod3      68.7%  24.5%   0.7%   2.3%   0.6%   2.5%   0.4%   0.3%  (n=53K)
mod4      57.6%  22.9%   1.4%   0.8%  11.8%   4.0%   0.4%   1.1%  (n=42K)
mod5      60.6%  25.8%   2.1%   1.3%   2.5%   6.9%   0.4%   0.4%  (n=54K)
mod6      58.0%  38.3%   1.2%   0.3%   0.2%   0.7%   1.2%   0.1%  (n=29K)
mod7      67.0%  28.9%   0.6%   0.5%   0.7%   1.3%   0.1%   1.0%  (n=34K)

Key patterns:

  • mod0→mod0 dominates (64.1%) — mod0 is the default scheduling slot
  • mod1 is the secondary slot (25-38% as successor)
  • mod4 has 11.8% self-affinity — runs in chains (branch-loop pattern)
  • mod6→mod1 elevated (38.3%) — mod6 almost always transitions to mod0/mod1

Pipeline Position Gradient (Base 0x12)

The most informative base shows modifier encodes pipeline position:

Modifier LowALU prev LowALU next Reg next Interpretation
mod0 11.3% 28.1% 9.3% Standard middle-pipe
mod1 11.0% 28.5% 5.7% Similar to mod0
mod2 25.7% 9.2% 8.5% More LowALU-fed
mod3 36.8% 7.4% 8.9% Heavy LowALU source
mod4 40.9% 9.2% 12.8% LowALU→Reg bridge
mod5 48.0% 5.5% 15.2% Source setup stage
mod6 47.9% 1.7% 19.3% Near-terminal
mod7 61.0% 0.5% 25.7% Pipeline terminus

mod7 is a pipeline terminus: 61% LowALU predecessors, 0.5% LowALU successors, 25.7% Reg successors. The modifier gradient encodes where in the LowALU→MidALU→HiALU clause the instruction sits.

Clause Patterns

Top patterns (consecutive MidALU modifier sequences):

Pattern Count
m0→m0 614K
m0→m1 236K
m1→m0 212K
m0→m0→m0 205K
m1→m1 144K
m0→m1→m0 69K
m1→m0→m1 52K

The m0↔m1 alternation accounts for the bulk of scheduling. Higher modifiers (2-7) are specialist inserts in predominantly mod0/mod1 streams.

Confirmations

  • Base 0x11 mod6 (mid_alu_d1 = TG_FENCE): 53.3% Reg next, 9.4% Branch prev — independently confirms register sync role from trigram clustering.
  • Base 0x14 mod1: 74.7% MidALU prev, 35.0% Reg next — distinctive register pipeline stage.
  • Base 0x10 mod7: 49.8% HiALU next, 30.6% Reg prev — HiALU feeder from register file.

Conclusions

  1. Modifier = pipeline position, not instruction variant. Higher modifiers sit closer to the pipeline terminus (LowALU source → Reg/HiALU sink).
  2. mod0/mod1 are the scheduling backbone — 88.7% of all transitions stay within mod0↔mod1.
  3. mod4 is a loop-body specialist — 11.8% self-transition (branch-loop confirmed by mid_alu_9b chains).
  4. mod6/mod7 are pipeline-terminal — they feed into Reg words or HiALU consumers with minimal LowALU continuation.
  5. The AGX clause grammar is: (mod0|mod1)* [mod2-7_specialist]? (mod0|mod1)* — specialist modifiers are injected into mod0/mod1 streams.

Files

  • Tools: tools/trigram_cluster.go, tools/trigram_kmeans.go, tools/modifier_matrix.go
  • Raw aux6 stats: scratch/aux_field_corpus_stats.json
  • aux6 analysis: scratch/aux6_corpus_analysis.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment