For every occurrence of a target tag in the corpus (1.49M HiALU + all MidALU), extract a trigram window:
[prev_form] TARGET [next_form]
Where forms are: LowALU, MidALU, HiALU, MemHi, MemOp, Reg, ExtALU, Branch, MidCatch, CF.
Represent each tag as a normalized frequency vector over form-level trigrams, then compute cosine similarity.
- 260 disasm files (imac_flat, macos26, system, xcode_beta, icon_composer, gpu, docs)
- 11 unprobed MidALU tags analyzed against 16 known baseline tags
Tags that sit in generic MidALU↔MidALU, MidALU↔LowALU, MidALU↔HiALU neighborhoods. High cosine (>0.93) with known arithmetic tags.
| Unprobed | Base | Modifier | Closest Known | Cosine | Family |
|---|---|---|---|---|---|
| mid_alu_b5 | 0x15 | mod 5 | mid_alu_03 | 0.959 | ALU general |
| mid_alu_b9 | 0x19 | mod 5 | mid_alu_1d | 0.958 | ALU general |
| mid_alu_f9 | 0x19 | mod 7 | mid_alu_04 | 0.932 | ALU general |
| mid_alu_ff | 0x1f | mod 7 | mid_alu_03 | 0.908 | ALU general |
b5 and b9 cluster together at 0.977 cosine — they are the same functional family with different base opcodes. Their mod-0 bases (0x15, 0x19) also cluster tightly (0.970).
| Unprobed | Base | Modifier | Closest Known | Cosine | Family |
|---|---|---|---|---|---|
| mid_alu_2a | 0x0a | mod 1 | mid_alu_0a | 0.984 | ALU feed (nearly identical to base) |
| mid_alu_d3 | 0x13 | mod 6 | mid_alu_03 | 0.896 | ALU with heavy MemHi context |
| mid_alu_36 | 0x16 | mod 1 | mid_alu_03 | 0.907 | ALU config (more HiALU context) |
mid_alu_d3 is distinct: 8.8% of its trigrams involve MemHi↔MemHi patterns (memory load/store chains). Top prevs include mem_hi_6=274, mem_hi_e=208.
| Unprobed | Base | Modifier | Closest Known | Cosine | Family |
|---|---|---|---|---|---|
| mid_alu_9b | 0x1b | mod 4 | mid_alu_11 | 0.859 | Branch-heavy context |
| mid_alu_bd | 0x1d | mod 5 | mid_alu_01 | 0.810 | Compare+branch pattern |
mid_alu_9b signature: 8.9% branch→TARGET→branch trigrams. Top prev: branch_9a=3451. Top next: branch_9a=4010. Chains with itself: mid_alu_9b→mid_alu_9b→branch_9a.
mid_alu_bd signature: 5.8% TARGET→Branch, 14% TARGET→HiALU. Top pattern: low_alu_002b→mid_alu_bd→branch_10 (5.8%).
| Unprobed | Base | Modifier | Closest Known | Cosine | Family |
|---|---|---|---|---|---|
| mid_alu_d1 | 0x11 | mod 6 | mid_alu_19 | 0.508 | Unique: register sync |
Genuine outlier. Trigram profile:
- 22.9% MidALU → TARGET → Reg
- 15.0% HiALU → TARGET → Reg
- 8.7% Branch → TARGET → Reg
- 53% of next instructions are
regwords
This is consistent with TG_FENCE_M6 — a threadgroup fence/barrier that requires register file synchronization. The following reg word likely encodes barrier metadata.
| Unprobed | Base | Modifier | Closest Known | Cosine | Family |
|---|---|---|---|---|---|
| mid_alu_ad | 0x0d | mod 5 | mid_alu_11 | 0.756 | Mid-pipeline step |
Unique signature: low_alu_0005→TARGET→hi_alu_81 (10.3%), low_alu_0004→TARGET→hi_alu_81 (6.0%). A MidALU step that sits between LowALU setup and HiALU consumer. Top next: hi_alu_81=1987.
64.6% of next instructions are mid_alu_0c. This is the tightest pairing in the corpus.
TOP NEXT: mid_alu_0c=3892 (64.6%), low_alu_000e=266, low_alu_001e=212
TOP PREV: low_alu_0000=734, reg=624, low_alu_catch=315
Confirmed: a divider pipeline micro-op that always feeds into mid_alu_0c.
Heavily self-chaining in branch contexts:
8.9% branch_9a → mid_alu_9b → branch_9a
6.0% mid_alu_9b → mid_alu_9b → branch_9a
4.5% branch_9a → mid_alu_9b → mid_alu_9b
2.6% mid_alu_9b → mid_alu_9b → mid_alu_9b
A conditional chain setup that runs in loops with branches. 23,914 corpus occurrences.
Heavy memory context:
4.0% mem_hi_6 → TARGET → mem_hi_4
3.5% mem_hi_e → TARGET → mem_hi_e
TOP PREV: mem_hi_6=274, mem_hi_e=208
A register bank reconfiguration between memory operations. 2,753 corpus occurrences.
Nearly identical to its base mid_alu_0a (cosine=0.984):
TOP PREV: mid_alu_1a=5635, mid_alu_3a=4233, mid_alu_5a=3733
TOP NEXT: mid_alu_34=5653, low_alu_catch=5243, mem_hi_4=4636
Same pipeline position as 0x0a, just modifier variant. 73,868 corpus occurrences (not rare at all).
mid_alu_13 ↔ mid_alu_15 0.992
mid_alu_15 ↔ mid_alu_1b 0.991
mid_alu_13 ↔ mid_alu_1b 0.984
mid_alu_0a ↔ mid_alu_2a 0.984 ← unprobed matches base
mid_alu_15 ↔ mid_alu_1d 0.984
mid_alu_13 ↔ mid_alu_1d 0.984
mid_alu_1b ↔ mid_alu_1d 0.981
mid_alu_b5 ↔ mid_alu_b9 0.977 ← unprobed pair clusters
mid_alu_03 ↔ mid_alu_04 0.974
mid_alu_03 ↔ mid_alu_b5 0.959 ← unprobed matches known
mid_alu_1d ↔ mid_alu_b9 0.958 ← unprobed matches known
-
7 of 11 unprobed tags are standard ALU pipeline variants (cosine >0.90 with known tags). They don't need oracle probing — they're modifier variants of well-understood base opcodes.
-
mid_alu_d1 is the most interesting — a genuine outlier that always precedes
regwords. TG_FENCE_M6 label is well-supported. -
mid_alu_ff always feeds mid_alu_0c — divider pipeline confirmed.
-
mid_alu_9b lives in branch loops — conditional chain setup confirmed.
-
Trigram clustering recovers instruction families without knowing instruction semantics, purely from compiler scheduling patterns.
Form-level trigram vectors, L2-normalized, k-means++ initialization, 20 trials.
Signature: [LowALU] TARGET [MidALU] dominant (83%)
Tags: mid_alu_{3a, 5a, 72, 7a, 7f, 92, b2, d2, da, f2, fa}
These are MidALU instructions that primarily receive input from LowALU — pipeline feed operations. All modifier variants (mod 1-7) of a small set of base opcodes.
Signature: [MidALU] TARGET [MidALU] dominant (75%)
The largest cluster. Contains most arithmetic, logic, and configuration tags including all 11 unprobed targets except mid_alu_ad and mid_alu_d1. These sit in MidALU↔MidALU chains — standard ALU pipeline instructions.
Signature: [LowALU] TARGET [HiALU] dominant (62%)
Tags: mid_alu_{78, 95, ad, b0, b8, f0, f8}
MidALU instructions that bridge LowALU setup to HiALU execution. mid_alu_ad (ALU_MID_STEP_M5) falls here — confirmed as pipeline step between source setup and ALU consumer.
Signature: [MidALU] TARGET [Reg] dominant (51%)
Tags: mid_alu_{34, b1, d1, d5}
MidALU instructions that almost always precede reg words. mid_alu_d1 (TG_FENCE_M6) confirmed here. mid_alu_34 is the high-volume anchor (101K occurrences). This cluster represents operations requiring register file metadata words.
Signature: [LowALU] TARGET [LowALU] dominant (78%)
Tags: mid_alu_{09, 0b, 50, 70, a0, a4}
MidALU instructions sandwiched between LowALU instructions. These are inline data or barrier-like tags that sit within LowALU instruction streams without disrupting them. mid_alu_a0 is the high-volume member (66K occurrences).
| Pair | Cosine |
|---|---|
| C1 ↔ C2 | 0.705 |
| C0 ↔ C1 | 0.691 |
| C1 ↔ C4 | 0.653 |
| C1 ↔ C3 | 0.604 |
| C0 ↔ C2 | 0.528 |
Clusters 3 (Reg Sync) and 4 (LowALU Interleave) are the most distinctive — lowest similarity to other clusters.
For every MidALU instruction in the corpus, extract base = tag[4:0] and modifier = tag[7:5]. Build:
- Base × Modifier frequency table
- Per base+mod: predecessor/successor form distribution
- Modifier transition matrix (consecutive MidALU pairs)
- Clause patterns (modifier sequences of length 2-3)
Tool: tools/modifier_matrix.go
Two tiers of base opcodes:
- Simple bases (0x00-0x0f): only mod0 + mod1 (plus occasional mod5). Core scalar ALU.
- Rich bases (0x10-0x1f): up to 8 modifiers. Complex pipeline ops (FMA, conversion, compare, fence).
Widest modifier spread: base 0x1a (459K instances, 7 modifiers), 0x12 (283K, all 8 mods), 0x14 (154K, 7 mods).
prev\next mod0 mod1 mod2 mod3 mod4 mod5 mod6 mod7
mod0 64.1% 24.6% 1.9% 1.6% 2.4% 3.7% 0.9% 0.9% (n=957K)
mod1 52.4% 35.6% 2.6% 1.6% 2.4% 3.5% 0.9% 1.0% (n=405K)
mod2 59.9% 33.8% 3.0% 0.8% 0.8% 1.2% 0.3% 0.3% (n=67K)
mod3 68.7% 24.5% 0.7% 2.3% 0.6% 2.5% 0.4% 0.3% (n=53K)
mod4 57.6% 22.9% 1.4% 0.8% 11.8% 4.0% 0.4% 1.1% (n=42K)
mod5 60.6% 25.8% 2.1% 1.3% 2.5% 6.9% 0.4% 0.4% (n=54K)
mod6 58.0% 38.3% 1.2% 0.3% 0.2% 0.7% 1.2% 0.1% (n=29K)
mod7 67.0% 28.9% 0.6% 0.5% 0.7% 1.3% 0.1% 1.0% (n=34K)
Key patterns:
- mod0→mod0 dominates (64.1%) — mod0 is the default scheduling slot
- mod1 is the secondary slot (25-38% as successor)
- mod4 has 11.8% self-affinity — runs in chains (branch-loop pattern)
- mod6→mod1 elevated (38.3%) — mod6 almost always transitions to mod0/mod1
The most informative base shows modifier encodes pipeline position:
| Modifier | LowALU prev | LowALU next | Reg next | Interpretation |
|---|---|---|---|---|
| mod0 | 11.3% | 28.1% | 9.3% | Standard middle-pipe |
| mod1 | 11.0% | 28.5% | 5.7% | Similar to mod0 |
| mod2 | 25.7% | 9.2% | 8.5% | More LowALU-fed |
| mod3 | 36.8% | 7.4% | 8.9% | Heavy LowALU source |
| mod4 | 40.9% | 9.2% | 12.8% | LowALU→Reg bridge |
| mod5 | 48.0% | 5.5% | 15.2% | Source setup stage |
| mod6 | 47.9% | 1.7% | 19.3% | Near-terminal |
| mod7 | 61.0% | 0.5% | 25.7% | Pipeline terminus |
mod7 is a pipeline terminus: 61% LowALU predecessors, 0.5% LowALU successors, 25.7% Reg successors. The modifier gradient encodes where in the LowALU→MidALU→HiALU clause the instruction sits.
Top patterns (consecutive MidALU modifier sequences):
| Pattern | Count |
|---|---|
| m0→m0 | 614K |
| m0→m1 | 236K |
| m1→m0 | 212K |
| m0→m0→m0 | 205K |
| m1→m1 | 144K |
| m0→m1→m0 | 69K |
| m1→m0→m1 | 52K |
The m0↔m1 alternation accounts for the bulk of scheduling. Higher modifiers (2-7) are specialist inserts in predominantly mod0/mod1 streams.
- Base 0x11 mod6 (mid_alu_d1 = TG_FENCE): 53.3% Reg next, 9.4% Branch prev — independently confirms register sync role from trigram clustering.
- Base 0x14 mod1: 74.7% MidALU prev, 35.0% Reg next — distinctive register pipeline stage.
- Base 0x10 mod7: 49.8% HiALU next, 30.6% Reg prev — HiALU feeder from register file.
- Modifier = pipeline position, not instruction variant. Higher modifiers sit closer to the pipeline terminus (LowALU source → Reg/HiALU sink).
- mod0/mod1 are the scheduling backbone — 88.7% of all transitions stay within mod0↔mod1.
- mod4 is a loop-body specialist — 11.8% self-transition (branch-loop confirmed by mid_alu_9b chains).
- mod6/mod7 are pipeline-terminal — they feed into Reg words or HiALU consumers with minimal LowALU continuation.
- The AGX clause grammar is:
(mod0|mod1)* [mod2-7_specialist]? (mod0|mod1)*— specialist modifiers are injected into mod0/mod1 streams.
- Tools:
tools/trigram_cluster.go,tools/trigram_kmeans.go,tools/modifier_matrix.go - Raw aux6 stats:
scratch/aux_field_corpus_stats.json - aux6 analysis:
scratch/aux6_corpus_analysis.md