Skip to content

Instantly share code, notes, and snippets.

@docularxu
Last active February 19, 2026 14:10
Show Gist options
  • Select an option

  • Save docularxu/9d5ac47e10b4ca6210b66c9b6156118f to your computer and use it in GitHub Desktop.

Select an option

Save docularxu/9d5ac47e10b4ca6210b66c9b6156118f to your computer and use it in GitHub Desktop.
Intel's Heterogeneous Core SIMD/AVX Roadmap and the Scheduling Problem - Research Report (Feb 2026)

Intel's Heterogeneous Core SIMD/AVX Roadmap and the Scheduling Problem

Research Report — February 2026 Audience: Senior kernel engineers (particularly RISC-V), for architectural comparison


Executive Summary

Intel's introduction of hybrid (big.LITTLE-style) x86 CPUs starting with Alder Lake in 2021 created an unprecedented problem in the x86 ecosystem: heterogeneous ISA support across cores within a single package. The P-cores (Golden Cove) supported AVX-512 while the E-cores (Gracemont) did not. Intel's handling of this — disabling AVX-512 across the entire chip — and their multi-year journey toward AVX10 as a solution provides critical lessons for any ISA designer considering heterogeneous core designs. This report covers the technical details, the OS scheduling implications, and comparisons with Arm, Apple, and RISC-V approaches.


1. AVX-512 on Hybrid Architectures

1.1 Background: The ISA Asymmetry Problem

AVX-512 was introduced with Knights Landing (2016) and Skylake-SP (2017). By 2020, Intel's client Tiger Lake and Rocket Lake chips supported AVX-512 on all cores. Then came the hybrid architecture.

1.2 Alder Lake (12th Gen, Nov 2021)

Architecture: Golden Cove P-cores + Gracemont E-cores

The Problem: Golden Cove P-cores had full AVX-512 execution units (two 256-bit FMA units fusable into one 512-bit unit, same as in Sunny Cove/Willow Cove). Gracemont E-cores had no AVX-512 support whatsoever — they topped out at AVX2 (256-bit).

The Mechanism — Multiple Layers of Disabling:

  1. BIOS/Firmware (primary mechanism): On hybrid SKUs with both P-cores and E-cores, the BIOS was configured to not enumerate AVX-512 in CPUID. The OS never sees AVX-512 as available. This was done through microcode/firmware configuration at boot time, not physical fuses.

  2. Microcode coordination: The processor's microcode, working with the BIOS, ensured that CPUID reported a consistent ISA across all cores. Since x86 has historically assumed ISA homogeneity, the simplest solution was to report only the intersection of capabilities (i.e., what both P-cores and E-cores support = AVX2).

  3. NOT a hard fuse-off: This is a critical distinction. The AVX-512 execution hardware was physically present and functional on Golden Cove P-cores. It was disabled in software/firmware, not blown fuses.

The E-core Disable Hack:

Early Alder Lake adopters discovered that on desktop SKUs (LGA 1700), if you:

  • Disabled all E-cores in BIOS
  • Used specific BIOS versions from motherboard vendors (notably ASUS, MSI, Gigabyte)
  • The system would then report AVX-512 via CPUID and it worked perfectly

This was widely documented and benchmarked. Alder Lake P-cores running AVX-512 performed comparably to Rocket Lake on AVX-512 workloads. Some motherboard vendors even added explicit "AVX-512" toggle options in BIOS (with E-cores auto-disabled).

Intel's Response: Intel explicitly stated this was unsupported. They pushed microcode updates that removed the ability to enable AVX-512 even with E-cores disabled on some platforms, though enforcement was inconsistent. The company's official position was that Alder Lake does not support AVX-512, period.

1.3 Raptor Lake (13th Gen, Oct 2022)

Architecture: Raptor Cove P-cores + Gracemont E-cores

Raptor Cove is a derivative of Golden Cove, so the P-cores still had AVX-512 execution units physically present. The same firmware-level disabling applied. The E-core disable hack still worked on many boards. Nothing fundamentally changed from Alder Lake regarding AVX-512.

1.4 Arrow Lake (Core Ultra 200S, Oct 2024)

Architecture: Lion Cove P-cores + Skymont E-cores

Key Change: Skymont E-cores gained significant SIMD improvements but still did not support AVX-512 with 512-bit vector widths. However, Skymont did gain support for many AVX-512 instructions at 256-bit width (the EVEX encoding, masking registers k0-k7, and many AVX-512 sub-extensions). This was essentially the precursor to the AVX10.1/256 concept.

Lion Cove P-cores retained full AVX-512 execution units internally. However, AVX-512 (512-bit) was still disabled at the platform level. The official ISA extensions listed for Arrow Lake are: SSE4.1, SSE4.2, AVX2.

The Lunar Lake Wikipedia page confirms extensions listed as "SSE4.1, SSE4.2, AVX2" — same story.

Important nuance: While the marketing says "AVX2," both Lion Cove and Skymont actually support a significant subset of AVX-512 instructions at 128/256-bit widths via EVEX encoding. Intel calls this AVX10.1/256 internally, but didn't market it as such for Arrow Lake desktop.

1.5 Lunar Lake (Core Ultra 200V, Sep 2024)

Architecture: Lion Cove P-cores + Skymont E-cores (mobile-only, 4P+4E)

Same ISA situation as Arrow Lake. AVX-512 at 512-bit widths disabled. Both core types support the AVX-512 instruction encodings at 256-bit width.

1.6 Summary Table

Generation P-core E-core AVX-512 HW in P-core? AVX-512 HW in E-core? AVX-512 Enabled? Mechanism
Alder Lake (12th Gen, Nov 2021) Golden Cove Gracemont Yes (full 512-bit) No No (hackable) BIOS/microcode CPUID masking
Raptor Lake (13th Gen, Oct 2022) Raptor Cove Gracemont Yes (full 512-bit) No No (hackable) BIOS/microcode CPUID masking
Arrow Lake (Core Ultra 200S, Oct 2024) Lion Cove Skymont Yes (full 512-bit) 256-bit EVEX only No BIOS/microcode; E-cores have partial support
Lunar Lake (Core Ultra 200V, Sep 2024) Lion Cove Skymont Yes (full 512-bit) 256-bit EVEX only No Same as Arrow Lake

2. AVX10: Intel's Proposed Solution

2.1 Motivation

The core problem: x86 has no mechanism for an OS to schedule threads based on ISA capabilities. CPUID is global — it describes the processor, not individual cores. When cores have different capabilities, the only options are:

  1. Report the intersection (lose features) — what Intel did
  2. Report the union and let software crash on the wrong core — unacceptable
  3. Build OS awareness of per-core ISA — massive ecosystem change
  4. Make all cores support the same ISA — what AVX10 aims for

2.2 AVX10.1

Announced: July 2023 Key Design Principles:

  • Versioned ISA: Instead of the combinatorial explosion of AVX-512 sub-extensions (AVX-512F, AVX-512BW, AVX-512DQ, AVX-512VL, AVX-512VNNI, AVX-512VBMI, AVX-512VBMI2, AVX-512BITALG, AVX-512VPOPCNTDQ, AVX-512FP16, AVX-512BF16, AVX-512IFMA, etc.), AVX10 introduces a single version number. AVX10.1 includes a specific fixed set of instructions equivalent to the union of extensions found in Sapphire Rapids.

  • Width specification: AVX10 comes in two flavors:

    • AVX10.1/256: All AVX10.1 instructions at up to 256-bit vector width. Uses EVEX encoding, 32 vector registers (ZMM0-31 accessible as YMM/XMM), 8 mask registers. This is what E-cores can support.
    • AVX10.1/512: All AVX10.1 instructions at up to 512-bit vector width. This is what P-cores (and server chips) support.
  • CPUID enumeration: New CPUID leaf specifically for AVX10. Reports: (a) AVX10 version number, (b) maximum supported vector width (256 or 512).

  • No new instructions over AVX-512: AVX10.1 is a reorganization, not an extension. It defines a convergence point.

Initial specification note: In early drafts (2023), 512-bit support was optional. Intel later revised this to make 512-bit mandatory for AVX10-capable processors, with the intent to bring 512-bit support to E-cores as well. This was a significant policy reversal driven by developer feedback — the community pushed back hard against fragmenting the width.

2.3 AVX10.2

Published: Late 2024 (specification updates)

AVX10.2 adds new instructions on top of AVX10.1:

  • YMM-embedded rounding and SAE (Suppress All Exceptions): Previously, embedded rounding and SAE were only available for 512-bit instructions. AVX10.2 extends this to 256-bit operations — a significant usability improvement for code that wants to control rounding mode without modifying MXCSR.
  • New minmax instructions for packed floating-point
  • Saturating conversion instructions (e.g., convert FP to integer with saturation)
  • BF16 and FP16 enhancements
  • Media-new instructions for video encode/decode acceleration

2.4 How AVX10 Solves the Heterogeneous ISA Problem

The theory:

  1. Both P-cores and E-cores implement AVX10.1 (at minimum)
  2. P-cores implement AVX10.1/512 (or higher)
  3. E-cores implement AVX10.1/256 (initially) → eventually AVX10.1/512
  4. CPUID reports the minimum supported width across all cores
  5. Software that uses AVX10.1/256 can run on any core
  6. Software that uses AVX10.1/512 needs to be pinned to P-cores (or all cores need /512)

Intel's stated goal (as of late 2024/2025): Bring 512-bit execution to E-cores so that the entire chip can report AVX10/512 uniformly. This eliminates the scheduling problem entirely.

2.5 Timeline and CPU Support

CPU Expected AVX10 Vector Width Status
Granite Rapids (server) AVX10.1/512 (effectively, via AVX-512) 512-bit Shipping
Arrow Lake (client) AVX10.1/256 (de facto, not marketed) 256-bit Shipping
Lunar Lake (client) AVX10.1/256 (de facto) 256-bit Shipping
Panther Lake (client) AVX10.1/256 likely 256-bit Launched Jan 2026
Nova Lake (client) AVX10.x/512 (goal) 512-bit all cores? TBD ~2027
Diamond Rapids (server) AVX10.2/512 512-bit Expected 2025-2026

Note on Panther Lake: The Wikipedia listing for Panther Lake shows extensions as "SSE4, AVX, AVX2, AVX-VNNI, AVX-IFMA" — no explicit AVX10 or AVX-512 mention. The Cougar Cove P-cores and Darkmont E-cores likely support AVX10.1/256 internally. The E-cores (Darkmont) are an evolution of Skymont, which already had EVEX encoding support at 256-bit.


3. Linux Kernel Scheduler and ISA Heterogeneity

3.1 The Fundamental Assumption: ISA Homogeneity

The Linux kernel scheduler (CFS/EEVDF) fundamentally assumes all CPUs in a system can execute the same instructions. There is no mechanism in struct rq, struct task_struct, or the load balancing code to say "this task uses AVX-512 and can only run on cores 0-7."

When a process executes an instruction not supported by the current core, the CPU generates a #UD (undefined instruction) exception, which Linux delivers as SIGILL. There is no trap-and-migrate mechanism.

3.2 Intel Thread Director (ITD) / Hardware Feedback Interface (HFI)

Hardware Feedback Interface (HFI) is an Intel hardware feature (introduced with Alder Lake) that provides per-core performance and energy efficiency hints to the OS via a shared memory table.

What HFI provides:

  • Per-core performance capability (0-255 scale)
  • Per-core energy efficiency (0-255 scale)
  • These values are dynamic — they change based on thermal conditions, power limits, and workload characteristics
  • Updated by hardware via an interrupt when values change

Intel Thread Director (ITD) is the hardware classification engine that feeds into HFI:

  • Monitors instruction mix per-thread in hardware
  • Classifies workloads into categories (scalar, vectorized, FP-heavy, etc.)
  • Adjusts HFI hints based on which core type would be best for the current workload

Linux support:

  • HFI driver merged in Linux 5.18 (2022)
  • intel_hfi driver exposes performance/efficiency data
  • The intel_pstate driver uses HFI data for frequency scaling
  • The scheduler uses it via the Energy-Aware Scheduling (EAS) framework and Preferred Core ranking

What HFI/ITD does NOT do:

  • Does not expose per-core ISA capabilities. HFI says "core 4 is 80% as performant as core 0" but does NOT say "core 4 lacks AVX-512."
  • Does not prevent scheduling a task to an incompatible core. If AVX-512 were somehow enabled on P-cores only, the kernel would happily migrate an AVX-512 task to an E-core, resulting in SIGILL.
  • Does not understand instruction-level requirements. ITD classifies workload type but doesn't enforce ISA compatibility.

3.3 Why Intel Chose to Disable Rather Than Schedule

Given the above, Intel had no viable option but to disable AVX-512 on hybrid chips:

  1. No per-core CPUID: CPUID is architecturally defined as returning the same value on all cores. Changing this would break a fundamental x86 assumption.
  2. No ISA-aware scheduling in Linux (or Windows): Neither OS had any mechanism to restrict tasks to cores based on instruction usage. Building this would require:
    • Hardware trapping of unsupported instructions (not just #UD but with migration capability)
    • Kernel support for ISA-capability-based affinity
    • Userspace ABI changes
  3. Binary compatibility nightmare: Existing AVX-512 binaries couldn't know they needed core pinning. Library code (glibc, OpenSSL, etc.) uses runtime CPUID detection — if CPUID says AVX-512, code uses it everywhere.

3.4 Potential Approaches Considered (and Why They're Hard)

Approach A: Trap-and-migrate

  • E-core encounters AVX-512 instruction → #UD → kernel catches it → migrates task to P-core → resumes
  • Problems: High latency for migration, complex state management, AVX-512 state (ZMM16-31, opmask registers) doesn't exist on E-core so can't be saved, instruction may be in a tight loop causing constant migrations

Approach B: Per-core CPUID

  • Let CPUID return different values on different cores
  • Problems: Breaks every piece of x86 software that caches CPUID results at startup. Glibc does this. Every JIT compiler does this. Total ecosystem breakage.

Approach C: ISA-affinity cpumask

  • Extend task_struct with ISA requirements, auto-detected from instruction usage
  • Problems: Detection requires decode of the instruction stream or runtime tracking. Massive overhead. False negatives (code that conditionally uses AVX-512).

3.5 The Windows Perspective

Windows 11 added hybrid-aware scheduling specifically for Alder Lake:

  • Thread Director hints feed into the Windows scheduler
  • Windows uses a "Heterogeneous Policy" to classify threads (Efficiency, Neutral, Performance)
  • But Windows similarly does NOT have ISA-aware scheduling — it relies on Intel disabling AVX-512

4. Comparison with Other Approaches

4.1 Arm big.LITTLE — ISA Homogeneity by Design

Key insight: Arm mandates ISA homogeneity across all cores in a big.LITTLE or DynamIQ cluster. A Cortex-A720 (big) and a Cortex-A520 (little) in the same SoC must support exactly the same ISA, including SIMD extensions (NEON/ASIMD, SVE if present).

How Arm achieves this:

  • The Arm Architecture Reference Manual specifies the ISA independently of microarchitecture
  • All cores implement the same Armv9.x profile
  • NEON/ASIMD is mandatory for all AArch64 implementations
  • SVE/SVE2 are optional, but if present, must be on ALL cores (or none)
  • The vector length may differ between big and little cores, but the instruction set is the same
  • SVE's vector-length-agnostic programming model means code works regardless of width

SVE heterogeneous vector length handling: When big and little cores have different SVE vector lengths, the Linux kernel handles this gracefully:

  1. Intersection, not minimum: At boot, each core probes its full set of supported SVE vector lengths (e.g., CPU0: {128, 256, 512}, CPU1: {128, 256}). The kernel computes the intersection of all cores' sets (→ {128, 256}). Userspace can choose from any VL in this intersection, not just the minimum. (fpsimd.c: vec_update_vq_map() uses bitmap_and())

  2. Per-task VL via ZCR_EL1: The kernel writes ZCR_EL1 on every context switch to set the current task's requested VL. Different processes can use different vector lengths simultaneously (e.g., process A uses VL=256, process B uses VL=128), as long as they are within the intersection. This is not a one-time system-wide setting.

  3. Userspace control: Processes request a specific VL via prctl(PR_SVE_SET_VL, ...). The requested VL is rounded down to the nearest supported VL in the intersection.

  4. Hot-plug safety: If a CPU comes online late and doesn't support a VL already committed in the intersection, that CPU is killed (cpu_die_early()). The system continues running; only the non-conforming CPU is excluded.

This is directly relevant to RISC-V: RISC-V currently lacks a ZCR_EL1 equivalent - a CSR writable by S-mode to limit the effective VLEN for U-mode. Adding such a CSR would enable the same per-task, intersection-based approach for heterogeneous VLEN cores.

Scheduling: The Linux EAS (Energy-Aware Scheduling) framework handles Arm big.LITTLE purely as a performance/power optimization. No ISA concerns.

Lesson: Arm solved this problem at two levels: ISA homogeneity as an architectural requirement, and ZCR_EL1 as a hardware mechanism for the kernel to manage vector length heterogeneity per-task.

4.2 Apple M-series — Same Approach as Arm

Apple's M1-M4 chips use Arm's DynamIQ with performance (Firestorm/Avalanche/Everest) and efficiency (Icestorm/Blizzard/Sawtooth) cores. Following Arm's mandate:

  • All cores support the same ISA (NEON, AMX is separate accelerator)
  • Both P-cores and E-cores support the same SIMD width (128-bit NEON)
  • Apple's AMX (matrix coprocessor) is accessible from any core via system registers
  • macOS scheduler uses QoS classes to route threads, but any thread can run on any core

Key difference from Intel: Apple never had to sacrifice SIMD capability for heterogeneity because Arm's NEON is compact enough to implement everywhere.

4.3 RISC-V — The Emerging Question

RISC-V has the most interesting situation because its modular ISA design explicitly allows heterogeneity:

  • V extension (vector): Optional. Different cores could theoretically have different VLEN.
  • Discovery: RISC-V uses device tree or ACPI to describe per-hart capabilities. Unlike x86, per-core ISA discovery is architecturally supported.
  • The misa CSR and various extension CSRs can differ per hart.

Current Linux RISC-V approach:

  • The kernel builds an "ISA string" for each hart from device tree
  • riscv_isa_extension_check() tests per-hart capabilities
  • The hwprobe syscall allows userspace to query what extensions are available
  • But: The scheduler does not yet use ISA information for placement decisions

The RISC-V opportunity:

  • RISC-V could implement ISA-aware scheduling because the architecture was designed for heterogeneity
  • Device tree naturally describes per-hart capabilities
  • The hwprobe() interface could be extended to expose which harts support which extensions
  • A cpumask per ISA-extension could enable affinity-based scheduling
  • RVV's vector-length-agnostic model (like SVE) means code can run on cores with different VLEN

Open question for RISC-V designers: If you build a big.LITTLE RISC-V SoC where big cores have RVV 1.0 and little cores don't, do you:

  1. Follow Arm's lead and mandate ISA homogeneity? (Simplest)
  2. Build ISA-aware scheduling infrastructure in Linux? (Most flexible but hard)
  3. Follow Intel's approach and report only the intersection? (Wastes capability)

4.4 Comparison Summary

Aspect Intel x86 Arm big.LITTLE Apple M-series RISC-V
ISA homogeneity required? No (by accident) Yes (by design) Yes (inherits Arm) No (by design)
Per-core ISA discovery No (CPUID is global) N/A (all same) N/A Yes (device tree)
Scheduling ISA-aware? No N/A N/A Not yet, but possible
SIMD width heterogeneity Yes (problem) SVE lengths can vary No VLEN can vary (VLA helps)
Solution to heterogeneity Disable to intersection Don't allow Don't allow TBD

5. Panther Lake / Nova Lake / Future Intel

5.1 Panther Lake (Core Ultra Series 3, Jan 2026)

From the Wikipedia article and Chips and Cheese ITT 2025 coverage:

Architecture:

  • P-cores: Cougar Cove (evolution of Lion Cove)
  • E-cores: Darkmont (evolution of Skymont)
  • LP E-cores: Darkmont (low-power variant)
  • Process: Intel 18A (compute tile), Intel 3 / TSMC N3E (GPU tile), TSMC N6 (platform controller)

Configurations:

  • Low-power: 4 P-cores + 4 LP E-cores
  • Mid: 4P + 8E + 4LP (16 cores total)
  • High-end: 4P + 8E + 4LP + larger GPU

SIMD/AVX status: Listed extensions are "SSE4, AVX, AVX2, AVX-VNNI, AVX-IFMA, AES-NI, SHA-NI" — no AVX-512 or explicit AVX10. This strongly suggests:

  • Cougar Cove P-cores likely still have AVX-512 execution hardware (evolutionary from Lion Cove)
  • Darkmont E-cores support AVX-512 instruction encodings at 256-bit (like Skymont)
  • But the platform still does NOT advertise 512-bit support
  • De facto AVX10.1/256

Key observation: Intel has been shipping AVX10.1/256-equivalent hardware since Arrow Lake (Oct 2024) without marketing it as AVX10. The new EVEX encoding features, mask registers, and many AVX-512 instruction subsets are available at 256-bit across all cores. They just call it "AVX2" in marketing materials.

5.2 Nova Lake (~2027-2028)

Very limited information available. Nova Lake is expected to succeed Panther Lake for client:

  • Likely to use next-generation P-cores and E-cores
  • This is where Intel may finally bring 512-bit execution to E-cores, enabling AVX10/512 across the entire chip
  • Would be the first hybrid Intel client chip to officially support 512-bit vectors
  • But this is speculative — no confirmed details

5.3 Diamond Rapids (Server, ~2025-2026)

  • Server counterpart, P-core only (no E-cores)
  • Expected to support AVX10.2/512
  • No heterogeneity problem since server parts don't use E-cores

5.4 Clearwater Forest (E-core Server, ~2025)

  • Interesting case: all E-cores (Darkmont-based)
  • Will likely support AVX10.1/256 (since all cores are the same type)
  • ISA-homogeneous by being all-E-core

6. The "Wasted Silicon" Argument

6.1 How Much Die Area Does AVX-512 Consume?

This is one of the most discussed but least precisely answered questions. Some data points:

AVX-512 execution unit area estimates:

The AVX-512 execution hardware primarily consists of:

  • 512-bit FMA units: Two 256-bit FMA units that can fuse into one 512-bit unit (on Golden Cove / Raptor Cove / Lion Cove)
  • 512-bit shuffle/permute networks
  • Extended register file: ZMM0-ZMM31 (32 × 512-bit = 2KB just for architectural state, much more for physical register file with rename)
  • Mask register file: k0-k7 (8 × 64-bit)

Estimates from die analysis and industry sources:

  1. Chips and Cheese / die shot analysis: The vector execution units (including AVX-512 support) are estimated to occupy roughly 10-15% of the P-core area. A Golden Cove P-core is approximately 3.5-4mm² on Intel 7. The vector/SIMD portion is perhaps 0.4-0.6mm² per core.

  2. The incremental cost argument: The more relevant question is not "how much area does the vector unit take" but "how much extra area does 512-bit support cost over 256-bit AVX2?" The incremental cost of widening from 256-bit to 512-bit is:

    • Wider datapaths (doubled in the fusion case)
    • Larger physical register file (more rename registers, each twice as wide)
    • Wider shuffle/permute networks
    • Estimated incremental cost: ~5-8% of P-core area
  3. For a typical Alder Lake die (8P+8E):

    • 8 P-cores × ~0.5mm² wasted AVX-512 area ≈ 4mm²
    • Total die area: ~215mm²
    • Wasted area: ~1.5-2% of total die
  4. For Arrow Lake (Lion Cove P-cores):

    • Lion Cove is designed for AVX-512. The 512-bit datapath is integral to the core design.
    • Removing it would save area but would require a different core design
    • The 6 P-cores × perhaps 0.4mm² (on TSMC N3) ≈ 2.4mm² on the compute tile
    • Compute tile is estimated at ~50-60mm² — so ~4-5% of compute tile area is "wasted" AVX-512 capability

6.2 Power, Not Area, Is the Real Cost

The area argument is somewhat misleading. The more significant costs of unused AVX-512 hardware are:

  1. Leakage power: Transistors leak even when unused. The AVX-512 units contribute to idle power draw.
  2. Design complexity: Supporting AVX-512 in the microarchitecture constrains other design decisions (pipeline width, register file organization).
  3. Validation cost: The AVX-512 execution paths must be validated even though they're disabled on client parts.
  4. Opportunity cost: The silicon could have been used for more cache, wider issue, or other features.

6.3 Why Intel Doesn't Just Remove It

Intel uses the same P-core design across client and server products:

  • Lion Cove appears in both Arrow Lake (client, AVX-512 disabled) and Granite Rapids (server, AVX-512 enabled)
  • Designing two variants of the core — one with and one without AVX-512 — would double the design and validation cost
  • The economics favor a single core design with runtime disable

This is the same logic as GPU CU disabling, PCIe lane disabling, etc. — ship one die, harvest defects and segment products through configuration.


7. Lessons for RISC-V ISA Designers

7.1 What Intel Got Wrong

  1. Assumed ISA homogeneity would persist: x86's CPUID was never designed for per-core ISA variation. This architectural debt made hybrid designs painful.

  2. No OS interface for ISA-aware scheduling: Neither the hardware (CPUID) nor the OS (scheduler) had mechanisms to handle ISA differences. Intel had to disable features rather than expose them selectively.

  3. Customer expectations vs. hardware capability: Users buying P-core silicon with AVX-512 hardware but unable to use it caused significant backlash.

  4. Multi-year gap between problem and solution: Alder Lake shipped in Nov 2021. AVX10 was announced in July 2023. Actual universal 512-bit support across all cores may not arrive until ~2027 (Nova Lake). That's a 6-year gap.

7.2 What RISC-V Can Learn

  1. Design per-hart ISA discovery from day one ✓ (already done — device tree, hwprobe)

  2. Consider mandating ISA homogeneity for SMP systems — Arm's approach is simpler and avoids the problem entirely. The RISC-V Profiles specification moves in this direction by defining mandatory extension sets.

  3. If allowing ISA heterogeneity, build scheduling infrastructure early:

    • Extend hwprobe() to support per-CPU queries
    • Implement ISA-aware cpumask constraints in the scheduler
    • Define trap-and-migrate behavior for unsupported extensions
    • Consider making the V extension's VLEN the only permitted axis of variation
  4. Vector-length agnosticism is key: Both RVV and SVE use VLA (vector-length agnostic) programming. This means code works correctly regardless of VLEN, even if performance varies. This is far superior to x86's fixed-width model where 512-bit code simply doesn't work on a 256-bit core.

  5. Don't ship disabled hardware for years: If E-cores can't do wide vectors, either don't put wide vector units in P-cores, or invest in making E-cores capable. Intel's multi-year gap of shipping (and paying for) unused silicon was economically suboptimal.


8. Timeline Summary

Date Event
2016 AVX-512 ships in Knights Landing
2017 AVX-512 in Skylake-SP server
2020 AVX-512 in Tiger Lake client (all cores identical)
2021 Q1 AVX-512 in Rocket Lake client (all cores identical, last hurrah)
2021 Q4 Alder Lake ships — AVX-512 disabled on hybrid parts
2022 Q4 Raptor Lake ships — same situation
2023 Jul Intel announces AVX10 specification
2024 Q3 Lunar Lake ships — Lion Cove + Skymont, AVX10.1/256 de facto
2024 Q4 Arrow Lake desktop ships — same de facto AVX10.1/256
2024 AVX10.2 specification published
2026 Q1 Panther Lake ships — Cougar Cove + Darkmont, still AVX10.1/256
~2025-26 Diamond Rapids server expected — AVX10.2/512
~2027-28 Nova Lake client — potential AVX10/512 on all cores (speculative)

References and Sources

  • Wikipedia: AVX-512, Advanced Vector Extensions (AVX10 section), Alder Lake, Arrow Lake, Lunar Lake, Panther Lake articles
  • Chips and Cheese: "Panther Lake's Reveal at ITT 2025" (Oct 2025), "Interviewing Intel's Chief Architect of x86 Cores"
  • Intel Architecture Instruction Set Extensions Programming Reference (AVX10 specification documents)
  • Linux kernel source: arch/x86/kernel/cpu/, drivers/thermal/intel/intel_hfi.c
  • Linux kernel HFI/ITD support: merged in 5.18+ (intel_hfi driver)
  • RISC-V ISA specification: Volume I (unprivileged), Vector Extension 1.0
  • Various community documentation on Alder Lake AVX-512 enable hacks (2021-2022)

Report compiled February 2026. Information about unreleased products (Nova Lake, Diamond Rapids) is based on available leaks and Intel public statements; details may change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment