Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save suntong/c7191c3d0692bf295d28720c9ff37db0 to your computer and use it in GitHub Desktop.

Select an option

Save suntong/c7191c3d0692bf295d28720c9ff37db0 to your computer and use it in GitHub Desktop.

x86 Assembly Premium Tutorial & Quick Reference (2025 Edition)

This is the most complete, accurate, and up-to-date single-document reference for real-world x86 assembly programming, from 8086 to modern x86-64 (including AVX-512, APX, AVX10, etc.).

Part 1: 8086/8088 — The Eternal Foundation (1978–forever)

Everything you learn here still works in 2025 in 16-bit real mode and is critical for bootloaders, BIOS, UEFI, and deep understanding.

Registers (8086)

AX - Accumulator (AH/AL)
BX - Base (BH/BL)
CX - Count (CH/CL)
DX - Data (DH/DL)

SI - Source Index
DI - Destination Index
BP - Base Pointer
SP - Stack Pointer

IP - Instruction Pointer
FLAGS

Segment registers (real mode only):
CS - Code Segment
DS - Data Segment
SS - Stack Segment
ES - Extra Segment

Memory addressing modes (8086)

[bx + si]           [bp + di]           [si]
[bx + di]           [bp + si]           [di]
[bx]                [bp]                immediate (16-bit only in 8086)
[bx + disp8]        [bp + disp8]
[bx + disp16]       [bp + disp16]

Essential 8086 instructions (most used even in 64-bit code)

MOV  dest, src
PUSH src
POP  dest
ADD  dest, src
SUB  dest, src
ADC  dest, src      ; add with carry
SBB  dest, src      ; subtract with borrow
INC  dest
DEC  dest
CMP  dest, src      ; affects flags only
AND  dest, src
OR   dest, src
XOR  dest, src
NOT  dest
TEST dest, src      ; AND but no write

JMP  label
JE/JZ, JNE/JNZ, JA/JNBE, JB/JNAE, JAE/JNB, JBE/JNA
JG/JNLE, JL/JNGE, JGE/JNL, JLE/JNG
JC, JNC, JO, JNO, JS, JNS, JP/JPE, JNP/JPO

LOOP, LOOPE/LOOPZ, LOOPNE/LOOPNZ

CALL near/far
RET  near/far

INT  n
IRET

SHL/SAL, SHR, SAR    ; shifts
ROL, ROR, RCL, RCR   ; rotates

MUL  ax/al           ; unsigned
IMUL ax/al
DIV  ax/al
IDIV ax/al

LEA  reg, [mem]      ; load effective address (very important!)

FLAGS register bits (still identical in 64-bit)

CF - Carry          PF - Parity
AF - Auxiliary      ZF - Zero
SF - Sign           OF - Overflow
IF - Interrupt      DF - Direction
TF - Trap (single-step)

Part 2: 80386 — The 32-bit Revolution (1985)

Everything changes here. This is where modern x86 begins.

New 32-bit registers (386+)

EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP
EIP
EFLAGS

New segment registers (protected mode)

FS, GS          ; these become priceless in 64-bit OSdev

Revolutionary new addressing mode (32-bit)

[base + index*scale + disp32]

Examples:
[eax + ecx*4 + 8]
[ebx + esi*8 - 1234h]
[ebp + 16]                 ; classic stack frame access

Scale can be 1, 2, 4, 8

New instructions added in 386

MOVZX reg32, reg8/reg16      ; zero-extend
MOVSX reg32, reg8/reg16      ; sign-extend

BSF  reg, src                ; bit scan forward (find first set bit)
BSR  reg, src                ; bit scan reverse

SHLD dest, src, count        ; double precision shift left
SHRD dest, src, count        ; double precision shift right

CMOVcc reg, src              ; conditional move (huge for branchless code)

SETcc reg8                   ; set byte if condition true

386+ Control registers

CR0 - PE (bit0), PG (bit31), etc.
CR3 - Page directory base (paging)
CR4 - PSE, PAE, PGE, etc.

Part 3: x86-64 (AMD64) — The Modern Era (2003–today)

Register explosion — the best thing that ever happened to assembly

RAX RBX RCX RDX RSI RDI RBP RSP
R8  R9  R10 R11 R12 R13 R14 R15

RIP
RFLAGS

You now have 16 GP registers — life is beautiful.

Calling conventions

Windows x64 (Microsoft)

  • RCX, RDX, R8, R9 = first 4 integer args
  • RAX = return
  • XMM0-XMM3 = first 4 float args
  • Caller cleans stack (no more stdcall headache)
  • Callee must preserve: RBX, RBP, RDI, RSI, R12-R15
  • Stack aligned to 16 bytes before CALL

System V AMD64 (Linux, macOS, BSD)

  • RDI, RSI, RDX, RCX, R8, R9 = first 6 integer args
  • RAX = return
  • XMM0-XMM7 = first 8 float args
  • Callee must preserve: RBX, RBP, R12-R15
  • Red zone: 128 bytes below RSP is safe from signal/async interruption

New instructions in x86-64 (long mode)

REX prefix enables:

  • 64-bit operands (RAX etc.)
  • R8-R15
  • Extended low-byte registers (SIL, DIL, BPL, SPL)

Must-know x86-64 instructions

MOVABS rax, 0xFFFFFFFFFFFFFFFF   ; only way to load full 64-bit immediate
LEA    rax, [rip + label]        ; RIP-relative (position-independent code)

CQO                              ; sign-extend RAX → RDX:RAX (for 128-bit divide)
CDQE                             ; sign-extend EAX → RAX

NOP                              ; short NOP = 90h
Multi-byte NOPs for alignment:
    0F 1F 40 00          ; 4-byte NOP
    0F 1F 84 00 00 00 00 00   ; 8-byte NOP

RIP-relative addressing — mandatory for PIC code

lea rax, [rip + symbol]      ; default in 64-bit mode (NASM -felf64, GAS .intel_syntax)
mov rax, [rip + symbol]

Part 4: SIMD Evolution — The Real Performance Story

MMX (1997) — obsolete but still exists

8 × 64-bit MM0-MM7 registers (aliases of x87 FPU)

SSE / SSE2 (1999-2001) — still mandatory

XMM0-XMM15 (16 in x86-64)
128-bit registers

Essential SSE2 (integer + double)

MOVDQA xmm, xmm/m128         ; aligned move
MOVDQU xmm, xmm/m128         ; unaligned move
MOVAPS/MOVUPS                ; float versions

PADDD xmm, xmm/m128          ; add dwords
PSUBD
PCMPEQD                      ; compare equal → all 1s or 0s
PCMPGTD                      ; greater than

PAND, POR, PXOR

MOVD   xmm, r32/m32
MOVQ   xmm, r64/m64
PEXTRW, PEXTRD, PEXTRQ
PINSRW, PINSRD, PINSRQ

PSLLD, PSRLD, PSRAD          ; shifts
PSHUFD xmm, xmm/m128, imm8   ; shuffle

SSE4.1 / SSE4.2 (2008-2010)

PINSRB/PINSRD/PINSRQ
PEXTRB/PEXTRD/PEXTRQ
PMOVZX, PMOVSX               ; zero/sign extend
PCMPESTRI, etc.

AVX / AVX2 (2011-2013) — 256-bit

YMM0-YMM15 (upper half dirtyable only with VEX)
VMOVAPS, VMOVDQA, etc.
VPADD, VPSUB, etc.
VPBROADCASTD ymm, xmm        ; very useful
VPERMD ymm, ymm, ymm         ; full permute (AVX2)

AVX-512 (2016-2025) — The Beast

ZMM0-ZMM31 (512-bit)
K0-K7 mask registers
EVEX prefix

Most useful AVX-512 instructions (used even on Xeon without full AVX-512)

VPTERNLOGD zmm, zmm, zmm, imm8    ; any boolean operation (magic!)
VPROLD/VPROLQ                     ; rotate by immediate
VPERMI2D                          ; full permute
VPCOMPRESSD, VPEXPANDD
VPMOVDB, VPMOVQB, etc.            ; downconvert with saturation

AVX10.1 (2024-2025) — The Future (Intel Sapphire Rapids+ & AMD Zen 6+)

  • Upper 16 ZMM registers (ZMM16-31) available even without 512-bit support
  • APX (Advanced Performance Extensions) — 32 GP registers (R16-R31) + no REX prefix needed
  • Coming 2025-2026

Ultimate Quick Reference Cheat Sheet (2025)

Registers you actually have in x86-64:

Caller-saved (volatile): RAX, RCX, RDX, RSI, RDI, R8-R11, XMM0-5
Callee-saved (non-volatile): RBX, RBP, R12-R15, XMM6-15
Scratch: R10, R11 often used as temps

Best registers for specific jobs:

RAX - return value, often used in multiplication/division
RCX - loop counter (Windows: 1st arg)
RDX - often remainder/output of div
RBX - best callee-saved register (rarely used by compilers)
RBP - optional frame pointer (use -fno-omit-frame-pointer for debugging)
RSP - stack pointer (keep 16-byte aligned!)
R12-R15 - best for holding long-lived values (structs, this pointer, etc.)
R8-R9 - Windows args, otherwise great temps
R10-R11 - pure scratch, compilers love them

Fastest way to zero a register

XOR EAX, EAX    ; 2 bytes, zero RAX, no partial register stall

Fastest way to sign-extend

MOVSXD RAX, ECX        ; RAX = sign-extended ECX

Best way to load address (PIC)

LEA RAX, [RIP + symbol]

Best way to call function (System V)

call    rax            ; indirect
call    symbol

Alignment for speed (2025 CPUs)

.text
align 64               ; best for Skylake and later
function:
    vmovaps zmm...     ; will not fault

Memory ordering (rarely needed)

MFENCE                 ; full memory barrier
LFENCE                 ; load fence
SFENCE                 ; store fence

AT&T syntax

The AT&T syntax (also called GAS syntax), is the default syntax of GNU assembler (as) used in Linux, macOS, BSD, and every GCC/Clang inline assembly.

Intel syntax (NASM, MASM, Go assembler, etc.) writes the exact same instructions like this:

AT&T syntax (GAS) Intel syntax (NASM/MASM/Go) Meaning
mov %rdx, %r14 mov r14, rdx r14 ← rdx
mov %r14, %rdx mov rdx, r14 rdx ← r14
mov (%r14), %eax mov eax, [r14] eax ← dword at address in r14 (dereference)
movl $0x6, 0x30(%rsp) mov dword [rsp + 0x30], 6 store 6 into stack slot 0x30 bytes above rsp
mov %eax, 0x20(%rsp) mov dword [rsp + 0x20], eax store eax into stack slot
lea 0x32c0145(%rip), %rdx lea rdx, [rip + 0x32c0145] rdx ← address of label/symbol (RIP-relative)
lea 0x20(%rsp), %rcx lea rcx, [rsp + 0x20] rcx ← address of stack slot (SIB form)

Complete AT&T vs Intel Syntax Cheat Sheet (2025)

Feature AT&T (GAS) syntax Intel syntax Notes
Source → Destination order src, dest dest, src Opposite!
Register prefix %rax rax
Immediate prefix $0x123 0x123
Memory reference 0x30(%rsp) or (%r14) [rsp + 0x30] or [r14]
Size suffix movb, movw, movl, movq none (inferred or explicit byte ptr, etc.) movq = 64-bit
RIP-relative LEA (default in 64-bit) symbol(%rip) [rip + symbol] Both are identical in meaning
Comments # comment or // comment (C99) ; comment

All Addressing Modes

All Addressing Modes showing explicitly in both syntaxes (complete list for x86-64):

Mode AT&T syntax Intel syntax Description
Register %rax rax
Immediate $0x123 0x123
Direct (absolute) 0x403000 [0x403000] Rare in 64-bit
Register indirect (%rax) [rax]
Base + displacement 0x20(%rsp) [rsp + 0x20]
Base + index (%rax,%rcx) [rax + rcx]
Base + index × scale (%rax,%rcx,4) [rax + rcx*4] scale = 1,2,4,8
Base + index × scale + displacement 0x123(%rax,%rcx,8) [rax + rcx*8 + 0x123] Full SIB
RIP-relative (64-bit only) symbol(%rip) or 0x123(%rip) [rip + symbol] or [rip + 0x123] Default & mandatory for PIC code
Scaled index only (rare) (%rcx,8) [rcx*8]

Some Examples

mov    (%r14),%eax          ; eax = *(uint32_t*)r14
movl   $0x6,0x30(%rsp)      ; *(uint32_t*)(rsp + 0x30) = 6
mov    %eax,0x20(%rsp)      ; *(uint32_t*)(rsp + 0x20) = eax
movl   $0x6,0x50(%rsp)      ; *(uint32_t*)(rsp + 0x50) = 6

lea    0x32c0145(%rip),%rdx ; rdx = current RIP + 0x32c0145 → address of global/static data
lea    0x20(%rsp),%rcx      ; rcx = rsp + 0x20 → address of local stack variable

These are absolutely standard in every Linux/macOS binary compiled by GCC/Clang in 2025.

Final 2025 Reality Check

  • Linux, macOS, BSD, iOS, Android (NDK), Rust, Zig, Go, LLVM: all use AT&T syntax internally
  • Windows (MSVC), NASM, YASM, FASM, most hand-written asm tutorials: use Intel syntax
  • Modern compilers always use RIP-relative addressing for global data in 64-bit code (both syntaxes)
  • You must be 100% fluent in both syntaxes to read real-world binaries and write high-performance code in 2025.

Final Words

You now possess the complete lineage and practical mastery of x86 assembly from 1978 to 2025.

The most important truth in 2025:

Modern x86-64 assembly is not slow — it is the fastest way to write code on Earth when you need absolute maximum performance. Hand-written assembly with AVX-512 + APX (coming 2026) beats any compiler on the planet for hot loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment