Skip to content

Instantly share code, notes, and snippets.

@bradfitz
Last active March 7, 2026 15:43
Show Gist options
  • Select an option

  • Save bradfitz/46c4b69ee8d6db639f3f7bf52594675a to your computer and use it in GitHub Desktop.

Select an option

Save bradfitz/46c4b69ee8d6db639f3f7bf52594675a to your computer and use it in GitHub Desktop.

(Claude Code; Opus 4.6)

Windows Stack Memory Corruption Investigation

Summary

Runtime stack memory corruption on Windows amd64 causes a DEP violation (Exception 0xc0000005 code 0x8) when a goroutine jumps to a corrupted return address. The corruption always overwrites the high 32 bits of a return address, replacing a valid code pointer (e.g. 0x00007ff6XXXXXXXX) with a value whose high 32 bits are a small number (e.g. 0x00000010XXXXXXXX). The low 32 bits are preserved. The corrupted value is a normal heap/stack address that gets written over the return address's upper dword.

The crash was initially reported as a Go 1.26 regression, but testing showed it also reproduces with Go 1.25.0 and Go tip (master). It may have become more frequent in 1.26 due to changes in binary layout or stack usage patterns.

Reproduction

The crash reliably reproduces by running tailscale.com/tsnet tests on Windows amd64 with -test.count=3:

GOOS=windows GOARCH=amd64 go test -c -o tsnet_test.exe ./tsnet/
tsnet_test.exe -test.timeout=90s -test.count=3

The crash typically occurs during TestConn or later tests, in a goroutine running derpserver.(*sclient).run which reads DERP frames via ReadFrameHeader. The crashing goroutine was created by net/http.(*Server).Serve for an HTTPS DERP connection.

The crash was tested on:

  • Windows 11 build 26200 (12th Gen Intel i7-1255U, hybrid P+E cores)
  • GitHub Actions Windows runners (Azure VMs, various CPUs)

What We Know For Sure

The corruption pattern

  • Always the high 32 bits of an 8-byte return address on the goroutine stack
  • The corrupted return address was pushed by a CALL instruction (to ReadFrameHeader)
  • The goroutine's RET pops this corrupted value and jumps to it, causing DEP violation
  • The value written over the high 32 bits is a normal-range address (not a special constant)

Async preemption is required

  • GODEBUG=asyncpreemptoff=1 prevents the crash (test times out instead)
  • This was the first and most definitive finding

Stack growth is involved

  • Setting stackMin=4096 (or higher) prevents the crash (test times out instead)
  • The default stackMin=2048 allows the crash
  • The crashing goroutine consistently has stackcopycount of 10-12, meaning its stack was copied/grown 10-12 times during its lifetime
  • The combination of async preemption + small stacks + stack growth is the trigger

The corruption does NOT happen during PushCall/SetThreadContext

  • Instrumented preemptM to verify the resumePC was correctly written to the goroutine stack by PushCall and still correct after SetThreadContext
  • The value was always correct at that point
  • The corruption occurs later, after the goroutine has been resumed

Not caused by GC or stack shrinking

  • GOGC=off still crashes
  • GODEBUG=gcshrinkstackoff=1 still crashes

Not caused by stale/freed stack references

  • stackPoisonCopy=1 (fills old stack with 0xfd after copy) still crashes with the same pattern (no 0xfd values in the corrupted data)
  • stackFaultOnFree=1 (maps old stack pages as inaccessible) still crashes with the same pattern (no access violation on old stack pages)

The copy itself appears correct

  • Added a post-adjustframe verification in copystack that compared every 8-byte value in the new stack against the old stack for the corruption pattern. It did not fire. This means the corruption is not introduced by memmove or adjustframe during the copy itself.

A separate bug exists in cgo callback stack growth

  • debugCheckBP=true detected 98 instances of invalid frame pointers during copystack when goroutines in Windows cgo callback chains (e.g., the desktop session watcher's pumpThreadMessages -> getMessage -> Windows callback -> wndProc -> destroyWindow -> Windows callback -> callbackWrap) need stack growth
  • The BP chain in these goroutines crosses from the Go goroutine stack to the Windows system/thread stack. adjustframe encounters a BP value outside the goroutine's stack range.
  • In the non-debug path, adjustpointer correctly skips adjustment of out-of-range values, so this is "safe" in that it doesn't corrupt data, but it means the BP chain is broken after the stack copy.
  • This is a SEPARATE bug from the DERP return address corruption. The DERP goroutine has no cgo frames. Both bugs involve copystack but in different goroutines with different stack structures.

What We Tried That Didn't Pan Out

.pdata sorting (commit bbed50aaa3)

The Go linker emits .pdata (Windows SEH function table) entries unsorted, violating the PE/COFF spec requirement. Windows RtlLookupFunctionEntry does a binary search on these entries. The sort fix exists on master but is not in Go 1.26.x. However, the crash still occurs with the sort fix applied (tested on master). The .pdata issue is a real bug but is not the cause of this crash.

GetThreadContext return value checking

Added a check for GetThreadContext returning 0 (failure). It never failed.

Stack scanning in preemptM

Added a loop to scan the goroutine stack for the corruption pattern immediately after PushCall + SetThreadContext. No corruption was found at that point, confirming the corruption happens later.

Larger initial stacks

Setting stackMin to 4096, 8192, or 65536 all prevent the crash, but also cause the test to time out. Larger stacks mean less stack growth, which avoids the bug. This doesn't pinpoint the mechanism but confirms stack growth is part of the trigger.

Minimal reproducers

Wrote two minimal Go programs that create many goroutines doing frame-reading over TCP/TLS connections with small stack frames, similar to the DERP server pattern. Neither crashed. The full tsnet test suite is needed to trigger the bug, suggesting it requires a specific combination of goroutine count, stack depth, I/O patterns, and timing.

Theories

Most likely: corruption during stack growth after async preemption resume

The goroutine is deep in a call stack when async-preempted. It yields via asyncPreempt -> asyncPreempt2 -> mcall -> gopreempt_m. When later rescheduled, it returns through asyncPreempt back to the interrupted code. The interrupted code (or a subsequent function call) triggers stack growth via morestack -> newstack -> copystack.

During this stack growth, something goes wrong. The memmove and adjustframe produce correct results (verified), but something after copystack returns uses a stale or incorrect reference to the old stack location, writing data to what is now either freed memory or another goroutine's stack. This write overwrites 4 bytes of a return address with heap/stack address data.

The stackFaultOnFree test should have caught a write to freed stack pages but didn't, which means either:

  1. The old stack pages were immediately reused (returned to the stack pool and given to another goroutine), making them accessible
  2. The corruption is on the NEW stack, not the old one, but the post-copy check missed it (perhaps due to timing - the corruption happens after copystack returns)
  3. The corruption involves a different mechanism entirely

Alternative: Windows thread context interaction

When a goroutine is async-preempted via SuspendThread + SetThreadContext + ResumeThread, and then its stack is grown, there might be a subtle interaction where Windows retains internal references to the old stack (e.g., for APC delivery, exception handling, or thread context restoration) that become stale after the stack moves. This wouldn't be caught by stackFaultOnFree if Windows keeps its own mappings.

Next Steps

  1. Use a real debugger: Set a hardware data breakpoint (DR0-DR3) on the return address location to catch the exact instruction that overwrites it. This requires a Windows debugger (WinDbg) attached to the process.

  2. Add per-goroutine stack-copy tracking with frame validation: After each copystack, walk the new stack's frame pointer chain and validate that all return addresses look like valid code pointers (high bits in the expected image range).

  3. Bisect the stack growth: Instead of growing the stack to a new allocation, try growing it in-place (remap to a larger region) to eliminate the memmove/pointer-adjustment path.

  4. Test with stackNoCache=1: Prevent stack page reuse to see if the corruption changes (would confirm if old stack pages being reused by other goroutines is part of the story).

  5. Investigate the cgo callback BP bug: The 98 invalid-BP-during-copystack warnings are a real bug that should be filed separately. While not the direct cause of the DERP corruption, they indicate that copystack has difficulty with certain frame layouts on Windows.

@bradfitz
Copy link
Author

bradfitz commented Mar 7, 2026

Windows Stack Memory Corruption Investigation

Root Cause

Windows overlapped I/O APIs (WSARecv, WSARecvFrom, WSASend, WSASendto,
AcceptEx, etc.) receive pointers to output parameters (lpNumberOfBytesRecvd,
lpFlags, lpFromlen). When the operation completes asynchronously
(ERROR_IO_PENDING), Windows writes completion results back to these addresses.

In Go's internal/poll/fd_windows.go, these output parameters were declared as
stack-local variables inside closures passed to execIO:

n, err = fd.execIO('r', func(o *operation) (qty uint32, err error) {
    var flags uint32                // stack-local in the closure
    err = syscall.WSARecv(fd.Sysfd, newWsaBuf(buf), 1, &qty, &flags, &o.o, nil)
    //                                                  ^^^^   ^^^^^^
    //                          pointers to closure stack locals passed to Windows
    return qty, err
}, buf)

The closure executes the Windows API call, then returns to execIO. execIO then
enters waitIO to park the goroutine until the I/O completes. But by this point,
the closure's stack frame has been popped - flags and qty are dead stack space,
now reused by waitIO's call frames (return addresses, saved registers, etc.).

When Windows completes the I/O and writes the results to the original addresses, it
overwrites whatever now occupies those stack locations. A 32-bit write of a small
integer (like a flags value or byte count) to the high 32 bits of a return address
produces the observed corruption pattern: 0x00007ff6XXXXXXXX becomes
0x00000010XXXXXXXX.

The affected call sites are every execIO closure that passes &qty, &flags, or
&rsan (fromlen) to a Windows socket API:

  • Read (kindNet path): WSARecv gets &qty and &flags
  • ReadFrom, ReadFromInet4, ReadFromInet6: WSARecvFrom gets &qty, &flags, &rsan
  • Write (kindNet path): WSASend gets &qty
  • WriteTo, WriteToInet4, WriteToInet6: WSASendto gets &qty
  • Writev: WSASend gets &qty
  • acceptOne: AcceptEx gets &qty
  • waitForReading: WSARecv gets &qty and &flags
  • ReadMsg, ReadMsgInet4, ReadMsgInet6: WSARecvMsg gets &qty
  • WSAGetOverlappedResult in execIO itself: gets &flags

The Fix

Move the output parameters from closure stack locals to fields in the
heap-allocated operation struct, which is pooled and lives for the entire duration
of the I/O operation:

type operation struct {
    o          syscall.Overlapped
    runtimeCtx uintptr
    mode       int32

    // Output parameters for Windows APIs that may complete asynchronously.
    // These MUST NOT be stack-allocated because Windows may write to them
    // after the initiating function call returns ERROR_IO_PENDING.
    qty   uint32
    flags uint32
    rsan  int32
}

Then change all closures to use &o.qty, &o.flags, &o.rsan instead of stack
locals.

Why Async Preemption and Stack Growth Appear Involved

The bug requires this sequence:

  1. Goroutine calls WSARecv/etc., which returns ERROR_IO_PENDING
  2. The closure returns; its stack locals (flags, qty) become dead stack space
  3. execIO calls waitIO, whose call frames reuse the dead stack space
  4. Windows completes the I/O and writes to the dead stack addresses, corrupting
    waitIO's return addresses or saved registers

Without stack growth, this corruption may be "silent" - the corrupted stack
locations happen to be in the goroutine's current stack segment, and the values
written (small integers) may not cause an immediate crash if the goroutine doesn't
use those particular stack slots again. But with stack growth:

  1. copystack moves the goroutine stack to a new allocation
  2. The corrupted values on the old stack are faithfully copied to the new location
  3. Windows completes more I/O and writes to the OLD stack addresses
  4. Those addresses are now freed memory or belong to another goroutine

This explains why GODEBUG=asyncpreemptoff=1 and stackMin=4096 both prevent the
crash - they reduce stack growth, reducing the chance of old stack addresses being
freed and reused.

stackFaultOnFree=1 changed the corruption pattern (from 0x10XXXXXXXX to 0x12)
because decommitting old stack pages changes what Windows reads/writes there.

Verification

Configuration Result
Go 1.25.0, 1.26.1, master (no fix) CRASH (3/3 runs)
With fix (output params in operation struct) NO CRASH (3/3 runs, tests PASS)
GODEBUG=asyncpreemptoff=1 (no fix) No crash (but test is pre-existing flaky/slow)
stackMin=4096 (no fix) No crash (but test is pre-existing flaky/slow)
GOGC=off (no fix) CRASH
stackFaultOnFree=1 (no fix) CRASH (different pattern: PC=0x12)

Note: the tsnet tests are independently flaky on Windows and sometimes hang
regardless of this bug. During investigation we used a 90s timeout which was often
too short. With a 300s timeout the fixed binary passes cleanly.

Reproduction

The crash reliably reproduces by running tailscale.com/tsnet tests on Windows
amd64 with -test.count=3:

GOOS=windows GOARCH=amd64 go test -c -o tsnet_test.exe ./tsnet/
tsnet_test.exe -test.timeout=90s -test.count=3

Tested on Windows 11 build 26200 (12th Gen Intel i7-1255U) and GitHub Actions
Windows runners.

Other Bugs Found During Investigation

Unsorted .pdata section (linker bug)

The Go linker emits .pdata (Windows SEH function table) entries unsorted,
violating the PE/COFF spec requirement that RUNTIME_FUNCTION entries be sorted by
function start address. This can cause RtlLookupFunctionEntry (binary search) to
return incorrect results. Fixed on master by commit bbed50aaa3 but not yet in Go
1.26.x. Not the cause of the stack corruption, but a real bug.

Invalid frame pointers during cgo callback stack growth

debugCheckBP=true detected 98 instances of invalid frame pointers during
copystack for goroutines in Windows cgo callback chains. The frame pointer chain
in these goroutines crosses from the Go goroutine stack to the Windows system stack.
adjustframe encounters BP values outside the goroutine's stack range. In the
non-debug path, adjustpointer correctly skips adjustment of out-of-range values,
so no data corruption occurs, but the BP chain is broken after the stack copy. This
is a separate issue from the I/O corruption bug.

Investigation Path

What we tried (chronological)

  1. Investigated .pdata sorting - not the cause
  2. GODEBUG=asyncpreemptoff=1 - prevented crash, pointed to preemption
  3. GOGC=off - still crashed, ruled out GC
  4. GODEBUG=gcshrinkstackoff=1 - still crashed, ruled out stack shrinking
  5. Instrumented PushCall in preemptM - write was correct, corruption is later
  6. Verified SetThreadContext didn't corrupt stack - it didn't
  7. stackPoisonCopy=1 - still crashed with same pattern (no 0xfc in corrupted data)
  8. stackMin=4096/8192/65536 - prevented crash, confirmed stack growth involvement
  9. Added stackcopycount field to g struct - crashing goroutine had 10-12 copies
  10. Post-adjustframe verification in copystack - no corruption detected during copy
  11. debugCheckBP=true - found the cgo callback BP bug (separate issue)
  12. debugCheckBP as warning (non-fatal) - 98 BP warnings plus DERP crash, confirmed separate
  13. stackFaultOnFree=1 - different crash pattern (PC=0x12), suggested stale pointer read
  14. stackNoCache=1 - still crashed, stack cache reuse not the mechanism
  15. stackFaultOnFree=1 + stackNoCache=1 - crash at PC=0x12, confirmed old stack involvement
  16. Audited all execIO closures for stack-allocated output parameters passed to Windows APIs
  17. Moved output params to heap-allocated operation struct - FIXED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment