bradfitz/win-corrupt.md

## win-corrupt.md

      
    Raw
  

              win-corrupt.md
            
          
    (Claude Code; Opus 4.6)
Windows Stack Memory Corruption Investigation

Summary

Runtime stack memory corruption on Windows amd64 causes a DEP violation (Exception
0xc0000005 code 0x8) when a goroutine jumps to a corrupted return address. The
corruption always overwrites the high 32 bits of a return address, replacing a valid
code pointer (e.g. 0x00007ff6XXXXXXXX) with a value whose high 32 bits are a
small number (e.g. 0x00000010XXXXXXXX). The low 32 bits are preserved. The
corrupted value is a normal heap/stack address that gets written over the return
address's upper dword.
The crash was initially reported as a Go 1.26 regression, but testing showed it
also reproduces with Go 1.25.0 and Go tip (master). It may have become more
frequent in 1.26 due to changes in binary layout or stack usage patterns.
Reproduction

The crash reliably reproduces by running tailscale.com/tsnet tests on Windows
amd64 with -test.count=3:
GOOS=windows GOARCH=amd64 go test -c -o tsnet_test.exe ./tsnet/
tsnet_test.exe -test.timeout=90s -test.count=3

The crash typically occurs during TestConn or later tests, in a goroutine running
derpserver.(*sclient).run which reads DERP frames via ReadFrameHeader. The
crashing goroutine was created by net/http.(*Server).Serve for an HTTPS DERP
connection.
The crash was tested on:

Windows 11 build 26200 (12th Gen Intel i7-1255U, hybrid P+E cores)
GitHub Actions Windows runners (Azure VMs, various CPUs)

What We Know For Sure

The corruption pattern


Always the high 32 bits of an 8-byte return address on the goroutine stack
The corrupted return address was pushed by a CALL instruction (to ReadFrameHeader)
The goroutine's RET pops this corrupted value and jumps to it, causing DEP violation
The value written over the high 32 bits is a normal-range address (not a special constant)

Async preemption is required


GODEBUG=asyncpreemptoff=1 prevents the crash (test times out instead)
This was the first and most definitive finding

Stack growth is involved


Setting stackMin=4096 (or higher) prevents the crash (test times out instead)
The default stackMin=2048 allows the crash
The crashing goroutine consistently has stackcopycount of 10-12, meaning its
stack was copied/grown 10-12 times during its lifetime
The combination of async preemption + small stacks + stack growth is the trigger

The corruption does NOT happen during PushCall/SetThreadContext


Instrumented preemptM to verify the resumePC was correctly written to the
goroutine stack by PushCall and still correct after SetThreadContext
The value was always correct at that point
The corruption occurs later, after the goroutine has been resumed

Not caused by GC or stack shrinking


GOGC=off still crashes
GODEBUG=gcshrinkstackoff=1 still crashes

Not caused by stale/freed stack references


stackPoisonCopy=1 (fills old stack with 0xfd after copy) still crashes with
the same pattern (no 0xfd values in the corrupted data)
stackFaultOnFree=1 (maps old stack pages as inaccessible) still crashes with
the same pattern (no access violation on old stack pages)

The copy itself appears correct


Added a post-adjustframe verification in copystack that compared every
8-byte value in the new stack against the old stack for the corruption pattern.
It did not fire. This means the corruption is not introduced by memmove or
adjustframe during the copy itself.

A separate bug exists in cgo callback stack growth


debugCheckBP=true detected 98 instances of invalid frame pointers during
copystack when goroutines in Windows cgo callback chains (e.g., the desktop
session watcher's pumpThreadMessages -> getMessage -> Windows callback ->
wndProc -> destroyWindow -> Windows callback -> callbackWrap) need stack
growth
The BP chain in these goroutines crosses from the Go goroutine stack to the
Windows system/thread stack. adjustframe encounters a BP value outside the
goroutine's stack range.
In the non-debug path, adjustpointer correctly skips adjustment of
out-of-range values, so this is "safe" in that it doesn't corrupt data, but it
means the BP chain is broken after the stack copy.
This is a SEPARATE bug from the DERP return address corruption. The DERP
goroutine has no cgo frames. Both bugs involve copystack but in different
goroutines with different stack structures.

What We Tried That Didn't Pan Out

.pdata sorting (commit bbed50aaa3)

The Go linker emits .pdata (Windows SEH function table) entries unsorted,
violating the PE/COFF spec requirement. Windows RtlLookupFunctionEntry does a
binary search on these entries. The sort fix exists on master but is not in Go
1.26.x. However, the crash still occurs with the sort fix applied (tested on
master). The .pdata issue is a real bug but is not the cause of this crash.
GetThreadContext return value checking

Added a check for GetThreadContext returning 0 (failure). It never failed.
Stack scanning in preemptM

Added a loop to scan the goroutine stack for the corruption pattern immediately
after PushCall + SetThreadContext. No corruption was found at that point,
confirming the corruption happens later.
Larger initial stacks

Setting stackMin to 4096, 8192, or 65536 all prevent the crash, but also cause
the test to time out. Larger stacks mean less stack growth, which avoids the bug.
This doesn't pinpoint the mechanism but confirms stack growth is part of the
trigger.
Minimal reproducers

Wrote two minimal Go programs that create many goroutines doing frame-reading over
TCP/TLS connections with small stack frames, similar to the DERP server pattern.
Neither crashed. The full tsnet test suite is needed to trigger the bug, suggesting
it requires a specific combination of goroutine count, stack depth, I/O patterns,
and timing.
Theories

Most likely: corruption during stack growth after async preemption resume

The goroutine is deep in a call stack when async-preempted. It yields via
asyncPreempt -> asyncPreempt2 -> mcall -> gopreempt_m. When later
rescheduled, it returns through asyncPreempt back to the interrupted code. The
interrupted code (or a subsequent function call) triggers stack growth via
morestack -> newstack -> copystack.
During this stack growth, something goes wrong. The memmove and adjustframe
produce correct results (verified), but something after copystack returns uses a
stale or incorrect reference to the old stack location, writing data to what is now
either freed memory or another goroutine's stack. This write overwrites 4 bytes of
a return address with heap/stack address data.
The stackFaultOnFree test should have caught a write to freed stack pages but
didn't, which means either:

The old stack pages were immediately reused (returned to the stack pool and
given to another goroutine), making them accessible
The corruption is on the NEW stack, not the old one, but the post-copy check
missed it (perhaps due to timing - the corruption happens after copystack
returns)
The corruption involves a different mechanism entirely

Alternative: Windows thread context interaction

When a goroutine is async-preempted via SuspendThread + SetThreadContext +
ResumeThread, and then its stack is grown, there might be a subtle interaction
where Windows retains internal references to the old stack (e.g., for APC delivery,
exception handling, or thread context restoration) that become stale after the
stack moves. This wouldn't be caught by stackFaultOnFree if Windows keeps its
own mappings.
Next Steps


Use a real debugger: Set a hardware data breakpoint (DR0-DR3) on the return
address location to catch the exact instruction that overwrites it. This
requires a Windows debugger (WinDbg) attached to the process.


Add per-goroutine stack-copy tracking with frame validation: After each
copystack, walk the new stack's frame pointer chain and validate that all
return addresses look like valid code pointers (high bits in the expected image
range).


Bisect the stack growth: Instead of growing the stack to a new allocation,
try growing it in-place (remap to a larger region) to eliminate the
memmove/pointer-adjustment path.


Test with stackNoCache=1: Prevent stack page reuse to see if the
corruption changes (would confirm if old stack pages being reused by other
goroutines is part of the story).


Investigate the cgo callback BP bug: The 98 invalid-BP-during-copystack
warnings are a real bug that should be filed separately. While not the direct
cause of the DERP corruption, they indicate that copystack has difficulty
with certain frame layouts on Windows.
Configuration	Result
Go 1.25.0, 1.26.1, master (no fix)	CRASH (3/3 runs)
With fix (output params in operation struct)	NO CRASH (3/3 runs, tests PASS)
`GODEBUG=asyncpreemptoff=1` (no fix)	No crash (but test is pre-existing flaky/slow)
`stackMin=4096` (no fix)	No crash (but test is pre-existing flaky/slow)
`GOGC=off` (no fix)	CRASH
`stackFaultOnFree=1` (no fix)	CRASH (different pattern: PC=0x12)
No results found