A flaky test TestNewManagerImplStartProbeMode in the Kubernetes device manager was causing CI failures in the ci-kubernetes-unit-dependencies job. After thorough investigation, the root cause was identified as a data race between logging code that uses reflection to format structs and gRPC connection cleanup code that modifies internal mutex state.
The race was triggered by a gRPC dependency update from v1.75.0 to v1.78.0, which changed internal buffer handling. The fix involves removing a problematic log statement that logged an entire struct containing a gRPC client.
- Background
- Investigation Process
- Root Cause Analysis
- The gRPC Changes
- Call Graph Analysis
- The Fix
- Verification
- Lessons Learned
The test TestNewManagerImplStartProbeMode is located in pkg/kubelet/cm/devicemanager/manager_test.go. It tests the device plugin manager's probe mode functionality by:
- Setting up a device manager and plugin manager
- Starting them in goroutines
- Immediately calling cleanup to stop everything
This rapid setup-and-teardown pattern creates timing conditions that can expose race conditions.
The ci-kubernetes-unit-dependencies job:
- Updates Go module dependencies to their latest versions
- Runs unit tests with the race detector enabled
- Tests compatibility with newer dependency versions before they're merged
The job was failing intermittently with the following pattern:
- Failure rate: ~10% when race conditions were triggered
- Go version: The issue appeared with Go 1.26rc1 but was actually caused by dependency updates
- Key dependency change: gRPC v1.75.0 → v1.78.0
Four CI job logs were analyzed:
ci-kubernetes-unit-dependencies/2008641956699181056ci-kubernetes-unit-dependencies/2007491912692404224ci-kubernetes-unit-dependencies/2007431281842851840ci-kubernetes-unit-dependencies/2009065501396832256
All four showed the identical DATA RACE pattern.
The investigation began by examining the CI build logs. The logs showed:
=== FAIL: pkg/kubelet/cm/devicemanager TestNewManagerImplStartProbeMode (0.05s)
==================
WARNING: DATA RACE
Read at 0x00c00063d47c by goroutine 204:
reflect.Value.Int()
/usr/local/go/src/reflect/value.go:1450 +0x524
github.com/onsi/gomega/format.formatValue()
...
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).PluginDisconnected()
manager.go:253 +0x393
The differences.md artifact from the CI job revealed the key dependency changes:
| Package | Old Version | New Version |
|---|---|---|
| google.golang.org/grpc | v1.75.0 | v1.78.0 |
| grpc-ecosystem/go-grpc-middleware/v2 | v2.3.0 | v2.3.3 |
| go.opentelemetry.io/otel | v1.38.0 | v1.39.0 |
The gRPC update was identified as the most likely culprit given the stack trace involved gRPC internals.
Initial attempts to reproduce locally on macOS failed:
- 13,000+ runs with Go 1.25.5: 0 failures
- 2,500+ runs with Go 1.26rc1: 0 failures
- 2,500+ runs with gRPC v1.78.0 on macOS: 0 failures
This suggested the race was platform-dependent or required specific timing conditions.
Running in a Linux Docker container successfully reproduced the issue:
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"Results with gRPC v1.78.0 (unfixed code):
- 1466 runs, 148 failures (10.10% failure rate)
The race occurs at pkg/kubelet/cm/devicemanager/manager.go:253:
func (m *ManagerImpl) PluginDisconnected(logger klog.Logger, resourceName string) {
m.mutex.Lock()
defer m.mutex.Unlock()
if ep, exists := m.endpoints[resourceName]; exists {
m.markResourceUnhealthy(logger, resourceName)
// THIS LINE CAUSES THE RACE:
logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", ep)
ep.e.setStopTime(time.Now())
}
}// endpointInfo contains the endpoint and its options
type endpointInfo struct {
e *endpointImpl
options *pluginapi.DevicePluginOptions
}
// endpointImpl contains the gRPC client
type endpointImpl struct {
mutex sync.Mutex
resourceName string
api pluginapi.DevicePluginClient // gRPC client!
stopTime time.Time
client plugin.Client
}Goroutine A (Logger - Reader):
PluginDisconnected()is called- Logger formats
epfor output gomega/formatuses reflection to traverse the struct- Reflection reads fields of the gRPC
ClientConnincluding internal buffers
Goroutine B (gRPC Close - Writer):
client.Disconnect()is calledc.grpc.Close()is invoked- gRPC closes internal
buffer.Unbounded sync.Mutex.Unlock()modifies mutex state
The race occurs because both goroutines access the same memory location:
- Reader:
reflect.Value.Int()reading mutex state - Writer:
sync/atomic.AddInt32()modifying mutex state
Read side (Goroutine 204):
reflect.Value.Int()
github.com/onsi/gomega/format.formatValue()
github.com/onsi/gomega/format.formatStruct()
... (multiple recursive calls)
k8s.io/kubernetes/test/utils/format.Object()
k8s.io/kubernetes/test/utils/ktesting.newLogger.func1()
k8s.io/klog/v2/internal/serialize.Formatter.formatAny()
k8s.io/klog/v2/ktesting.tlogger.Info()
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).PluginDisconnected()
manager.go:253
Write side (Goroutine 282):
sync/atomic.AddInt32()
sync.(*Mutex).Unlock()
google.golang.org/grpc/internal/buffer.(*Unbounded).Close.deferwrap1()
google.golang.org/grpc.(*ccResolverWrapper).close()
google.golang.org/grpc.(*ClientConn).Close()
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/plugin/v1beta1.(*client).Disconnect()
client.go:106
The race became detectable after upgrading from gRPC v1.75.0 to v1.78.0. Between these versions, there were 73 commits that changed various aspects of gRPC's internal behavior.
The following commits from the grpc/grpc-go repository were identified as potentially relevant:
Commit: 83bead40 - "internal/buffer: set closed flag when closing channel in the Load method"
func (b *Unbounded) Load() {
// ...
} else if b.closing && !b.closed {
+ b.closed = true
close(b.c)
}
}This fix addressed a bug where the closed field was never set to true when closing the buffer with pending backlog items. While this is a correctness fix, it changes the timing of state modifications during close operations.
Commit: 9b990b63 - "gracefulswitch: Wait for all goroutines on close"
// Before: Close() returned immediately
// After: Close() waits for all spawned goroutines
func (gsb *Balancer) Close() {
// ... existing close logic ...
currentBalancerToClose.Close()
pendingBalancerToClose.Close()
+ gsb.activeGoroutines.Wait() // NEW: Wait for goroutines
}Release Note: "client: Closing a graceful switch balancer will now block until all pending goroutines complete."
This change significantly impacts timing during connection close. By making Close() wait for goroutines to finish, it changes when mutex operations complete relative to other operations.
Several commits optimized buffer handling in the transport layer:
| Commit | Description | Version |
|---|---|---|
dc355bc0 |
Remove buffer copies while writing HTTP/2 Data frames | v1.77.0 |
363018c3 |
Avoid buffer copies when reading Data frames | v1.77.0 |
f959da61 |
Reduce heap allocations | v1.77.0 |
50c6321b |
Allocate at 4KiB boundaries in fallback buffer pool | v1.78.0 |
8ab0c821 |
Avoid clearing new buffers, clear buffers from simpleBufferPools | v1.77.0 |
These optimizations changed memory allocation patterns and buffer lifecycle, which can affect timing of concurrent operations.
The buffer.Unbounded type in gRPC is used for passing updates between internal components:
// internal/buffer/unbounded.go
type Unbounded struct {
c chan any
closed bool
closing bool
mu sync.Mutex // This mutex is involved in the race
backlog []any
}
func (b *Unbounded) Close() {
b.mu.Lock()
defer b.mu.Unlock() // The unlock triggers atomic operations
if b.closing {
return
}
b.closing = true
if len(b.backlog) == 0 {
b.closed = true
close(b.c)
}
}
func (b *Unbounded) Load() {
b.mu.Lock()
defer b.mu.Unlock()
if len(b.backlog) > 0 {
// ... handle backlog ...
} else if b.closing && !b.closed {
b.closed = true // Added in v1.76.0 (commit 83bead40)
close(b.c)
}
}The race existed latently in the Kubernetes code, but became reproducible due to a combination of gRPC changes:
-
Buffer state fix (v1.76.0): The
b.closed = trueaddition inLoad()added a new write operation during close sequences, slightly changing timing. -
Graceful switch wait (v1.78.0): The most impactful change. By making
Close()block until goroutines complete, it:- Extended the window during which internal state is being modified
- Made the close operation more synchronous
- Changed when mutex unlocks occur relative to other operations
-
Transport optimizations (v1.77.0-v1.78.0): Buffer pool and allocation changes altered the memory access patterns, potentially affecting cache behavior and timing.
v1.75.0 (Kubernetes baseline)
|
v--- Commit 83bead40: Buffer closed flag fix
|
v1.76.0
|
v--- Multiple transport optimizations
|
v1.77.0
|
v--- Commit 9b990b63: Graceful switch waits for goroutines
|
v1.78.0 (Race becomes reproducible)
The underlying problem is not a gRPC bug - gRPC's changes are all correct and improve the library. The issue is in the Kubernetes code:
Logging a struct containing a gRPC client via reflection during connection shutdown is inherently unsafe.
The gRPC changes simply made an existing latent bug in Kubernetes more likely to manifest. This is a common pattern where dependency updates expose pre-existing race conditions that were previously hidden by favorable timing.
TestNewManagerImplStartProbeMode (manager_test.go:122)
├── setupInProbeMode()
│ ├── setupDeviceManager()
│ ├── setupDevicePlugin()
│ └── setupPluginManager()
│ └── go pluginManager.Run() // Starts in background
└── cleanup() (manager_test.go:361)
├── p.Stop() // Stop plugin stub
└── m.Stop() (manager.go:361)
└── server.Stop() (server.go:131)
└── visitClients() (server.go:203)
└── Stop.func1() (server.go:132)
└── disconnectClient() (handler.go:104)
└── client.Disconnect() (client.go:112)
├── c.grpc.Close() (client.go:106) ← WRITER
└── handler.PluginDisconnected()
└── PluginDisconnected() (manager.go:253)
└── logger.Info("endpoint", ep) ← READER
The race window exists because:
client.Disconnect()first callsc.grpc.Close()to close the gRPC connection- Then it calls
PluginDisconnected()which logs the endpoint - The gRPC
Close()operation continues asynchronously in background goroutines - The logger uses reflection to read the gRPC client struct
- The background goroutines modify mutex state in the gRPC client
Before (manager.go:253):
if ep, exists := m.endpoints[resourceName]; exists {
m.markResourceUnhealthy(logger, resourceName)
logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", ep)
ep.e.setStopTime(time.Now())
}After:
if ep, exists := m.endpoints[resourceName]; exists {
m.markResourceUnhealthy(logger, resourceName)
logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName)
ep.e.setStopTime(time.Now())
}- The
endpointfield is redundant: TheresourceNameis already in the log message and uniquely identifies the endpoint - Logging complex structs is dangerous: Structs containing gRPC clients, network connections, or other concurrent objects should not be logged via reflection
- Minimal change: The fix removes only the problematic field without changing other behavior
- Copy safe fields before logging: Would require creating a new struct with only safe-to-log fields
- Implement
String()method on endpointInfo: Would prevent reflection-based formatting but requires more code - Log before Close(): Would require restructuring the disconnect flow
The chosen approach is the simplest and most direct fix.
| Test Configuration | Platform | Runs | Failures | Rate |
|---|---|---|---|---|
| Unfixed + gRPC v1.75.0 | macOS | 13,000+ | 0 | 0% |
| Unfixed + gRPC v1.78.0 | macOS | 2,500+ | 0 | 0% |
| Unfixed + gRPC v1.78.0 | Linux (Docker) | 1,466 | 148 | 10.10% |
| Fixed + gRPC v1.78.0 | Linux (Docker) | 200 | 0 | 0% |
To reproduce the race (unfixed code + gRPC v1.78.0):
# Pin gRPC to v1.78.0
./hack/pin-dependency.sh google.golang.org/grpc v1.78.0
./hack/update-vendor.sh
# Run in Linux container
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"To verify the fix:
# Apply fix to manager.go:253
# Remove "endpoint", ep from the log statement
# Run same test - should pass with 0 races
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"Race conditions may manifest differently on different platforms. This race:
- Did not reproduce on macOS/arm64 even with 13,000+ test runs
- Reproduced consistently on Linux with ~10% failure rate
Always test on the same platform as CI when investigating flaky tests.
Logging structs that contain:
- gRPC clients/connections
- Network connections
- Mutexes or other synchronization primitives
- Channels
...using reflection-based formatters (like gomega/format) can cause races. The reflection traverses all fields, including internal state that may be concurrently modified.
The race existed in the code before the gRPC update, but the timing changes in gRPC v1.78.0 made it detectable. This highlights the importance of:
- Running tests with race detection
- Testing with updated dependencies before merging
- The value of the
ci-kubernetes-unit-dependenciesjob
The CI logs contained all the information needed to diagnose the issue:
- The exact stack trace showing where the race occurred
- The
differences.mdartifact showing which dependencies changed - Multiple failure examples confirming the pattern
The original issue #121451 and PR #121494 provided context about previous race conditions in the same code area. While this specific race was different (a logging issue vs. a nil pointer issue), understanding the code's history helped narrow down the investigation.
pkg/kubelet/cm/devicemanager/manager.go- Line 253
pkg/kubelet/cm/devicemanager/manager_test.go- Test filepkg/kubelet/cm/devicemanager/endpoint.go- Endpoint struct definitionspkg/kubelet/cm/devicemanager/plugin/v1beta1/client.go- Client disconnect logicpkg/kubelet/cm/devicemanager/plugin/v1beta1/handler.go- Connection handlingpkg/kubelet/cm/devicemanager/plugin/v1beta1/server.go- Server stop logicvendor/google.golang.org/grpc/internal/buffer/unbounded.go- gRPC buffer (v1.78.0)
- Previous flake fix: kubernetes/kubernetes#121494
- Original issue: kubernetes/kubernetes#121451
| Commit | Description | Version | Link |
|---|---|---|---|
83bead40 |
Set closed flag when closing channel in Load method | v1.76.0 | PR #8575 |
9b990b63 |
Wait for all goroutines on close in gracefulswitch | v1.78.0 | PR #8746 |
dc355bc0 |
Remove buffer copies while writing HTTP/2 Data frames | v1.77.0 | PR #8667 |
363018c3 |
Avoid buffer copies when reading Data frames | v1.77.0 | PR #8657 |
50c6321b |
Allocate at 4KiB boundaries in fallback buffer pool | v1.78.0 | PR #8705 |
8ab0c821 |
Avoid clearing new buffers | v1.77.0 | PR #8670 |
- Buffer closed flag bug: grpc/grpc-go#8572
- Graceful switch goroutine leak: grpc/grpc-go#8655
- https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2008641956699181056/build-log.txt
- https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2007491912692404224/build-log.txt
- https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2007431281842851840/build-log.txt
- https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2009065501396832256/build-log.txt
Document generated: January 2026 Author: Claude Code Analysis