dims/TestNewManagerImplStartProbeMode.md

## TestNewManagerImplStartProbeMode.md

      
    Raw
  

              TestNewManagerImplStartProbeMode.md
            
          
    Device Manager Race Condition Analysis

Executive Summary

A flaky test TestNewManagerImplStartProbeMode in the Kubernetes device manager was causing CI failures in the ci-kubernetes-unit-dependencies job. After thorough investigation, the root cause was identified as a data race between logging code that uses reflection to format structs and gRPC connection cleanup code that modifies internal mutex state.
The race was triggered by a gRPC dependency update from v1.75.0 to v1.78.0, which changed internal buffer handling. The fix involves removing a problematic log statement that logged an entire struct containing a gRPC client.

Table of Contents


Background
Investigation Process
Root Cause Analysis
The gRPC Changes
Call Graph Analysis
The Fix
Verification
Lessons Learned


Background

The Failing Test

The test TestNewManagerImplStartProbeMode is located in pkg/kubelet/cm/devicemanager/manager_test.go. It tests the device plugin manager's probe mode functionality by:

Setting up a device manager and plugin manager
Starting them in goroutines
Immediately calling cleanup to stop everything

This rapid setup-and-teardown pattern creates timing conditions that can expose race conditions.
CI Job Context

The ci-kubernetes-unit-dependencies job:

Updates Go module dependencies to their latest versions
Runs unit tests with the race detector enabled
Tests compatibility with newer dependency versions before they're merged

The job was failing intermittently with the following pattern:

Failure rate: ~10% when race conditions were triggered
Go version: The issue appeared with Go 1.26rc1 but was actually caused by dependency updates
Key dependency change: gRPC v1.75.0 → v1.78.0

Affected CI Jobs

Four CI job logs were analyzed:

ci-kubernetes-unit-dependencies/2008641956699181056
ci-kubernetes-unit-dependencies/2007491912692404224
ci-kubernetes-unit-dependencies/2007431281842851840
ci-kubernetes-unit-dependencies/2009065501396832256

All four showed the identical DATA RACE pattern.

Investigation Process

Step 1: Initial Analysis

The investigation began by examining the CI build logs. The logs showed:
=== FAIL: pkg/kubelet/cm/devicemanager TestNewManagerImplStartProbeMode (0.05s)
==================
WARNING: DATA RACE
Read at 0x00c00063d47c by goroutine 204:
  reflect.Value.Int()
      /usr/local/go/src/reflect/value.go:1450 +0x524
  github.com/onsi/gomega/format.formatValue()
  ...
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).PluginDisconnected()
      manager.go:253 +0x393

Step 2: Dependency Analysis

The differences.md artifact from the CI job revealed the key dependency changes:


Package
Old Version
New Version


google.golang.org/grpc
v1.75.0
v1.78.0


grpc-ecosystem/go-grpc-middleware/v2
v2.3.0
v2.3.3


go.opentelemetry.io/otel
v1.38.0
v1.39.0


The gRPC update was identified as the most likely culprit given the stack trace involved gRPC internals.
Step 3: Local Reproduction Attempts

Initial attempts to reproduce locally on macOS failed:

13,000+ runs with Go 1.25.5: 0 failures
2,500+ runs with Go 1.26rc1: 0 failures
2,500+ runs with gRPC v1.78.0 on macOS: 0 failures

This suggested the race was platform-dependent or required specific timing conditions.
Step 4: Docker-Based Reproduction

Running in a Linux Docker container successfully reproduced the issue:
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
  bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"
Results with gRPC v1.78.0 (unfixed code):

1466 runs, 148 failures (10.10% failure rate)


Root Cause Analysis

The Racing Code

The race occurs at pkg/kubelet/cm/devicemanager/manager.go:253:
func (m *ManagerImpl) PluginDisconnected(logger klog.Logger, resourceName string) {
    m.mutex.Lock()
    defer m.mutex.Unlock()

    if ep, exists := m.endpoints[resourceName]; exists {
        m.markResourceUnhealthy(logger, resourceName)
        // THIS LINE CAUSES THE RACE:
        logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", ep)
        ep.e.setStopTime(time.Now())
    }
}
The Data Structures Involved

// endpointInfo contains the endpoint and its options
type endpointInfo struct {
    e       *endpointImpl
    options *pluginapi.DevicePluginOptions
}

// endpointImpl contains the gRPC client
type endpointImpl struct {
    mutex        sync.Mutex
    resourceName string
    api          pluginapi.DevicePluginClient  // gRPC client!
    stopTime     time.Time
    client       plugin.Client
}
The Race Mechanism

Goroutine A (Logger - Reader):

PluginDisconnected() is called
Logger formats ep for output
gomega/format uses reflection to traverse the struct
Reflection reads fields of the gRPC ClientConn including internal buffers

Goroutine B (gRPC Close - Writer):

client.Disconnect() is called
c.grpc.Close() is invoked
gRPC closes internal buffer.Unbounded
sync.Mutex.Unlock() modifies mutex state

The race occurs because both goroutines access the same memory location:

Reader: reflect.Value.Int() reading mutex state
Writer: sync/atomic.AddInt32() modifying mutex state

Stack Trace Analysis

Read side (Goroutine 204):
reflect.Value.Int()
github.com/onsi/gomega/format.formatValue()
github.com/onsi/gomega/format.formatStruct()
  ... (multiple recursive calls)
k8s.io/kubernetes/test/utils/format.Object()
k8s.io/kubernetes/test/utils/ktesting.newLogger.func1()
k8s.io/klog/v2/internal/serialize.Formatter.formatAny()
k8s.io/klog/v2/ktesting.tlogger.Info()
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).PluginDisconnected()
    manager.go:253

Write side (Goroutine 282):
sync/atomic.AddInt32()
sync.(*Mutex).Unlock()
google.golang.org/grpc/internal/buffer.(*Unbounded).Close.deferwrap1()
google.golang.org/grpc.(*ccResolverWrapper).close()
google.golang.org/grpc.(*ClientConn).Close()
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/plugin/v1beta1.(*client).Disconnect()
    client.go:106


The gRPC Changes

Version Comparison

The race became detectable after upgrading from gRPC v1.75.0 to v1.78.0. Between these versions, there were 73 commits that changed various aspects of gRPC's internal behavior.
Key Commits Analyzed

The following commits from the grpc/grpc-go repository were identified as potentially relevant:
1. Buffer State Fix (v1.76.0)

Commit: 83bead40 - "internal/buffer: set closed flag when closing channel in the Load method"

PR: #8575
Issue: #8572
Introduced in: v1.76.0

 func (b *Unbounded) Load() {
     // ...
     } else if b.closing && !b.closed {
+        b.closed = true
         close(b.c)
     }
 }
This fix addressed a bug where the closed field was never set to true when closing the buffer with pending backlog items. While this is a correctness fix, it changes the timing of state modifications during close operations.
2. Graceful Switch Goroutine Wait (v1.78.0)

Commit: 9b990b63 - "gracefulswitch: Wait for all goroutines on close"

PR: #8746
Related Issues: #8655
Introduced in: v1.78.0

// Before: Close() returned immediately
// After: Close() waits for all spawned goroutines

func (gsb *Balancer) Close() {
    // ... existing close logic ...
    currentBalancerToClose.Close()
    pendingBalancerToClose.Close()
+   gsb.activeGoroutines.Wait()  // NEW: Wait for goroutines
}
Release Note: "client: Closing a graceful switch balancer will now block until all pending goroutines complete."
This change significantly impacts timing during connection close. By making Close() wait for goroutines to finish, it changes when mutex operations complete relative to other operations.
3. Transport Layer Optimizations (v1.77.0)

Several commits optimized buffer handling in the transport layer:


Commit
Description
Version


dc355bc0
Remove buffer copies while writing HTTP/2 Data frames
v1.77.0


363018c3
Avoid buffer copies when reading Data frames
v1.77.0


f959da61
Reduce heap allocations
v1.77.0


50c6321b
Allocate at 4KiB boundaries in fallback buffer pool
v1.78.0


8ab0c821
Avoid clearing new buffers, clear buffers from simpleBufferPools
v1.77.0


These optimizations changed memory allocation patterns and buffer lifecycle, which can affect timing of concurrent operations.
Internal Buffer Implementation

The buffer.Unbounded type in gRPC is used for passing updates between internal components:
// internal/buffer/unbounded.go
type Unbounded struct {
    c       chan any
    closed  bool
    closing bool
    mu      sync.Mutex  // This mutex is involved in the race
    backlog []any
}

func (b *Unbounded) Close() {
    b.mu.Lock()
    defer b.mu.Unlock()  // The unlock triggers atomic operations
    if b.closing {
        return
    }
    b.closing = true
    if len(b.backlog) == 0 {
        b.closed = true
        close(b.c)
    }
}

func (b *Unbounded) Load() {
    b.mu.Lock()
    defer b.mu.Unlock()
    if len(b.backlog) > 0 {
        // ... handle backlog ...
    } else if b.closing && !b.closed {
        b.closed = true  // Added in v1.76.0 (commit 83bead40)
        close(b.c)
    }
}
Why the Race Became Visible in v1.78.0

The race existed latently in the Kubernetes code, but became reproducible due to a combination of gRPC changes:


Buffer state fix (v1.76.0): The b.closed = true addition in Load() added a new write operation during close sequences, slightly changing timing.


Graceful switch wait (v1.78.0): The most impactful change. By making Close() block until goroutines complete, it:

Extended the window during which internal state is being modified
Made the close operation more synchronous
Changed when mutex unlocks occur relative to other operations


Transport optimizations (v1.77.0-v1.78.0): Buffer pool and allocation changes altered the memory access patterns, potentially affecting cache behavior and timing.


Timeline of Changes

v1.75.0 (Kubernetes baseline)
    |
    v--- Commit 83bead40: Buffer closed flag fix
    |
v1.76.0
    |
    v--- Multiple transport optimizations
    |
v1.77.0
    |
    v--- Commit 9b990b63: Graceful switch waits for goroutines
    |
v1.78.0 (Race becomes reproducible)

The Fundamental Issue

The underlying problem is not a gRPC bug - gRPC's changes are all correct and improve the library. The issue is in the Kubernetes code:
Logging a struct containing a gRPC client via reflection during connection shutdown is inherently unsafe.
The gRPC changes simply made an existing latent bug in Kubernetes more likely to manifest. This is a common pattern where dependency updates expose pre-existing race conditions that were previously hidden by favorable timing.

Call Graph Analysis

Test Execution Flow

TestNewManagerImplStartProbeMode (manager_test.go:122)
├── setupInProbeMode()
│   ├── setupDeviceManager()
│   ├── setupDevicePlugin()
│   └── setupPluginManager()
│       └── go pluginManager.Run()  // Starts in background
└── cleanup() (manager_test.go:361)
    ├── p.Stop()  // Stop plugin stub
    └── m.Stop() (manager.go:361)
        └── server.Stop() (server.go:131)
            └── visitClients() (server.go:203)
                └── Stop.func1() (server.go:132)
                    └── disconnectClient() (handler.go:104)
                        └── client.Disconnect() (client.go:112)
                            ├── c.grpc.Close() (client.go:106)     ← WRITER
                            └── handler.PluginDisconnected()
                                └── PluginDisconnected() (manager.go:253)
                                    └── logger.Info("endpoint", ep) ← READER

The Race Window

The race window exists because:

client.Disconnect() first calls c.grpc.Close() to close the gRPC connection
Then it calls PluginDisconnected() which logs the endpoint
The gRPC Close() operation continues asynchronously in background goroutines
The logger uses reflection to read the gRPC client struct
The background goroutines modify mutex state in the gRPC client


The Fix

Code Change

Before (manager.go:253):
if ep, exists := m.endpoints[resourceName]; exists {
    m.markResourceUnhealthy(logger, resourceName)
    logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", ep)
    ep.e.setStopTime(time.Now())
}
After:
if ep, exists := m.endpoints[resourceName]; exists {
    m.markResourceUnhealthy(logger, resourceName)
    logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName)
    ep.e.setStopTime(time.Now())
}
Rationale


The endpoint field is redundant: The resourceName is already in the log message and uniquely identifies the endpoint
Logging complex structs is dangerous: Structs containing gRPC clients, network connections, or other concurrent objects should not be logged via reflection
Minimal change: The fix removes only the problematic field without changing other behavior

Alternative Approaches Considered


Copy safe fields before logging: Would require creating a new struct with only safe-to-log fields
Implement String() method on endpointInfo: Would prevent reflection-based formatting but requires more code
Log before Close(): Would require restructuring the disconnect flow

The chosen approach is the simplest and most direct fix.

Verification

Test Results


Test Configuration
Platform
Runs
Failures
Rate


Unfixed + gRPC v1.75.0
macOS
13,000+
0
0%


Unfixed + gRPC v1.78.0
macOS
2,500+
0
0%


Unfixed + gRPC v1.78.0
Linux (Docker)
1,466
148
10.10%


Fixed + gRPC v1.78.0
Linux (Docker)
200
0
0%


Reproduction Commands

To reproduce the race (unfixed code + gRPC v1.78.0):
# Pin gRPC to v1.78.0
./hack/pin-dependency.sh google.golang.org/grpc v1.78.0
./hack/update-vendor.sh

# Run in Linux container
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
  bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"
To verify the fix:
# Apply fix to manager.go:253
# Remove "endpoint", ep from the log statement

# Run same test - should pass with 0 races
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
  bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"

Lessons Learned

1. Platform-Dependent Race Conditions

Race conditions may manifest differently on different platforms. This race:

Did not reproduce on macOS/arm64 even with 13,000+ test runs
Reproduced consistently on Linux with ~10% failure rate

Always test on the same platform as CI when investigating flaky tests.
2. Logging Complex Structs is Risky

Logging structs that contain:

gRPC clients/connections
Network connections
Mutexes or other synchronization primitives
Channels

...using reflection-based formatters (like gomega/format) can cause races. The reflection traverses all fields, including internal state that may be concurrently modified.
3. Dependency Updates Can Expose Latent Bugs

The race existed in the code before the gRPC update, but the timing changes in gRPC v1.78.0 made it detectable. This highlights the importance of:

Running tests with race detection
Testing with updated dependencies before merging
The value of the ci-kubernetes-unit-dependencies job

4. CI Log Analysis is Critical

The CI logs contained all the information needed to diagnose the issue:

The exact stack trace showing where the race occurred
The differences.md artifact showing which dependencies changed
Multiple failure examples confirming the pattern

5. Historical Context Helps

The original issue #121451 and PR #121494 provided context about previous race conditions in the same code area. While this specific race was different (a logging issue vs. a nil pointer issue), understanding the code's history helped narrow down the investigation.

Appendix: Related Files

Files Modified


pkg/kubelet/cm/devicemanager/manager.go - Line 253

Files Analyzed


pkg/kubelet/cm/devicemanager/manager_test.go - Test file
pkg/kubelet/cm/devicemanager/endpoint.go - Endpoint struct definitions
pkg/kubelet/cm/devicemanager/plugin/v1beta1/client.go - Client disconnect logic
pkg/kubelet/cm/devicemanager/plugin/v1beta1/handler.go - Connection handling
pkg/kubelet/cm/devicemanager/plugin/v1beta1/server.go - Server stop logic
vendor/google.golang.org/grpc/internal/buffer/unbounded.go - gRPC buffer (v1.78.0)

Kubernetes Issues and PRs


Previous flake fix: kubernetes/kubernetes#121494
Original issue: kubernetes/kubernetes#121451

gRPC-Go Commits Analyzed


Commit
Description
Version
Link


83bead40
Set closed flag when closing channel in Load method
v1.76.0
PR #8575


9b990b63
Wait for all goroutines on close in gracefulswitch
v1.78.0
PR #8746


dc355bc0
Remove buffer copies while writing HTTP/2 Data frames
v1.77.0
PR #8667


363018c3
Avoid buffer copies when reading Data frames
v1.77.0
PR #8657


50c6321b
Allocate at 4KiB boundaries in fallback buffer pool
v1.78.0
PR #8705


8ab0c821
Avoid clearing new buffers
v1.77.0
PR #8670


gRPC-Go Issues Referenced


Buffer closed flag bug: grpc/grpc-go#8572
Graceful switch goroutine leak: grpc/grpc-go#8655

CI Job Logs Analyzed


https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2008641956699181056/build-log.txt
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2007491912692404224/build-log.txt
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2007431281842851840/build-log.txt
https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-unit-dependencies/2009065501396832256/build-log.txt


Document generated: January 2026
Author: Claude Code Analysis
Package	Old Version	New Version
google.golang.org/grpc	v1.75.0	v1.78.0
grpc-ecosystem/go-grpc-middleware/v2	v2.3.0	v2.3.3
go.opentelemetry.io/otel	v1.38.0	v1.39.0
Commit	Description	Version
`dc355bc0`	Remove buffer copies while writing HTTP/2 Data frames	v1.77.0
`363018c3`	Avoid buffer copies when reading Data frames	v1.77.0
`f959da61`	Reduce heap allocations	v1.77.0
`50c6321b`	Allocate at 4KiB boundaries in fallback buffer pool	v1.78.0
`8ab0c821`	Avoid clearing new buffers, clear buffers from simpleBufferPools	v1.77.0
Test Configuration	Platform	Runs	Failures	Rate
Unfixed + gRPC v1.75.0	macOS	13,000+	0	0%
Unfixed + gRPC v1.78.0	macOS	2,500+	0	0%
Unfixed + gRPC v1.78.0	Linux (Docker)	1,466	148	10.10%
Fixed + gRPC v1.78.0	Linux (Docker)	200	0	0%
Commit	Description	Version	Link
`83bead40`	Set closed flag when closing channel in Load method	v1.76.0	PR #8575
`9b990b63`	Wait for all goroutines on close in gracefulswitch	v1.78.0	PR #8746
`dc355bc0`	Remove buffer copies while writing HTTP/2 Data frames	v1.77.0	PR #8667
`363018c3`	Avoid buffer copies when reading Data frames	v1.77.0	PR #8657
`50c6321b`	Allocate at 4KiB boundaries in fallback buffer pool	v1.78.0	PR #8705
`8ab0c821`	Avoid clearing new buffers	v1.77.0	PR #8670