Skip to content

Instantly share code, notes, and snippets.

@dims
Last active January 9, 2026 13:11
Show Gist options
  • Select an option

  • Save dims/5355758b86012c0654be9d2f6502b472 to your computer and use it in GitHub Desktop.

Select an option

Save dims/5355758b86012c0654be9d2f6502b472 to your computer and use it in GitHub Desktop.
Device Manager Race Condition Analysis

Device Manager Race Condition Analysis

Executive Summary

A flaky test TestNewManagerImplStartProbeMode in the Kubernetes device manager was causing CI failures in the ci-kubernetes-unit-dependencies job. After thorough investigation, the root cause was identified as a data race between logging code that uses reflection to format structs and gRPC connection cleanup code that modifies internal mutex state.

The race was triggered by a gRPC dependency update from v1.75.0 to v1.78.0, which changed internal buffer handling. The fix involves removing a problematic log statement that logged an entire struct containing a gRPC client.


Table of Contents

  1. Background
  2. Investigation Process
  3. Root Cause Analysis
  4. The gRPC Changes
  5. Call Graph Analysis
  6. The Fix
  7. Verification
  8. Lessons Learned

Background

The Failing Test

The test TestNewManagerImplStartProbeMode is located in pkg/kubelet/cm/devicemanager/manager_test.go. It tests the device plugin manager's probe mode functionality by:

  1. Setting up a device manager and plugin manager
  2. Starting them in goroutines
  3. Immediately calling cleanup to stop everything

This rapid setup-and-teardown pattern creates timing conditions that can expose race conditions.

CI Job Context

The ci-kubernetes-unit-dependencies job:

  • Updates Go module dependencies to their latest versions
  • Runs unit tests with the race detector enabled
  • Tests compatibility with newer dependency versions before they're merged

The job was failing intermittently with the following pattern:

  • Failure rate: ~10% when race conditions were triggered
  • Go version: The issue appeared with Go 1.26rc1 but was actually caused by dependency updates
  • Key dependency change: gRPC v1.75.0 → v1.78.0

Affected CI Jobs

Four CI job logs were analyzed:

  • ci-kubernetes-unit-dependencies/2008641956699181056
  • ci-kubernetes-unit-dependencies/2007491912692404224
  • ci-kubernetes-unit-dependencies/2007431281842851840
  • ci-kubernetes-unit-dependencies/2009065501396832256

All four showed the identical DATA RACE pattern.


Investigation Process

Step 1: Initial Analysis

The investigation began by examining the CI build logs. The logs showed:

=== FAIL: pkg/kubelet/cm/devicemanager TestNewManagerImplStartProbeMode (0.05s)
==================
WARNING: DATA RACE
Read at 0x00c00063d47c by goroutine 204:
  reflect.Value.Int()
      /usr/local/go/src/reflect/value.go:1450 +0x524
  github.com/onsi/gomega/format.formatValue()
  ...
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).PluginDisconnected()
      manager.go:253 +0x393

Step 2: Dependency Analysis

The differences.md artifact from the CI job revealed the key dependency changes:

Package Old Version New Version
google.golang.org/grpc v1.75.0 v1.78.0
grpc-ecosystem/go-grpc-middleware/v2 v2.3.0 v2.3.3
go.opentelemetry.io/otel v1.38.0 v1.39.0

The gRPC update was identified as the most likely culprit given the stack trace involved gRPC internals.

Step 3: Local Reproduction Attempts

Initial attempts to reproduce locally on macOS failed:

  • 13,000+ runs with Go 1.25.5: 0 failures
  • 2,500+ runs with Go 1.26rc1: 0 failures
  • 2,500+ runs with gRPC v1.78.0 on macOS: 0 failures

This suggested the race was platform-dependent or required specific timing conditions.

Step 4: Docker-Based Reproduction

Running in a Linux Docker container successfully reproduced the issue:

docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
  bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"

Results with gRPC v1.78.0 (unfixed code):

  • 1466 runs, 148 failures (10.10% failure rate)

Root Cause Analysis

The Racing Code

The race occurs at pkg/kubelet/cm/devicemanager/manager.go:253:

func (m *ManagerImpl) PluginDisconnected(logger klog.Logger, resourceName string) {
    m.mutex.Lock()
    defer m.mutex.Unlock()

    if ep, exists := m.endpoints[resourceName]; exists {
        m.markResourceUnhealthy(logger, resourceName)
        // THIS LINE CAUSES THE RACE:
        logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", ep)
        ep.e.setStopTime(time.Now())
    }
}

The Data Structures Involved

// endpointInfo contains the endpoint and its options
type endpointInfo struct {
    e       *endpointImpl
    options *pluginapi.DevicePluginOptions
}

// endpointImpl contains the gRPC client
type endpointImpl struct {
    mutex        sync.Mutex
    resourceName string
    api          pluginapi.DevicePluginClient  // gRPC client!
    stopTime     time.Time
    client       plugin.Client
}

The Race Mechanism

Goroutine A (Logger - Reader):

  1. PluginDisconnected() is called
  2. Logger formats ep for output
  3. gomega/format uses reflection to traverse the struct
  4. Reflection reads fields of the gRPC ClientConn including internal buffers

Goroutine B (gRPC Close - Writer):

  1. client.Disconnect() is called
  2. c.grpc.Close() is invoked
  3. gRPC closes internal buffer.Unbounded
  4. sync.Mutex.Unlock() modifies mutex state

The race occurs because both goroutines access the same memory location:

  • Reader: reflect.Value.Int() reading mutex state
  • Writer: sync/atomic.AddInt32() modifying mutex state

Stack Trace Analysis

Read side (Goroutine 204):

reflect.Value.Int()
github.com/onsi/gomega/format.formatValue()
github.com/onsi/gomega/format.formatStruct()
  ... (multiple recursive calls)
k8s.io/kubernetes/test/utils/format.Object()
k8s.io/kubernetes/test/utils/ktesting.newLogger.func1()
k8s.io/klog/v2/internal/serialize.Formatter.formatAny()
k8s.io/klog/v2/ktesting.tlogger.Info()
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager.(*ManagerImpl).PluginDisconnected()
    manager.go:253

Write side (Goroutine 282):

sync/atomic.AddInt32()
sync.(*Mutex).Unlock()
google.golang.org/grpc/internal/buffer.(*Unbounded).Close.deferwrap1()
google.golang.org/grpc.(*ccResolverWrapper).close()
google.golang.org/grpc.(*ClientConn).Close()
k8s.io/kubernetes/pkg/kubelet/cm/devicemanager/plugin/v1beta1.(*client).Disconnect()
    client.go:106

The gRPC Changes

Version Comparison

The race became detectable after upgrading from gRPC v1.75.0 to v1.78.0. Between these versions, there were 73 commits that changed various aspects of gRPC's internal behavior.

Key Commits Analyzed

The following commits from the grpc/grpc-go repository were identified as potentially relevant:

1. Buffer State Fix (v1.76.0)

Commit: 83bead40 - "internal/buffer: set closed flag when closing channel in the Load method"

 func (b *Unbounded) Load() {
     // ...
     } else if b.closing && !b.closed {
+        b.closed = true
         close(b.c)
     }
 }

This fix addressed a bug where the closed field was never set to true when closing the buffer with pending backlog items. While this is a correctness fix, it changes the timing of state modifications during close operations.

2. Graceful Switch Goroutine Wait (v1.78.0)

Commit: 9b990b63 - "gracefulswitch: Wait for all goroutines on close"

  • PR: #8746
  • Related Issues: #8655
  • Introduced in: v1.78.0
// Before: Close() returned immediately
// After: Close() waits for all spawned goroutines

func (gsb *Balancer) Close() {
    // ... existing close logic ...
    currentBalancerToClose.Close()
    pendingBalancerToClose.Close()
+   gsb.activeGoroutines.Wait()  // NEW: Wait for goroutines
}

Release Note: "client: Closing a graceful switch balancer will now block until all pending goroutines complete."

This change significantly impacts timing during connection close. By making Close() wait for goroutines to finish, it changes when mutex operations complete relative to other operations.

3. Transport Layer Optimizations (v1.77.0)

Several commits optimized buffer handling in the transport layer:

Commit Description Version
dc355bc0 Remove buffer copies while writing HTTP/2 Data frames v1.77.0
363018c3 Avoid buffer copies when reading Data frames v1.77.0
f959da61 Reduce heap allocations v1.77.0
50c6321b Allocate at 4KiB boundaries in fallback buffer pool v1.78.0
8ab0c821 Avoid clearing new buffers, clear buffers from simpleBufferPools v1.77.0

These optimizations changed memory allocation patterns and buffer lifecycle, which can affect timing of concurrent operations.

Internal Buffer Implementation

The buffer.Unbounded type in gRPC is used for passing updates between internal components:

// internal/buffer/unbounded.go
type Unbounded struct {
    c       chan any
    closed  bool
    closing bool
    mu      sync.Mutex  // This mutex is involved in the race
    backlog []any
}

func (b *Unbounded) Close() {
    b.mu.Lock()
    defer b.mu.Unlock()  // The unlock triggers atomic operations
    if b.closing {
        return
    }
    b.closing = true
    if len(b.backlog) == 0 {
        b.closed = true
        close(b.c)
    }
}

func (b *Unbounded) Load() {
    b.mu.Lock()
    defer b.mu.Unlock()
    if len(b.backlog) > 0 {
        // ... handle backlog ...
    } else if b.closing && !b.closed {
        b.closed = true  // Added in v1.76.0 (commit 83bead40)
        close(b.c)
    }
}

Why the Race Became Visible in v1.78.0

The race existed latently in the Kubernetes code, but became reproducible due to a combination of gRPC changes:

  1. Buffer state fix (v1.76.0): The b.closed = true addition in Load() added a new write operation during close sequences, slightly changing timing.

  2. Graceful switch wait (v1.78.0): The most impactful change. By making Close() block until goroutines complete, it:

    • Extended the window during which internal state is being modified
    • Made the close operation more synchronous
    • Changed when mutex unlocks occur relative to other operations
  3. Transport optimizations (v1.77.0-v1.78.0): Buffer pool and allocation changes altered the memory access patterns, potentially affecting cache behavior and timing.

Timeline of Changes

v1.75.0 (Kubernetes baseline)
    |
    v--- Commit 83bead40: Buffer closed flag fix
    |
v1.76.0
    |
    v--- Multiple transport optimizations
    |
v1.77.0
    |
    v--- Commit 9b990b63: Graceful switch waits for goroutines
    |
v1.78.0 (Race becomes reproducible)

The Fundamental Issue

The underlying problem is not a gRPC bug - gRPC's changes are all correct and improve the library. The issue is in the Kubernetes code:

Logging a struct containing a gRPC client via reflection during connection shutdown is inherently unsafe.

The gRPC changes simply made an existing latent bug in Kubernetes more likely to manifest. This is a common pattern where dependency updates expose pre-existing race conditions that were previously hidden by favorable timing.


Call Graph Analysis

Test Execution Flow

TestNewManagerImplStartProbeMode (manager_test.go:122)
├── setupInProbeMode()
│   ├── setupDeviceManager()
│   ├── setupDevicePlugin()
│   └── setupPluginManager()
│       └── go pluginManager.Run()  // Starts in background
└── cleanup() (manager_test.go:361)
    ├── p.Stop()  // Stop plugin stub
    └── m.Stop() (manager.go:361)
        └── server.Stop() (server.go:131)
            └── visitClients() (server.go:203)
                └── Stop.func1() (server.go:132)
                    └── disconnectClient() (handler.go:104)
                        └── client.Disconnect() (client.go:112)
                            ├── c.grpc.Close() (client.go:106)     ← WRITER
                            └── handler.PluginDisconnected()
                                └── PluginDisconnected() (manager.go:253)
                                    └── logger.Info("endpoint", ep) ← READER

The Race Window

The race window exists because:

  1. client.Disconnect() first calls c.grpc.Close() to close the gRPC connection
  2. Then it calls PluginDisconnected() which logs the endpoint
  3. The gRPC Close() operation continues asynchronously in background goroutines
  4. The logger uses reflection to read the gRPC client struct
  5. The background goroutines modify mutex state in the gRPC client

The Fix

Code Change

Before (manager.go:253):

if ep, exists := m.endpoints[resourceName]; exists {
    m.markResourceUnhealthy(logger, resourceName)
    logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName, "endpoint", ep)
    ep.e.setStopTime(time.Now())
}

After:

if ep, exists := m.endpoints[resourceName]; exists {
    m.markResourceUnhealthy(logger, resourceName)
    logger.V(2).Info("Endpoint became unhealthy", "resourceName", resourceName)
    ep.e.setStopTime(time.Now())
}

Rationale

  1. The endpoint field is redundant: The resourceName is already in the log message and uniquely identifies the endpoint
  2. Logging complex structs is dangerous: Structs containing gRPC clients, network connections, or other concurrent objects should not be logged via reflection
  3. Minimal change: The fix removes only the problematic field without changing other behavior

Alternative Approaches Considered

  1. Copy safe fields before logging: Would require creating a new struct with only safe-to-log fields
  2. Implement String() method on endpointInfo: Would prevent reflection-based formatting but requires more code
  3. Log before Close(): Would require restructuring the disconnect flow

The chosen approach is the simplest and most direct fix.


Verification

Test Results

Test Configuration Platform Runs Failures Rate
Unfixed + gRPC v1.75.0 macOS 13,000+ 0 0%
Unfixed + gRPC v1.78.0 macOS 2,500+ 0 0%
Unfixed + gRPC v1.78.0 Linux (Docker) 1,466 148 10.10%
Fixed + gRPC v1.78.0 Linux (Docker) 200 0 0%

Reproduction Commands

To reproduce the race (unfixed code + gRPC v1.78.0):

# Pin gRPC to v1.78.0
./hack/pin-dependency.sh google.golang.org/grpc v1.78.0
./hack/update-vendor.sh

# Run in Linux container
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
  bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"

To verify the fix:

# Apply fix to manager.go:253
# Remove "endpoint", ep from the log statement

# Run same test - should pass with 0 races
docker run --rm -v $(pwd):/kubernetes -w /kubernetes golang:latest \
  bash -c "go test -race -count=200 -run TestNewManagerImplStartProbeMode \
  k8s.io/kubernetes/pkg/kubelet/cm/devicemanager"

Lessons Learned

1. Platform-Dependent Race Conditions

Race conditions may manifest differently on different platforms. This race:

  • Did not reproduce on macOS/arm64 even with 13,000+ test runs
  • Reproduced consistently on Linux with ~10% failure rate

Always test on the same platform as CI when investigating flaky tests.

2. Logging Complex Structs is Risky

Logging structs that contain:

  • gRPC clients/connections
  • Network connections
  • Mutexes or other synchronization primitives
  • Channels

...using reflection-based formatters (like gomega/format) can cause races. The reflection traverses all fields, including internal state that may be concurrently modified.

3. Dependency Updates Can Expose Latent Bugs

The race existed in the code before the gRPC update, but the timing changes in gRPC v1.78.0 made it detectable. This highlights the importance of:

  • Running tests with race detection
  • Testing with updated dependencies before merging
  • The value of the ci-kubernetes-unit-dependencies job

4. CI Log Analysis is Critical

The CI logs contained all the information needed to diagnose the issue:

  • The exact stack trace showing where the race occurred
  • The differences.md artifact showing which dependencies changed
  • Multiple failure examples confirming the pattern

5. Historical Context Helps

The original issue #121451 and PR #121494 provided context about previous race conditions in the same code area. While this specific race was different (a logging issue vs. a nil pointer issue), understanding the code's history helped narrow down the investigation.


Appendix: Related Files

Files Modified

  • pkg/kubelet/cm/devicemanager/manager.go - Line 253

Files Analyzed

  • pkg/kubelet/cm/devicemanager/manager_test.go - Test file
  • pkg/kubelet/cm/devicemanager/endpoint.go - Endpoint struct definitions
  • pkg/kubelet/cm/devicemanager/plugin/v1beta1/client.go - Client disconnect logic
  • pkg/kubelet/cm/devicemanager/plugin/v1beta1/handler.go - Connection handling
  • pkg/kubelet/cm/devicemanager/plugin/v1beta1/server.go - Server stop logic
  • vendor/google.golang.org/grpc/internal/buffer/unbounded.go - gRPC buffer (v1.78.0)

Kubernetes Issues and PRs

gRPC-Go Commits Analyzed

Commit Description Version Link
83bead40 Set closed flag when closing channel in Load method v1.76.0 PR #8575
9b990b63 Wait for all goroutines on close in gracefulswitch v1.78.0 PR #8746
dc355bc0 Remove buffer copies while writing HTTP/2 Data frames v1.77.0 PR #8667
363018c3 Avoid buffer copies when reading Data frames v1.77.0 PR #8657
50c6321b Allocate at 4KiB boundaries in fallback buffer pool v1.78.0 PR #8705
8ab0c821 Avoid clearing new buffers v1.77.0 PR #8670

gRPC-Go Issues Referenced

CI Job Logs Analyzed


Document generated: January 2026 Author: Claude Code Analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment