Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save stbenjam/4c7c950b97568949979f0e06acd90842 to your computer and use it in GitHub Desktop.

Select an option

Save stbenjam/4c7c950b97568949979f0e06acd90842 to your computer and use it in GitHub Desktop.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Payload Analysis: 4.22.0-0.nightly-2026-03-04-084819</title>
<style>
body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; margin: 0; padding: 20px; background: #f5f5f5; color: #333; }
.container { max-width: 1200px; margin: 0 auto; }
h1 { color: #1a1a2e; border-bottom: 3px solid #e94560; padding-bottom: 10px; }
h2 { color: #16213e; margin-top: 30px; border-bottom: 2px solid #0f3460; padding-bottom: 5px; }
h3 { color: #533483; }
.card { background: white; border-radius: 8px; padding: 20px; margin: 15px 0; box-shadow: 0 2px 4px rgba(0,0,0,0.1); }
.executive-summary { background: linear-gradient(135deg, #1a1a2e, #16213e); color: white; border-radius: 8px; padding: 25px; margin: 20px 0; }
.executive-summary h2 { color: #e94560; border-bottom-color: #e94560; }
.badge { display: inline-block; padding: 4px 12px; border-radius: 12px; font-size: 0.85em; font-weight: 600; margin: 2px; }
.badge-ready { background: #fff3cd; color: #856404; }
.badge-rejected { background: #f8d7da; color: #721c24; }
.badge-accepted { background: #d4edda; color: #155724; }
.badge-failed { background: #f8d7da; color: #721c24; }
.badge-low { background: #d1ecf1; color: #0c5460; }
.badge-infra { background: #e2e3e5; color: #383d41; }
.badge-bug { background: #f8d7da; color: #721c24; }
.root-cause-box { background: #fff0f0; border: 2px solid #dc3545; border-radius: 8px; padding: 20px; margin: 15px 0; }
.root-cause-box h3 { color: #dc3545; margin-top: 0; }
.code-block { background: #1e1e1e; color: #d4d4d4; padding: 15px; border-radius: 6px; font-family: 'Fira Code', 'Consolas', monospace; font-size: 0.85em; overflow-x: auto; margin: 10px 0; }
.deadlock-diagram { background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 8px; padding: 20px; margin: 15px 0; text-align: center; font-family: monospace; }
table { width: 100%; border-collapse: collapse; margin: 10px 0; }
th { background: #16213e; color: white; padding: 10px 12px; text-align: left; font-size: 0.9em; }
td { padding: 8px 12px; border-bottom: 1px solid #eee; font-size: 0.9em; }
tr:hover { background: #f8f9fa; }
.pass-rate { font-weight: bold; }
.pass-rate.bad { color: #dc3545; }
.pass-rate.ok { color: #fd7e14; }
.pass-rate.good { color: #28a745; }
a { color: #0f3460; }
.verdict { font-size: 1.1em; padding: 15px; border-radius: 8px; margin: 15px 0; }
.verdict-report { background: #d1ecf1; border-left: 4px solid #17a2b8; }
.footer { text-align: center; color: #888; margin-top: 40px; padding: 20px; font-size: 0.85em; }
.trend-down { color: #dc3545; }
.trend-up { color: #28a745; }
.failure-detail { background: #fff8f0; border-left: 3px solid #fd7e14; padding: 15px; margin: 10px 0; border-radius: 0 8px 8px 0; }
ul.pr-list { list-style: none; padding: 0; }
ul.pr-list li { padding: 6px 0; border-bottom: 1px solid #f0f0f0; }
ul.pr-list li:last-child { border-bottom: none; }
.component-tag { background: #e8eaf6; color: #283593; padding: 2px 8px; border-radius: 4px; font-size: 0.8em; margin-right: 5px; }
</style>
</head>
<body>
<div class="container">
<h1>Payload Analysis: 4.22.0-0.nightly-2026-03-04-084819</h1>
<div class="executive-summary">
<h2>Executive Summary</h2>
<p><strong>Payload:</strong> <a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-04-084819" target="_blank" style="color:#e94560">4.22.0-0.nightly-2026-03-04-084819</a> <span class="badge badge-ready">Ready</span></p>
<p><strong>Release:</strong> 4.22 | <strong>Stream:</strong> nightly | <strong>Architecture:</strong> amd64</p>
<p><strong>Blocking Jobs:</strong> 7 succeeded, 7 pending, 4 failed (of 18)</p>
<p><strong>Last Accepted:</strong> <a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-02-153725" target="_blank" style="color:#e94560">4.22.0-0.nightly-2026-03-02-153725</a> (Mar 2)</p>
<p><strong>New PRs:</strong> 13 (vs previous payload), ~96 cumulative since last accepted</p>
<p><strong>Analysis Date:</strong> 2026-03-04</p>
<div class="verdict verdict-report" style="background:rgba(248,215,218,0.2); border-left-color:#dc3545; color: #f8d7da;">
<strong>Root Cause Identified: ManagementCPUsOverride Admission Plugin Deadlock</strong> — All 4 failed jobs share an identical root cause: the built-in <code>autoscaling.openshift.io/ManagementCPUsOverride</code> kube-apiserver admission plugin rejects ALL pod creation during bootstrap because "the cluster does not have any nodes", creating a fatal chicken-and-egg deadlock. This is a TechPreview-specific race condition in <code>openshift/kubernetes</code> that has existed since 2021 but is worsening (pass rates dropped from 62-83% to 41-70%). No specific PR in this payload is responsible — this is a pre-existing bug in the admission plugin's bootstrap handling.
</div>
</div>
<h2>Payload History (Recent)</h2>
<div class="card">
<table>
<tr><th>Payload Tag</th><th>Phase</th><th>Date</th><th>Blocking Results</th></tr>
<tr><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-04-084819" target="_blank">4.22.0-0.nightly-2026-03-04-084819</a></td><td><span class="badge badge-ready">Ready</span></td><td>Mar 4 08:48</td><td>7 passed, 7 pending, 4 failed</td></tr>
<tr><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-04-024042" target="_blank">4.22.0-0.nightly-2026-03-04-024042</a></td><td><span class="badge badge-rejected">Rejected</span></td><td>Mar 4 02:40</td><td>15 passed, 3 failed</td></tr>
<tr><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-03-150411" target="_blank">4.22.0-0.nightly-2026-03-03-150411</a></td><td><span class="badge badge-rejected">Rejected</span></td><td>Mar 3 15:04</td><td>15 passed, 3 failed</td></tr>
<tr style="background:#e8f5e9;"><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-02-153725" target="_blank">4.22.0-0.nightly-2026-03-02-153725</a></td><td><span class="badge badge-accepted">Accepted</span></td><td>Mar 2 15:37</td><td>18/18 passed</td></tr>
</table>
</div>
<h2>Failed Blocking Jobs</h2>
<div class="card">
<table>
<tr><th>Job</th><th>Failure Mode</th><th>Current Pass Rate</th><th>Trend</th><th>Retries</th></tr>
<tr>
<td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview/2029149691546963968" target="_blank">aws-ovn-techpreview</a></td>
<td><span class="badge badge-bug">Admission Deadlock</span></td>
<td class="pass-rate bad">41.7%</td>
<td class="trend-down">-20.4% (was 62.1%)</td>
<td>1</td>
</tr>
<tr>
<td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3/2029146963546476544" target="_blank">aws-ovn-techpreview-serial-1of3</a></td>
<td><span class="badge badge-bug">Admission Deadlock</span></td>
<td class="pass-rate bad">45.5%</td>
<td class="trend-down">-30.5% (was 76.0%)</td>
<td>1</td>
</tr>
<tr>
<td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3/2029127737813241856" target="_blank">aws-ovn-techpreview-serial-2of3</a></td>
<td><span class="badge badge-bug">Admission Deadlock</span></td>
<td class="pass-rate ok">60.0%</td>
<td class="trend-down">-23.3% (was 83.3%)</td>
<td>1</td>
</tr>
<tr>
<td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3/2029147245412093952" target="_blank">aws-ovn-techpreview-serial-3of3</a></td>
<td><span class="badge badge-bug">Admission Deadlock</span></td>
<td class="pass-rate ok">70.0%</td>
<td class="trend-up">+3.3% (was 66.7%)</td>
<td>1</td>
</tr>
</table>
<p><strong>Note:</strong> All 4 failures share an identical root cause: the <code>ManagementCPUsOverride</code> admission plugin creates a bootstrap deadlock by rejecting all pods with workload annotations when zero nodes exist. This is a TechPreview-specific bug — the admission plugin checks node count <em>before</em> checking whether CPU partitioning is active. See Root Cause section below for details.</p>
</div>
<h2>Root Cause: ManagementCPUsOverride Bootstrap Deadlock</h2>
<div class="root-cause-box">
<h3>ManagementCPUsOverride Admission Plugin — Logic Ordering Bug</h3>
<p>Deep analysis of installer log bundles from all 4 failed jobs reveals an identical root cause: the <code>autoscaling.openshift.io/ManagementCPUsOverride</code> kube-apiserver admission plugin blocks ALL pod creation during bootstrap with the error <em>"the cluster does not have any nodes"</em>.</p>
<p><strong>The Bug:</strong> The admission plugin's <code>Admit()</code> function checks for nodes <em>before</em> checking whether CPU partitioning is even active. During bootstrap, zero nodes exist, so the check fails — even though <code>CPUPartitioning == None</code> for these clusters. The <code>isCPUPartitioning()</code> check that would allow the pod through comes <em>after</em> the node check and is never reached.</p>
<div class="code-block">
<pre>// openshift/kubernetes: openshift-kube-apiserver/admission/autoscaling/
// managementcpusoverride/admission.go
func (a *managementCPUsOverride) Admit(...) {
// Step 1: Check for workload annotations
workloadType, _ := getWorkloadType(podAnnotations)
if len(workloadType) == 0 {
return nil // Pods WITHOUT annotations always pass
}
// Step 2: Wait for informer cache sync (10s timeout)
...
// Step 3: Check for nodes — RUNS FIRST, BLOCKS DURING BOOTSTRAP
nodes, _ := a.nodeLister.List(labels.Everything())
if len(nodes) == 0 {
return admission.NewForbidden(attr,
fmt.Errorf("%s the cluster does not have any nodes",
PluginName))
}
// Step 4: Check if CPU partitioning is active — NEVER REACHED
if !isCPUPartitioning(clusterInfra.Status, nodes, workloadType) {
return nil // Would pass here for non-CPU-partitioned clusters
}
}</pre>
</div>
<div style="margin: 20px 0; display: flex; flex-direction: column; align-items: center; gap: 0;">
<div style="border: 2px solid #666; border-radius: 6px; padding: 12px 24px; text-align: center; background: #f8f9fa;">
<div style="font-weight: 600;">Bootstrap kube-apiserver</div>
<div>ManagementCPUsOverride admission plugin loaded</div>
</div>
<div style="font-size: 1.5em; color: #666;">&#x25BC;</div>
<div style="border: 2px solid #fd7e14; border-radius: 6px; padding: 12px 24px; text-align: center; background: #fff8f0;">
<div style="font-weight: 600;">Pod has workload annotation</div>
<div style="font-size: 0.85em;">target.workload.openshift.io/management</div>
<div style="font-size: 0.85em; color: #888;">(CVO, MCO, ingress, etc. — statically baked in manifests)</div>
</div>
<div style="font-size: 1.5em; color: #666;">&#x25BC;</div>
<div style="border: 2px solid #dc3545; border-radius: 6px; padding: 12px 24px; text-align: center; background: #fff0f0;">
<div style="font-weight: 600; color: #dc3545;">nodeLister.List() &rarr; 0 nodes</div>
<div style="font-weight: 600; color: #dc3545;">REJECTED: "the cluster does not have any nodes"</div>
<div style="font-size: 0.85em; margin-top: 4px;">Chicken: can't create pods without nodes</div>
</div>
<div style="font-size: 1.5em; color: #666;">&#x25BC;</div>
<div style="border: 2px solid #dc3545; border-radius: 6px; padding: 12px 24px; text-align: center; background: #fff0f0;">
<div style="font-weight: 600;">Operator pods never start</div>
<div style="font-weight: 600;">Nodes never register</div>
<div style="font-weight: 600; color: #dc3545;">Bootstrap times out</div>
<div style="font-size: 0.85em; margin-top: 4px;">Egg: can't register nodes without operator pods</div>
</div>
</div>
<p><strong>Why TechPreview only:</strong> In TechPreview mode, operator deployments carry <code>target.workload.openshift.io/management</code> workload annotations for workload partitioning support. These annotations trigger the admission plugin's node check. Non-TechPreview clusters don't have these annotations on operator pods, so the plugin returns <code>nil</code> at Step 1 and pods are created normally.</p>
<p><strong>Affected operators:</strong> CVO, Machine Config Controller, Cluster Ingress Operator, bootstrap kube-apiserver pod, cluster-monitoring-operator, cluster-storage-operator, service-ca-operator, and others — all have workload annotations statically baked into their deployment manifests.</p>
<p><strong>Code location:</strong> <code>openshift/kubernetes</code> → <code>openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/admission.go</code></p>
<p><strong>Historical note:</strong> This bug has existed since 2021. <a href="https://github.com/openshift/kubernetes/pull/756" target="_blank">openshift/kubernetes#756</a> was supposed to fix it by checking Infrastructure status first, but the code flow still checks nodes before checking if CPU partitioning is active.</p>
</div>
<h2>Failure Details (Log Bundle Analysis)</h2>
<div class="failure-detail">
<h3>aws-ovn-techpreview — Bootstrap Timeout</h3>
<p><strong>Symptoms:</strong> Three master machines created in AWS (us-west-2), Phase/State/Node/ProviderID all empty after 81 minutes. Zero nodes in cluster. Bootstrap etcd and kube-apiserver ran for ~55 minutes, then timed out.</p>
<p><strong>Log Evidence:</strong> ManagementCPUsOverride admission plugin rejected pod creation across all operator namespaces with "the cluster does not have any nodes" error.</p>
<p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p>
</div>
<div class="failure-detail">
<h3>aws-ovn-techpreview-serial-1of3 — Bootstrap Timeout</h3>
<p><strong>Symptoms:</strong> All 4 machines created via Cluster API in us-east-1. Bootstrap etcd and kube-apiserver ran normally for 55 minutes. All 37 cluster operators remained "not available (missing)". Master nodes never joined.</p>
<p><strong>Log Evidence:</strong> ManagementCPUsOverride blocked operator pod creation — operators never started, so nodes could never register.</p>
<p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p>
</div>
<div class="failure-detail">
<h3>aws-ovn-techpreview-serial-2of3 — Bootstrap Timeout</h3>
<p><strong>Symptoms:</strong> AWS infrastructure provisioned correctly (AWSMachines "ready: true" / "running"). Machines stuck in "Provisioned" phase — "Waiting for Cluster control plane to be initialized". kube-apiserver showed "namespaces kube-system not found" errors.</p>
<p><strong>Log Evidence:</strong> ManagementCPUsOverride prevented control plane pods from starting. Control plane never initialized, so machines waiting for it timed out.</p>
<p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p>
</div>
<div class="failure-detail">
<h3>aws-ovn-techpreview-serial-3of3 — Bootstrap Timeout</h3>
<p><strong>Symptoms:</strong> All 6 machines created in us-east-2. Bootstrap etcd formed single-node cluster. Zero nodes ever registered. Only 2 MachineConfigs existed (SSH only). kube-apiserver restarted every ~20 minutes.</p>
<p><strong>Log Evidence:</strong> ManagementCPUsOverride rejected MCO and other operator pods. MCO never initialized, so MachineConfigs were never rendered and nodes never completed ignition.</p>
<p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p>
</div>
<h2>Suspect PRs</h2>
<div class="card">
<p>13 new PRs in this payload vs the previous payload. <strong>All scored LOW confidence (&lt;60)</strong> — no causal link to bootstrap timeout failures.</p>
<table>
<tr><th>PR</th><th>Component</th><th>Description</th><th>Confidence</th><th>Rationale</th></tr>
<tr>
<td><a href="https://github.com/openshift/cluster-kube-apiserver-operator/pull/2003" target="_blank">#2003</a></td>
<td><span class="component-tag">cluster-kube-apiserver-operator</span></td>
<td>Rebase 1.35</td>
<td><span class="badge badge-low">LOW (25)</span></td>
<td>Operator rebase doesn't affect bootstrap kube-apiserver (runs from ignition)</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cluster-kube-controller-manager-operator/pull/902" target="_blank">#902</a></td>
<td><span class="component-tag">cluster-kube-controller-manager-operator</span></td>
<td>Rebase 1.35</td>
<td><span class="badge badge-low">LOW (25)</span></td>
<td>Operator rebase; failures are in bootstrap before operators run</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cluster-kube-scheduler-operator/pull/602" target="_blank">#602</a></td>
<td><span class="component-tag">cluster-kube-scheduler-operator</span></td>
<td>Rebase 1.35</td>
<td><span class="badge badge-low">LOW (25)</span></td>
<td>Operator rebase; failures are in bootstrap before operators run</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/installer/pull/10345" target="_blank">#10345</a></td>
<td><span class="component-tag">installer</span></td>
<td>Kludge: restart sshd server</td>
<td><span class="badge badge-low">LOW (30)</span></td>
<td>SSH restart kludge — unlikely to affect node registration/bootstrap</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/machine-api-provider-aws/pull/171" target="_blank">#171</a></td>
<td><span class="component-tag">aws-machine-controllers</span></td>
<td>Enable primary IPv6 for DualStackIPv6Primary</td>
<td><span class="badge badge-low">LOW (20)</span></td>
<td>Only affects DualStackIPv6Primary — not the default for these jobs</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/hypershift/pull/7831" target="_blank">#7831</a></td>
<td><span class="component-tag">hypershift</span></td>
<td>Inject proxy env vars into karpenter workloads</td>
<td><span class="badge badge-low">LOW (10)</span></td>
<td>HyperShift change — not used in standard IPI installs</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/console/pull/16042" target="_blank">#16042</a></td>
<td><span class="component-tag">console</span></td>
<td>Fix topology node labels disappearing</td>
<td><span class="badge badge-low">LOW (5)</span></td>
<td>UI fix — no impact on cluster installation</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cloud-credential-operator/pull/979" target="_blank">#979</a></td>
<td><span class="component-tag">cloud-credential-operator</span></td>
<td>cloud-credential-tests-ext multi-arch support</td>
<td><span class="badge badge-low">LOW (10)</span></td>
<td>Test infrastructure change — no runtime impact</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cluster-api-provider-vsphere/pull/82" target="_blank">#82</a></td>
<td><span class="component-tag">vsphere-cluster-api-controllers</span></td>
<td>Update to new manifests-gen</td>
<td><span class="badge badge-low">LOW (5)</span></td>
<td>vSphere provider — not used in AWS jobs</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cluster-samples-operator/pull/675" target="_blank">#675</a></td>
<td><span class="component-tag">cluster-samples-operator</span></td>
<td>ART image update</td>
<td><span class="badge badge-low">LOW (5)</span></td>
<td>Image consistency update — no functional change</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/openshift-state-metrics/pull/130" target="_blank">#130</a></td>
<td><span class="component-tag">openshift-state-metrics</span></td>
<td>ART image update</td>
<td><span class="badge badge-low">LOW (5)</span></td>
<td>Image consistency update — no functional change</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/prometheus-alertmanager/pull/119" target="_blank">#119</a></td>
<td><span class="component-tag">prometheus-alertmanager</span></td>
<td>ART image update</td>
<td><span class="badge badge-low">LOW (5)</span></td>
<td>Image consistency update — no functional change</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/prometheus-operator/pull/372" target="_blank">#372</a></td>
<td><span class="component-tag">prometheus-operator</span></td>
<td>ART image update</td>
<td><span class="badge badge-low">LOW (5)</span></td>
<td>Image consistency update — no functional change</td>
</tr>
</table>
</div>
<h2>Broader Context: TechPreview Job Health</h2>
<div class="card">
<p>TechPreview jobs across the 4.22 release are experiencing <strong>widespread bootstrap failures</strong> driven by the same ManagementCPUsOverride deadlock. Nearly all techpreview variants (not just AWS) show severely degraded or 0% pass rates:</p>
<ul>
<li><strong>gcp-ovn-techpreview-serial:</strong> 0% (was 47.6%) — 7/7 failed</li>
<li><strong>metal-ipi-ovn-dualstack-techpreview:</strong> 0% (was 52.4%) — 7/7 failed</li>
<li><strong>aws-ovn-single-node-techpreview:</strong> 0% — 7/7 failed</li>
<li><strong>metal-ipi-ovn-ipv6-techpreview:</strong> 0% — 7/7 failed</li>
</ul>
<p>All platforms are affected equally, confirming this is <strong>not an infrastructure issue</strong> but a systemic TechPreview bootstrap bug. The ManagementCPUsOverride admission plugin deadlock affects any TechPreview cluster where operator pods carry <code>target.workload.openshift.io/</code> annotations — which is all of them.</p>
</div>
<h2>PR Investigation Summary</h2>
<div class="card">
<p>All 13 new PRs were investigated in depth. The three Kubernetes 1.35 rebase PRs were examined with particular attention since TechPreview bootstrap was the failure mode:</p>
<table>
<tr><th>PR</th><th>Finding</th></tr>
<tr>
<td><a href="https://github.com/openshift/cluster-kube-apiserver-operator/pull/2003" target="_blank">CKAO #2003</a></td>
<td>Vendor-only rebase. Zero operator source code changes. No admission plugin configuration changes. Only 2 non-vendor files changed (PodSecurityAdmission API adaptation).</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cluster-kube-controller-manager-operator/pull/902" target="_blank">CKCM #902</a></td>
<td>Vendor-only rebase. Go 1.24→1.25, k8s libs 1.34→1.35. Zero source changes.</td>
</tr>
<tr>
<td><a href="https://github.com/openshift/cluster-kube-scheduler-operator/pull/602" target="_blank">CKSO #602</a></td>
<td>Vendor-only rebase (2,643 files changed, all vendor). Zero source changes. Scheduler does not interact with ManagementCPUsOverride.</td>
</tr>
</table>
<p><strong>Key finding:</strong> The kube-apiserver binary is still on Kubernetes <strong>1.34.2</strong> (not rebased to 1.35 yet), while operator client libraries are now on 1.35. The admission plugin code in <code>openshift/kubernetes</code> is <strong>identical</strong> between release-4.21 and release-4.22. No PR in this payload modified the admission plugin, feature gates, or workload annotations. The bug is pre-existing but worsening — likely due to increased bootstrap load during the partial 1.35 rebase.</p>
</div>
<h2>Action Taken</h2>
<div class="card">
<div class="verdict verdict-report">
<strong>REPORT ONLY</strong> — No automated actions taken.
<ul>
<li>All suspect PRs scored LOW confidence (&lt;60)</li>
<li>Failures are pre-existing infrastructure flakiness in TechPreview bootstrap</li>
<li>No reverts staged, no bisect experiments initiated</li>
<li>The payload is still in "Ready" state with 7 pending jobs — it may still be accepted if retries pass</li>
</ul>
</div>
</div>
<h2>Recommendations</h2>
<div class="card">
<ol>
<li><strong>Fix the ManagementCPUsOverride admission plugin logic ordering:</strong> In <code>openshift/kubernetes</code> <code>openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/admission.go</code>, move the <code>isCPUPartitioning()</code> check <em>before</em> the <code>len(nodes) == 0</code> check. For clusters where <code>CPUPartitioning != AllNodes</code>, the plugin should return <code>nil</code> without checking node count. This would fix the bootstrap deadlock for all non-CPU-partitioned TechPreview clusters.</li>
<li><strong>File a tracking bug against <code>openshift/kubernetes</code>:</strong> The bug has existed since 2021 (<a href="https://github.com/openshift/kubernetes/pull/756" target="_blank">#756</a> was supposed to fix it). All TechPreview jobs across all platforms (AWS, GCP, metal) are affected. This is not platform-specific — it's a kube-apiserver admission plugin bug.</li>
<li><strong>Investigate timing regression:</strong> The admission plugin code hasn't changed, but pass rates have dropped from 62-83% to 41-70%. The partial Kubernetes 1.35 rebase (operators on 1.35 libs, kube-apiserver binary on 1.34.2) may have introduced timing differences that make the race condition more likely to trigger.</li>
<li><strong>Monitor the payload:</strong> 7 blocking jobs are still pending. The payload may yet be accepted if retries pass, but TechPreview jobs will continue to be unreliable until the admission plugin is fixed.</li>
</ol>
</div>
<div class="footer">
<p>Generated by Payload Agent | Analysis Date: 2026-03-04 | Payload: 4.22.0-0.nightly-2026-03-04-084819</p>
<p>All links open in new tabs.</p>
</div>
</div>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment