Created
March 4, 2026 14:24
-
-
Save stbenjam/4c7c950b97568949979f0e06acd90842 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| <!DOCTYPE html> | |
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>Payload Analysis: 4.22.0-0.nightly-2026-03-04-084819</title> | |
| <style> | |
| body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; margin: 0; padding: 20px; background: #f5f5f5; color: #333; } | |
| .container { max-width: 1200px; margin: 0 auto; } | |
| h1 { color: #1a1a2e; border-bottom: 3px solid #e94560; padding-bottom: 10px; } | |
| h2 { color: #16213e; margin-top: 30px; border-bottom: 2px solid #0f3460; padding-bottom: 5px; } | |
| h3 { color: #533483; } | |
| .card { background: white; border-radius: 8px; padding: 20px; margin: 15px 0; box-shadow: 0 2px 4px rgba(0,0,0,0.1); } | |
| .executive-summary { background: linear-gradient(135deg, #1a1a2e, #16213e); color: white; border-radius: 8px; padding: 25px; margin: 20px 0; } | |
| .executive-summary h2 { color: #e94560; border-bottom-color: #e94560; } | |
| .badge { display: inline-block; padding: 4px 12px; border-radius: 12px; font-size: 0.85em; font-weight: 600; margin: 2px; } | |
| .badge-ready { background: #fff3cd; color: #856404; } | |
| .badge-rejected { background: #f8d7da; color: #721c24; } | |
| .badge-accepted { background: #d4edda; color: #155724; } | |
| .badge-failed { background: #f8d7da; color: #721c24; } | |
| .badge-low { background: #d1ecf1; color: #0c5460; } | |
| .badge-infra { background: #e2e3e5; color: #383d41; } | |
| .badge-bug { background: #f8d7da; color: #721c24; } | |
| .root-cause-box { background: #fff0f0; border: 2px solid #dc3545; border-radius: 8px; padding: 20px; margin: 15px 0; } | |
| .root-cause-box h3 { color: #dc3545; margin-top: 0; } | |
| .code-block { background: #1e1e1e; color: #d4d4d4; padding: 15px; border-radius: 6px; font-family: 'Fira Code', 'Consolas', monospace; font-size: 0.85em; overflow-x: auto; margin: 10px 0; } | |
| .deadlock-diagram { background: #f8f9fa; border: 1px solid #dee2e6; border-radius: 8px; padding: 20px; margin: 15px 0; text-align: center; font-family: monospace; } | |
| table { width: 100%; border-collapse: collapse; margin: 10px 0; } | |
| th { background: #16213e; color: white; padding: 10px 12px; text-align: left; font-size: 0.9em; } | |
| td { padding: 8px 12px; border-bottom: 1px solid #eee; font-size: 0.9em; } | |
| tr:hover { background: #f8f9fa; } | |
| .pass-rate { font-weight: bold; } | |
| .pass-rate.bad { color: #dc3545; } | |
| .pass-rate.ok { color: #fd7e14; } | |
| .pass-rate.good { color: #28a745; } | |
| a { color: #0f3460; } | |
| .verdict { font-size: 1.1em; padding: 15px; border-radius: 8px; margin: 15px 0; } | |
| .verdict-report { background: #d1ecf1; border-left: 4px solid #17a2b8; } | |
| .footer { text-align: center; color: #888; margin-top: 40px; padding: 20px; font-size: 0.85em; } | |
| .trend-down { color: #dc3545; } | |
| .trend-up { color: #28a745; } | |
| .failure-detail { background: #fff8f0; border-left: 3px solid #fd7e14; padding: 15px; margin: 10px 0; border-radius: 0 8px 8px 0; } | |
| ul.pr-list { list-style: none; padding: 0; } | |
| ul.pr-list li { padding: 6px 0; border-bottom: 1px solid #f0f0f0; } | |
| ul.pr-list li:last-child { border-bottom: none; } | |
| .component-tag { background: #e8eaf6; color: #283593; padding: 2px 8px; border-radius: 4px; font-size: 0.8em; margin-right: 5px; } | |
| </style> | |
| </head> | |
| <body> | |
| <div class="container"> | |
| <h1>Payload Analysis: 4.22.0-0.nightly-2026-03-04-084819</h1> | |
| <div class="executive-summary"> | |
| <h2>Executive Summary</h2> | |
| <p><strong>Payload:</strong> <a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-04-084819" target="_blank" style="color:#e94560">4.22.0-0.nightly-2026-03-04-084819</a> <span class="badge badge-ready">Ready</span></p> | |
| <p><strong>Release:</strong> 4.22 | <strong>Stream:</strong> nightly | <strong>Architecture:</strong> amd64</p> | |
| <p><strong>Blocking Jobs:</strong> 7 succeeded, 7 pending, 4 failed (of 18)</p> | |
| <p><strong>Last Accepted:</strong> <a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-02-153725" target="_blank" style="color:#e94560">4.22.0-0.nightly-2026-03-02-153725</a> (Mar 2)</p> | |
| <p><strong>New PRs:</strong> 13 (vs previous payload), ~96 cumulative since last accepted</p> | |
| <p><strong>Analysis Date:</strong> 2026-03-04</p> | |
| <div class="verdict verdict-report" style="background:rgba(248,215,218,0.2); border-left-color:#dc3545; color: #f8d7da;"> | |
| <strong>Root Cause Identified: ManagementCPUsOverride Admission Plugin Deadlock</strong> — All 4 failed jobs share an identical root cause: the built-in <code>autoscaling.openshift.io/ManagementCPUsOverride</code> kube-apiserver admission plugin rejects ALL pod creation during bootstrap because "the cluster does not have any nodes", creating a fatal chicken-and-egg deadlock. This is a TechPreview-specific race condition in <code>openshift/kubernetes</code> that has existed since 2021 but is worsening (pass rates dropped from 62-83% to 41-70%). No specific PR in this payload is responsible — this is a pre-existing bug in the admission plugin's bootstrap handling. | |
| </div> | |
| </div> | |
| <h2>Payload History (Recent)</h2> | |
| <div class="card"> | |
| <table> | |
| <tr><th>Payload Tag</th><th>Phase</th><th>Date</th><th>Blocking Results</th></tr> | |
| <tr><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-04-084819" target="_blank">4.22.0-0.nightly-2026-03-04-084819</a></td><td><span class="badge badge-ready">Ready</span></td><td>Mar 4 08:48</td><td>7 passed, 7 pending, 4 failed</td></tr> | |
| <tr><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-04-024042" target="_blank">4.22.0-0.nightly-2026-03-04-024042</a></td><td><span class="badge badge-rejected">Rejected</span></td><td>Mar 4 02:40</td><td>15 passed, 3 failed</td></tr> | |
| <tr><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-03-150411" target="_blank">4.22.0-0.nightly-2026-03-03-150411</a></td><td><span class="badge badge-rejected">Rejected</span></td><td>Mar 3 15:04</td><td>15 passed, 3 failed</td></tr> | |
| <tr style="background:#e8f5e9;"><td><a href="https://amd64.ocp.releases.ci.openshift.org/releasestream/4.22.0-0.nightly/release/4.22.0-0.nightly-2026-03-02-153725" target="_blank">4.22.0-0.nightly-2026-03-02-153725</a></td><td><span class="badge badge-accepted">Accepted</span></td><td>Mar 2 15:37</td><td>18/18 passed</td></tr> | |
| </table> | |
| </div> | |
| <h2>Failed Blocking Jobs</h2> | |
| <div class="card"> | |
| <table> | |
| <tr><th>Job</th><th>Failure Mode</th><th>Current Pass Rate</th><th>Trend</th><th>Retries</th></tr> | |
| <tr> | |
| <td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview/2029149691546963968" target="_blank">aws-ovn-techpreview</a></td> | |
| <td><span class="badge badge-bug">Admission Deadlock</span></td> | |
| <td class="pass-rate bad">41.7%</td> | |
| <td class="trend-down">-20.4% (was 62.1%)</td> | |
| <td>1</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-1of3/2029146963546476544" target="_blank">aws-ovn-techpreview-serial-1of3</a></td> | |
| <td><span class="badge badge-bug">Admission Deadlock</span></td> | |
| <td class="pass-rate bad">45.5%</td> | |
| <td class="trend-down">-30.5% (was 76.0%)</td> | |
| <td>1</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-2of3/2029127737813241856" target="_blank">aws-ovn-techpreview-serial-2of3</a></td> | |
| <td><span class="badge badge-bug">Admission Deadlock</span></td> | |
| <td class="pass-rate ok">60.0%</td> | |
| <td class="trend-down">-23.3% (was 83.3%)</td> | |
| <td>1</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-ci-4.22-e2e-aws-ovn-techpreview-serial-3of3/2029147245412093952" target="_blank">aws-ovn-techpreview-serial-3of3</a></td> | |
| <td><span class="badge badge-bug">Admission Deadlock</span></td> | |
| <td class="pass-rate ok">70.0%</td> | |
| <td class="trend-up">+3.3% (was 66.7%)</td> | |
| <td>1</td> | |
| </tr> | |
| </table> | |
| <p><strong>Note:</strong> All 4 failures share an identical root cause: the <code>ManagementCPUsOverride</code> admission plugin creates a bootstrap deadlock by rejecting all pods with workload annotations when zero nodes exist. This is a TechPreview-specific bug — the admission plugin checks node count <em>before</em> checking whether CPU partitioning is active. See Root Cause section below for details.</p> | |
| </div> | |
| <h2>Root Cause: ManagementCPUsOverride Bootstrap Deadlock</h2> | |
| <div class="root-cause-box"> | |
| <h3>ManagementCPUsOverride Admission Plugin — Logic Ordering Bug</h3> | |
| <p>Deep analysis of installer log bundles from all 4 failed jobs reveals an identical root cause: the <code>autoscaling.openshift.io/ManagementCPUsOverride</code> kube-apiserver admission plugin blocks ALL pod creation during bootstrap with the error <em>"the cluster does not have any nodes"</em>.</p> | |
| <p><strong>The Bug:</strong> The admission plugin's <code>Admit()</code> function checks for nodes <em>before</em> checking whether CPU partitioning is even active. During bootstrap, zero nodes exist, so the check fails — even though <code>CPUPartitioning == None</code> for these clusters. The <code>isCPUPartitioning()</code> check that would allow the pod through comes <em>after</em> the node check and is never reached.</p> | |
| <div class="code-block"> | |
| <pre>// openshift/kubernetes: openshift-kube-apiserver/admission/autoscaling/ | |
| // managementcpusoverride/admission.go | |
| func (a *managementCPUsOverride) Admit(...) { | |
| // Step 1: Check for workload annotations | |
| workloadType, _ := getWorkloadType(podAnnotations) | |
| if len(workloadType) == 0 { | |
| return nil // Pods WITHOUT annotations always pass | |
| } | |
| // Step 2: Wait for informer cache sync (10s timeout) | |
| ... | |
| // Step 3: Check for nodes — RUNS FIRST, BLOCKS DURING BOOTSTRAP | |
| nodes, _ := a.nodeLister.List(labels.Everything()) | |
| if len(nodes) == 0 { | |
| return admission.NewForbidden(attr, | |
| fmt.Errorf("%s the cluster does not have any nodes", | |
| PluginName)) | |
| } | |
| // Step 4: Check if CPU partitioning is active — NEVER REACHED | |
| if !isCPUPartitioning(clusterInfra.Status, nodes, workloadType) { | |
| return nil // Would pass here for non-CPU-partitioned clusters | |
| } | |
| }</pre> | |
| </div> | |
| <div style="margin: 20px 0; display: flex; flex-direction: column; align-items: center; gap: 0;"> | |
| <div style="border: 2px solid #666; border-radius: 6px; padding: 12px 24px; text-align: center; background: #f8f9fa;"> | |
| <div style="font-weight: 600;">Bootstrap kube-apiserver</div> | |
| <div>ManagementCPUsOverride admission plugin loaded</div> | |
| </div> | |
| <div style="font-size: 1.5em; color: #666;">▼</div> | |
| <div style="border: 2px solid #fd7e14; border-radius: 6px; padding: 12px 24px; text-align: center; background: #fff8f0;"> | |
| <div style="font-weight: 600;">Pod has workload annotation</div> | |
| <div style="font-size: 0.85em;">target.workload.openshift.io/management</div> | |
| <div style="font-size: 0.85em; color: #888;">(CVO, MCO, ingress, etc. — statically baked in manifests)</div> | |
| </div> | |
| <div style="font-size: 1.5em; color: #666;">▼</div> | |
| <div style="border: 2px solid #dc3545; border-radius: 6px; padding: 12px 24px; text-align: center; background: #fff0f0;"> | |
| <div style="font-weight: 600; color: #dc3545;">nodeLister.List() → 0 nodes</div> | |
| <div style="font-weight: 600; color: #dc3545;">REJECTED: "the cluster does not have any nodes"</div> | |
| <div style="font-size: 0.85em; margin-top: 4px;">Chicken: can't create pods without nodes</div> | |
| </div> | |
| <div style="font-size: 1.5em; color: #666;">▼</div> | |
| <div style="border: 2px solid #dc3545; border-radius: 6px; padding: 12px 24px; text-align: center; background: #fff0f0;"> | |
| <div style="font-weight: 600;">Operator pods never start</div> | |
| <div style="font-weight: 600;">Nodes never register</div> | |
| <div style="font-weight: 600; color: #dc3545;">Bootstrap times out</div> | |
| <div style="font-size: 0.85em; margin-top: 4px;">Egg: can't register nodes without operator pods</div> | |
| </div> | |
| </div> | |
| <p><strong>Why TechPreview only:</strong> In TechPreview mode, operator deployments carry <code>target.workload.openshift.io/management</code> workload annotations for workload partitioning support. These annotations trigger the admission plugin's node check. Non-TechPreview clusters don't have these annotations on operator pods, so the plugin returns <code>nil</code> at Step 1 and pods are created normally.</p> | |
| <p><strong>Affected operators:</strong> CVO, Machine Config Controller, Cluster Ingress Operator, bootstrap kube-apiserver pod, cluster-monitoring-operator, cluster-storage-operator, service-ca-operator, and others — all have workload annotations statically baked into their deployment manifests.</p> | |
| <p><strong>Code location:</strong> <code>openshift/kubernetes</code> → <code>openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/admission.go</code></p> | |
| <p><strong>Historical note:</strong> This bug has existed since 2021. <a href="https://github.com/openshift/kubernetes/pull/756" target="_blank">openshift/kubernetes#756</a> was supposed to fix it by checking Infrastructure status first, but the code flow still checks nodes before checking if CPU partitioning is active.</p> | |
| </div> | |
| <h2>Failure Details (Log Bundle Analysis)</h2> | |
| <div class="failure-detail"> | |
| <h3>aws-ovn-techpreview — Bootstrap Timeout</h3> | |
| <p><strong>Symptoms:</strong> Three master machines created in AWS (us-west-2), Phase/State/Node/ProviderID all empty after 81 minutes. Zero nodes in cluster. Bootstrap etcd and kube-apiserver ran for ~55 minutes, then timed out.</p> | |
| <p><strong>Log Evidence:</strong> ManagementCPUsOverride admission plugin rejected pod creation across all operator namespaces with "the cluster does not have any nodes" error.</p> | |
| <p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p> | |
| </div> | |
| <div class="failure-detail"> | |
| <h3>aws-ovn-techpreview-serial-1of3 — Bootstrap Timeout</h3> | |
| <p><strong>Symptoms:</strong> All 4 machines created via Cluster API in us-east-1. Bootstrap etcd and kube-apiserver ran normally for 55 minutes. All 37 cluster operators remained "not available (missing)". Master nodes never joined.</p> | |
| <p><strong>Log Evidence:</strong> ManagementCPUsOverride blocked operator pod creation — operators never started, so nodes could never register.</p> | |
| <p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p> | |
| </div> | |
| <div class="failure-detail"> | |
| <h3>aws-ovn-techpreview-serial-2of3 — Bootstrap Timeout</h3> | |
| <p><strong>Symptoms:</strong> AWS infrastructure provisioned correctly (AWSMachines "ready: true" / "running"). Machines stuck in "Provisioned" phase — "Waiting for Cluster control plane to be initialized". kube-apiserver showed "namespaces kube-system not found" errors.</p> | |
| <p><strong>Log Evidence:</strong> ManagementCPUsOverride prevented control plane pods from starting. Control plane never initialized, so machines waiting for it timed out.</p> | |
| <p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p> | |
| </div> | |
| <div class="failure-detail"> | |
| <h3>aws-ovn-techpreview-serial-3of3 — Bootstrap Timeout</h3> | |
| <p><strong>Symptoms:</strong> All 6 machines created in us-east-2. Bootstrap etcd formed single-node cluster. Zero nodes ever registered. Only 2 MachineConfigs existed (SSH only). kube-apiserver restarted every ~20 minutes.</p> | |
| <p><strong>Log Evidence:</strong> ManagementCPUsOverride rejected MCO and other operator pods. MCO never initialized, so MachineConfigs were never rendered and nodes never completed ignition.</p> | |
| <p><strong>Assessment:</strong> <span class="badge badge-bug">Admission Plugin Deadlock</span></p> | |
| </div> | |
| <h2>Suspect PRs</h2> | |
| <div class="card"> | |
| <p>13 new PRs in this payload vs the previous payload. <strong>All scored LOW confidence (<60)</strong> — no causal link to bootstrap timeout failures.</p> | |
| <table> | |
| <tr><th>PR</th><th>Component</th><th>Description</th><th>Confidence</th><th>Rationale</th></tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-kube-apiserver-operator/pull/2003" target="_blank">#2003</a></td> | |
| <td><span class="component-tag">cluster-kube-apiserver-operator</span></td> | |
| <td>Rebase 1.35</td> | |
| <td><span class="badge badge-low">LOW (25)</span></td> | |
| <td>Operator rebase doesn't affect bootstrap kube-apiserver (runs from ignition)</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-kube-controller-manager-operator/pull/902" target="_blank">#902</a></td> | |
| <td><span class="component-tag">cluster-kube-controller-manager-operator</span></td> | |
| <td>Rebase 1.35</td> | |
| <td><span class="badge badge-low">LOW (25)</span></td> | |
| <td>Operator rebase; failures are in bootstrap before operators run</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-kube-scheduler-operator/pull/602" target="_blank">#602</a></td> | |
| <td><span class="component-tag">cluster-kube-scheduler-operator</span></td> | |
| <td>Rebase 1.35</td> | |
| <td><span class="badge badge-low">LOW (25)</span></td> | |
| <td>Operator rebase; failures are in bootstrap before operators run</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/installer/pull/10345" target="_blank">#10345</a></td> | |
| <td><span class="component-tag">installer</span></td> | |
| <td>Kludge: restart sshd server</td> | |
| <td><span class="badge badge-low">LOW (30)</span></td> | |
| <td>SSH restart kludge — unlikely to affect node registration/bootstrap</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/machine-api-provider-aws/pull/171" target="_blank">#171</a></td> | |
| <td><span class="component-tag">aws-machine-controllers</span></td> | |
| <td>Enable primary IPv6 for DualStackIPv6Primary</td> | |
| <td><span class="badge badge-low">LOW (20)</span></td> | |
| <td>Only affects DualStackIPv6Primary — not the default for these jobs</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/hypershift/pull/7831" target="_blank">#7831</a></td> | |
| <td><span class="component-tag">hypershift</span></td> | |
| <td>Inject proxy env vars into karpenter workloads</td> | |
| <td><span class="badge badge-low">LOW (10)</span></td> | |
| <td>HyperShift change — not used in standard IPI installs</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/console/pull/16042" target="_blank">#16042</a></td> | |
| <td><span class="component-tag">console</span></td> | |
| <td>Fix topology node labels disappearing</td> | |
| <td><span class="badge badge-low">LOW (5)</span></td> | |
| <td>UI fix — no impact on cluster installation</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cloud-credential-operator/pull/979" target="_blank">#979</a></td> | |
| <td><span class="component-tag">cloud-credential-operator</span></td> | |
| <td>cloud-credential-tests-ext multi-arch support</td> | |
| <td><span class="badge badge-low">LOW (10)</span></td> | |
| <td>Test infrastructure change — no runtime impact</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-api-provider-vsphere/pull/82" target="_blank">#82</a></td> | |
| <td><span class="component-tag">vsphere-cluster-api-controllers</span></td> | |
| <td>Update to new manifests-gen</td> | |
| <td><span class="badge badge-low">LOW (5)</span></td> | |
| <td>vSphere provider — not used in AWS jobs</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-samples-operator/pull/675" target="_blank">#675</a></td> | |
| <td><span class="component-tag">cluster-samples-operator</span></td> | |
| <td>ART image update</td> | |
| <td><span class="badge badge-low">LOW (5)</span></td> | |
| <td>Image consistency update — no functional change</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/openshift-state-metrics/pull/130" target="_blank">#130</a></td> | |
| <td><span class="component-tag">openshift-state-metrics</span></td> | |
| <td>ART image update</td> | |
| <td><span class="badge badge-low">LOW (5)</span></td> | |
| <td>Image consistency update — no functional change</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/prometheus-alertmanager/pull/119" target="_blank">#119</a></td> | |
| <td><span class="component-tag">prometheus-alertmanager</span></td> | |
| <td>ART image update</td> | |
| <td><span class="badge badge-low">LOW (5)</span></td> | |
| <td>Image consistency update — no functional change</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/prometheus-operator/pull/372" target="_blank">#372</a></td> | |
| <td><span class="component-tag">prometheus-operator</span></td> | |
| <td>ART image update</td> | |
| <td><span class="badge badge-low">LOW (5)</span></td> | |
| <td>Image consistency update — no functional change</td> | |
| </tr> | |
| </table> | |
| </div> | |
| <h2>Broader Context: TechPreview Job Health</h2> | |
| <div class="card"> | |
| <p>TechPreview jobs across the 4.22 release are experiencing <strong>widespread bootstrap failures</strong> driven by the same ManagementCPUsOverride deadlock. Nearly all techpreview variants (not just AWS) show severely degraded or 0% pass rates:</p> | |
| <ul> | |
| <li><strong>gcp-ovn-techpreview-serial:</strong> 0% (was 47.6%) — 7/7 failed</li> | |
| <li><strong>metal-ipi-ovn-dualstack-techpreview:</strong> 0% (was 52.4%) — 7/7 failed</li> | |
| <li><strong>aws-ovn-single-node-techpreview:</strong> 0% — 7/7 failed</li> | |
| <li><strong>metal-ipi-ovn-ipv6-techpreview:</strong> 0% — 7/7 failed</li> | |
| </ul> | |
| <p>All platforms are affected equally, confirming this is <strong>not an infrastructure issue</strong> but a systemic TechPreview bootstrap bug. The ManagementCPUsOverride admission plugin deadlock affects any TechPreview cluster where operator pods carry <code>target.workload.openshift.io/</code> annotations — which is all of them.</p> | |
| </div> | |
| <h2>PR Investigation Summary</h2> | |
| <div class="card"> | |
| <p>All 13 new PRs were investigated in depth. The three Kubernetes 1.35 rebase PRs were examined with particular attention since TechPreview bootstrap was the failure mode:</p> | |
| <table> | |
| <tr><th>PR</th><th>Finding</th></tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-kube-apiserver-operator/pull/2003" target="_blank">CKAO #2003</a></td> | |
| <td>Vendor-only rebase. Zero operator source code changes. No admission plugin configuration changes. Only 2 non-vendor files changed (PodSecurityAdmission API adaptation).</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-kube-controller-manager-operator/pull/902" target="_blank">CKCM #902</a></td> | |
| <td>Vendor-only rebase. Go 1.24→1.25, k8s libs 1.34→1.35. Zero source changes.</td> | |
| </tr> | |
| <tr> | |
| <td><a href="https://github.com/openshift/cluster-kube-scheduler-operator/pull/602" target="_blank">CKSO #602</a></td> | |
| <td>Vendor-only rebase (2,643 files changed, all vendor). Zero source changes. Scheduler does not interact with ManagementCPUsOverride.</td> | |
| </tr> | |
| </table> | |
| <p><strong>Key finding:</strong> The kube-apiserver binary is still on Kubernetes <strong>1.34.2</strong> (not rebased to 1.35 yet), while operator client libraries are now on 1.35. The admission plugin code in <code>openshift/kubernetes</code> is <strong>identical</strong> between release-4.21 and release-4.22. No PR in this payload modified the admission plugin, feature gates, or workload annotations. The bug is pre-existing but worsening — likely due to increased bootstrap load during the partial 1.35 rebase.</p> | |
| </div> | |
| <h2>Action Taken</h2> | |
| <div class="card"> | |
| <div class="verdict verdict-report"> | |
| <strong>REPORT ONLY</strong> — No automated actions taken. | |
| <ul> | |
| <li>All suspect PRs scored LOW confidence (<60)</li> | |
| <li>Failures are pre-existing infrastructure flakiness in TechPreview bootstrap</li> | |
| <li>No reverts staged, no bisect experiments initiated</li> | |
| <li>The payload is still in "Ready" state with 7 pending jobs — it may still be accepted if retries pass</li> | |
| </ul> | |
| </div> | |
| </div> | |
| <h2>Recommendations</h2> | |
| <div class="card"> | |
| <ol> | |
| <li><strong>Fix the ManagementCPUsOverride admission plugin logic ordering:</strong> In <code>openshift/kubernetes</code> <code>openshift-kube-apiserver/admission/autoscaling/managementcpusoverride/admission.go</code>, move the <code>isCPUPartitioning()</code> check <em>before</em> the <code>len(nodes) == 0</code> check. For clusters where <code>CPUPartitioning != AllNodes</code>, the plugin should return <code>nil</code> without checking node count. This would fix the bootstrap deadlock for all non-CPU-partitioned TechPreview clusters.</li> | |
| <li><strong>File a tracking bug against <code>openshift/kubernetes</code>:</strong> The bug has existed since 2021 (<a href="https://github.com/openshift/kubernetes/pull/756" target="_blank">#756</a> was supposed to fix it). All TechPreview jobs across all platforms (AWS, GCP, metal) are affected. This is not platform-specific — it's a kube-apiserver admission plugin bug.</li> | |
| <li><strong>Investigate timing regression:</strong> The admission plugin code hasn't changed, but pass rates have dropped from 62-83% to 41-70%. The partial Kubernetes 1.35 rebase (operators on 1.35 libs, kube-apiserver binary on 1.34.2) may have introduced timing differences that make the race condition more likely to trigger.</li> | |
| <li><strong>Monitor the payload:</strong> 7 blocking jobs are still pending. The payload may yet be accepted if retries pass, but TechPreview jobs will continue to be unreliable until the admission plugin is fixed.</li> | |
| </ol> | |
| </div> | |
| <div class="footer"> | |
| <p>Generated by Payload Agent | Analysis Date: 2026-03-04 | Payload: 4.22.0-0.nightly-2026-03-04-084819</p> | |
| <p>All links open in new tabs.</p> | |
| </div> | |
| </div> | |
| </body> | |
| </html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment