fizz/kubeflow-prod-incident-update-2026-02-27.md

## kubeflow-prod-incident-update-2026-02-27.md

      
    Raw
  

              kubeflow-prod-incident-update-2026-02-27.md
            
          
    Kubeflow prod incident update: reconciliation + RBAC recovery (2026-02-27)

Date: February 27, 2026
Clusters: mlinfra-prod, mlinfra-29
Three things broke, all fixed now. Both clusters are stable.

KFP frontend images kept reverting after manual edits.
workflow-controller and kserve-controller-manager were in CrashLoopBackOff on prod.
Prod and dev had drifted apart on controller RBAC and KFP config.

What happened and what I fixed

KFP frontend reconciliation

Manual edits to ml-pipeline-ui and ml-pipeline-ui-artifact kept reverting. Metacontroller/profile-controller was reconciling them back from the parent Namespace resource.
Fix: patched the profile-controller env ConfigMap to set FRONTEND_IMAGE=ghcr.io/kubeflow/kfp-frontend and FRONTEND_TAG=2.5.0, then restarted profile-controller and triggered a namespace reconcile. Now the controller itself writes the correct image — no more manual edits to revert.
CrashLoopBackOff controllers

workflow-controller and kserve-controller-manager were crashing on prod with forbidden list/watch errors. RBAC bindings had drifted.
Fixes:

Restored cluster-scope RBAC for workflow controller (kubeflow/argo SA).
Fixed KServe manager binding subjects to include expected service accounts.
Restarted both deployments, both rolled out healthy.

Argo CD cleanup

Argo CD was not running (no namespace, no pods, no apps) but had leftover CRDs: applications.argoproj.io, applicationsets.argoproj.io, appprojects.argoproj.io. Removed those. Kept the Argo Workflows CRDs — Kubeflow Pipelines needs those.
For anyone confused by this in the future: KFP depends on Argo Workflows (execution engine), not Argo CD (GitOps controller). Different projects, similar names.
Follow-up changes (post-stabilization)

Timebomb removed: pipeline-install-config.appVersion

Was 2.0.5 in both clusters. Now 2.5.0 in both.
Profile-controller pins in both clusters

Set explicit image pins so reconciliation writes the right thing going forward:

FRONTEND_IMAGE=ghcr.io/kubeflow/kfp-frontend
FRONTEND_TAG=2.5.0
VISUALIZATION_SERVER_IMAGE=gcr.io/ml-pipeline/visualization-server
VISUALIZATION_SERVER_TAG=2.0.5 (pinned during the appVersion bump — will update separately)

Dev/prod parity restored


admin/ml-pipeline-ui-artifact → ghcr.io/kubeflow/kfp-frontend:2.5.0 in both clusters.
kserve-manager-rolebinding includes both subjects in both clusters:

ServiceAccount:kserve:kserve-controller-manager
ServiceAccount:kubeflow:kserve-controller-manager


New operational scripts

kubeflow-version-snapshot.sh

One-command dump of Kubeflow/KFP state for a given context. Reports component images, pipeline-install-config.appVersion, profile-controller frontend overrides, and infers the Kubeflow release line.
./scripts/kubeflow-version-snapshot.sh mlinfra-prod kubeflow
./scripts/kubeflow-version-snapshot.sh mlinfra-29 kubeflow
kubeflow-rbac-smoke.sh

Runs kubectl auth can-i checks against the service accounts that broke during this incident: workflow-controller (kubeflow/argo), KServe controller (both SA locations), and scheduledworkflow.
./scripts/kubeflow-rbac-smoke.sh mlinfra-prod
./scripts/kubeflow-rbac-smoke.sh mlinfra-29
Both clusters pass all checks as of this writing.
Current versions


Kubernetes: v1.32.11-eks-* (both clusters)
Kubeflow control plane: v1.8.x (core components on v1.8.0 tags)
KFP frontend: 2.5.0 (both clusters)

Upgrade path (EKS + Kubeflow)

How to do the next upgrade without repeating this:


Freeze the baseline. Export critical Kubeflow manifests and the RBAC deltas from this incident into source control before touching anything.


Pick target versions up front. Choose the target EKS minor and compatible Kubeflow/KServe/KFP versions as a set. Don't mix and match.


Do dev first, fully. Control plane → add-ons → nodegroups → Kubeflow validation. The whole sequence in dev before touching prod.


Make the hotfixes declarative. The profile-controller env pins and RBAC fixes from this incident need to live in manifests, not be things I patched by hand.


Gate each phase. Don't move to the next step until these are healthy:

workflow-controller
kserve-controller-manager
ml-pipeline-ui
ml-pipeline-ui-artifact
A sample pipeline run completes


Collapse the duplicate KServe controller. Right now it exists in both kserve and kubeflow namespaces. Pick one, remove the other.


Run the smoke scripts after each phase. Version snapshot + RBAC smoke. That's what they're for.


Don't do a one-shot full-stack upgrade on prod. Same staged sequence as dev, with rollback points between phases.
No results found