Date: February 27, 2026
Clusters: mlinfra-prod, mlinfra-29
Three things broke, all fixed now. Both clusters are stable.
- KFP frontend images kept reverting after manual edits.
workflow-controllerandkserve-controller-managerwere in CrashLoopBackOff on prod.- Prod and dev had drifted apart on controller RBAC and KFP config.
Manual edits to ml-pipeline-ui and ml-pipeline-ui-artifact kept reverting. Metacontroller/profile-controller was reconciling them back from the parent Namespace resource.
Fix: patched the profile-controller env ConfigMap to set FRONTEND_IMAGE=ghcr.io/kubeflow/kfp-frontend and FRONTEND_TAG=2.5.0, then restarted profile-controller and triggered a namespace reconcile. Now the controller itself writes the correct image — no more manual edits to revert.
workflow-controller and kserve-controller-manager were crashing on prod with forbidden list/watch errors. RBAC bindings had drifted.
Fixes:
- Restored cluster-scope RBAC for workflow controller (
kubeflow/argoSA). - Fixed KServe manager binding subjects to include expected service accounts.
- Restarted both deployments, both rolled out healthy.
Argo CD was not running (no namespace, no pods, no apps) but had leftover CRDs: applications.argoproj.io, applicationsets.argoproj.io, appprojects.argoproj.io. Removed those. Kept the Argo Workflows CRDs — Kubeflow Pipelines needs those.
For anyone confused by this in the future: KFP depends on Argo Workflows (execution engine), not Argo CD (GitOps controller). Different projects, similar names.
Was 2.0.5 in both clusters. Now 2.5.0 in both.
Set explicit image pins so reconciliation writes the right thing going forward:
FRONTEND_IMAGE=ghcr.io/kubeflow/kfp-frontendFRONTEND_TAG=2.5.0VISUALIZATION_SERVER_IMAGE=gcr.io/ml-pipeline/visualization-serverVISUALIZATION_SERVER_TAG=2.0.5(pinned during the appVersion bump — will update separately)
admin/ml-pipeline-ui-artifact→ghcr.io/kubeflow/kfp-frontend:2.5.0in both clusters.kserve-manager-rolebindingincludes both subjects in both clusters:ServiceAccount:kserve:kserve-controller-managerServiceAccount:kubeflow:kserve-controller-manager
One-command dump of Kubeflow/KFP state for a given context. Reports component images, pipeline-install-config.appVersion, profile-controller frontend overrides, and infers the Kubeflow release line.
./scripts/kubeflow-version-snapshot.sh mlinfra-prod kubeflow
./scripts/kubeflow-version-snapshot.sh mlinfra-29 kubeflowRuns kubectl auth can-i checks against the service accounts that broke during this incident: workflow-controller (kubeflow/argo), KServe controller (both SA locations), and scheduledworkflow.
./scripts/kubeflow-rbac-smoke.sh mlinfra-prod
./scripts/kubeflow-rbac-smoke.sh mlinfra-29Both clusters pass all checks as of this writing.
- Kubernetes:
v1.32.11-eks-*(both clusters) - Kubeflow control plane:
v1.8.x(core components onv1.8.0tags) - KFP frontend:
2.5.0(both clusters)
How to do the next upgrade without repeating this:
-
Freeze the baseline. Export critical Kubeflow manifests and the RBAC deltas from this incident into source control before touching anything.
-
Pick target versions up front. Choose the target EKS minor and compatible Kubeflow/KServe/KFP versions as a set. Don't mix and match.
-
Do dev first, fully. Control plane → add-ons → nodegroups → Kubeflow validation. The whole sequence in dev before touching prod.
-
Make the hotfixes declarative. The profile-controller env pins and RBAC fixes from this incident need to live in manifests, not be things I patched by hand.
-
Gate each phase. Don't move to the next step until these are healthy:
workflow-controllerkserve-controller-managerml-pipeline-uiml-pipeline-ui-artifact- A sample pipeline run completes
-
Collapse the duplicate KServe controller. Right now it exists in both
kserveandkubeflownamespaces. Pick one, remove the other. -
Run the smoke scripts after each phase. Version snapshot + RBAC smoke. That's what they're for.
-
Don't do a one-shot full-stack upgrade on prod. Same staged sequence as dev, with rollback points between phases.