Date: February 27, 2026
Clusters: mlinfra-prod, mlinfra-29
Three things broke, all fixed now. Both clusters are stable.
- KFP frontend images kept reverting after manual edits.
workflow-controllerandkserve-controller-managerwere in CrashLoopBackOff on prod.- Prod and dev had drifted apart on controller RBAC and KFP config.