Skip to content

Instantly share code, notes, and snippets.

@therevoman
Created July 17, 2025 16:24
Show Gist options
  • Select an option

  • Save therevoman/84b9e509c9d3075c4b64d1dbb2b5f049 to your computer and use it in GitHub Desktop.

Select an option

Save therevoman/84b9e509c9d3075c4b64d1dbb2b5f049 to your computer and use it in GitHub Desktop.
How to recover from failing machine-config operator or failed OpenShift upgrade

3 node compact clusters seem to struggle with upgrades, the cordoning prevents necessary pods from coming online and often do not uncordon. SNO clusters almost always need hand holding.

When the node drain gets stuck I have to manually force a drain. I drain one node at a time, not really a good idea to brute force across all hosts at once. Take the time to handle each node one at a time. oc adm drain master1 --ignore-daemonsets --delete-emptydir-data

Sometimes I'll see the MCP still hung on that node/master, at which point I'll give it a kick with touch /host/run/machine-config-daemon-force

If the mcp and/or machine-config operator continue to fail for the node I'll go kick it harder with

rm /etc/machine-config-daemon/currentconfig
touch /host/run/machine-config-daemon-force

which should trigger a reboot.

When the node reboots, if its SNO or Compact (only 3 masters) I manually uncordon the node to get the critical pods running, which usually is enough for it to move forward.

Often I have to repeat this process on all masters.

Sometimes operators don't upgrade... so I have to uninstall/reinstall them. If I leave the customresources/components in place it usually works.

Reference:

https://access.redhat.com/solutions/6997263

If things go really bad I have to reset the node annotations for currentConfig and expectedConfig https://access.redhat.com/solutions/5598401

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment