therevoman/fix-my-machine-config.md

## fix-my-machine-config.md

      
    Raw
  

              fix-my-machine-config.md
            
          
    3 node compact clusters seem to struggle with upgrades, the cordoning prevents necessary pods from coming online and often do not uncordon.
SNO clusters almost always need hand holding.
When the node drain gets stuck I have to manually force a drain.  I drain one node at a time, not really a good idea to brute force across all hosts at once.  Take the time to handle each node one at a time.
oc adm drain master1 --ignore-daemonsets --delete-emptydir-data
Sometimes I'll see the MCP still hung on that node/master, at which point I'll give it a kick with
touch /host/run/machine-config-daemon-force
If the mcp and/or machine-config operator continue to fail for the node I'll go kick it harder with
rm /etc/machine-config-daemon/currentconfig
touch /host/run/machine-config-daemon-force

which should trigger a reboot.
When the node reboots, if its SNO or Compact (only 3 masters) I manually uncordon the node to get the critical pods running, which usually is enough for it to move forward.
Often I have to repeat this process on all masters.
Sometimes operators don't upgrade...  so I have to uninstall/reinstall them.  If I leave the customresources/components in place it usually works.
Reference:
https://access.redhat.com/solutions/6997263
If things go really bad I have to reset the node annotations for currentConfig and expectedConfig
https://access.redhat.com/solutions/5598401
No results found