How to skip validation of failing / stuck MachineConfig in OCP 4?
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4.x
Issue
- Skip validation of failing / stuck MachineConfigurations in Red Hat OpenShift Container Platform 4?
Resolution
Note: adding a force file only punts the problem to the next upgrade and thus the issue needs to be fixed by changing the machine configs.
Generate a force file on the node that the MachineConfigOperator blocks on:
[root@openshift-worker-2 ~]# touch /run/machine-config-daemon-force
[root@openshift-worker-2 ~]#
After waiting for up to one minute, the logs then show that validation passed due to presence of the force file:
[root@openshift-jumpserver-0 ~]# oc logs -n openshift-machine-config-operator machine-config-daemon-7bh6k -c machine-config-daemon -f --tail=0
I0723 09:18:14.774884 2781 daemon.go:771] Current config: rendered-worker-188a29e5b3089268c8aad7d30e19df4e
I0723 09:18:14.774904 2781 daemon.go:772] Desired config: rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.780027 2781 daemon.go:458] Detected a login session before the daemon took over on first boot
I0723 09:18:14.780064 2781 daemon.go:459] Applying annotation: machineconfiguration.openshift.io/ssh
I0723 09:18:14.790755 2781 update.go:1404] Disk currentConfig rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a overrides node annotation rendered-worker-188a29e5b3089268c8aad7d30e19df4e
I0723 09:18:14.793129 2781 daemon.go:1014] Validating against pending config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.793210 2781 daemon.go:1025] Skipping on-disk validation; /run/machine-config-daemon-force present
I0723 09:18:14.793225 2781 daemon.go:1030] Validated on-disk state
I0723 09:18:14.804833 2781 daemon.go:1064] Completing pending config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.817299 2781 update.go:1404] completed update for config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.827305 2781 daemon.go:1080] In desired config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
[root@openshift-jumpserver-0 ~]#
The machine config operator will notice after that a couple of minutes that the node was finally able to apply the MachineConfig we wanted. The MachineConfigPool will then switch to UPDATED true:
[root@openshift-jumpserver-0 ~]# oc get machineconfigpool
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
kata-oc rendered-kata-oc-83c7461363c2f2cf6ad3ae809f2f28dd True False False 1 1 1 0 12h
master rendered-master-9498852ddfc929de9a98bb56cea862f8 True False False 3 3 3 0 2d16h
worker rendered-worker-188a29e5b3089268c8aad7d30e19df4e True False False 1 1 1 0 2d16h
worker-deadlock rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a True False False 1 1 1 0 27m
[root@openshift-jumpserver-0 ~]#
Once the validation was passed, it is possible to correct the issue in the machine config definitions manually.
As a last step, if the file is still present, remove it:
[root@openshift-worker-2 ~]# rm -f /run/machine-config-daemon-force
[root@openshift-worker-2 ~]#
Root Cause
It is possible to skip the MCD validation by creating a /run/machine-config-daemon-force file and removing /etc/machine-config-daemon/currentconfig file from the node.
Diagnostic Steps
The configuration applied on the node can be seen from the below file:
# ls /etc/machine-config-daemon/currentconfig
It contains the current config which Machine Config Daemon will try to apply on the node.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.