How to skip validation of failing / stuck MachineConfig in OCP 4?

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4.x

Issue

  • Skip validation of failing / stuck MachineConfigurations in Red Hat OpenShift Container Platform 4?

Resolution

Note: adding a force file only punts the problem to the next upgrade and thus the issue needs to be fixed by changing the machine configs.

Generate a force file on the node that the MachineConfigOperator blocks on:

[root@openshift-worker-2 ~]# touch /run/machine-config-daemon-force
[root@openshift-worker-2 ~]# 

After waiting for up to one minute, the logs then show that validation passed due to presence of the force file:

[root@openshift-jumpserver-0 ~]# oc logs -n openshift-machine-config-operator machine-config-daemon-7bh6k -c machine-config-daemon -f --tail=0
I0723 09:18:14.774884    2781 daemon.go:771] Current config: rendered-worker-188a29e5b3089268c8aad7d30e19df4e
I0723 09:18:14.774904    2781 daemon.go:772] Desired config: rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.780027    2781 daemon.go:458] Detected a login session before the daemon took over on first boot
I0723 09:18:14.780064    2781 daemon.go:459] Applying annotation: machineconfiguration.openshift.io/ssh
I0723 09:18:14.790755    2781 update.go:1404] Disk currentConfig rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a overrides node annotation rendered-worker-188a29e5b3089268c8aad7d30e19df4e
I0723 09:18:14.793129    2781 daemon.go:1014] Validating against pending config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.793210    2781 daemon.go:1025] Skipping on-disk validation; /run/machine-config-daemon-force present
I0723 09:18:14.793225    2781 daemon.go:1030] Validated on-disk state
I0723 09:18:14.804833    2781 daemon.go:1064] Completing pending config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.817299    2781 update.go:1404] completed update for config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
I0723 09:18:14.827305    2781 daemon.go:1080] In desired config rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a
[root@openshift-jumpserver-0 ~]# 

The machine config operator will notice after that a couple of minutes that the node was finally able to apply the MachineConfig we wanted. The MachineConfigPool will then switch to UPDATED true:

[root@openshift-jumpserver-0 ~]# oc get machineconfigpool
NAME              CONFIG                                                      UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
kata-oc           rendered-kata-oc-83c7461363c2f2cf6ad3ae809f2f28dd           True      False      False      1              1                   1                     0                      12h
master            rendered-master-9498852ddfc929de9a98bb56cea862f8            True      False      False      3              3                   3                     0                      2d16h
worker            rendered-worker-188a29e5b3089268c8aad7d30e19df4e            True      False      False      1              1                   1                     0                      2d16h
worker-deadlock   rendered-worker-deadlock-8489d24bd0e8a36d0f727ef2c3b46b1a   True      False      False      1              1                   1                     0                      27m
[root@openshift-jumpserver-0 ~]#

Once the validation was passed, it is possible to correct the issue in the machine config definitions manually.

As a last step, if the file is still present, remove it:

[root@openshift-worker-2 ~]# rm -f /run/machine-config-daemon-force
[root@openshift-worker-2 ~]# 

Root Cause

It is possible to skip the MCD validation by creating a /run/machine-config-daemon-force file and removing /etc/machine-config-daemon/currentconfig file from the node.

Diagnostic Steps

The configuration applied on the node can be seen from the below file:

# ls /etc/machine-config-daemon/currentconfig

It contains the current config which Machine Config Daemon will try to apply on the node.

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.