Cluster Updates Without Error but Machine Config Pools Degraded with `Marking Degraded due to: unexpected on-disk state` on OCP 4.6 and newer

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4.6+

Issue

  • After performing an update to a newer version of OpenShift Container Platform, not all nodes are upgraded. For example:

    $ oc get node
    NAME                      STATUS                      ROLES   AGE  VERSION
    master-0.ocp.example.net  Ready                       master  34d  v1.17.1+9d33dd3
    master-1.ocp.example.net  Ready                       master  34d  v1.17.1+9d33dd3
    master-2.ocp.example.net  Ready                       master  34d  v1.17.1+9d33dd3
    worker-0.ocp.example.net  Ready                       worker  34d  v1.17.1+9d33dd3
    worker-1.ocp.example.net  Ready                       worker  34d  v1.17.1+9d33dd3
    worker-2.ocp.example.net  Ready, SchedulingDisabled   worker  34d  v1.17.1+912792b         <----------
    
  • After performing an update to a newer version of OpenShift Container Platform, the MachineConfigOperator is reporting degraded pools:

    $ oc describe co/machine-config
    ...
    'Failed to resync $VERSION because: error during syncRequiredMachineConfigPools:
          [timed out waiting for the condition, error pool $POOL is not ready, retrying.
          Status: (pool degraded: true total: x, ready y, updated: y, unavailable: 1)]'
    
  • A machine config pool is degraded, and in the MachineConfigOperator clusteroperator extensions, we see an error similar to:

    worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
          status on sync": "Node worker0 is reporting: \"unexpected
          on-disk state validating against rendered-worker-abc:
          expected target osImageURL \\\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:xxx\\\",
          have \\\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:yyy\\\"
          (\\\"zzz\\\")\""'
    
  • The machine-config-daemon pod logs show:

    Marking Degraded due to: unexpected on-disk state validating against rendered-master- 
    7eee653a0a756d9bb2eb74f2ea00b91e: content mismatch for file "/usr/local/bin/configure-ovs.sh"
    

Resolution

Important: This solution is for OCP 4.6 and newer releases, and has now been updated for 4.12+. Previous revisions of this solution referenced using the deprecated pivot command. For OCP 4.5 and older, please check KCS 4466631 instead.

Before applying the workaround, please collect the logs from the affected node to assist in finding the root cause:

$ oc debug node/[node_name]
[...]
sh-4.4# chroot /host bash

[core@node_name ~]# journalctl -b -1 -u ostree-finalize-staged.service
[core@node_name ~]# journalctl -b -1 -u rpm-ostreed.service

Collecting a must-gather would also be helpful. In newer versions of the must-gather tool, the above services are automatically collected.

You can also collect a sosreport from the failing node/nodes, which would contain the above service logs.

These logs should tell you why the attempted OS upgrade did not succeed. Since the error is varied and changes from version to version, please check the Root Cause below for details.

Workaround 1

Note: for ARO or OCP clusters installed on Azure, please refer to KCS 6522771 before applying the workaround.

If we determine the issue to be transient, we can retry the OS image update by performing the following steps on the node:

  1. Access the failing node:
$ oc debug node/[node_name]
sh-4.4# chroot /host
  1. Delete the currentConfig file on-disk
sh-4.4# rm /etc/machine-config-daemon/currentconfig
  1. Tell the MCD to forcefully retry the update and ignore the current validation error
sh-4.4# touch /run/machine-config-daemon-force 

The MCD should now retry an update, and the node should reboot. In case it did not, check the machine-config-daemon logs to see what went wrong. Before touching the forcefile, you can also first follow the MCD logs on another console, which may aid in debugging efforts if something goes wrong again.

oc logs -n openshift-machine-config-operator -c machine-config-daemon machine-config-daemon-xxxxx -f

Workaround 2

If Workaround 1 didn't work, it is possible to oc rsh into the machine-config-daemon Pod on the problematic node and then run the following command to have the liveness probe succeed so that the Pod doesn't get killed and the force update can complete. Then run Workaround 1 again.

cat <<EOF > http.sh
printf "HTTP/1.1 200 OK\r\n"
printf "Content-Length: 0\r\n"
printf "\r\n"
EOF
chmod +x http.sh
socat TCP-LISTEN:8798,fork,reuseaddr EXEC:'./http.sh'

Workaround 3

It is also possible to recreate the files manually; see This KCS for details.

Root Cause

The Machine Config Operator component in charge of managing each individual node is the Machine Config Daemon (MCD), which runs as a daemonset in the openshift-machine-config-operator namespace.

If the system state differs in any way from what it expects, it sets the MachineConfigPool as Degraded and also reflects that in the machineconfiguration.openshift.io/state node annotation. This error then bubbles up to the Machine-Config-Operator ClusterOperator status. The MCD then stops taking any action until the current issue is fixed in order to prevent any further breakage. Reason for the degradation is usually explained in the machine-config-daemon container logs.

One of the configuration differences that can cause that validation failure is when the node is running an incorrect ostree image. The machine-config-daemon detects this and reports it in the way described in the diagnostic steps. It is unusual that this happens, unless a bug or an abnormal situation is hit or manual changes are performed in the nodes.

The most common case is that a cluster would encounter this during an upgrade. When individual Machine-Config-Daemons attempt to upgrade the OS, it does not directly do so, but instead stages the incoming OS update via rpm-ostree. After the Machine-Config-Daemon initiates a reboot, rpm-ostree attempts to perform the actual OS update. The update is transactional, meaning that if the update fails, we simply stay on the old OS version. The Machine-Config-Daemon at this point, however, has already shut down, so it will not know of the failure until the node has rebooted and the Machine-Config-Daemon runs again, attempts to validate the image, and fails. The Machine-Config-Daemon does not actually know why the failure occurred, which is why rpm-ostreed and ostree-finalize-staged logs are needed.

The Resolution Steps of this solution explain how to force upgrade the node again, which can help unblock the upgrade if the original underlying issue was transient. However, if the issue persists, we would need to find the root cause of the issue via the journal logs, and then attempt remediation based on the issue.

Diagnostic Steps

  1. Check to see if any MachineConfigPools are degraded:
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-c58240ee462345a9375360cd7a78443d   True      False      False      3              3                   3                     0                      34d
worker   rendered-worker-49131c85db4e2ee452b4eaafcc566ca9   False     True       True       3              2                   2                     1                      34d
  1. If there are, the logs for the machine-config-daemon pods need to be checked for instances of incorrect osImageURL states.
$ oc project openshift-machine-config-operator
$ MCD_PODS=`oc get pod -o=jsonpath='{.items[*].metadata.name}' -l k8s-app=machine-config-daemon`
$ for POD in $MCD_PODS;do echo ----;echo Checking for osImageURL mismatch; oc get pod $POD -o wide; oc logs $POD -c  machine-config-daemon  | grep "expected target osImageURL"; done;

Output should look like the following:

----
Checking for osImageURL mismatch
NAME                          READY   STATUS    RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
machine-config-daemon-5p2zv   2/2     Running   1          5d20h   10.74.178.208   worker-0.ocp.example.net   <none>           <none>
----
Checking for osImageURL mismatch
NAME                          READY   STATUS    RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
machine-config-daemon-h7wdf   2/2     Running   1          5d20h   10.74.178.190   worker-2.ocp.example.net   <none>           <none>
E0706 11:25:56.146529    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
E0706 11:25:58.203841    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
E0706 11:26:06.244588    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
E0706 11:26:22.294623    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
E0706 11:26:54.331769    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
E0706 11:27:54.374431    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
E0706 11:28:54.416390    2408 daemon.go:1186] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8
----
Checking for osImageURL mismatch
NAME                          READY   STATUS    RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
machine-config-daemon-p7b6s   2/2     Running   1          5d20h   10.74.178.201   master-1.ocp.example.net   <none>           <none>
----
Checking for osImageURL mismatch
NAME                          READY   STATUS    RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
machine-config-daemon-qrmvx   2/2     Running   1          5d20h   10.74.178.192   master-0.ocp.example.net   <none>           <none>
----
Checking for osImageURL mismatch
NAME                          READY   STATUS    RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
machine-config-daemon-rhjn5   2/2     Running   1          5d20h   10.74.178.214   worker-1.ocp.example.net   <none>           <none>
----
Checking for osImageURL mismatch
NAME                          READY   STATUS    RESTARTS   AGE     IP              NODE                                        NOMINATED NODE   READINESS GATES
machine-config-daemon-spvkf   2/2     Running   1          5d20h   10.74.178.145   master-2.ocp.example.net   <none>           <none>

In the example, it complains that the osImageURL run by the system should be quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:328a1e57fe5281f4faa300167cdf63cfca1f28a9582aea8d6804e45f4c0522a8, but it isn't.

Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.