While testing kdump on RHCOS node, nodes falls in to emergeny shell on reboot.

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP) 4
  • Red Hat Enterprise Linux CoreOS (RHCOS)

Issue

  • After crashing the node manually to test kdump, node fails to boot.
  • Node fails to boot with error Couldn't find specified OSTree root:
    alt text

Resolution

To rescue the node from boot failure, follow the below solution:

RHCOS nodes are failing to boot with an error "Couldn't find specified OSTree root" in RHOCP 4

Before crashing the node manually to test kdump, please perform the below steps:

  1. Check current deployments and ensure only one deployment exists:

    rpm-ostree status -vvv
    
  2. Clean older deployments manually:

    rpm-ostree cleanup -r
    

    If it says Deployments unchanged, there are no deployments to be removed.

  3. If the above command fails due a transaction already in progress, wait for it to complete. The pid of the of the process can be found via below command.

    ps aux | grep rpm-ostree
    

Root Cause

  • After booting into a new deployment (boot.0), the machine-config-daemon pod runs rpm-ostree cleanup -r to remove the old rollback deployment (boot.1). If the node is crashed or rebooted while this cleanup is in progress, the bootloader may not be updated correctly. As a result, GRUB may try to boot from a now-missing boot.1, causing boot failure.

Diagnostic Steps

  • Check if the node fails to boot due to ostree-prepare service failure.
SBR
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.