How to automatically gather a coredump right after an OpenShift PLEG issue

Solution Verified - Updated 14 Jun 2024

Environment

OpenShift Container Platform
- 3.11
Docker
- 1.13

Issue

PLEG health errors occur randomly and intermittently
When a dockerd or a containerd coredump is gathered, the issue was already gone, so it does not contain useful information.

Resolution

Download the watchForNotReady.sh script shown at Diagnostic Steps.

Then, copy it on the affected nodes and leave it running until the issue reoccurs. Once the issue reoccurs, the following files will be generated on the current directory (XXXX is a timestamp):

dockerd_core-XXXX.gz: dockerd coredump (gzipped)
containerd_core-XXXX.gz: docker-containerd coredump (gzipped)
atomic-openshift-node-XXXX: atomic-openshift-node coredump
docker-openshift-stuff-XXXX.tar.gz: An archive file with the following contents:
- lsof-dockerd-current-XXXX.txt: output of lsof of dockerd process
- rpm_-q_docker: Output of rpm -q docker
- lsof-docker-containerd-current-XXXX.txt: output of lsof of docker-containerd process
- ps_auxwww-XXX.txt: Output of ps auxwww
- ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_lstart_tty_time_cmd-XXX.txt: Output of ps axo flags,state,uid,pid,ppid,pgid,sid,cls,pri,addr,sz,wchan,lstart,tty,time,cmd
- $(hostname)-XXXX.log: Journal contents. Among other things, it should contain containerd goroutine dump.
- goroutine-stacks-XXXX.log: Goroutine stack dump from dockerd
- daemon-data-XXXX.log: Daemon datastructure dump from dockerd
- containers-info: Folder with docker info output, docker ps -a output and all the outputs of docker logs -t ${container} and docker inspect ${container} for every container.

Please wait until all the files for a single occurrence are generated. Note that:

Core dumps (and other outputs) may not be generated immediately if any of the processes is on uninterruptible sleep (D state).
While non-docker commands would be executed during the issue (as long as previous point does not apply), docker commands may block until the issue occurrence is just over. This is the reason to leave them at the end.

Once those files are generated, please attach them to the support case as individual attachments. It is important to attach them this way (specially in the case of the dumps).

Please ensure that the script runs in a folder with plenty of disk space, as coredumps may take much disk space (at least, until gzipped) and this script does not currently perform any kind of rotation.

Root Cause

Sometimes, PLEG issues occur and end so quickly, that there is no time to get all the relevant information.

This script tails the system journal and, whenever a certain FILTER indicating a PLEG issue is read, some information (including dockerd, containerd and atomic-openshift-node coredumps) is gathered. This information may be very valuable if container experts collaboration is needed to troubleshoot docker issues in OpenShift.

Diagnostic Steps

The watchForNotReady.sh script is the following:

#!/bin/bash

FILTER='Reason:KubeletNotReady'

echo 'Checking required binaries'
which gcore lsof tar > /dev/null
if [[ $? -ne 0 ]]
then
    echo 'Missing required dependency'
    exit 1
fi

echo 'Starting to watch'
echo

while read line
do 
    linefilter=$(echo $line | grep $FILTER)
    if [[ -n $linefilter ]]; then
            echo "Detected not ready event:"
        echo $line
        DOCKER_OPENSHIFT_STUFF_FOLDER="docker-openshift-stuff-$(date +'%Y-%m-%d-%H-%M-%S')"
        echo "Creating temp folder"
        mkdir $DOCKER_OPENSHIFT_STUFF_FOLDER
        DOCKERD_PID=$(pidof dockerd-current)
        DOCKER_CONTAINERD_PID=$(pidof docker-containerd-current)
        ATOMIC_OPENSHIFT_NODE_PID=$(systemctl -p MainPID show atomic-openshift-node | awk -F= '{print $2}') # A regular pidof is not enough as binary name may change between versions
        echo "Getting lsofs from dockerd and containerd"
        lsof -p ${DOCKERD_PID} > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/lsof-dockerd-current-$(date +"%Y-%m-%d-%H-%M-%S").txt
        lsof -p ${DOCKER_CONTAINERD_PID} > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/lsof-docker-containerd-current-$(date +"%Y-%m-%d-%H-%M-%S").txt
        echo "Getting ps auxw output"
        ps auxwww > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/ps_auxwww-$(date +"%Y-%m-%d-%H-%M-%S").txt
        ps axo flags,state,uid,pid,ppid,pgid,sid,cls,pri,addr,sz,wchan,stime,tty,time,cmd > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_lstart_tty_time_cmd-$(date +"%Y-%m-%d-%H-%M-%S").txt
        echo "Dumping goroutines from dockerd"
        kill -SIGUSR1 ${DOCKERD_PID}
        echo "Dumping goroutines from containerd"
        kill -SIGUSR1 ${DOCKER_CONTAINERD_PID}
        echo "Getting journals from latest 2 hours"
        journalctl --since "2 hours ago" > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").log
        echo "Performing core dumps"
        DOCKERD_CORE_FILE="./dockerd_core-$(date +"%Y-%m-%d-%H-%M-%S")"
        CONTAINERD_CORE_FILE="./containerd_core-$(date +"%Y-%m-%d-%H-%M-%S")"
        NODE_CORE_FILE="./atomic-openshift-node-core-$(date +"%Y-%m-%d-%H-%M-%S")"
        gcore -o "${DOCKERD_CORE_FILE}" ${DOCKERD_PID}
        gcore -o "${CONTAINERD_CORE_FILE}" ${DOCKER_CONTAINERD_PID}
        gcore -o "${NODE_CORE_FILE}" ${ATOMIC_OPENSHIFT_NODE_PID}
        rpm -q docker > ${DOCKER_OPENSHIFT_STUFF_FOLDER}/rpm_-q_docker
        CONTAINERS_STUFF_FOLDER="${DOCKER_OPENSHIFT_STUFF_FOLDER}/containers-info"
        mkdir ${CONTAINERS_STUFF_FOLDER}
        echo "Getting containers info"
        docker ps -a > ./${CONTAINERS_STUFF_FOLDER}/docker_ps_-a
        docker info > ./${CONTAINERS_STUFF_FOLDER}/docker_info
        for container in $(docker ps -aq)
        do
            docker inspect $container > ./${CONTAINERS_STUFF_FOLDER}/docker_inspect_${container}
            docker logs -t $container &> ./${CONTAINERS_STUFF_FOLDER}/docker_logs_-t_${container}
        done
        DOCKERD_GOROUTINE_STACKS_PATH=$(journalctl -u docker --no-pager | grep 'goroutine stacks written to' | tail -1 | sed -E 's:^.+"goroutine stacks written to ([^"]+)"$:\1:g')
        DOCKERD_DATASTRUCTURE_DUMP_PATH=$(journalctl -u docker --no-pager | grep 'daemon datastructure dump written to' | tail -1 | sed -E 's:^.+"daemon datastructure dump written to ([^"]+)"$:\1:g')
        echo "Gathering dockerd goroutines dump from file ${DOCKERD_GOROUTINE_STACKS_PATH}"
        cp -v ${DOCKERD_GOROUTINE_STACKS_PATH} ./${DOCKER_OPENSHIFT_STUFF_FOLDER}
        echo "Gathering dockerd datastructure dump from file ${DOCKERD_DATASTRUCTURE_DUMP_PATH}"
        cp -v ${DOCKERD_DATASTRUCTURE_DUMP_PATH} ./${DOCKER_OPENSHIFT_STUFF_FOLDER}
        echo "Compressing files info and removing temp folder"
        tar cvzf "${DOCKER_OPENSHIFT_STUFF_FOLDER}.tar.gz" ./${DOCKER_OPENSHIFT_STUFF_FOLDER} && rm -rf ./${DOCKER_OPENSHIFT_STUFF_FOLDER} 
        gzip ${DOCKERD_CORE_FILE}.* ${CONTAINERD_CORE_FILE}.* ${NODE_CORE_FILE}.*
        echo "Done!"
        echo 'Please provide all the files in this folder as individual attachments to the support case'
        ls
    fi
done < <(journalctl -xf -n0)

Note that:

This script generates 4 files (3 gzipped coredumps and a tar.gz with other information). Please upload them as separate attachments to the support case.
FILTER variable is the filter that, when detected, triggers the files gathering. It can be tuned.
The following programs must be available on $PATH: gcore (from gdb package), lsof and tar. Otherwise, the script won't work

SBR

Product(s)

Red Hat OpenShift Container Platform

Components

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.