How to automatically gather a coredump right after an OpenShift PLEG issue
Environment
-
OpenShift Container Platform
- 3.11
-
Docker
- 1.13
Issue
PLEGhealth errors occur randomly and intermittently- When a
dockerdor acontainerdcoredump is gathered, the issue was already gone, so it does not contain useful information.
Resolution
Download the watchForNotReady.sh script shown at Diagnostic Steps.
Then, copy it on the affected nodes and leave it running until the issue reoccurs. Once the issue reoccurs, the following files will be generated on the current directory (XXXX is a timestamp):
-
dockerd_core-XXXX.gz:dockerdcoredump (gzipped) -
containerd_core-XXXX.gz:docker-containerdcoredump (gzipped) -
atomic-openshift-node-XXXX:atomic-openshift-nodecoredump -
docker-openshift-stuff-XXXX.tar.gz: An archive file with the following contents:lsof-dockerd-current-XXXX.txt: output oflsofofdockerdprocessrpm_-q_docker: Output ofrpm -q dockerlsof-docker-containerd-current-XXXX.txt: output oflsofofdocker-containerdprocessps_auxwww-XXX.txt: Output ofps auxwwwps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_lstart_tty_time_cmd-XXX.txt: Output ofps axo flags,state,uid,pid,ppid,pgid,sid,cls,pri,addr,sz,wchan,lstart,tty,time,cmd$(hostname)-XXXX.log: Journal contents. Among other things, it should contain containerd goroutine dump.goroutine-stacks-XXXX.log: Goroutine stack dump from dockerddaemon-data-XXXX.log: Daemon datastructure dump from dockerdcontainers-info: Folder withdocker infooutput,docker ps -aoutput and all the outputs ofdocker logs -t ${container}anddocker inspect ${container}for every container.
Please wait until all the files for a single occurrence are generated. Note that:
-
Core dumps (and other outputs) may not be generated immediately if any of the processes is on uninterruptible sleep (
Dstate). -
While non-docker commands would be executed during the issue (as long as previous point does not apply), docker commands may block until the issue occurrence is just over. This is the reason to leave them at the end.
Once those files are generated, please attach them to the support case as individual attachments. It is important to attach them this way (specially in the case of the dumps).
Please ensure that the script runs in a folder with plenty of disk space, as coredumps may take much disk space (at least, until gzipped) and this script does not currently perform any kind of rotation.
Root Cause
Sometimes, PLEG issues occur and end so quickly, that there is no time to get all the relevant information.
This script tails the system journal and, whenever a certain FILTER indicating a PLEG issue is read, some information (including dockerd, containerd and atomic-openshift-node coredumps) is gathered. This information may be very valuable if container experts collaboration is needed to troubleshoot docker issues in OpenShift.
Diagnostic Steps
The watchForNotReady.sh script is the following:
#!/bin/bash
FILTER='Reason:KubeletNotReady'
echo 'Checking required binaries'
which gcore lsof tar > /dev/null
if [[ $? -ne 0 ]]
then
echo 'Missing required dependency'
exit 1
fi
echo 'Starting to watch'
echo
while read line
do
linefilter=$(echo $line | grep $FILTER)
if [[ -n $linefilter ]]; then
echo "Detected not ready event:"
echo $line
DOCKER_OPENSHIFT_STUFF_FOLDER="docker-openshift-stuff-$(date +'%Y-%m-%d-%H-%M-%S')"
echo "Creating temp folder"
mkdir $DOCKER_OPENSHIFT_STUFF_FOLDER
DOCKERD_PID=$(pidof dockerd-current)
DOCKER_CONTAINERD_PID=$(pidof docker-containerd-current)
ATOMIC_OPENSHIFT_NODE_PID=$(systemctl -p MainPID show atomic-openshift-node | awk -F= '{print $2}') # A regular pidof is not enough as binary name may change between versions
echo "Getting lsofs from dockerd and containerd"
lsof -p ${DOCKERD_PID} > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/lsof-dockerd-current-$(date +"%Y-%m-%d-%H-%M-%S").txt
lsof -p ${DOCKER_CONTAINERD_PID} > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/lsof-docker-containerd-current-$(date +"%Y-%m-%d-%H-%M-%S").txt
echo "Getting ps auxw output"
ps auxwww > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/ps_auxwww-$(date +"%Y-%m-%d-%H-%M-%S").txt
ps axo flags,state,uid,pid,ppid,pgid,sid,cls,pri,addr,sz,wchan,stime,tty,time,cmd > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/ps_axo_flags_state_uid_pid_ppid_pgid_sid_cls_pri_addr_sz_wchan_lstart_tty_time_cmd-$(date +"%Y-%m-%d-%H-%M-%S").txt
echo "Dumping goroutines from dockerd"
kill -SIGUSR1 ${DOCKERD_PID}
echo "Dumping goroutines from containerd"
kill -SIGUSR1 ${DOCKER_CONTAINERD_PID}
echo "Getting journals from latest 2 hours"
journalctl --since "2 hours ago" > ./${DOCKER_OPENSHIFT_STUFF_FOLDER}/$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").log
echo "Performing core dumps"
DOCKERD_CORE_FILE="./dockerd_core-$(date +"%Y-%m-%d-%H-%M-%S")"
CONTAINERD_CORE_FILE="./containerd_core-$(date +"%Y-%m-%d-%H-%M-%S")"
NODE_CORE_FILE="./atomic-openshift-node-core-$(date +"%Y-%m-%d-%H-%M-%S")"
gcore -o "${DOCKERD_CORE_FILE}" ${DOCKERD_PID}
gcore -o "${CONTAINERD_CORE_FILE}" ${DOCKER_CONTAINERD_PID}
gcore -o "${NODE_CORE_FILE}" ${ATOMIC_OPENSHIFT_NODE_PID}
rpm -q docker > ${DOCKER_OPENSHIFT_STUFF_FOLDER}/rpm_-q_docker
CONTAINERS_STUFF_FOLDER="${DOCKER_OPENSHIFT_STUFF_FOLDER}/containers-info"
mkdir ${CONTAINERS_STUFF_FOLDER}
echo "Getting containers info"
docker ps -a > ./${CONTAINERS_STUFF_FOLDER}/docker_ps_-a
docker info > ./${CONTAINERS_STUFF_FOLDER}/docker_info
for container in $(docker ps -aq)
do
docker inspect $container > ./${CONTAINERS_STUFF_FOLDER}/docker_inspect_${container}
docker logs -t $container &> ./${CONTAINERS_STUFF_FOLDER}/docker_logs_-t_${container}
done
DOCKERD_GOROUTINE_STACKS_PATH=$(journalctl -u docker --no-pager | grep 'goroutine stacks written to' | tail -1 | sed -E 's:^.+"goroutine stacks written to ([^"]+)"$:\1:g')
DOCKERD_DATASTRUCTURE_DUMP_PATH=$(journalctl -u docker --no-pager | grep 'daemon datastructure dump written to' | tail -1 | sed -E 's:^.+"daemon datastructure dump written to ([^"]+)"$:\1:g')
echo "Gathering dockerd goroutines dump from file ${DOCKERD_GOROUTINE_STACKS_PATH}"
cp -v ${DOCKERD_GOROUTINE_STACKS_PATH} ./${DOCKER_OPENSHIFT_STUFF_FOLDER}
echo "Gathering dockerd datastructure dump from file ${DOCKERD_DATASTRUCTURE_DUMP_PATH}"
cp -v ${DOCKERD_DATASTRUCTURE_DUMP_PATH} ./${DOCKER_OPENSHIFT_STUFF_FOLDER}
echo "Compressing files info and removing temp folder"
tar cvzf "${DOCKER_OPENSHIFT_STUFF_FOLDER}.tar.gz" ./${DOCKER_OPENSHIFT_STUFF_FOLDER} && rm -rf ./${DOCKER_OPENSHIFT_STUFF_FOLDER}
gzip ${DOCKERD_CORE_FILE}.* ${CONTAINERD_CORE_FILE}.* ${NODE_CORE_FILE}.*
echo "Done!"
echo 'Please provide all the files in this folder as individual attachments to the support case'
ls
fi
done < <(journalctl -xf -n0)
Note that:
- This script generates 4 files (3 gzipped coredumps and a tar.gz with other information). Please upload them as separate attachments to the support case.
FILTERvariable is the filter that, when detected, triggers the files gathering. It can be tuned.- The following programs must be available on
$PATH:gcore(fromgdbpackage),lsofandtar. Otherwise, the script won't work
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.