Data gathering when oc adm must-gather fails

Updated

The oc adm must-gather CLI command collects information from an OpenShift cluster that is most likely needed for debugging issues. In case that command fails, it is still possible to collect some information for debugging.

Note: some times, the message error running backup collection: errors occurred while gathering data is shown when collecting a must-gather. In most of the cases, the must-gather is collected anyway. Please refer to must-gather fails with error running backup collection: errors occurred while gathering data for additional info about that message.

Missing packages in Linux or Windows machines

Some errors are caused by missing required packages (specially when running the oc adm must-gather command from a Windows machine). Refer to error generating must-gather: No available strategies to copy for some of the possible errors.

Data gathering when the authentication module is down but SSH is available

When it is possible to SSH to one of the control plane nodes, then connect and run the following commands to generate and compress the must-gather data (refer to about the OpenShift 4 kubeconfig file for system:admin for other kubeconfig files in control plane nodes):

# export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/lb-int.kubeconfig 
# cd /tmp
# oc adm must-gather
# ls -ltrh
[...]
-rw-r--r--. 1 root root 109125404 Sep 23 20:53 must-gather.local.XXXXXXXXX
[...]
# tar caf must-gather.local.XXXXXXXXX.tar.xz must-gather.local.XXXXXXXXX

It is possible to run any type of must gather from the shell. Once the must-gather completes, copy the must-gather from host by using scp to get the must-gather from the control plane node to a machine with upload access to the Red Hat Customer Portal:

$ scp core@<nodename>:/tmp/must-gather-XXXXXXXXX.tar.xz /tmp/must-gather-XXXXXXXXX.tar.xz

After copying the must-gather from the control plane node, delete files from it:

# rm -rf /tmp/must-gather.local.XXXXXXXXX /tmp/must-gather.local.XXXXXXXXX.tar.xz

Data gathering when the cluster API is running but the oc adm must-gather command fails

Note: there are some errors shown during must-gather collection that doesn't prevent that most of the information is being collected, so in many cases, it is enough to compress the directory generated by the oc adm must-gather and provide it into the support case.

In the event the API is not down but must-gather is impossible (for example, because no new pod can be scheduled), a way to generate almost the same information would be:

To gather the same cluster resources:

$ oc adm inspect $(oc get co -o name) clusterversion/version ns/openshift-cluster-version $(oc get node -o name) ns/default ns/openshift ns/kube-system ns/openshift-etcd

To gather etcd status:

$ export ETCD_POD_NAME=$(oc get pods -n openshift-etcd -l app=etcd --field-selector="status.phase==Running" -o jsonpath="{.items[0].metadata.name}")

$ oc exec -n openshift-etcd -c etcd ${ETCD_POD_NAME} -- etcdctl member list -w table > etcd_member_list.txt
$ oc exec -n openshift-etcd -c etcd ${ETCD_POD_NAME} -- etcdctl endpoint health --cluster > etcd_endpoint_health.txt
$ oc exec -n openshift-etcd -c etcd ${ETCD_POD_NAME} -- etcdctl endpoint status --cluster -w table > etcd_endpoint_status.txt
$ oc exec -n openshift-etcd -c etcd ${ETCD_POD_NAME} -- etcdctl endpoint status --cluster -w json | jq '.[] | ((.Status.dbSize - .Status.dbSizeInUse)/.Status.dbSize)*100' > etcd_usage.txt

To gather the same service logs:

$ for service in kubelet crio machine-config-daemon-firstboot machine-config-daemon-host; do oc adm node-logs --role=master -l kubernetes.io/os=linux -u="${service}" > "${service}_service.log"; done

To gather the audit logs:

$ cat <<"EOF" >gather_audit_logs
#!/bin/bash
paths=(openshift-apiserver kube-apiserver)
for path in "${paths[@]}" ; do
  oc adm node-logs --role=master --path="$path" | \
  tee "$path.audit_logs_listing" | \
  grep -v ".terminating" | \
  sed "s|^|$path . |"
done  | \
xargs --max-args=4 --max-procs=45 bash -c \
  'echo "INFO: Started  downloading $1/$4 from $3";
  oc adm node-logs $3 --path=$1/$4 | gzip > $2/$3-$4.gz;
  echo "INFO: Finished downloading $1/$4 from $3"' \
  bash
echo "INFO: Audit logs collected."
EOF
$ chmod +x gather_audit_logs 
$ ./gather_audit_logs
(...)
INFO: Audit logs collected

However, note that even if API is reachable, some or all these commands may fail. If that happens, please try with the steps on the sections below.

Data gathering when API is down

In the event the API is down and it is not possible to execute a must-gather, there aren't many ways to collect information in an automated manner. This procedure does not replace must-gather and is only to be used if collecting a must-gather is not possible and the steps of the section above are impossible as well or fail.

Run the following from a bastion node that is able to directly ssh to all the nodes in the cluster. When finished, please attach the diag-output.tar.gz file to the case.

  1. Include all ip addresses of nodes in the cluster in the HOSTS variable, separated by spaces.
  2. Set PRIVATE_KEY to the path of the private key associated with the public key provisioned when the cluster was installed.
  3. Set the CLUSTER_DOMAIN_NAME to the cluster domain name. For example, in console.apps.example.com - example.com is the cluster domain name.

Before running the script export all used variables:

$ export HOSTS="192.168.100.51 192.168.100.52 ...."
$ export PRIVATE_KEY=~/id_rsa
$ export CLUSTER_DOMAIN_NAME=example.com

Now we run the following script:

$ mkdir diag-output

$ for host in $HOSTS;do
ssh core@$host -i $PRIVATE_KEY sudo journalctl --since=-1d --no-pager &> diag-output/$host.journal.log
ssh core@$host -i $PRIVATE_KEY sudo tar chz /var/log/containers/* > diag-output/$host.containerlogs.tar.gz
ssh core@$host -i $PRIVATE_KEY sudo tar chz /etc/kubernetes/* > diag-output/$host.etckubernetes.tar.gz
ssh core@$host -i $PRIVATE_KEY "sudo crictl ps;sudo crictl pods; sudo ps aux;systemctl status;systemctl status --all --no-pager;" &> diag-output/$host.systeminfo
ssh core@$host -i $PRIVATE_KEY "sudo ip a;sudo netstat -anlp; dig api.$CLUSTER_DOMAIN_NAME;dig api-int.$CLUSTER_DOMAIN_NAME;dig console.apps.$CLUSTER_DOMAIN_NAME;curl -kv https://api.$CLUSTER_DOMAIN_NAME:6443;curl -kv https://api-int.$CLUSTER_DOMAIN_NAME:6443;curl -kv https://console.apps.$CLUSTER_DOMAIN_NAME:443;" &> diag-output/$host.network
done

$ tar cvfz diag-output.tar.gz ./diag-output/

The tarball file diag-output.tar.gz will be created and contains the gathered logs. Please attach to the case for analysis.

Collecting sosreports

In addition to the above collected data, it could be needed to also collect a sosreport, specially from the three control plane nodes. Please refer to how to provide an sosreport from a RHEL CoreOS OpenShift 4 node for collecting and provide sosreports to support cases.

SBR
Category
Components
Article Type