Troubleshooting OpenShift Container Platform 3.x: Etcd

Updated 10 Feb 2021

Note - From OCP 3.10 where if etcd runs as static pod then you need to run the etcdctl commands from the pod

An example to understand the above check section "If etcd runs as a static pod, run the following commands" in this

For etcd static pod details check this

Starting with new installations of OpenShift Container Platform 3.6, the etcd3 v3 data model is the default.

Getting Started

Determining which etcd data model your cluster is using.

Starting with new installations of OpenShift Container Platform 3.6, the etcd3 v3 data model is the default. With OpenShift Container Platform 3.7 is required.
Checking the RPM version of etcd can be deceiving as version 3 etcd rpm can still be using the v2 data model with OpenShift.
- OpenShift Container Platform 3.5 or lower v2 data model is used.
- OpenShift Container Platform 3.6 check the master-config.yaml to determine data model is used.
  - grep storage-backend -A1 /etc/origin/master/master-config.yaml
- OpenShift Container PlatformWith 3.7 or greater v3 data model is used.

ETCD storage version 3

Setting etcd variables

From an etcd host we can source the etcd.conf file to set most of the needed variables.

# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3

Set endpoint variable to include all etcd endpoints

# ETCD_ALL_ENDPOINTS=` etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`

Check health of etcd

Single Host status and health checks.

#  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table endpoint status

#  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table endpoint health

Cluster status and health checks.

# etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table  member list

# etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS  --write-out=table endpoint status

# etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS endpoint health

NOTE: In the above 'member list', 'endpoint status', and 'endpoint health' commands you might need to use the 'etcdctl3' command.

ETCD storage version 2

Check health of etcd

Setting etcd variables

From an etcd host we can source the etcd.conf file to set most of the needed variables.

# source /etc/etcd/etcd.conf
# export ETCDCTL_API=2

From etcd host using peer certs

# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list

Check health with curl from etcd host using peer certs

# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/health

Check health with curl from master host using using client certs

# curl --cert /etc/origin/master/master.etcd-client.crt  --key /etc/origin/master/master.etcd-client.key  --cacert /etc/origin/master/master.etcd-ca.crt   $ETCD_LISTEN_CLIENT_URLS/health

More Information

Set Debug Logging

**Set debug logging dynamically (no restart required)** - Enable debug logging one (will set debug logging on "$ETCD_LISTEN_CLIENT_URLS" only)

- V3
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"DEBUG"}'

- V2
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"DEBUG"}'

Enable info logging

- V3
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"INFO"}'

- V2
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"INFO"}'

Gather Logs (rpm installation)

# journalctl -u etcd > $(hostname)-etcd.log

Gather Logs (containeraized installation)

# journalctl -u etcd_container > $(hostname)-etcd.log

-For OCP 3.10 and 3.11, if etcd is running in a static pod logs can gathered with the following:

# /usr/local/bin/master-logs etcd etcd > $(hostname)-etcd.log 2>&1

Troubleshooting

Preform a watch on etcd to see what keys are being changed.
- This can help show operations that may be making to many changes to keys.
- kubernetes.io/minions keys are updated every 10 seconds of each node. 5 nodes will equal 150 key changes in 5 minutes.

# source /etc/etcd/etcd.conf
# export  ETCDCTL_API=3
# timeout 5m  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS watch / --prefix  --write-out=fields > etcdwatchraw.log

- Scrub data:
# grep -v Value etcdwatchraw.log >  etcdwatch.log

- Example one liner to parse data: 
# awk  'BEGIN{FS="/"; OFS="/";} /^\"Key/{print $2,$3}' etcdwatch.log | sort | uniq -c | sort -nr

Gather etcd metrics:
- To rule out a slow disk, monitor Content from etcd.io is not included.backend_commit_duration_seconds (p99 duration should be less than 25ms) and Content from etcd.io is not included.wal_fsync_duration_seconds (p99 duration should be less than 10ms), to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.

# source /etc/etcd/etcd.conf
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/metrics

Check Performance

# source /etc/etcd/etcd.conf
# export  ETCDCTL_API=3
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS  check perf --load='s'

For accurate benchmark do these tests when there is no load currently on etcd, for example when master API is stopped. You can specify higher loads, but it is discouraged if kube-apiserver is running, as it can impact the cluster. Note that if the check perf command does not end successfully, you may need to perform manual cleanup of the keys it has created (under prefix /etcdctl-check-perf/ or whatever custom prefix has been specified via --prefix). You may also need to keep an eye on etcd DB size and compact+defrag if it grows too much.

Useful Links

How do I remove and add back an existing etcd member for the OpenShift cluster?
How do I restore from an etcd backup in OpenShift

Defrag Etcd

It is possible to clean Etcd database by doing a defrag. By default the database is >= 4GB, more than that it turn into maitenance mode. In such case make a backup to another place first.
More details at:
How to defrag etcd in OpenShift to decrease DB size
The openshift-ansible performs many backups and store them on /var/lib/etcd, if the filesystem /var/lib/etcd is growing, make sure that there isn't many old backups.

# du -had1 /var/lib/etcd | sort -h
# du -had1  /var/lib/etcd/openshift-backup-* backups

Etcd pprof data for etcd

Data can be gathered to profile CPU, heap, mutex, and goroutine utilization for the etcd process.
Not all of the below is needed, depending on issue stack or just heap data might be needed.

# sed '/^ETCD_DEBUG=/{h;s/=.*/=True/};${x;/^$/{s//ETCD_DEBUG=True/;H};x}' /etc/etcd/etcd.conf
# /usr/local/bin/master-restart etcd etcd 
# source /etc/etcd/etcd.conf 
# alias etcdcurl="curl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt"

# etcdcurl  $ETCD_LISTEN_CLIENT_URLS/debug/pprof/goroutine?debug=2 > `hostname -s`-etcd-goroutine.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/heap?debug=1  > `hostname`-etcd-heap.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/trace?seconds=5  > `hostname`-etcd-trace.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/block?debug=2 > `hostname`-etcd-block.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/profile?seconds=30 > `hostname`-etcd-profile.pprof.gz

# etcdctl3 version

Product(s)

Red Hat OpenShift Container Platform

Category

Troubleshoot

Components

etcd

Tags

Article Type

General