Troubleshooting OpenShift Container Platform 3.x: Etcd

Updated

Note - From OCP 3.10 where if etcd runs as static pod then you need to run the etcdctl commands from the pod

An example to understand the above check section "If etcd runs as a static pod, run the following commands" in this

For etcd static pod details check this

Starting with new installations of OpenShift Container Platform 3.6, the etcd3 v3 data model is the default.

Getting Started

Determining which etcd data model your cluster is using.

  • Starting with new installations of OpenShift Container Platform 3.6, the etcd3 v3 data model is the default. With OpenShift Container Platform 3.7 is required.
    Checking the RPM version of etcd can be deceiving as version 3 etcd rpm can still be using the v2 data model with OpenShift.

    • OpenShift Container Platform 3.5 or lower v2 data model is used.
    • OpenShift Container Platform 3.6 check the master-config.yaml to determine data model is used.
      • grep storage-backend -A1 /etc/origin/master/master-config.yaml
    • OpenShift Container PlatformWith 3.7 or greater v3 data model is used.

ETCD storage version 3

Setting etcd variables

  • From an etcd host we can source the etcd.conf file to set most of the needed variables.
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
  • Set endpoint variable to include all etcd endpoints
# ETCD_ALL_ENDPOINTS=` etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=fields   member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`

Check health of etcd

  • Single Host status and health checks.
#  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table endpoint status

#  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table endpoint health
  • Cluster status and health checks.
# etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table  member list

# etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS  --write-out=table endpoint status

# etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS endpoint health

NOTE: In the above 'member list', 'endpoint status', and 'endpoint health' commands you might need to use the 'etcdctl3' command.

ETCD storage version 2

Check health of etcd


Setting etcd variables
  • From an etcd host we can source the etcd.conf file to set most of the needed variables.
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=2
  • From etcd host using peer certs
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list
  • Check health with curl from etcd host using peer certs
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/health
  • Check health with curl from master host using using client certs
# curl --cert /etc/origin/master/master.etcd-client.crt  --key /etc/origin/master/master.etcd-client.key  --cacert /etc/origin/master/master.etcd-ca.crt   $ETCD_LISTEN_CLIENT_URLS/health

More Information

Set Debug Logging


**Set debug logging dynamically (no restart required)** - Enable debug logging one (will set debug logging on "$ETCD_LISTEN_CLIENT_URLS" only)
- V3
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"DEBUG"}'

- V2
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"DEBUG"}'
  • Enable info logging
- V3
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"INFO"}'

- V2
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"INFO"}'
  • Gather Logs (rpm installation)
# journalctl -u etcd > $(hostname)-etcd.log
  • Gather Logs (containeraized installation)
# journalctl -u etcd_container > $(hostname)-etcd.log

-For OCP 3.10 and 3.11, if etcd is running in a static pod logs can gathered with the following:

# /usr/local/bin/master-logs etcd etcd > $(hostname)-etcd.log 2>&1

Troubleshooting

  • Preform a watch on etcd to see what keys are being changed.
    • This can help show operations that may be making to many changes to keys.
    • kubernetes.io/minions keys are updated every 10 seconds of each node. 5 nodes will equal 150 key changes in 5 minutes.
# source /etc/etcd/etcd.conf
# export  ETCDCTL_API=3
# timeout 5m  etcdctl  --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS watch / --prefix  --write-out=fields > etcdwatchraw.log

- Scrub data:
# grep -v Value etcdwatchraw.log >  etcdwatch.log

- Example one liner to parse data: 
# awk  'BEGIN{FS="/"; OFS="/";} /^\"Key/{print $2,$3}' etcdwatch.log | sort | uniq -c | sort -nr
# source /etc/etcd/etcd.conf
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE   $ETCD_LISTEN_CLIENT_URLS/metrics
  • Check Performance
# source /etc/etcd/etcd.conf
# export  ETCDCTL_API=3
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS  check perf --load='s'
  • For accurate benchmark do these tests when there is no load currently on etcd, for example when master API is stopped. You can specify higher loads, but it is discouraged if kube-apiserver is running, as it can impact the cluster. Note that if the check perf command does not end successfully, you may need to perform manual cleanup of the keys it has created (under prefix /etcdctl-check-perf/ or whatever custom prefix has been specified via --prefix). You may also need to keep an eye on etcd DB size and compact+defrag if it grows too much.

How do I remove and add back an existing etcd member for the OpenShift cluster?
How do I restore from an etcd backup in OpenShift

Defrag Etcd

  • It is possible to clean Etcd database by doing a defrag. By default the database is >= 4GB, more than that it turn into maitenance mode. In such case make a backup to another place first.
    More details at:
    How to defrag etcd in OpenShift to decrease DB size
  • The openshift-ansible performs many backups and store them on /var/lib/etcd, if the filesystem /var/lib/etcd is growing, make sure that there isn't many old backups.
# du -had1 /var/lib/etcd | sort -h
# du -had1  /var/lib/etcd/openshift-backup-* backups

Etcd pprof data for etcd

  • Data can be gathered to profile CPU, heap, mutex, and goroutine utilization for the etcd process.
  • Not all of the below is needed, depending on issue stack or just heap data might be needed.
# sed '/^ETCD_DEBUG=/{h;s/=.*/=True/};${x;/^$/{s//ETCD_DEBUG=True/;H};x}' /etc/etcd/etcd.conf
# /usr/local/bin/master-restart etcd etcd 
# source /etc/etcd/etcd.conf 
# alias etcdcurl="curl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt"

# etcdcurl  $ETCD_LISTEN_CLIENT_URLS/debug/pprof/goroutine?debug=2 > `hostname -s`-etcd-goroutine.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/heap?debug=1  > `hostname`-etcd-heap.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/trace?seconds=5  > `hostname`-etcd-trace.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/block?debug=2 > `hostname`-etcd-block.pprof
# etcdcurl  $ETCD_LISTEN_CLIENT_URLS//debug/pprof/profile?seconds=30 > `hostname`-etcd-profile.pprof.gz

# etcdctl3 version
Category
Components
Article Type