Troubleshooting OpenShift Container Platform 3.x: Etcd
Note - From OCP 3.10 where if etcd runs as static pod then you need to run the etcdctl commands from the pod
An example to understand the above check section "If etcd runs as a static pod, run the following commands" in this
For etcd static pod details check this
Starting with new installations of OpenShift Container Platform 3.6, the etcd3 v3 data model is the default.
- Getting Started
- ETCD v3
- ETCD v2
- Set Debug Logging
- Troubleshooting
- Useful Links
- Defrag Etcd
- Profile data Etcd
Getting Started
Determining which etcd data model your cluster is using.
-
Starting with new installations of OpenShift Container Platform 3.6, the etcd3 v3 data model is the default. With OpenShift Container Platform 3.7 is required.
Checking the RPM version of etcd can be deceiving as version 3 etcd rpm can still be using the v2 data model with OpenShift.- OpenShift Container Platform 3.5 or lower v2 data model is used.
- OpenShift Container Platform 3.6 check the master-config.yaml to determine data model is used.
grep storage-backend -A1 /etc/origin/master/master-config.yaml
- OpenShift Container PlatformWith 3.7 or greater v3 data model is used.
ETCD storage version 3
Setting etcd variables
- From an etcd host we can source the etcd.conf file to set most of the needed variables.
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
- Set endpoint variable to include all etcd endpoints
# ETCD_ALL_ENDPOINTS=` etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=fields member list | awk '/ClientURL/{printf "%s%s",sep,$3; sep=","}'`
Check health of etcd
- Single Host status and health checks.
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table endpoint status
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table endpoint health
- Cluster status and health checks.
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS --write-out=table member list
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS --write-out=table endpoint status
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_ALL_ENDPOINTS endpoint health
NOTE: In the above 'member list', 'endpoint status', and 'endpoint health' commands you might need to use the 'etcdctl3' command.
ETCD storage version 2
Check health of etcd
Setting etcd variables
- From an etcd host we can source the etcd.conf file to set most of the needed variables.
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=2
- From etcd host using peer certs
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS cluster-health
# etcdctl --cert-file=$ETCD_PEER_CERT_FILE --key-file=$ETCD_PEER_KEY_FILE --ca-file=$ETCD_CA_FILE --peers=$ETCD_LISTEN_CLIENT_URLS member list
- Check health with curl from etcd host using peer certs
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE $ETCD_LISTEN_CLIENT_URLS/health
- Check health with curl from master host using using client certs
# curl --cert /etc/origin/master/master.etcd-client.crt --key /etc/origin/master/master.etcd-client.key --cacert /etc/origin/master/master.etcd-ca.crt $ETCD_LISTEN_CLIENT_URLS/health
More Information
- Manually adding a new etcd host to the cluster for OpenShift Container Platform
- Content from coreos.com is not included.Etcd admin guide
Set Debug Logging
**Set debug logging dynamically (no restart required)** - Enable debug logging one (will set debug logging on "$ETCD_LISTEN_CLIENT_URLS" only)
- V3
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"DEBUG"}'
- V2
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"DEBUG"}'
- Enable info logging
- V3
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"INFO"}'
- V2
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_CA_FILE $ETCD_LISTEN_CLIENT_URLS/config/local/log -XPUT -d '{"Level":"INFO"}'
- Gather Logs (rpm installation)
# journalctl -u etcd > $(hostname)-etcd.log
- Gather Logs (containeraized installation)
# journalctl -u etcd_container > $(hostname)-etcd.log
-For OCP 3.10 and 3.11, if etcd is running in a static pod logs can gathered with the following:
# /usr/local/bin/master-logs etcd etcd > $(hostname)-etcd.log 2>&1
Troubleshooting
- Preform a watch on etcd to see what keys are being changed.
- This can help show operations that may be making to many changes to keys.
- kubernetes.io/minions keys are updated every 10 seconds of each node. 5 nodes will equal 150 key changes in 5 minutes.
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
# timeout 5m etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS watch / --prefix --write-out=fields > etcdwatchraw.log
- Scrub data:
# grep -v Value etcdwatchraw.log > etcdwatch.log
- Example one liner to parse data:
# awk 'BEGIN{FS="/"; OFS="/";} /^\"Key/{print $2,$3}' etcdwatch.log | sort | uniq -c | sort -nr
- Gather etcd metrics:
- To rule out a slow disk, monitor Content from etcd.io is not included.backend_commit_duration_seconds (p99 duration should be less than 25ms) and Content from etcd.io is not included.wal_fsync_duration_seconds (p99 duration should be less than 10ms), to confirm the disk is reasonably fast. If the disk is too slow, assigning a dedicated disk to etcd or using faster disk will typically solve the problem.
# source /etc/etcd/etcd.conf
# curl --cert $ETCD_PEER_CERT_FILE --key $ETCD_PEER_KEY_FILE --cacert $ETCD_TRUSTED_CA_FILE $ETCD_LISTEN_CLIENT_URLS/metrics
- Check Performance
# source /etc/etcd/etcd.conf
# export ETCDCTL_API=3
# etcdctl --cert=$ETCD_PEER_CERT_FILE --key=$ETCD_PEER_KEY_FILE --cacert=$ETCD_TRUSTED_CA_FILE --endpoints=$ETCD_LISTEN_CLIENT_URLS check perf --load='s'
- For accurate benchmark do these tests when there is no load currently on etcd, for example when master API is stopped. You can specify higher loads, but it is discouraged if kube-apiserver is running, as it can impact the cluster. Note that if the check perf command does not end successfully, you may need to perform manual cleanup of the keys it has created (under prefix
/etcdctl-check-perf/or whatever custom prefix has been specified via--prefix). You may also need to keep an eye on etcd DB size and compact+defrag if it grows too much.
Useful Links
How do I remove and add back an existing etcd member for the OpenShift cluster?
How do I restore from an etcd backup in OpenShift
Defrag Etcd
- It is possible to clean Etcd database by doing a defrag. By default the database is >= 4GB, more than that it turn into
maitenance mode. In such case make a backup to another place first.
More details at:
How to defrag etcd in OpenShift to decrease DB size - The
openshift-ansibleperforms many backups and store them on/var/lib/etcd, if the filesystem/var/lib/etcdis growing, make sure that there isn't many old backups.
# du -had1 /var/lib/etcd | sort -h
# du -had1 /var/lib/etcd/openshift-backup-* backups
Etcd pprof data for etcd
- Data can be gathered to profile CPU, heap, mutex, and goroutine utilization for the etcd process.
- Not all of the below is needed, depending on issue stack or just heap data might be needed.
# sed '/^ETCD_DEBUG=/{h;s/=.*/=True/};${x;/^$/{s//ETCD_DEBUG=True/;H};x}' /etc/etcd/etcd.conf
# /usr/local/bin/master-restart etcd etcd
# source /etc/etcd/etcd.conf
# alias etcdcurl="curl --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key --cacert /etc/etcd/ca.crt"
# etcdcurl $ETCD_LISTEN_CLIENT_URLS/debug/pprof/goroutine?debug=2 > `hostname -s`-etcd-goroutine.pprof
# etcdcurl $ETCD_LISTEN_CLIENT_URLS//debug/pprof/heap?debug=1 > `hostname`-etcd-heap.pprof
# etcdcurl $ETCD_LISTEN_CLIENT_URLS//debug/pprof/trace?seconds=5 > `hostname`-etcd-trace.pprof
# etcdcurl $ETCD_LISTEN_CLIENT_URLS//debug/pprof/block?debug=2 > `hostname`-etcd-block.pprof
# etcdcurl $ETCD_LISTEN_CLIENT_URLS//debug/pprof/profile?seconds=30 > `hostname`-etcd-profile.pprof.gz
# etcdctl3 version