How to delete all kubernetes.io/events in etcd
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
- etcd
- 3
events
Issue
- There is a large number of
eventsin etcd. - The etcd is firing
NOSPACEalarm. - OpenShift API is unavailable or
oc deletecommand is too slow to delete all events.
Resolution
Kubernetes events have a time-to-live of 3 hours as explained in time-to-live for the events in an RHOCP cluster, so having a big amount of events is usually the consequence of other issues, and not the cause. After deleting the events as a workaround, it is needed to check the cause of big amount of events created.
IMPORTANT NOTE: Do not delete other resources but
eventsfollowing this procedure. Onlyeventsshould be deleted this way. Deleting any other resource from etcd following this procedure could cause the cluster to fail. For other resources, please investigate the reason for the big amount of instances and only delete them withoc delete [resource_name] -n [namespace_name].
Deleting events from etcd
Before events deletion in the etcd, check that an This page is not included, but the link has been rewritten to point to the nearest parent document.etcd backup is available.
-
Connect to the etcd pod from CoreOS:
$ ssh -i <identity> core@master core@master$ sudo -i root@master# crictl exec -ti $(crictl ps --label "io.kubernetes.container.name=etcdctl" -q) /bin/sh -
Check number of
eventsand etcd status (confirm that etcd is filled withevents):$ etcdctl --command-timeout=60s get --prefix --keys-only / | awk -F/ '/./ { print $3 }' | sort | uniq -c | sort -n <truncated> 410 serviceaccounts 788 configmaps 807 oauth 2734 secrets 3977840 events -
Check the etcd endpoints status, they must be reachable. A
NOSPACEalarm may be fired:$ etcdctl endpoint status -w table +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+ | https://10.0.0.1:2379 | 123456789abcdef1 | 3.4.9 | 8.2 GB | false | false | 4470 | 2273805 | 2273805 | alarm:NOSPACE | | https://10.0.0.2:2379 | 123456789abcdef2 | 3.4.9 | 8.2 GB | true | false | 4470 | 2273805 | 2273805 | alarm:NOSPACE | | https://10.0.0.2:2379 | 123456789abcdef3 | 3.4.9 | 8.2 GB | false | false | 4470 | 2273805 | 2273805 | alarm:NOSPACE | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+ $ etcdctl alarm list memberID:123456789abcdef1 alarm:NOSPACE memberID:123456789abcdef2 alarm:NOSPACE memberID:123456789abcdef3 alarm:NOSPACE -
Check in which
namespacesare the biggest amount ofeventsgenerated; this can be used to address the root cause or misbehaving applications:$ etcdctl --command-timeout=60s get --prefix --keys-only / |awk -F/ '/./ { print $3 " " $4}' | grep events | sort | uniq -c | sort -n -
Delete all
eventsfrom etcd. Run the following loop to start the deletion process:# Variable definition # COUNT: number of events deleted per requests. # FROM: first event in etcd # TO: last event to delete in the query # NUM: number of events deleted COUNT=10000 FROM="$(etcdctl --command-timeout=60s get '/kubernetes.io/events/' --prefix --keys-only --limit 1)" while :; do TO="$(etcdctl get '/kubernetes.io/events/' --command-timeout=60s --prefix --keys-only --limit ${COUNT} | sed '/^$/d' | tail -1)" [ $(etcdctl get ${FROM} ${TO} --command-timeout=60s --keys-only | grep -vEc "^$|^/kubernetes.io/events/") -eq 0 ] && NUM=$(etcdctl --command-timeout=60s del ${FROM} ${TO}) || { echo "Non event key found, aborting..." ; break ;} [ "${NUM}" == "0" ] && echo "All events deleted" && break echo "${NUM} events deleted" done <truncated> 9999 events deleted 9999 events deleted 411 events deleted All events deleted -
Free disk space usage. Although all
eventshave been deleted, etcd disk usage has not changed. To free it, etcd must be compacted and defragmented as explained in how to compact and defrag etcd to decrease database size in OpenShift 4.
Root Cause
Large number of events are created due to cluster issues (like operators trying to remediate the cluster state, an issue with an internal or external component creating objects in an infinite loop, ...), could cause etcd poor performance with a lot of leader reelections. It is needed to investigate the real issue, as the big amount of events could be created again in few time.
Diagnostic Steps
-
Most of API calls are failing with
etcdserver: mvcc: database space exceeded or etcdserver: leader changed. -
oc describe nodetakes several minutes to run andoc get eventsis impossible on many namespaces. -
Checking etcd size show that it is full (more than 8GB which is default max value):
$ etcdctl endpoint status -w table +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+ | https://10.0.0.1:2379 | 123456789abcdef1 | 3.4.9 | 8.2 GB | false | false | 4470 | 2273805 | 2273805 | alarm:NOSPACE | | https://10.0.0.2:2379 | 123456789abcdef2 | 3.4.9 | 8.2 GB | true | false | 4470 | 2273805 | 2273805 | alarm:NOSPACE | | https://10.0.0.2:2379 | 123456789abcdef3 | 3.4.9 | 8.2 GB | false | false | 4470 | 2273805 | 2273805 | alarm:NOSPACE | +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+ -
Checking the numbers of objects by type shows us that the etcd is full of
events.$ etcdctl --command-timeout=60s get --prefix --keys-only / | awk -F/ '/./ { print $3 }' | sort | uniq -c | sort -n <truncated> 410 serviceaccounts 788 configmaps 807 oauth 2734 secrets 3977840 events -
Refer to how to list the number of objects and size in etcd on OpenShift for additional information about number and size of the resources in etcd.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.