How to delete all kubernetes.io/events in etcd

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • etcd
    • 3
  • events

Issue

  • There is a large number of events in etcd.
  • The etcd is firing NOSPACE alarm.
  • OpenShift API is unavailable or oc delete command is too slow to delete all events.

Resolution

Kubernetes events have a time-to-live of 3 hours as explained in time-to-live for the events in an RHOCP cluster, so having a big amount of events is usually the consequence of other issues, and not the cause. After deleting the events as a workaround, it is needed to check the cause of big amount of events created.

IMPORTANT NOTE: Do not delete other resources but events following this procedure. Only events should be deleted this way. Deleting any other resource from etcd following this procedure could cause the cluster to fail. For other resources, please investigate the reason for the big amount of instances and only delete them with oc delete [resource_name] -n [namespace_name].

Deleting events from etcd

Before events deletion in the etcd, check that an This page is not included, but the link has been rewritten to point to the nearest parent document.etcd backup is available.

  1. Connect to the etcd pod from CoreOS:

    $ ssh -i <identity> core@master
    core@master$ sudo -i
    root@master# crictl exec -ti $(crictl ps --label "io.kubernetes.container.name=etcdctl" -q) /bin/sh
    
  2. Check number of events and etcd status (confirm that etcd is filled with events):

    $ etcdctl --command-timeout=60s get --prefix --keys-only / | awk -F/ '/./ { print $3 }' | sort | uniq -c | sort -n
    <truncated>
        410 serviceaccounts
        788 configmaps
        807 oauth
       2734 secrets
    3977840 events
    
  3. Check the etcd endpoints status, they must be reachable. A NOSPACE alarm may be fired:

    $ etcdctl endpoint status -w table
    +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+
    |         ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS        |
    +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+
    | https://10.0.0.1:2379 | 123456789abcdef1 |  3.4.9  |  8.2 GB |     false |      false |      4470 |    2273805 |            2273805 | alarm:NOSPACE |
    | https://10.0.0.2:2379 | 123456789abcdef2 |  3.4.9  |  8.2 GB |      true |      false |      4470 |    2273805 |            2273805 | alarm:NOSPACE |
    | https://10.0.0.2:2379 | 123456789abcdef3 |  3.4.9  |  8.2 GB |     false |      false |      4470 |    2273805 |            2273805 | alarm:NOSPACE |
    +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+
    
    $ etcdctl alarm list
    memberID:123456789abcdef1 alarm:NOSPACE
    memberID:123456789abcdef2 alarm:NOSPACE
    memberID:123456789abcdef3 alarm:NOSPACE
    
  4. Check in which namespaces are the biggest amount of events generated; this can be used to address the root cause or misbehaving applications:

    $ etcdctl --command-timeout=60s get --prefix --keys-only / |awk -F/ '/./ { print $3 " " $4}' | grep events | sort | uniq -c | sort -n
    
  5. Delete all events from etcd. Run the following loop to start the deletion process:

        # Variable definition
        # COUNT: number of events deleted per requests. 
        # FROM: first event in etcd
        # TO: last event to delete in the query
        # NUM: number of events deleted
    
        COUNT=10000
        FROM="$(etcdctl --command-timeout=60s get '/kubernetes.io/events/' --prefix --keys-only --limit 1)"
        while :; do
          TO="$(etcdctl get '/kubernetes.io/events/' --command-timeout=60s --prefix --keys-only --limit ${COUNT} | sed '/^$/d' | tail -1)"
          [ $(etcdctl get ${FROM} ${TO} --command-timeout=60s --keys-only | grep -vEc "^$|^/kubernetes.io/events/") -eq 0 ] && NUM=$(etcdctl --command-timeout=60s del ${FROM} ${TO}) || { echo "Non event key found, aborting..." ; break ;}
          [ "${NUM}" == "0" ] && echo "All events deleted" && break
          echo "${NUM} events deleted"
        done
    
        <truncated>
        9999 events deleted
        9999 events deleted
        411 events deleted
        All events deleted
    
  6. Free disk space usage. Although all events have been deleted, etcd disk usage has not changed. To free it, etcd must be compacted and defragmented as explained in how to compact and defrag etcd to decrease database size in OpenShift 4.

Root Cause

Large number of events are created due to cluster issues (like operators trying to remediate the cluster state, an issue with an internal or external component creating objects in an infinite loop, ...), could cause etcd poor performance with a lot of leader reelections. It is needed to investigate the real issue, as the big amount of events could be created again in few time.

Diagnostic Steps

  • Most of API calls are failing with etcdserver: mvcc: database space exceeded or etcdserver: leader changed.

  • oc describe node takes several minutes to run and oc get events is impossible on many namespaces.

  • Checking etcd size show that it is full (more than 8GB which is default max value):

      $ etcdctl endpoint status -w table
      +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+
      |         ENDPOINT      |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS        |
      +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+
      | https://10.0.0.1:2379 | 123456789abcdef1 |  3.4.9  |  8.2 GB |     false |      false |      4470 |    2273805 |            2273805 | alarm:NOSPACE |
      | https://10.0.0.2:2379 | 123456789abcdef2 |  3.4.9  |  8.2 GB |      true |      false |      4470 |    2273805 |            2273805 | alarm:NOSPACE |
      | https://10.0.0.2:2379 | 123456789abcdef3 |  3.4.9  |  8.2 GB |     false |      false |      4470 |    2273805 |            2273805 | alarm:NOSPACE |
      +-----------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+---------------+
    
  • Checking the numbers of objects by type shows us that the etcd is full of events.

      $ etcdctl --command-timeout=60s get --prefix --keys-only / | awk -F/ '/./ { print $3 }' | sort | uniq -c | sort -n
      <truncated>
          410 serviceaccounts
          788 configmaps
          807 oauth
         2734 secrets
      3977840 events
    
  • Refer to how to list the number of objects and size in etcd on OpenShift for additional information about number and size of the resources in etcd.

SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.