ETCD pod is restarting frequently.

Solution Verified - Updated 14 Jun 2024

Environment

Red Hat OpenShift Container Platform
- 3.10.
- 3.11.

Issue

The liveness probe for the master-etcd pod failed.

rafthttp: the clock difference against peer XXXX is too high [1.46664075s > 1s]
rafthttp: the clock difference against peer XXXX is too high [3.281962067s > 1s]

Liveness probe for master-etcd-master.example.com(XXXX):etcd failed (failure): member XXXX is unhealthy: got unhealthy result from https://ip-address:2379

member XXXX is unhealthy: got unhealthy result from https://ip-address:2379
member XXXX is unhealthy: got unhealthy result from https://ip-address:2379

I/O timeouts errors for the etcd members.

Resolution

The first thing here is to make the etcd cluster stable without any clock differences by syncing their clocks.
This can be done in two ways:
1. Enabling This page is not included, but the link has been rewritten to point to the nearest parent document.NTP.
2. Manually sync the clocks on all the master nodes.

Root Cause

etcd servers show that the clocks are out of sync with each other which is causing I/O timeouts.
Due to the I/O timeouts, the liveness probe was failing which made the etcd pod to restart frequently.

Diagnostic Steps

Check the etcd pod logs:

/usr/local/bin/master-logs etcd etcd

Check the atomic-openshift-node service logs.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

etcd

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.