Chronyd crashes when performing server leap smear
Environment
- Red Hat Enterprise Linux 6
- Red Hat Enterprise Linux 7
- NTP Server running chronyd
Issue
- When chronyd is configured with the
smoothtimedirective there is a chance it will crash - Chronyd crashes when performing server leap smear with the following error message:
smooth.c:164: update_stages: Assertion `dir <= 1 && l1 >= 0.0 && l3 >= 0.0' failed.
Resolution
This issue is addressed, based on RHEL version, below:
- For RHEL 7 update to
chrony-2.1.1-4.el7_3(released with RHBA-2016-2887) or later to prevent this issue. - For RHEL 6 update to
chrony-2.1.1-2.el6_8(released with RHBA-2016-2944) or later to prevent this issue.
Alternatively, any of the following workarounds may be implemented:
- Disable the leap smear on the server (remove the
smoothtimedirective) and configure NTP clients to slew or ignore the leap second (e.g. add theleapsecmode slewdirective with version 2.0+ of chrony or use the-xoption with ntpd). - Configure the clients with more leap smearing NTP servers in order to increase the chance they will have at least one working server.
- Remove the
leaponlyoption and configure the clients to use just one server. - Alternatively, the chance of chronyd crashing may be minimized by increasing the polling interval (e.g. by adding the
maxpoll 12andpolltarget 4options to all server directives inchrony.conf). Increasing the polling interval will result in less accurate timekeeping; however, clients will typically be less accurate during the leap smear, and this may be an acceptable trade off to prevent chronyd from crashing. If the servers polled by chronyd announce the leap second late (e.g. only one hour before UTC midnight), the server might miss the announcement due to the long polling interval. It's recommended to use theleapsectz right/UTCdirective with an up-to-date tzdata package to make sure the leap smear is performed even if the servers announce the leap second too close to the midnight.
Root Cause
When chronyd is configured with the smoothtime directive, and the smoothing process is updated with an extremely small offset, it may not be able to select a direction in which the offset needs to be smoothed out due to numerical errors in floating-point operations, resulting in an assertion failure.
Normally, the offset is large enough to not hit this problem, but with the leaponly option (which can be used to perform a synchronized leap smear on multiple servers) the smoothing process is updated with zero offset after the leap second is inserted, which creates ideal conditions for hitting this bug. The chances of crash during whole leap smear depends on the update interval, which depends on the polling interval.
With 0.001 ppm/s smoothtime wander and polling interval of 1024 seconds (the default maximum) the probability of a crash is roughly 1%, and with a 1 second polling interval the probability of a crash is roughly 50%.
Diagnostic Steps
-
Prepare an NTP server that will simulate a leap second
-
Configure chronyd as a client of the server using 1-second polling interval and performing a leap smear for its own clients. For example, the following configuration was used:
server ntp.example.com minpoll 0 maxpoll 0 leapsecmode slew maxslewrate 1000 smoothtime 400 0.001 leaponly -
Wait for the simulated leap second and then wait until the leap smear is finished (the progress can be monitored with
chronyc smoothing)
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.