Chronyd crashes when performing server leap smear

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • NTP Server running chronyd

Issue

  • When chronyd is configured with the smoothtime directive there is a chance it will crash
  • Chronyd crashes when performing server leap smear with the following error message:
smooth.c:164: update_stages: Assertion `dir <= 1 && l1 >= 0.0 && l3 >= 0.0' failed.

Resolution

This issue is addressed, based on RHEL version, below:

  • For RHEL 7 update to chrony-2.1.1-4.el7_3 (released with RHBA-2016-2887) or later to prevent this issue.
  • For RHEL 6 update to chrony-2.1.1-2.el6_8 (released with RHBA-2016-2944) or later to prevent this issue.

Alternatively, any of the following workarounds may be implemented:

  • Disable the leap smear on the server (remove the smoothtime directive) and configure NTP clients to slew or ignore the leap second (e.g. add the leapsecmode slew directive with version 2.0+ of chrony or use the -x option with ntpd).
  • Configure the clients with more leap smearing NTP servers in order to increase the chance they will have at least one working server.
  • Remove the leaponly option and configure the clients to use just one server.
  • Alternatively, the chance of chronyd crashing may be minimized by increasing the polling interval (e.g. by adding the maxpoll 12 and polltarget 4 options to all server directives in chrony.conf). Increasing the polling interval will result in less accurate timekeeping; however, clients will typically be less accurate during the leap smear, and this may be an acceptable trade off to prevent chronyd from crashing. If the servers polled by chronyd announce the leap second late (e.g. only one hour before UTC midnight), the server might miss the announcement due to the long polling interval. It's recommended to use the leapsectz right/UTC directive with an up-to-date tzdata package to make sure the leap smear is performed even if the servers announce the leap second too close to the midnight.

Root Cause

When chronyd is configured with the smoothtime directive, and the smoothing process is updated with an extremely small offset, it may not be able to select a direction in which the offset needs to be smoothed out due to numerical errors in floating-point operations, resulting in an assertion failure.

Normally, the offset is large enough to not hit this problem, but with the leaponly option (which can be used to perform a synchronized leap smear on multiple servers) the smoothing process is updated with zero offset after the leap second is inserted, which creates ideal conditions for hitting this bug. The chances of crash during whole leap smear depends on the update interval, which depends on the polling interval.

With 0.001 ppm/s smoothtime wander and polling interval of 1024 seconds (the default maximum) the probability of a crash is roughly 1%, and with a 1 second polling interval the probability of a crash is roughly 50%.

Diagnostic Steps

  1. Prepare an NTP server that will simulate a leap second

  2. Configure chronyd as a client of the server using 1-second polling interval and performing a leap smear for its own clients. For example, the following configuration was used:

    server ntp.example.com minpoll 0 maxpoll 0
    leapsecmode slew
    maxslewrate 1000
    smoothtime 400 0.001 leaponly
    
  3. Wait for the simulated leap second and then wait until the leap smear is finished (the progress can be monitored with chronyc smoothing)

SBR
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.