It’s time to leave the second jump a thing of the past

The second leap concept was first introduced in 1972 by the International Earth Rotation and Reference System Service (IERS) in an attempt to periodically update Coordinated Universal Time (UTC) due to observed solar time imprecise (UT1) and the long-term deceleration of the Earth’s rotation. This periodic adjustment primarily benefits scientists and astronomers as it allows them to observe celestial bodies using UTC for most purposes. If there was no UTC correction, adjustments would have to be made to the legacy equipment and software that synchronizes to UTC for astronomical observations.

As of today, since the introduction of the second interleave, UTC has been updated 27 times.

While the second leap might have been an acceptable solution in 1972, when it made both the scientific community and the telecommunications industry happy, today UTC is just as bad for digital applications as for scientists, who often choose TAI or UT1.

At Meta, we’re supporting an industry effort to stop future leap second introductions and stay at the current level of 27. Introducing new leap seconds is a risky practice that does more harm than good, and we believe it’s time to ‘enter new interleaved seconds. technologies to replace it.

leap of faith

One of the many factors that contribute to the irregularities in the Earth’s rotation is the constant melting and freezing of ice caps on the world’s highest mountains. This phenomenon can be visualized simply by thinking of a spinning figure skater, who manages angular velocity by controlling his arms and hands. As the arms extend, the angular velocity decreases, conserving the skater’s momentum. As soon as the skater brings the arms back the angular velocity increases.

To visualize the change in angular velocity, think of a spinning figure skater.

So far, only positive split seconds have been added. In the early days, this was done simply by adding an extra second, resulting in an unusual time stamp:

23:59:59 -> 23:59:60 -> 00:00:00

At best, this time jump resulted in crashed programs or even corrupted data, due to strange timestamps in the data storage.

With the Earth’s rotation pattern changing, it is very likely that we will have a second negative jump at some point in the future. The timestamp will then look like this:

23:59:58 -> 00:00:00

The impact of a second negative interleave has never been tested on a large scale; it could have a devastating effect on software that relies on timers or schedulers.

In any case, every split second is a huge source of pain for people who manage hardware infrastructures.

smear

More recently, it has become common practice to “smear” a split second simply by slowing down or speeding up the clock. There is no universal way to do this, but in Meta we broadcast the second hop for 17 hours, starting at 00:00:00 UTC based on the contents of the timezone data packet (tzdata).

Skip second defaming Meta.

Let’s break it down a bit.

We chose a duration of 17 hours mainly because slandering is happening in Stratum 2, where hundreds of NTP servers are slandering at the same time. To ensure that the difference between them is tolerable, the steps should be minimal. If the broadcast steps are too large, NTP clients may consider some devices to be faulty and exclude them from the quorum, which may cause an outage.

The starting point at 00:00:00 UTC is also not standardized and there are many possible options. For example, some companies start smearing at 12:00:00 UTC the day before and for 24 hours; some do it two hours before the event, and others right on the edge.

There are also different algorithms on the spread itself. There is kernel leap second correction, linear (when equal steps are applied), cosine and quadratic (used by Meta) speckles. The algorithms are based on different mathematical models and produce different compensation charts:

Second kernel jump with NTPD

The source of the hop indicator differs between GNSS constellations (eg GPS, GLONASS, Galileo and BeiDou). In some cases, it is broadcast by satellite several hours in advance. In other cases, the time is propagated in UTC with the jump already applied. In different constellations, the value of the interleaved second differs depending on when it was cast.

Difference in second interleaver values ​​between GNSS constellations.

All of this requires the non-trivial conversion logic within the time sources, including our own Time Appliance. Losing a GNSS signal during such a sensitive time can lead to the loss of a hop indicator and a split-brain situation, potentially leading to an outage.

The jump event is also propagated via the tzdata package months in advance, and for ntpd fans, via a second hop file distributed via the Internet Engineering Taskforce (IETF) website. Not having a fresh copy of the file can forget a split second and cause a break.

As already mentioned, enamel is a very sensitive moment. If the NTP server is restarted during this period, we will likely end up with “old” or “new” time, which can propagate to clients and cause an outage.

Because of these ambiguities, public NTP groups do not slander, sometimes passing a jump indicator to clients to find out. SNTP clients often end up increasing their clock and facing the consequences described above. Smarter customers can choose a default strategy to make the jump locally. All in all, this means that big players like Meta, who defame public services, cannot join public pools.

And even after the jump, things are still at risk. The NTP software must constantly apply an offset compared to the time source it uses (GNSS, TAI, or atomic clock), and the PTP software must propagate the so-called UTC offset flag in the advertisement messages.

The negative impact of split seconds

The second jump and the compensation it creates causes problems for the entire industry. One of the easiest ways to cause a break is to cook in an always forward time assumption. Let’s say we have code like this:

start := time.Now()

// do something

spent := time.Now().Sub(start)

Depending on how it is used, we can end up in a situation where we rely on a negative value during a second interleaved event. These assumptions have caused numerous outages and there are many articles describing such cases.

In 2012, Reddit experienced a massive outage due to a split second; the site was inaccessible for 30 to 40 minutes. This happened when the time change confused the high-resolution timer (hrtimer), causing hyperactivity on the servers, which froze the machines’ CPUs.

In 2017, Cloudflare published a very detailed article about the impact of a second hop on the company’s public DNS. The root cause of the error that affected their DNS service was the belief that time cannot go backwards. The code took the upstream time values ​​and fed them to Go’s rand.Int63n() function. The rand.Int63n() function quickly panicked because the argument was negative, causing the DNS server to crash.

Going beyond the second interleave

Second jump events have caused problems for the industry and continue to present many risks. As an industry, we run into problems whenever an interleaved second is introduced. And because it’s such a rare event, it devastates the community every time it happens. With increasing demand for clock accuracy across industries, the second leap is now causing more harm than good, leading to disruption and disruption.

As Meta engineers, we are supporting a larger community push to stop the future introduction of leap seconds and stay at the current level of 27, which we believe will be sufficient for the next millennium.

Leave a Comment

Your email address will not be published. Required fields are marked *