We Got Bit By the Leap Second Bug
Did you know there was a leap second going from Saturday June 30th into Sunday July 1st? At midnight Greenwich Mean Time (GMT), the clock was held back a second to keep time in sync with the planet’s daily rotation.
Sounds harmless – more of an opportunity for jokes about what you’ll do with an extra second of time…
Well, turns out that the Linux kernel had a small bug related to the seconds not advancing at midnight GMT. This timing issue did NOT affect any of our hosting servers.
However, one of our backup servers that we use for 6 hour disaster recovery backups did not fare so well.
Sunday morning, I noticed the CPU for this server was pegged at 100%, and the backup software was sluggish at best. I knew about the leap second, but didn’t connect the 2 events. I thought the backup software was malfunctioning.
Send an emergency email to the backup software company – they start looking at it – we try upgrading – stopping and starting – collecting data – still no solution.
Fast forward to Monday, lots of back and forth, and finally one of the support guys at the backup company says “maybe it’s the leap second bug“.
A simple reboot of the server, and everything was back to normal. Turns out the Java software that powers the backup system was highly susceptible to this timing bug. It did not like the time not advancing into the next day.
With Linux servers, a reboot *almost never* solves anything, as reboots are only needed to load a new kernel (and with new technology we can perform kernel updates live without a reboot). But in this case, it was actually needed to clear a dead-lock condition.
It’s amazing what a simple “extra” second can do to a complex computer system.