This past week has been the worst for Campfire reliability since we launched the service nearly five years ago. We can't say we're sorry enough for the terrible show we put on. Thursday last week, and Monday this week, were particularly damaging as the service spent hours on both days bouncing up and down. As if that wasn't bad enough, we added insult to that injury with another 5-minute outage on Tuesday and another 15-minute outage on Wednesday.
This kind of reliability would be bad for any web application that people depend on, but it was extra bad for Campfire because it's all about real time. Campfire is the most important tool we have to coordinate the work at 37signals and it's highly disruptive when it's not working. We know this is true for most of our customers too.
Campfire needs to have dial-tone level reliability to be trustworthy. For the past five years, up until this week, we believe we've had a pretty good track record of that. But given the severity of this outage, we know it'll take us a long time to rebuild this trust. Many thanks to those of you who are giving us that second chance.
A free month of Campfire for your trouble
As a way of apology, we're going to give all Campfire customers a free month of service in form of a credit. This means that your next invoice is on us. We didn't earn your money this month. If you want to cancel your service, please write me directly at david@37signals.com and I'll issue a refund instead.
If you're a 37signals Suite customer and use Campfire, we're offering a 20% credit on your next bill. Please write me directly at david@37signals.com with your suite URL and I will apply the credit.
The technical lowdown
Now for the technical walk-through. The summary is that we've isolated three distinct root causes. They are 1) faulty CPUs, 2) a filesystem bug with XFS, and 3) a Linux kernel bug. Accidents rarely come alone, it appears, and we were hit with this triple whammy all in one week.
The faulty CPUs problem arised as unexplained, random crashes. It didn't seem to have any rhythm or rhyme until we discovered peaks in temperature readings just before the crashes. We consulted experts and it was determined to be problems with the CPUs. All of these CPUs have now been replaced (in fact, we moved to entirely new machines).
The filesystem bug was triggered by a combination of xtrabackup for MySQL and using O_DIRECT as the innodb_flush_method. The bug has been diagnosed and there's a config resolution to it on this ticket for Percona-XtraBackup.
The kernel crash bug turned out to be the most serious and the hardest to diagnose issue. Given that we already had problems with the CPUs and the filesystem, we failed to imagine that there could be a third root cause at play. The bug has been documented on Kernel.org in this report about divide-by-zero crashes on 2.6.32 kernels with long uptimes.
Fail-over improvements and faster response times
All these issues highlighted a range of deficiencies in our fail-over procedures. While every server had a backup machine ready to go, it often took too long to fail over and it wasn't push-button enough. We're going to develop ways of better automated fail over and, at the very least, lightning fast manual fail over.
No data was lost from any of the incidents, but plenty of goodwill and trust. We're so sorry for causing this prolonged disruption.




