Discuss DTV: SatelliteGuys Forum DTV USA Forum AVS Forum Digital Home Forum

   RabbitEars.Info   
Sitemap
  

Third Unscheduled Outage in a Month

I wanted to take a moment to talk a little about the sudden, recent unreliability of the RabbitEars server.

Yesterday's 10 hour outage was the third unscheduled outage in the past month. I'm not particularly happy, and this outage in particular made very plain that I, personally, am very dependent on RabbitEars and probably can't give it up even if I want to. At least a dozen times through the day, I went to look at something and then remembered it was down, and had to find work-arounds where possible. So good news for all the people who use it.

Basically, I suspect there's some kind of underlying hardware issue that we have been unable to diagnose.

(If you don't want to read all the technical details, feel free to skip to the bottom.)

In the initial outage at the end of March, I woke up to find the server not running, and my ssh connection told me essentially that there was some kind of disk problem. The server has two 2TB SSDs in RAID-1, so a disk problem seems really odd, but that's what it told me. Silica Broadband rebooted it and it came back up without issue. Nothing of value in the logs, I ran SMART checks and found nothing of note, etc. Since I was planning to visit the server in early April as part of a trip to see the total solar eclipse, I decided to bring a spare SSD just in case and try to wait it out.

Made it to the server just fine, upgraded the operating system from Ubuntu 20.04 to 22.04 on the day of the visit while driving there and it rebooted and came up without much issue. (An Apache setting must have gotten modified in the upgrade process which caused the performance issue that I requested and received help resolving a week back. Thanks to Nate and Matthew for your help on that.) I got there and confirmed that it wasn't reporting any specific hardware issues, but I left the spare SSD with Silica Broadband just in case one of the disks decided to fail.

On Monday evening, it just up and stopped responding. Came right back up with a reboot, still nothing in the logs and nothing useful to report.

Finally, yesterday, the server was acting flaky in the morning. I did some checking and found that it had actually rebooted itself several times in the past few weeks without me realizing, another worrying sign. In a moment that was definitely not a shining moment for my wisdom and intelligence, I decided it would be safe to try rebooting it. Had I been smart about it, knowing as I did that the system may have a hardware issue, I'd have run another full backup as I did after the first outage and before the Ubuntu upgrade, and would have verified that my contact at Silica Broadband was available to look at it if something went wrong. Alas, I did not have that foresight, and so I rebooted it around 8:40AM, and it never came back up. I wrote to my contact and he informed me he was unavailable to look at it until after 4PM. So it was down all day.

When he got to it, it was reporting a disk issue again, this time on boot-up. It said it couldn't read a sector on the disk, and then continued to try to boot and immediately had a kernel panic. After a few hours of troubleshooting by text message, in which the RAID array passed an fsck without issue, we eventually resolved the problem by doing an update-initramfs after chrooting an Ubuntu Live disk. Which implies that it was a failed update rather than a hardware problem, but then why has this kept happening? It doesn't add up and I am still suspicious that there's an underlying hardware issue.

Start Here If You Skipped

So all of this is a long, roundabout way of saying that I don't think the unreliability is over yet. It's my hardware, so the ultimate responsibility for a repair is mine, but if it won't tell me what's wrong, I'm not sure i can do much until it finally fails in a way that reveals what's actually going on. In preparation for such a failure, I'm going to do more frequent full backups of the server, which means there may be more periods of the server appearing slow as I have it pack up files to copy to my computer at home and then do the copy over the Internet, and ask for your patience in the event of future outages. My intent is to ultimately resolve this issue and return to the long periods of reliability that I've had over the past three years.

Thanks so much for your time and your support. I greatly appreciate it.

Add a comment

HTML code is displayed as text and web addresses are automatically converted.

This post's comments feed