HomeFeaturesPricingSupportLog InSign Up

Server Crash of September 23rd, 2009

In the last few weeks, the web server has crashed several times, requiring manual restarts, until finally on Wednesday it went down for the count. Out of the four hard drives in the web server, two were completely dead, and a third was failing. The two dead drives were brand new, having been installed less than a month ago. Together, they comprised the RAID1 mirror that stored users' website files.

A RAID mirror provides redundancy by mirroring data between two hard drives. As long as they're both running as part of the RAID configuration, they contain identical data. This allows for one drive to fail, without losing data. However, if both drives fail simultaneously, as appears to be the case here, the data is gone.

Fortunately, we have reasonably fresh backups of all user files, except for logs and temporary files. The last backup before the crash ran on Tuesday morning, at around 3 am local time (Mountain Daylight Time). The webserver crashed at 11 am Wednesday morning, so any files uploaded between those times are lost.

Having three out of four drives fail may indicate a cause other than chance drive failure. One cause could be a failing power supply, which could cause damage to any drives attached to the system. Or perhaps there's a fault with the motherboard. Facing the prospect of a systemic problem, I opted to migrate services off that server and onto the mail and database servers.

The web server is now running from what used to be the database server, and I've moved MySQL over to the mail server. Since both the mail and database servers were previously underutilized, it makes sense to combine their functions, and the new web server has the same specs as the old one. At this point there should be no noticible degredation in performance, even though we're running on two servers instead of three.

I will need to do a bit more disk-swapping over the next few days, but as usual with planned maintenance I'll do this around 2-4 am local time.

Extent of Downtime and Data Loss

The system was offline for about an hour on Monday afternoon, another hour on Tuesday afternoon, and most of the day Wednesday.

Email service was also down during this time. This was due to the way the email delivery process is integrated with the rest of the system. Those who tried to send mail out were unable to do so for the duration of the downtime. But people trying to mail you from other networks (ie, everyone else in the world) should have only had their mail delayed until the system was back up. Some emails sent in the evening bounced with an error message, due to a configuration error.

MySQL should be unaffected by this disruption.

Data loss is primarily limited to website files uploaded between Tuesday morning and Wednesday. If you didn't upload any files to your website during this time, it should be in good shape now. If you did, then those files will need to be re-uploaded.

HTTP access logs were intentionally not backed up, being large files that change constantly they wreck havoc on my backup system, and also being relatively non-critical. However, the AWStats data files were backed up, so your web stats (at http://yourdomain.com/awstats/) are still there.

Going Forward

In order to try to prevent this situation from recurring, I plan to take the following steps:

  • Improve disk health monitoring, and replace any disks at the first sign of trouble.
  • Improve the mail delivery subsystem to continue operating in the absense of the web/file server.

I'll also rebuild the former web server, putting in a new motherboard and power supply, and returning it to service.

My sincere apologies for the hassle and inconvenience caused by this disruption - please let me know if you're experiencing any problems with your service.

  © 2004 - 2010 Telana Internet Services sales@telana.com