Server Crash of September 23rd, 2009
In the last few weeks, the web server has crashed several times,
requiring manual restarts, until finally on Wednesday it went down
for the count. Out of the four hard drives in the web server, two
were completely dead, and a third was failing. The two dead drives
were brand new, having been installed less than a month ago.
Together, they comprised the RAID1 mirror that stored users'
website files.
A RAID mirror provides redundancy by mirroring data between
two hard drives. As long as they're both running as part of the
RAID configuration, they contain identical data. This allows for
one drive to fail, without losing data. However, if both drives
fail simultaneously, as appears to be the case here, the data
is gone.
Fortunately, we have reasonably fresh backups of all user files,
except for logs and temporary files. The last backup before the
crash ran on Tuesday morning, at around 3 am local time (Mountain
Daylight Time). The webserver crashed at 11 am Wednesday morning,
so any files uploaded between those times are lost.
Having three out of four drives fail may indicate a cause
other than chance drive failure. One cause could be a failing
power supply, which could cause damage to any drives attached
to the system. Or perhaps there's a fault with the motherboard.
Facing the prospect of a systemic problem, I opted to migrate
services off that server and onto the mail and database servers.
The web server is now running from what used to be the database
server, and I've moved MySQL over to the mail server. Since both the
mail and database servers were previously underutilized, it makes
sense to combine their functions, and the new web server has the
same specs as the old one. At this point there should be no noticible
degredation in performance, even though we're running on two servers
instead of three.
I will need to do a bit more disk-swapping over
the next few days, but as usual with planned maintenance I'll do this
around 2-4 am local time.
Extent of Downtime and Data Loss
The system was offline for about an hour on Monday afternoon,
another hour on Tuesday afternoon, and most of the day Wednesday.
Email service was also down during this time. This was due to
the way the email delivery process is integrated with the rest of
the system. Those who tried to send mail out were unable to do so
for the duration of the downtime. But people trying to mail you
from other networks (ie, everyone else in the world) should have
only had their mail delayed until the system was back up. Some
emails sent in the evening bounced with an error message, due to
a configuration error.
MySQL should be unaffected by this disruption.
Data loss is primarily limited to website files uploaded between
Tuesday morning and Wednesday. If you didn't upload any files to your
website during this time, it should be in good shape now. If you did,
then those files will need to be re-uploaded.
HTTP access logs were intentionally not backed
up, being large files that change constantly they wreck havoc on my
backup system, and also being relatively non-critical. However, the
AWStats data files were backed up, so your web stats (at
http://yourdomain.com/awstats/) are still there.
Going Forward
In order to try to prevent this situation from recurring, I plan
to take the following steps:
- Improve disk health monitoring, and replace any disks at the
first sign of trouble.
- Improve the mail delivery subsystem to continue operating in
the absense of the web/file server.
I'll also rebuild the former web server, putting in a new motherboard
and power supply, and returning it to service.
My sincere apologies for the hassle and inconvenience caused by this
disruption - please let me know if you're experiencing any problems
with your service.
|