Archive for January, 2008

1-2 Jan Backup Organization in Progress

Thursday, January 3rd, 2008

As you’re aware if you’ve been reading along, you’ll know that I took Tuesday morning’s backups and downloaded them offsite [much to the consternation of my corporate IT department]. Now that /dev/hdb is back up and running, I’m now uploading the backups I made yesterday so I can move them on over to the backup drive. I’ll be putting these in their own specific folder so the newer backup that I make with cPanel tonight does not overwrite these. I’ll use the newer backups first, and these are just going to be secondaries only.

Update, 13:35 CST: All backups stored at my office are now uploaded and on the new hard drive.

Mid Morning Update

Thursday, January 3rd, 2008

Well, the server appears to be back online, but I’m getting additional errors from /dev/hda, which indicates to me that I need to have it replaced ASAP.

I cannot emphasize this enough: do not assume that my backup is going to save your ass. If you want the data, you best back it up yourself just to be safe. cPanel has a backup utility, and I suggest that you use it.

I will have a full backup made of the drive as it currently stands and put on the brand new secondary drive [which is not throwing errors] before having the primary drive replaced.

Early Morning 3 Jan Update

Thursday, January 3rd, 2008

Server was brought offline around 0220 CST this morning to replace the HDD. Drive replacement was complete around 0530. SSH and HTTPD [Web] were unresponsive. The datacenter is rebooting the machine to see if this improves the situation.

Update, 0730 CST: Well, the primary hard drive isn’t doing terribly well. We’re seeing if we can bring it back online long enough to back it all up, and then I’m probably going to ask for it to be replaced as well, and then I’ll re-load everything from backups. If there’s a good thing, it’s that the databases all seem to be extant, so I can copy their data [which should be good through 0200 this morning] and we won’t lose anything to that point.

Twitter Updates for 2008-01-02

Wednesday, January 2nd, 2008

Powered by Twitter Tools.

Hard Drive Failure Update

Wednesday, January 2nd, 2008

[Note: Where I used to update the Weblog entries, now I will do a new one to push out updates via Twitter.]

The investigation of this morning’s apparent hard drive failures was that there are 21 blocks bad on /dev/hda [our main drive] and 41 bad on /dev/hdb [where backups are stored]. I’ve asked for /dev/hdb to be replaced and have asked about what they can do in terms of replication on /dev/hda.

I’ll update when I know more.

Oh, and we’ve also had massive load spikes today as people attempt to do nefarious things, and I/O has been limited because the datacenter was running badblocks. Usually, we don’t have that kind of I/O consumption and the server handles the strain okay.

Update, 1445: Okay, so Twitter Tools allows me to tweet from inside WP. Nice. No more new-post-per-update.

The datacenter is going to replace the backup HDD after midnight tonight. I’m backing up Tuesday’s backups—the last ones I trust—and storing them offsite here at the office. Fun times.

Critical Hard Drive Failures … ?

Wednesday, January 2nd, 2008

The sound you just heard was me wailing in disbelief.

S.M.A.R.T Errors on /dev/hda
From Command: /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/hda
ATA Error Count: 302 (device log contains only the most recent five errors)
Error 302 occurred at disk power-on lifetime: 10418 hours (434 days + 2 hours)
Error 301 occurred at disk power-on lifetime: 10418 hours (434 days + 2 hours)
Error 300 occurred at disk power-on lifetime: 10294 hours (428 days + 22 hours)
Error 299 occurred at disk power-on lifetime: 10294 hours (428 days + 22 hours)
Error 298 occurred at disk power-on lifetime: 10150 hours (422 days + 22 hours)

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 80% 19242 48683947
—-END /dev/hda–

S.M.A.R.T Errors on /dev/hdb
From Command: /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/hdb
ATA Error Count: 28 (device log contains only the most recent five errors)
Error 28 occurred at disk power-on lifetime: 29912 hours (1246 days + 8 hours)
Error 27 occurred at disk power-on lifetime: 29912 hours (1246 days + 8 hours)
Error 26 occurred at disk power-on lifetime: 29912 hours (1246 days + 8 hours)
Error 25 occurred at disk power-on lifetime: 10444 hours (435 days + 4 hours)
Error 24 occurred at disk power-on lifetime: 10444 hours (435 days + 4 hours)

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 80% 19274 52155480
—-END /dev/hdb–

That could be nothing. It could be everything. It just … stunned me. The datacenter is going to run some diagnostics. We’ll see. Hopefully it’s nothing. But … gah.