January 3rd, 2008
Well, the server appears to be back online, but I’m getting additional errors from /dev/hda, which indicates to me that I need to have it replaced ASAP.
I cannot emphasize this enough: do not assume that my backup is going to save your ass. If you want the data, you best back it up yourself just to be safe. cPanel has a backup utility, and I suggest that you use it.
I will have a full backup made of the drive as it currently stands and put on the brand new secondary drive [which is not throwing errors] before having the primary drive replaced.
Posted in Announcements | 1 Comment »
January 3rd, 2008
Server was brought offline around 0220 CST this morning to replace the HDD. Drive replacement was complete around 0530. SSH and HTTPD [Web] were unresponsive. The datacenter is rebooting the machine to see if this improves the situation.
Update, 0730 CST: Well, the primary hard drive isn’t doing terribly well. We’re seeing if we can bring it back online long enough to back it all up, and then I’m probably going to ask for it to be replaced as well, and then I’ll re-load everything from backups. If there’s a good thing, it’s that the databases all seem to be extant, so I can copy their data [which should be good through 0200 this morning] and we won’t lose anything to that point.
Posted in Announcements | 2 Comments »
January 2nd, 2008
Posted in Announcements | No Comments »
January 2nd, 2008
[Note: Where I used to update the Weblog entries, now I will do a new one to push out updates via Twitter.]
The investigation of this morning’s apparent hard drive failures was that there are 21 blocks bad on /dev/hda [our main drive] and 41 bad on /dev/hdb [where backups are stored]. I’ve asked for /dev/hdb to be replaced and have asked about what they can do in terms of replication on /dev/hda.
I’ll update when I know more.
Oh, and we’ve also had massive load spikes today as people attempt to do nefarious things, and I/O has been limited because the datacenter was running badblocks. Usually, we don’t have that kind of I/O consumption and the server handles the strain okay.
Update, 1445: Okay, so Twitter Tools allows me to tweet from inside WP. Nice. No more new-post-per-update.
The datacenter is going to replace the backup HDD after midnight tonight. I’m backing up Tuesday’s backups—the last ones I trust—and storing them offsite here at the office. Fun times.
Posted in Announcements | 1 Comment »
January 2nd, 2008
The sound you just heard was me wailing in disbelief.
S.M.A.R.T Errors on /dev/hda
From Command: /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/hda
ATA Error Count: 302 (device log contains only the most recent five errors)
Error 302 occurred at disk power-on lifetime: 10418 hours (434 days + 2 hours)
Error 301 occurred at disk power-on lifetime: 10418 hours (434 days + 2 hours)
Error 300 occurred at disk power-on lifetime: 10294 hours (428 days + 22 hours)
Error 299 occurred at disk power-on lifetime: 10294 hours (428 days + 22 hours)
Error 298 occurred at disk power-on lifetime: 10150 hours (422 days + 22 hours)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 80% 19242 48683947
—-END /dev/hda–
S.M.A.R.T Errors on /dev/hdb
From Command: /usr/sbin/smartctl -q errorsonly -H -l selftest -l error /dev/hdb
ATA Error Count: 28 (device log contains only the most recent five errors)
Error 28 occurred at disk power-on lifetime: 29912 hours (1246 days + 8 hours)
Error 27 occurred at disk power-on lifetime: 29912 hours (1246 days + 8 hours)
Error 26 occurred at disk power-on lifetime: 29912 hours (1246 days + 8 hours)
Error 25 occurred at disk power-on lifetime: 10444 hours (435 days + 4 hours)
Error 24 occurred at disk power-on lifetime: 10444 hours (435 days + 4 hours)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 80% 19274 52155480
—-END /dev/hdb–
That could be nothing. It could be everything. It just … stunned me. The datacenter is going to run some diagnostics. We’ll see. Hopefully it’s nothing. But … gah.
Posted in Announcements | 5 Comments »
December 31st, 2007
- Lord a’ mercy, all the administrative work that I know of? Done. Server cancellation request is in. I’m … gonna take a breather…!
#
Powered by Twitter Tools.
Posted in Announcements | No Comments »
December 30th, 2007
Posted in Announcements | No Comments »
December 29th, 2007
If you’re a user of Twitter, you may wish to follow rmfoinfo on Twitter. Thanks to Alex King’s awesome Twitter Tools, you’ll get up-to-the-second notice that something’s wrong on the server. [We post so rarely that Google Reader, Newsgator, etc. aren't going to notify you for hours after an outage because they don't scrape us very often.]
Also, several of you have, through no fault of your own, gotten caught by our firewall. I found out one reason why: Shaun Inman’s Mint, which we use extensively around the RMFO network, sometimes does things that the Apache module mod_security doesn’t like. Given that CSF/LFD monitors mod_security, you get firewalled for no good reason. But knowing that this is a way to fix it, I’ve been rolling out the fix.
Now, I gotta get back to doing WordPress upgrades …
Posted in Announcements | No Comments »
December 29th, 2007
The last of the moves to the new server will be complete this weekend. Today, I’m moving andrewosenga.net, caedmonscall.net, and donmillerfans.net amongst others. This will me some downtime for each. Thanks for your patience.
Update, 1800 CST: Well, that cc.net downtime was terribly embarrassing, given how easy the fix was. Sorry about that!
Everything else has gone pretty smoothly. I have one account left to move off of the old server, which I’ll do late tonight, and then I’m done! Huzzah. I figure that, for 57 account moves, only having one really get my goat isn’t a bad record at all.
Posted in Announcements | No Comments »
December 21st, 2007
Howdy all:
Had a massive [several hundred processes] spike of exim usage just now. I’ve restarted exim and will monitor things; also, I have a support ticket in with the NOC to see if they can help triage this.
Update, 1228 CST: Well, 1.6 GB of email was sitting in the default address’s account. It’s … now gone.
That was causing a large part of the problem, because we were getting absolutely mailbombed [dozens of messages a second to non-existent addresses on rmfo-blogs.com]. Now the server rejects any incoming email to rmfo-blogs.com that’s not to a specific address out of hand. [And since there are no mailboxes on the server, that effectively settles the problem.]
I’ll do an audit of all the other accounts on the box and make sure that their default addresses are similarly configured to prevent future issues of a similar nature.
Posted in Announcements | No Comments »