The server is now offline for diagnosis and, likely, drive replacement.
Update (1409): It’s back up!
The server is now offline for diagnosis and, likely, drive replacement.
Update (1409): It’s back up!
Everything’s backed up over to the NAS. We’re good to go.
The backup HDD keeps throwing crazy failure warnings. I’ve reported all those to the NOC in the hopes that I can get the investigation time cut to a minimum. They haven’t said as much, but I’m guessing that their escalation procedures require them to do an investigation on possibly-working drives before they replace them—after all, the way the lease agreement works is that they have to provide the services they are leasing to me at all times, so they have to pay for the HD. They don’t want to go just on my say-so that the drive is failling—otherwise, someone enterprising would just report failures every nine months and get fresh drives regularly, and maybe even go so far as to get free O/S reloads out of it, too.
That said, I’m hoping that all the evidence that I have that shows that the HDD is not long for this world will shorten that time so we can be back up faster. I don’t expect any service at all after 1300 CDT tomor … err, today. Anything we get on Saturday night will be a bonus, I suspect.
I’ll update as I have information.
I’m about, oh, 20% done with backups at this point. I’m making liberal use of my account on the Network Attached Storage drives at The Planet. The backups have to be done manually [which isn't much fun, I assure you], but they’re getting done at a reasonable pace. Server performance will undoubtedly be degraded as the backups are made: normally the backups are done between 0300-0530 Central, the slackest times in the standard server day. These backups are being done while site use is still up and non-negligible. I appreciate your continued patience; I’ve also run some hard drive analysis testing that I hope will shorten the downtime with the NOC.
Update: Here at 0020, I’m 75% complete, picking up mostly small accounts, and I’m leaving the big account—rocksmyfaceoff.net itself—to run while I’m asleep.
Because of some minor security leaks, I started to update Apache’s security measures and inadvertently brought Apache down. I’ll have it back up ASAP…
We’ve determined the cause of the load spike: the backup HDD in Miller is failing. I’m working with the NOC to schedule a replacement time for tomorrow; I will spend today making secondary backups to my local machine here at the office.
No data has been lost. The load spike was caused by the HDD failures hanging the backup script, which is the only script that ever touches the secondary HDD.
Update: The NOC has to do an analysis of the HDD prior to replacing it; this will occur at 1300 CDT tomorrow. The server will be offline for a few hours after that.
Right around the time I got moving this morning, the server was undergoing a massive load spike. I believe that I have it under control, and I’m going to see if I can run down the cause. I have a suspect already [namely the backup script], but I’ll just have to look at it. I’ll keep my eye on it…
Update: I ended up having to reboot the server to get whatever process was crippling MySQL to stop; the reboot is underway, and I have a sneaking suspicion that it might hang up. I’ve got to head in to the office [love the timing], but I’ll do all I can …