Tuesday, May 29, 2007

Server panic - 3rd time's the charm

I guess the good news is that bug is evidently reproducible, at least in some sense of the word.
and it reproduced itself again the other day.

But I believe I've nailed it -- or rather I found something definitively broken that appears to account for what happened, and now we're running with the fix. As usual, we'll see how it goes.

Thursday, May 24, 2007

upgrading to etch

so Debian 4.0 is out, and I'm doing the nasty (one of my other machines died and so there's now a bunch of stuff I need to move to the moo server box which made the upgrade suddenly a bit more urgent, which is why you didn't see much advance notice on this).

And normally with Debian upgrades you're supposed to be able to keep things running throughout,which is way cool when you can do it. Except that this time it's not, since for various reasons the sarge->etch transition involves two kernel upgrades (one to get from 2.4 to 2.6 and the other to get from a 2.6 that sarge knows about to the 2.6 that etch knows about -- there's no overlap, see... oh well, it's free...) and some of the packages being altered are things like libc, which the moo server actually depends on (surprise), and after a few distubing "Waah, I can't find such and such in foo.so" messages because the process was suddenly bereft of its shared lib, I decided maybe I ought to shut it down for the duration... (I hope we got a clean checkpoint, otherwise we'll have lost about an hour; actually since it lookied like DNS stuff that was crashing and that's all a separate process anyway, I'm guessing things are fine, but we'll see how that goes).

Current task is dealing with hdparam (flashback to 2003: Jay sez, "Yes, you really do want to tune your hard drives. It'll be totally excellent...", he gives me the one-line command that does the trick; I wrap it in the 20 line init.d script one needs so that it properly fits into the boot sequence, and we're done. And then I go on to spend the next four years thinking about Other Things.

And now it seems that etch has its own script that's about to overwrite mine; it's only 300 lines and the corresponding config file is only about 150. Most of it is about RAID admittedly, but somewhere in there is the magic option to set that will produce the single line that actually does whatever it was that we evidently needed to do back in 2003. And I kinda wanna get it right the first time, because,... well, ...disk.)

so that's where I am at the moment. I like to think this won't take too much longer.

Meanwhile, it's a very nice day outside (at least here in the PNW anyway). Enjoy.

Update (1:48pm PDT): we made it through the second kernel reboot and still no evil on the console. yay. And of course, X is now b0rken (after only just having gotten it to work under sarge just a few days ago), but hey, not your problem, as they say...