Week of DOOOOOOOOM

Wednesday: A very important fileserver panicked and rebooted, apropos of nothing. I can't figure out why.

Thursday: Around 1.30am, a disk array at $WORK noticed one of its driveswas likely to fail shortly. It got very excited and sent me one hundred and fifty (150) (not exaggerating) text messages. When I got to work I failed the drive, put the spare into the array, the array started rebuilding, and I called Dell about 10am to arrange for a replacement to be sent out the next day (that is, Friday -- today).

When the rebuild was done it complained that another drive was likely to fail shortly. I contacted Dell and was told that the complaint about the second drive was a) misguided (it wasn't really failing) and b) really meant that the array (that is, /share/networkscratch) was likely to fail entirely. They called this a punctured stripe and there are more than a few complaints about this terminology. Anyhow. The only solution was to back up the data, delete the array, recreate it and restore from backup. "Everybody out of the pool!"

About 6pm last night the process was finally done, but the array still complained that the drive was going to fail soon. I contacted Dell again, and after looking at the array they decided that the second drive really was failing after all -- in fact, it had probably failed first, the array had been compensating for it all this time, and its problem only became evident when the other drive failed. A second replacement drive is due to arrive Monday; it was too late by this time to have it arrive today.

I brought up the server, restored the 2am backup to some spare space, and went home; this was about 9.15pm.

Friday: a long-running (ie, monthlong) rsync process decided to suck up all the memory on our webserver. It had to be forcibly rebooted.

And now I want a beer.