Week of DOOOOOOOOM

05 Apr 2013

Wednesday: A very important fileserver panicked and rebooted, apropos of nothing. I can't figure out why.

Thursday: Around 1.30am, a disk array at $WORK noticed one of its driveswas likely to fail shortly. It got very excited and sent me one hundred and fifty (150) (not exaggerating) text messages. When I got to work I failed the drive, put the spare into the array, the array started rebuilding, and I called Dell about 10am to arrange for a replacement to be sent out the next day (that is, Friday -- today).

When the rebuild was done it complained that another drive was likely to fail shortly. I contacted Dell and was told that the complaint about the second drive was a) misguided (it wasn't really failing) and b) really meant that the array (that is, /share/networkscratch) was likely to fail entirely. They called this a punctured stripe and there are more than a few complaints about this terminology. Anyhow. The only solution was to back up the data, delete the array, recreate it and restore from backup. "Everybody out of the pool!"

About 6pm last night the process was finally done, but the array still complained that the drive was going to fail soon. I contacted Dell again, and after looking at the array they decided that the second drive really was failing after all -- in fact, it had probably failed first, the array had been compensating for it all this time, and its problem only became evident when the other drive failed. A second replacement drive is due to arrive Monday; it was too late by this time to have it arrive today.

I brought up the server, restored the 2am backup to some spare space, and went home; this was about 9.15pm.

Friday: a long-running (ie, monthlong) rsync process decided to suck up all the memory on our webserver. It had to be forcibly rebooted.

And now I want a beer.

2 Comments

From: Jason Ross
6 April 2013 03:49:48

Your Thursday is why we now pay for 4 hour parts for our storage systems. Cause when a Late Friday failure becomes a Tuesday delivery and you lose a second drive in the rebuild, you very quickly get to bad things and your Luns go offline and your DB doesn't like it when it's disk is yanked out from under it. It really just leads to a bad week.

From: Jason Ross
6 April 2013 03:53:08

And that is why we have 4 hour parts on our dell storage arrays. When a rebuilt takes a second drive down. I really don't want to wait on NBD parts. Cause the dreaded third drive failure that causes your Luns to be marked offline and yanks the storage out from user your DB isn't pretty. That makes for a very long day.

Add a comment:

Name and email required; email is not displayed.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

Week of DOOOOOOOOM

2 Comments

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018