IT always takes longer than you think

18 Oct 2012

Yesterday I did a long-anticipated firmware upgrade on a disk array at $work. It's attached to the head node of a small cluster we have, and holds the usual assortment of home directories and data. The process was kind of involved:

shut down the cluster to prevent disk I/O ("not mandatory, but strongly recommended" -- thx, I'll just go with "mandatory");
remove the current management software from the head node, reboot and then reinstall;
X was needed for the installation ("not mandatory, but--" Okay, right, got it, thx): twice via SSH, once by running startx locally;
I couldn't upgrade directly to the new firmware itself, but had to install bridge firmware, wait 30 minutes for things to settle out (!), then install the new firmware
oh, and "due to limitations of the Linux environment", I couldn't install the firmware from the head node itself that just had the management software upgraded -- instead, I had to install that software on another machine and install it from there.

Which is why this all took about four hours to do. But that's not all:

Before all that, I read the many, many manuals; did a dress rehearsal to shake out problems; and made sure I had a checklist (thank you, Tom Limoncelli and Orgmode) with the exact commands to run
During the upgrade, I took notes on things I'd forgotten and problems I'd encountered.
After the upgrade, I did a postmortem: updated my documentation and filed bugs, notified the users that things were back up, and watched for problems.

Which is why a 4 hour upgrade took me 9.5 hours. I think there might be a handy rule of thumb for big work like this, though I can't decide if it's "it always takes twice as long" or "it always takes five hours longer than you think." Heh.

One other top tip: stop NFS exports while you're working on a server (but see the next paragraph!). One user started a session on another machine, which automounted her home directory from the head node. This was close to the end of my work, and while I could have used another reboot, I elected not to because I didn't want to mess up her session. Yes, the reboot was important, but I'd neglected to think about this situation, and I didn't think she should have to pay for my mistake.

And if you're going to turn off NFS exports, make damn sure you have your monitoring system checking exports in the first place; that way, you won't forget to turn it back on afterward. (/me scurries to add that test to Nagios right now...)

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

IT always takes longer than you think

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018