A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.
Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.
It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.
One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)
The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:
cfservd
had refused its connection because I had the MaxConnections
parameter too low.I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)
Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.
(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)
Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.
And best of 2010 to all of you!
Dress rehearsal includes checking to see if you can, in fact, unrack something. I was uanble to move a switch this morning because it was stuck behind a PDU. Arghh.
The saga of our crashing UPS continues. The techs came out to visit this morning, which meant I needed to schedule downtime so they could bypass the UPS manually. They were unable to find any smoking gun (or capacitors), and need to confer with HQ again. Best case: the UPS control panel continues to work, and they can do the next round of work w/o a manual bypass. Worst case: the control panel crashes again, and we schedule another round of downtime.
If you have space for two PDUs and you put one on each side of the rack, you will have no separate space for network cables and you'll get interference. If you put those two PDUs on one side of the rack, you'll put it on the wrong side and your power cords will interfere with your network cables. If you put those two PDUs on the correct side of the rack, you'll find that racking new items is a pain because the cords block the post holes on that side.
Gave a tour of the new server room today to about 30-odd people in the department. Ended on a bit of a low note ("and that's the end! Any questions?") but other than that it went well. Even got an ounce of champagne at the end of it.
Oh, and yesterday I found out that our SL-500 has three fibre channel interfaces, compared to the one interface in the server we bought. I think the sales folks assumed we had a fibre switch, and I didn't realize it all (data + control) wouldn't go over one cable. Arghh.
Just saw a character named Terence on "Entourage" who was not Terrance Stamp. Now I want to see "Bowfinger" and "The Limey", in that order.
Given the recent hoo-ha about abandoned blogs, and my own tendency to lose interest in writing about something the longer I put it off (I haven't graphed it, but I suspect it's a nice exponential decay), I figured I should finally write up what I've been doing the last week: the move at $WORK to our new server room.
So: construction finally got finished on our new server room. Our UPS was installed, our racks set up, and the keys handed over (though they were to be changed again twice). Our new netblock was assigned, the Internet access at the new location was in place, and movers were booked.
Things I did in advance which helped immensely:
Last Thursday morning, it all started. I got the machines shut down (thank you, SSH and ubiquitous wireless access at UBC) before the two volunteers who were helping me showed up. We started getting machines unracked; since it was only about 20 machines, I figured it wouldn't take too long. While that was true, I had not counted on the rat's nest of power cables (our power requirements were such that we had to connect machines to PDUs in adjacent racks), or the fact that we wouldn't be able to disassemble that 'til we'd got the machines out.
There was one heartstopping moment: a 1U server, while extended on its rails, came off one of the rails while no one was supporting it. Amazingly the other rail held on while it rotated quickly through 90 degrees to bang loudly against the rack. "You swear quickly," the movers remarked. (Doubly amazingly, the machine seems to be fine, though the rails for the thing are shot.)
The movers were big and burly, which was wonderful when it came to moving the Thumper. I weigh more than it does, but not by much, and I'd had the bad fortune to screw up my back a week before the move. It was tricky trying to figure out how to remove it from the rails, but the movers' trick of supporting it with a couple of big blankets, while fully extended from the rack, made such considerations less urgent. Eventually we got it figured out. I don't know how that could have gone smoother, since we'd got Sun to rack the thing and, frankly, it's not like you spend a lot of time un- and re-racking something like that. Anyhow, a minor point.
The new location was right around the corner, which was handy. The movers had put the servers in these big laundry-like carts on wheels; in the end, we only had four of em. We got the machines unloaded, racked the Thumper with the movers help, signed the paper, then went off for lunch where we picked up two more volunteers.
After that, we started racking servers. Having only one sysadmin around (me) proved to be a bottleneck; the volunteers had not worked with rackmounted machines before, and I kept having to stop what I was doing to explain something to them. It would have been a great help to have another admin around; in fact, I think this is the biggest move I'd want to make without some other admin around.
Problems we ran into:
Things that went well:
I'm going to post this now because if I don't, it'll never get done. I may come back and revise it later, but better this than nothing at all.