Xmas Maintenance 2010: Lessons learned
11 Jan 2011Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.
Order of rebooting
I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.
- Lesson: Don't do that.
Automating patching
Last year I tried getting machines to upgrade using Cfengine like so:
centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
"/usr/bin/yum -q -y clean all"
"/usr/bin/yum -q -y upgrade"
"/usr/bin/reboot"
This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.
This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.
This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.
- Lesson: I need a better way of doing this.
- Lesson: I need a way to check whether updates are needed.
I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.
Staggering reboots
Quick and dirty way to make sure you don't overload your PDUs:
sleep $(expr $RANDOM / 200 ) && reboot
Remote consoles
Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.
- Lesson: I need to test the SP before doing big upgrades; the simplest way of doing this may just be rebooting them.
Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.
- Lesson: Again, make sure the SP is okay before doing an upgrade.
- Lesson: Fscking a few TB will take an hour with ext3.
- Lesson: Start the console session on those machines before you reboot, so that you can at least see the progress of the boot messages up until the time it starts fscking.
- Lesson: Might be worth editing fstab so that they're not mounted at boot time; you can fsck them manually afterward. However, you'll need to remember to edit fstab again and reboot (just to make sure)...this may be more trouble than it's worth.
OpenSuSE
Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:
- Two of the machines were running OpenSuSE 11.1; the rest were running 11.2. The latter lets you upgrade to the latest release from the command line using "zypper dist-upgrade"; the former does not, and you need to run over with a DVD to upgrade them.
- By default, zypper fetches packages one at a time, installs them, then fetches them again. I'm not certain, but I think that means there's a lot more TCP overhead and less chance to ratchet up the speed. Sure as hell seemed slow downloading 1.8GB x 9 machines this way.
Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.
Lesson: This really needs to be automated.
Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
Lesson: Next time, uninstall the driver and build a goddamn RPM.
Lesson: A better way of managing xorg.conf would be nice.
Lesson: Look for prefetch options for zypper. And start a local mirror.
Lesson: Pick a working version of the driver, and commit that fucker to Subversion.
Special machines
These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:
- Lots of SSH/scp processes on the master
- Lots of SSH/scp processes on the slave (if it's up)
- If you try to run the slave binary on the slave, you get errors like "lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)" (from strace) or "ESPIPE text file busy" (from running it in the shell).
The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.
- Lesson: Bring up the slaves first, then bring up the master.
- Lesson: There are lots of interesting and obscure Unix errors.
I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.
- Lesson: Network cables are surprisingly fragile at the connection with the jack.
Virtual Machines
It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.
- Lesson: To get around this, go into single-user mode and copy /etc/sysconfig/network-scripts/ifcfg-eth0.bak to ifcfg.eth0.
- Lesson: Be sure you're monitoring everything in Nagios; it's a sysadmin's regression test.
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018