Some time between mid-December and January at $WORK, we noticed that FTP transfers from the NIH NCBI were nearly always failing; maybe one attempt in 15 or so would work. I got it down to this test case:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34777/matrix/GSE34777_series_matrix.txt.gz
The failed downloads would not fail right away, but instead hung when the data connection from the remote end should have transferred the file to us. Twiddling passive did nothing. If I tried HTTP instead of FTP, I'd get about 16k and then the transfer would hang.
That hostname, ftp.ncbi.nlm.nih.gov, resolves to 4-6 different IP addresses, with a TTL of 30 seconds. I found that ftp transfers from these IP addresses failed:
but this one worked:
The A record you get doesn't seem to have a pattern, so I presume it's being handled by a load balancer rather than simple round-robin. It didn't come up very often, which I think accounts for the low rate of success.
At first I thought this might indicate network problems here at $WORK, but the folks I contacted insisted nothing has changed, we're not behind any additional firewalls, and all our packets take the same route to both sets of addresses. So I checked our firewall, and couldn't find anything there -- no blocked packets, and to the best of my knowledge no changed settings. Weirdly, running the wget command on the firewall itself (which runs OpenBSD, instead of CentOS Linux like our servers) worked...that was an interesting rabbit hole. But if I deked out the firewall entirely and put a server outside, it still failed.
Then I tripped over the fix: lowering the MTU from our usual 9000 bytes to 8500 bytes made the transfers work successfully. (Yes, 8500 and no more; 8501 fails, 8500 or below works.) And what has an MTU of 8500 bytes? Cisco Firewall Service Modules, which are in use here at $WORK -- though not (I thought) on our network. I contacted the network folks again, they double-checked, and said no, we're not suddenly behind an FSM. And in fact, their MTU is 8500 nearly everywhere...which probably didn't happen overnight.
Changing the MTU here was an imposing thought; I'd have to change it everywhere, at once, and test with reboots...Bleah. Instead, I decided to try TCP MSS clamping instead with this iptables rule:
iptables -A OUTPUT -p tcp -d 130.14.250.0/24 --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 8460
(Again, 8460 or below works; 8461 or above works fine.) It's a hack, but it works. I'm going to contact the NCBI folks and ask if anything's changed at their end.
Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.
I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.
Last year I tried getting machines to upgrade using Cfengine like so:
centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
"/usr/bin/yum -q -y clean all"
"/usr/bin/yum -q -y upgrade"
"/usr/bin/reboot"
This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.
This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.
This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.
I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.
Quick and dirty way to make sure you don't overload your PDUs:
sleep $(expr $RANDOM / 200 ) && reboot
Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.
Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.
Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:
Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.
Lesson: This really needs to be automated.
Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
Lesson: Next time, uninstall the driver and build a goddamn RPM.
Lesson: A better way of managing xorg.conf would be nice.
Lesson: Look for prefetch options for zypper. And start a local mirror.
Lesson: Pick a working version of the driver, and commit that fucker to Subversion.
These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:
The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.
I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.
It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.
Arghh...I just spent 24 hours trying to figure out why shadow migration was causing our new 7310 to hang. The answer? Because jumbo frames were not enabled on the switch the 7310 was on, and they were on the machine we're migrating from. Arghh, I say!
Following in Matt's footsteps, I ran into a serious problem just before heading to LISA.
Wednesday afternoon, I'm showing my (sort of) backup how to connect to the console server. Since we're already on the firewall, I get him to SSH to it from there, I show him how to connect to a serial port, and we move on.
About an hour later, I get paged about problems with the database
server: SSH and SNMP aren't responding. I try to log in, and sure
enough it hangs. I connect to its console and log in as root; it
works instantly. Uhoh, I smell LDAP problems...only there's nothing
in the logs, and id <uid>
works fine. I flip to another terminal
and try SSHing to another machine, and that doesn't work either.
But already-existing sessions work fine until I try to run sudo
or
do ls -l
. So yeah, that's LDAP.
I try connecting via openssl to the LDAP server (stick alias
telnets='openssl s_client -connect'
in your .bashrc today!) and get
this:
CONNECTED(00000003)
...and that's all. Wha? I tried connecting to it from the other LDAP server and got the usual (certificate, certificate chain, cipher, driver's license, note from mom, etc). Now that's just weird.
After a long and fruitless hour trying to figure out if the LDAP server had suddenly decided that SSL was for suckers and chumps, I finally thought to run tcpdump on the client, the LDAP server and the firewall (which sits between the two). And there it was, plain as day:
Near as I can figure, this was the sequence of events:
This took me two hours to figure out, and another 90 minutes to fix; setting the link speed manually on the firewall just convinced the nic/driver/kernel that there was no carrier there. In the end the combination that worked was telling the switch it was a gigabit port, but letting it negotiate duplexiciousnessity.
Gah. Just gah.