Some time between mid-December and January at $WORK, we noticed that FTP transfers from the NIH NCBI were nearly always failing; maybe one attempt in 15 or so would work. I got it down to this test case:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34777/matrix/GSE34777_series_matrix.txt.gz
The failed downloads would not fail right away, but instead hung when the data connection from the remote end should have transferred the file to us. Twiddling passive did nothing. If I tried HTTP instead of FTP, I'd get about 16k and then the transfer would hang.
That hostname, ftp.ncbi.nlm.nih.gov, resolves to 4-6 different IP addresses, with a TTL of 30 seconds. I found that ftp transfers from these IP addresses failed:
but this one worked:
The A record you get doesn't seem to have a pattern, so I presume it's being handled by a load balancer rather than simple round-robin. It didn't come up very often, which I think accounts for the low rate of success.
At first I thought this might indicate network problems here at $WORK, but the folks I contacted insisted nothing has changed, we're not behind any additional firewalls, and all our packets take the same route to both sets of addresses. So I checked our firewall, and couldn't find anything there -- no blocked packets, and to the best of my knowledge no changed settings. Weirdly, running the wget command on the firewall itself (which runs OpenBSD, instead of CentOS Linux like our servers) worked...that was an interesting rabbit hole. But if I deked out the firewall entirely and put a server outside, it still failed.
Then I tripped over the fix: lowering the MTU from our usual 9000 bytes to 8500 bytes made the transfers work successfully. (Yes, 8500 and no more; 8501 fails, 8500 or below works.) And what has an MTU of 8500 bytes? Cisco Firewall Service Modules, which are in use here at $WORK -- though not (I thought) on our network. I contacted the network folks again, they double-checked, and said no, we're not suddenly behind an FSM. And in fact, their MTU is 8500 nearly everywhere...which probably didn't happen overnight.
Changing the MTU here was an imposing thought; I'd have to change it everywhere, at once, and test with reboots...Bleah. Instead, I decided to try TCP MSS clamping instead with this iptables rule:
iptables -A OUTPUT -p tcp -d 130.14.250.0/24 --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 8460
(Again, 8460 or below works; 8461 or above works fine.) It's a hack, but it works. I'm going to contact the NCBI folks and ask if anything's changed at their end.
Michael W. Lucas, of Absolute FreeBSD fame (among many others), has a book coming out on Network Flow Analysis. Sweet!
/me hurries off to pre-order...
New emacs, woo! I've downloaded it and compiled it already, 'cos I am that l33+, thank you. But one thing: the tarball is signed by Chong Yidong, pgp/gpg key #BC40251C. I could not find any indication anywhere that this is the right key, or what the right key might be. A quick search turned up lots of posts on the Emacs mailing lists, bugzilla entries and such from him, so I presume it's okay…but it would be nice to make this explicit. (Even a search for the key number turned up nothing.)
This article about updating pkg-src makes me even happier I went with Debian. That is all.
Yesterday I got a new switch in at work. Good god, the 10/100 Procurves are getting cheap — $600 w/academic discount for a 2626. I was just going to rack it, but as always I couldn't stop once I got going; that server room needs a lot of cleaning up. Three hours later I emerged, bloody but triumphant: the network cables were cleaned up considerably, I'd identified the last of the mystery boxes (step-down transformer, not a UPS like I thought), and I'd figured out that the big UPS was only one-third loaded — plenty o' room. Once I get all the cleaning done, I'll post before-and-after pix, 'cos that will be one chunk of work I'll be damned proud of.
Thursday: Go to The Other University to do some prep for the move coming up next week. Check in with their computer store (where you pretty much have to buy things) to see how the order on the console server is going. The guy behind the counter looks up the order, frowns, and tells me that it seems their supplier does not have one in any of their three Canadian warehouses. Okay, so how long will it take to get one in? He looks at me earnestly and says that, sometimes, they never come in. I ask at what point I can count on the supplier a) giving up and b) informing me of that fact. He frowns again, and suggests that I check back in a couple weeks (four weeks after I've placed the order) just to be safe.
Friday: Get email from contractor/university liason for new building to say that network and electrical connections will not be ready in time because the requests were received so very late. While The Other Guy was supposed to get them in long ago, I should've been on top of this.
Monday, a stat in Canada: Go to the old building to do a serverectomy on a soon-to-be-formerly shared rack. The Other Guy mentions that the new server room has water on the floor. I go over to look, and it's a rapidly evaporating puddle, irregular in shape and maybe two metres across at its widest. I can't figure out where it's coming from. Turns out there's some other stuff that should become formerly shared as well, so I spend time poring over Sun Enterprise 1 workstations (which I like) and old inkjet cartridges for printers that may no longer be around (which I don't like). Ask The Other Guy, who's been involved with the move a lot longer than I have, what electrical connections he's asked for him and for me (long story) in the new building. He says that he gave them the model number of the Sun rack he's got (which has built-in, and very nice, PDUs) and asked them to figure out what he needs.
Tuesday: Moving day. As expected, network and electrical are not present; we've got 2 x 15A 120V circuits. Also, the leak is back, and we can see that it's coming from a small leak in the concrete roof. I move my rack into another room; The Other Guy spreads a blanket over his rack. The liason promises us that the contractors are on the job to fix the roof. The network connections (two fiber, two Cat5) get terminated, so I call the local network folks to get that taken care of. The university wireless network is not present in the new building.
Wednesday: The contractors show up to start fixing the leak. The network connections have been set up. The contractors have put in a big tube of plastic sheeting, taped to the roof at one end and a 40-gallon recycling barrel at the other. The Other Guy decides things are good enough and starts setting up his rack; I elect to hold off another day.
Thursday: The contractors say the roof is fixed, so I move the rack in and start hooking things up. The new OpenBSD firewall comes up nicely -- thank you, pf developers -- as does the main Sun server. Next up is the SunRays in the lab, only they're not. I take my laptop in and try to verify connectivity. I can't. The Other Guys suggests that the VLANs on my new switch are the problem and suggests just simplifying things. I do and keep testing. Traffic from the laptop's RFC 1918 address just never makes it to the server. In a fit of desperation I try using an address in our routable subnet, and it works. This takes me until 8pm to figure out. I email various bosses explaining how far I've got, and the campus network folks to ask if they're filtering this subnet in some way. (This isn't completely out of the question; this place has a reputation for a pretty locked-down network.)
Friday: I buttonhole the guy at the campus network office and ask him about this. He considers this and realizes that while he's forgotten to unblock DHCP (told you it was pretty locked down), the other behaviour I'm seeing can be explained if I've somehow got my interfaces crossed. I'm doubtful but give it a try, which is a good thing because suddenly everything works. I don't understand it or what I did wrong, but assume that I was simply too tired the previous night and thank him profusely for taking the time to talk to me. I am now where I should have been twenty hours before. Mighty battles emerge with Sun's DHCP and Sunray servers. In the end, I have to delete the Sunray configuration, delete all DHCP configurations, and then add the Sunray configuration back. This works, which annoys me; why are there all these opaque configurations around? Not a single plain-text file in sight. I manage to get a printer working, then another. DHCP is modified so that laptops work as well. I call it a night and head home.
I just love clever network hacks.
Speaking of which, I think I'm going to ask my boss if she'll send me to LISA. I didn't realize I had sysadmin heroes 'til I started looking at the program: Æleen Frisch! Michael Lucas! Tom Limoncelli (who's working at Google now, natch)! W. Curtis Preston! But also Dan Fucking Kaminsky, that's who:
I like big graphs and I can't deny...You other hackers can't deny...when a packet routes in with an itty bitty length and a huge string in your face you get sick...cuz you've fuzzed that trick...
...who's going to be presenting the results of a worldwide SSL scan among lots of other stuff.
I think it'd be great to attend, but it's a long shot. Wish me luck.