FTP problems solved by bletcherous hack

Some time between mid-December and January at $WORK, we noticed that FTP transfers from the NIH NCBI were nearly always failing; maybe one attempt in 15 or so would work. I got it down to this test case:

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34777/matrix/GSE34777_series_matrix.txt.gz

The failed downloads would not fail right away, but instead hung when the data connection from the remote end should have transferred the file to us. Twiddling passive did nothing. If I tried HTTP instead of FTP, I'd get about 16k and then the transfer would hang.

That hostname, ftp.ncbi.nlm.nih.gov, resolves to 4-6 different IP addresses, with a TTL of 30 seconds. I found that ftp transfers from these IP addresses failed:

but this one worked:

The A record you get doesn't seem to have a pattern, so I presume it's being handled by a load balancer rather than simple round-robin. It didn't come up very often, which I think accounts for the low rate of success.

At first I thought this might indicate network problems here at $WORK, but the folks I contacted insisted nothing has changed, we're not behind any additional firewalls, and all our packets take the same route to both sets of addresses. So I checked our firewall, and couldn't find anything there -- no blocked packets, and to the best of my knowledge no changed settings. Weirdly, running the wget command on the firewall itself (which runs OpenBSD, instead of CentOS Linux like our servers) worked...that was an interesting rabbit hole. But if I deked out the firewall entirely and put a server outside, it still failed.

Then I tripped over the fix: lowering the MTU from our usual 9000 bytes to 8500 bytes made the transfers work successfully. (Yes, 8500 and no more; 8501 fails, 8500 or below works.) And what has an MTU of 8500 bytes? Cisco Firewall Service Modules, which are in use here at $WORK -- though not (I thought) on our network. I contacted the network folks again, they double-checked, and said no, we're not suddenly behind an FSM. And in fact, their MTU is 8500 nearly everywhere...which probably didn't happen overnight.

Changing the MTU here was an imposing thought; I'd have to change it everywhere, at once, and test with reboots...Bleah. Instead, I decided to try TCP MSS clamping instead with this iptables rule:

iptables -A OUTPUT -p tcp -d 130.14.250.0/24 --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 8460

(Again, 8460 or below works; 8461 or above works fine.) It's a hack, but it works. I'm going to contact the NCBI folks and ask if anything's changed at their end.