There it was, gone
30 Oct 2009Following in Matt's footsteps, I ran into a serious problem just before heading to LISA.
Wednesday afternoon, I'm showing my (sort of) backup how to connect to the console server. Since we're already on the firewall, I get him to SSH to it from there, I show him how to connect to a serial port, and we move on.
About an hour later, I get paged about problems with the database
server: SSH and SNMP aren't responding. I try to log in, and sure
enough it hangs. I connect to its console and log in as root; it
works instantly. Uhoh, I smell LDAP problems...only there's nothing
in the logs, and id <uid>
works fine. I flip to another terminal
and try SSHing to another machine, and that doesn't work either.
But already-existing sessions work fine until I try to run sudo
or
do ls -l
. So yeah, that's LDAP.
I try connecting via openssl to the LDAP server (stick alias
telnets='openssl s_client -connect'
in your .bashrc today!) and get
this:
CONNECTED(00000003)
...and that's all. Wha? I tried connecting to it from the other LDAP server and got the usual (certificate, certificate chain, cipher, driver's license, note from mom, etc). Now that's just weird.
After a long and fruitless hour trying to figure out if the LDAP server had suddenly decided that SSL was for suckers and chumps, I finally thought to run tcpdump on the client, the LDAP server and the firewall (which sits between the two). And there it was, plain as day:
- 3-way handshake
- client says "I speak SSL!"
- server says "I speak SSL too! Here you go!"
- but the client never sees that packet
- and neither does the firewall.
Near as I can figure, this was the sequence of events:
- We SSH'd from the firewall, with its two bridged Intel GigE jumbo-enabled NICs
- to the console server, which only does 10/100
- which somehow prompted a renegotiation of the link speed on the firewall's interface
- which settled on 100 MBit, full duplex, but with jumbo frames
- which the switch saw as completely bogus
- which prompted the switch to (silently, natch) drop all jumbo frames directed at the firewall's outside interface
- which, in the context of an LDAP lookup done by a client inside the firewall, meant that the first packet that failed was the "I speak SSL too! Here you go!" packet
- which left the client with an established TCP connection to the LDAP server, waiting for a certificate
- which meant that it never actually failed over to the other LDAP server.
This took me two hours to figure out, and another 90 minutes to fix; setting the link speed manually on the firewall just convinced the nic/driver/kernel that there was no carrier there. In the end the combination that worked was telling the switch it was a gigabit port, but letting it negotiate duplexiciousnessity.
Gah. Just gah.
4 Comments
From: Matt Simmons
30 October 2009 20:36:45
Yikes! Good catch, though. It is a bummer on the timing, but at least it happened now, instead of while you were gone. I don't even know how you'd troubleshoot that remotely.
Look me up at LISA. Are you coming to the "Blogger" BoF?
From: Saint Aardvark
30 October 2009 20:50:57
Oh aye. But are you coming to the Conference Organization BoF?
From: Matt Simmons
30 October 2009 21:00:39
Oh, I didn't see it! I was looking at the Network Automation BoF. Can't I just clone myself? :-)
Maybe I'll switch off. I take it you're going to the Conference Organization BoF?
From: Saint Aardvark
31 October 2009 00:22:50
I'm organizing the CO BoF. :-)
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018