The_johnstown_arp_flood


title: The Johnstown Arp Flood date: 2004-08-10 12:45:17

So last night, in the midst of other network problems, I notice lots of messages like this on our FreeBSD machines:

/kernel: arplookup 1.2.3.4 failed: host is not on local network

WTF? But as this was not the network problem I was looking for, I decided I did not need to see its identification and left it.

This morning, an SSH session to a test box became unresponsive after being idle for a few minutes; pings didn't work either. I ran over and checked that no one had disconnected it -- no one had -- and was able to ping my machine and others from it just fine.

As I walked back to my desk, someone came up to me and asked if the network was having problems right now, since he was no longer able to reach the network from his machine. I asked him to give it a try again, and went back to my desk to try getting to the test machine again. It was fine at first, then stopped responding again -- no SSH, no ping. I called the other guy and asked him if his computer was fine -- it was. I tested connectivity from my computer, but everything responded just fine except for the test box.

I walked over again, logged in and tried pinging my desktop machine. It took a good five or six seconds to respond, but then the responses were seen starting at packet 0. What the...I checked out other machines and saw that the arplookup message was turning up again. Time to check it out.

Well, first clue is that the address was almost one I'd assigned to a developer for his User-mode Linux sandbox: 3.4.1.2. I logged in and checked it out, but there was no indication it was using the address -- ifconfig turned up nothing, nor did tcpdump or the arp table. I checked the other machines' arp tables, but they had no entry either.

Then I remembered that this guy had been complaining about intermittent slow network access yesterday and today. At the time I figured it was related to the original problems I'd been checking out, but maybe they weren't. I decided to grit my teeth and talk to him.

First clue: the arp table on his W2K box (poor bastard) had an entry for 1.2.3.4. Aha! I tried running tcpdump on a laptop running FreeBSD hooked up to the same switch, but no luck. I was kind of hoping for a stray arp who-has or something, but I guess not.

Second clue: he mentioned in passing that the test box of his own by his desk, a single-board computer running Linux, was being used to test an ethernet driver that he was developing. He had the kernel set up to ping his User-mode Linux every five seconds. He also had it set up to watch for incoming traffic, swap bytes around, and then send it to his User-mode Linux. Aha!.

We agreed:

My suspicion at this point is that the bogus traffic was wreaking havoc with ARP tables, possibly those belonging to the (cheap, unmanaged, hopefully soon-to-be-upgraded-to-expensive-Cisco-Catalyst) switches that connect our whole network, or possibly just several Very Important Servers. I realize that it only caused problems with a few machines, not the whole network, but I'm wondering if this might be similar to what happened in February. Gotta say, though...the intense debugging activity almost fulfilled my deep-seated sysadmin fantasy of debugging raw Ethernet frames on the fly, so I'm happy.