I've seen the rains of the real world come forward on the plains
I've seen the Kansas of your sweet little myth...
I'm half-drunk on babble you transmit
Through your true dreams of Wichita.
"True Dreams of Wichita", Soul Coughing
This morning I had the SELinux tutorial, held by Rik Farrow. I took a
moment to shake hands with Rik Farrow, who's teaching this class, and
tell him that ;login: magazine, like, changed my life, man, you
know?. If you haven't picked up copies of that magazine/journal, you
owe it to yourself to do so. (And if you have and you agree with me,
send him an email -- he usually only gets email as editor when there's
a problem.)
The course was quite interesting. Some choice bits:
"How many of you are using SELinux?" (Two hands) "How many of you
have disabled SELinux?" (a hundred hands and six tentacles; yes,
even Cthulhu disables SELinux) "See, that's why I came up with this
course; I kept seeing instructions that started with 'Disable
SELinux' and I wanted to know why."
Me: So how to the big guys test their firewall changes?
Matt: I dunno...probably separate routers, duplicate hardware...
Me: Probably golden coffee cup holders, too.
Matt: Jerks.
You don't write SELinux policy. SELinux policy is hard. It's
NP-complete and makes baby Knuth cry. Instead, you use what other
people have written, and make use of booleans to toggle different
bits of policy.
However, the size of the SELinux policy is big and is only getting
bigger. There are something like 85,000 or more rules in recent
versions of RHEL/CentOS. This is very close to RF's rule of thumb
that a really, really smart and experienced person, who's been
intimately involved in its creation, can only comprehend about
100,000 lines of code. This worries him.
Also, the problem of using SELinux is complicated by a lack of
up-to-date documentation; like everything else it's a fast-moving
target, and a book published in 2007 is now half out-of-date.
But this should not stop you from using SELinux now,; it's handy,
it's here, get used to it. Example of SELinux stopping ntpd from
running /bin/bash; the SELinux audit file was the only sign.
"In a multi-level secure system, files tend to migrate to higher
security levels, and the system becomes less unusable. But that's
beyond the scope of this class."
(On programs with long histories of serious security problems)
"Flash is the Sendmail of -- what do we call this decade? the
naughts?"
(On the difficulty of trying to decode SELinux audit logs) "It says
the program 'local' had a problem. 'Local'. What the heck is that?
Part of Postfix. Oh, good. Thanks for the descriptive name,
Wietse."
Something I hope to quiz him further on: "Most Linux systems have a
single filesystem." Really?
During the break I met a guy who works with the Norwegian
Meteorological service. This was interesting. He's got 250TB in
production right now, and increasing CPU power means that their models
can increase their spatial resolution, which means increasing
(doubling?) their storage requirements. He talked briefly about
running into problems with islands of storage, but I got distracted
before I could quiz him further...
...by his story of building a new server room where they were
capturing the waste heat and using it to heat the building.
Interesting; what kind of contribution would it be making to the
overall heating budget? Probably not much, but it all just goes on
the grid anyhow, like the hot water from the garbage dump. What?
Turns out that there is a city-wide network of hot-water pipes that
collects heat from, among other places, water heaters powered by waste
methane from rotting garbage. So they don't use the methane to make
electricity and dump it in the electrical grid; they use it to heat
hot water and dump that in the hot water grid, consisting of
insulated water pipes buried in the ground, which places around the
city (and beyond!) will use. We've got what you could call a steam
grid at UBC and probably other universities, but I'd never thought of
doing this city-wide.
Oh, and he signed my LISA card, which was the second time he got asked
today; he was wearing a LISA t-shirt and so he was fair game.
At lunch I buttonholed Jay a bit. I asked him about his
coworker's firewall unit testing scheme. He said he's no longer
working at that place, but it ended up being a lot less useful than
they thought it would be. When I asked why, he said that 90% worked
but 10% didn't; that 10% was things like network isolation (to avoid
problems with using real IP addresses), and the fact that the
interface to the three machines was QEMU serial connections...less
than ideal.
The conversation shifted to firewalling, and another guy who was there
mentioned that he loved OpenBSD's pf, but had to use iptables because
of driver problems that prevented getting full performance out of
10GigE NICs with OpenBSD. Jay said they'd looked at the same problem
at his place o' work, and in his words "It was cheaper to throw 8 GigE
NICs in a box and pay someone to make Linux interface bonding not
suck."
Following in Matt's footsteps, I ran into a serious problem just
before heading to LISA.
Wednesday afternoon, I'm showing my (sort of) backup how to connect to
the console server. Since we're already on the firewall, I get him to
SSH to it from there, I show him how to connect to a serial port, and
we move on.
About an hour later, I get paged about problems with the database
server: SSH and SNMP aren't responding. I try to log in, and sure
enough it hangs. I connect to its console and log in as root; it
works instantly. Uhoh, I smell LDAP problems...only there's nothing
in the logs, and id <uid> works fine. I flip to another terminal
and try SSHing to another machine, and that doesn't work either.
But already-existing sessions work fine until I try to run sudo or
do ls -l. So yeah, that's LDAP.
I try connecting via openssl to the LDAP server (stick alias
telnets='openssl s_client -connect' in your .bashrc today!) and get
this:
CONNECTED(00000003)
...and that's all. Wha? I tried connecting to it from the other LDAP
server and got the usual (certificate, certificate chain, cipher,
driver's license, note from mom, etc). Now that's just weird.
After a long and fruitless hour trying to figure out if the LDAP
server had suddenly decided that SSL was for suckers and chumps, I
finally thought to run tcpdump on the client, the LDAP server and the
firewall (which sits between the two). And there it was, plain as
day:
3-way handshake
client says "I speak SSL!"
server says "I speak SSL too! Here you go!"
but the client never sees that packet
and neither does the firewall.
Near as I can figure, this was the sequence of events:
We SSH'd from the firewall, with its two bridged Intel GigE jumbo-enabled
NICs
to the console server, which only does 10/100
which somehow prompted a renegotiation of the link speed on the
firewall's interface
which settled on 100 MBit, full duplex, but with jumbo frames
which the switch saw as completely bogus
which prompted the switch to (silently, natch) drop all jumbo frames directed at the
firewall's outside interface
which, in the context of an LDAP lookup done by a client inside the
firewall, meant that the first packet that failed was the "I speak
SSL too! Here you go!" packet
which left the client with an established TCP connection to the LDAP
server, waiting for a certificate
which meant that it never actually failed over to the other LDAP
server.
This took me two hours to figure out, and another 90 minutes to fix;
setting the link speed manually on the firewall just convinced the
nic/driver/kernel that there was no carrier there. In the end the
combination that worked was telling the switch it was a gigabit port,
but letting it negotiate duplexiciousnessity.
So this morning, again, I got paged about machines in our server
room dropping off the network. And again, it was the bridge that was
the problem. This time, though, I think I've figured out what the
problem is.
The firewall has two interfaces, em0 (on the outside) and em1 (on
the inside) , which are bridged. em1 has an IP address. I was able
to SSH to the machine from the outside and poke around a bit. I still
didn't find anything in the logs, but I did notice this (edited for brevity):
$ ifconfig
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 9000
See that? em1 has OACTIVE set. A quick search turnedupsomeinterestinghits, so for fun I tried resetting the
interface:
$ sudo ifconfig em1 down
$ sudo ifconfig em1 up
and huzzah! it worked.
When I got to work I did some more digging and figured out that this
and the earlier outage were almost certainly caused by running
a full backup, via Bacula, of the /home partition on the machine.
The timing was just about exact. The weird thing, though, is that
the partition itself is smaller than var, which was backed up
successfully both times:
The bacula file daemon logged this on the firewall:
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Error: bsock.c:306 Write error sending 36841 bytes to Storage daemon:backup.example.com:9103: ERR=Broken pipe
With the earlier outage it was 65536 bytes, but otherwise the same
error.
Okay, so the firewall's working again...now what? I'm about to head
off to LISA in three days, so I can't very well upgrade to the
latest OpenBSD right now. I settled for:
turning off full backups on the firewall (everything important is
kept in Subversion anyhow), and
running a script from cron every 10 minutes that checks for the
OACTIVE flag and, if found, resets the interface.
Hopefully that'll keep things going 'til I get back.
Yesterday I got paged by one of my two Nagios boxes (learned that
trick the hard way): a bunch of the machines in our server room
had dropped off the network. Weirdly, this did not include the
other Nagios box that's over there. WTF?
I logged into the server room's Nagios box, and sure enough couldn't
ping the servers or the firewall. I could ping the console
server...which was also on the Outside VLAN along with the monitoring
box, as opposed to the Inside VLAN with the servers, which sat behind
our firewall.
I was also able to ping the management cards/ILOMs/SPs/whatever the
kids are calling them in the servers. Thankfully they're Sun boxes,
so no Vista-like maze of flavours there...they all come with
console redirection. I logged in and fired up a console, panicing
because I thought that perhaps the newly-installed NUT clients
had shut down the machines because I'd overlooked something.
But no...the machines were up, though hung if you tried to do any LDAP
lookups. (Through an oversight, the LDAP server was also on the
Outside VLAN. I'll be fixing that today.) Modulo that, they seemed
fine.
So I logged into the firewall, which runs OpenBSD 4.3 in bridging
mode. And this is where the weirdness lay: the bridge, and/or its
component cards, was not working. ifconfig and brconfigsaid
they were up and fine, and the ARP table was still populated (not sure
what the lifetime of entries is -- isn't it around 20 minutes or so?
must check -- but by this time the problem had been going on for about
an hour). Yet I couldn't ping the firewall (one of those cards has an
address) from either side, and I couldn't ping anything from the
firewall.
pfctl -s all didn't show anything suspicious. There were no obvious
problems in dmesg or /var/log/messages. I disabled, then
re-enabled, the firewall to no effect. I ran /etc/netstart to no
effect.
I even checked on the switches to see if the firewall's MAC address
was showing up anywhere, and it was not -- not even directly after
pinging it (and getting no response).
In the end I rebooted the machine and all was well.
The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that
I've never had problems with. I've never come across problems like
this before on OpenBSD (or, I think, anywhere else). The onboard
Broadcom (boo, hiss) was acting fine...it was also on the ILOM's
VLAN, and could see the other ILOMs just fine. (In fact, I should
have just SSHd to the firewall using that VLAN from the Nagios box,
rather than futz around with a 9600 bps console. Next time.)
So...that's my mystery for the weekend.
In other news, my older son (3.25 yrs) has taken to the stage in a
big way: he now stands on top of the steps going up from our living
room and sings us songs into one of at least two microphones.
"Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs
are prominent. This is after at least three solid weeks of guitar
playing, where anything and everything gets strummed while being
cradled in his arms while he sings, or maybe makes feedback sounds
that'd make Yo La Tengo proud.
Meanwhile, my younger (1.5 yrs) has started saying lots of different
phonemes, which is a real contrast to using "Dat!" for monkey, cereal,
ball, yes, no, President Barack Obama's attempted health care reforms,
and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the
morning, which lets me write things like this. Both are infinitely
endearing.
And incidentally, I really need to set up Nagios dependencies. I've
had to ACK 27 services in a row (unrelated (I think) problem with ILOM
temperature taking means SNMP checks are timing out). Either that or
there's some way that you can select n services in Nagios to ack all
at once. Anyone?
I just saw on Undeadly.org that orders for OpenBSD CDs are 'way
down this year. Without OpenSSH and pf, I wouldn't be able to do my
job nearly as well as I do. I've ordered a set for work (good excuse
to upgrade that firewall), and ordered a set for home and tossed 'em
$50 as well. I encourage you to do the same.
In the words of the original rant:
Do you use OpenBSD for fun? Contribute.
Do you use OpenBSD for work? Contribute.
Does OpenBSD allow you to worry about the problem you are trying
to solve rather rather than the tools? Contribute.
Do you wish your employer used the OpenBSD quality standard in
your work? Contribute.
Does your employer use OpenBSD? Ask them to contribute (after
you do, of course).
Do you bundle OpenBSD or subprojects like OpenSSH into your
product? Contribute big! (you won't, you rarely do, but hey,
I'll ask anyway)
Do you find yourself wondering why so few take computer software
quality seriously? Contribute!
Do you have any idea how fucking insane the h.323 protocol is? Anyone
who runs a h.323 should get shoved out a window, beaten, flayed,
spanked, shot, disembowled, hung, and forced to listen to hummpa music. If
you want to firewall h.323, go commit yourself to an asylum with
straight jackets and with padded walls -- at least you'll be in common
company with the other linux wacko's.
Sat down tonight to create a firewall for a new OpenBSD web server I'm
setting up, and holy crap is pf ever good. I got to test the
firewall syntax before loading it, and as a result I had a working
firewall the first fucking time I loaded it. That's never happened
before; I full expected that this time, as every other time with a new
firewall (let alone a new firewall language!), I'd have to reboot or
log in with a keyboard or serial cable, or something.
But no: not only did I not lock myself out, not only was this the
first time (well, nearly) that I'd read the FAQ, the firewall
does everything I wanted it to: no extra packets in, no extra packets
out. Wow.