This took me a while to figure out. (All my war stories start with that sentence...)
A faculty member is getting a new cluster next year. In the meantime, I've been setting up Rocks on a test bed of older machines to get familiar with it. This week I've been working out how Torque, Maui and MPI work, and today I tried running something non-trivial.
CHARMM is used for molecular simulations; it's mostly (I think) written in Fortran and has been around since the 80s. It's not the worst-behaved scientific program I've had to work with.
I had an example script from the faculty member to run. I was able to run it on the head node of the cluster like so:
mpirun -np 8 /path/to/charmm < stream.inp > out ZZZ=testscript.inp
8 CHARMM processes still running after, like, 5 days. (These things run forever, and I got distracted.) Sweet!
Now to use the cluster the way it was intended: by running the processes on the internal nodes. Just a short script and away we go:
$ cat test_mpi_charmm.sh
#PBS -N test_charmm
#PBS -S /bin/sh
#PBS -N mpi_charmm
#PBS -l nodes=2:ppn=4
. /opt/torque/etc/openmpi-setup.sh
mpirun /path/to/charmm < stream.inp ZZZ=testscript.inp
$ qsub test_mpi_charmm.sh
But no, it wasn't working. The error file showed:
At line 3211 of file ensemble.f
Fortran runtime error: No such file or directory
mpirun has exited due to process rank 0 with PID 9494 on
node compute-0-1.local exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
Well, that's helpful...but the tail of the output file showed:
CHARMM> ensemble open unit 19 read card name -
CHARMM> restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
Parameter: FILEROOT -> "TEST_RUN"
Parameter: PREV -> "FOO"
Parameter: NREP -> "1"
Parameter: NODE -> "0"
ENSEMBLE> REPLICA NODE 0
ENSEMBLE> OPENING FILE restart/test_run_foo_nr1_nd0
ENSEMBLE> ON UNIT 19
ENSEMBLE> WITH FORMAT FORMATTED AND ACCESS READ
What the what now?
Turns out CHARMM has the ability to checkpoint work as it goes along, saving its work in a restart file that can be read when starting up again. This is a Good Thing(tm) when calculations can take weeks and might be interrupted. From the charmm docs, the restart-relevant command is:
IUNREA -1 Fortran unit from which the dynamics restart file should
be read. A value of -1 means don't read any file.
(I'm guessing a Fortran unit is something like a file descriptor; haven't had time to look it up yet.)
The name of the restart file is set in this bit of the test script:
iunrea 19 iunwri 21 iuncrd 20
Next is this bit:
ensemble open unit 19 read card name -
"restart/@FILEROOT_@PREV_nr@NREP_nd@NODE.rst"
An @ sign indicates a variable, it seems. And it's Fortran, and Fortran's been around forever, so it's case-insensitive. So the restart file is being set to "@FILEROOT@PREVnr@NREP_nd@NODE.rst". Snipping from the input file, here are where the variables are set:
set fileroot test
set prev minim
set node ?whoiam
set nrep ?nensem
test" appears to be just a string. I'm assuming "minim" is some kind of numerical constant. But "whoiam" and "nensem" are set by MPI and turned into CHARMM variables. From charmm's documentation:
The CHARMM run is started using MPI commands to specify the number of processes
(replicas) to use, each of which is an identical copy of charmm. This number
is automatically passed by MPI to each copy of the executable, and it is set to
the internal CHARMM variable 'nensem', which can be used in scripts, e.g.
set nrep ?nensem
The other internal variable set automatically via MPI is 'whoiam', e.g.
set node ?whoiam
These are useful for giving different file names to different nodes.
So remember the way charmm was being invoked in the two jobs? The way it worked:
mpirun -np 8 ...
...and the way it didn't:
mpirun ...
Aha! Follow the bouncing ball:
At first I thought that I could get away with increasing the number of copies of charmm that would run by fiddling with torque/serverpriv/servernodes -- telling it that the nodes had 4 processors each (so total of 8) rather than 2. (These really are old machines.) And then I'd change the PBS line to "-l nodes=2:ppn=4", and we're done! Hurrah! Except no: I got the same error as before.
What does work is changing the mpirun args in the qsub file:
mpirun -np 8 ...
However, what that does is run 8 copies on one compute node -- which works, hurrah, but it's not what I think we want:
I think this is a problem for the faculty member to solve, though. It's taken me a whole day to figure this out, and I'm pretty sure I wouldn't understand the implications of just (say) deking out the bit that looks for a restart file. (Besides, there are many such bits.) (Oh, and incidentally, just moving the files out of the way doesn't help...still barfs and dies.) I'll email him about this and let him deal with it.
Following in Matt's footsteps, I ran into a serious problem just before heading to LISA.
Wednesday afternoon, I'm showing my (sort of) backup how to connect to the console server. Since we're already on the firewall, I get him to SSH to it from there, I show him how to connect to a serial port, and we move on.
About an hour later, I get paged about problems with the database
server: SSH and SNMP aren't responding. I try to log in, and sure
enough it hangs. I connect to its console and log in as root; it
works instantly. Uhoh, I smell LDAP problems...only there's nothing
in the logs, and id <uid>
works fine. I flip to another terminal
and try SSHing to another machine, and that doesn't work either.
But already-existing sessions work fine until I try to run sudo
or
do ls -l
. So yeah, that's LDAP.
I try connecting via openssl to the LDAP server (stick alias
telnets='openssl s_client -connect'
in your .bashrc today!) and get
this:
CONNECTED(00000003)
...and that's all. Wha? I tried connecting to it from the other LDAP server and got the usual (certificate, certificate chain, cipher, driver's license, note from mom, etc). Now that's just weird.
After a long and fruitless hour trying to figure out if the LDAP server had suddenly decided that SSL was for suckers and chumps, I finally thought to run tcpdump on the client, the LDAP server and the firewall (which sits between the two). And there it was, plain as day:
Near as I can figure, this was the sequence of events:
This took me two hours to figure out, and another 90 minutes to fix; setting the link speed manually on the firewall just convinced the nic/driver/kernel that there was no carrier there. In the end the combination that worked was telling the switch it was a gigabit port, but letting it negotiate duplexiciousnessity.
Gah. Just gah.
Last week was reading week here at UBC. Monday I was off sick. Tuesday we got an email from the folks at the building where we've got guest access to one of their server rooms: the cooling was being shut down from 7am on Wednesday to 3pm on Thursday, so we'd have to turn off our servers. We're guests, so it's not like we've got a lot of say in the matter.
Natch, Thursday 3pm came and went. We got an email at 3:45pm from a manager there, saying that unexpected problems had arisen; they were hoping to have things back up by the weekend. That night I pointed our website at a backup server; it was not serving my boss' big web app, as there was no way to make that tiny little box serve a nearly 1TB database.
Friday I obsessed over the ambient temperature on our firewall (which I'd left turned on); it hovered around 35C. Around 10am we were told that they were hoping to have it on later that day, but that another shutdown might need to be scheduled for the next week (this week). At noon we were told that things were looking hopeful, but they couldn't guarantee cooling over the weekend.
At 2pm I found a local A/C rental agency who told us they'd be out to look at the room on Monday. 4pm I emailed my contact at the other department, plus his manager, to ask for updates and whether any further shutdowns could be scheduled after we'd arranged for cooling.
Over the weekend I obsessed over the temperature some more; it had dropped to 21C and stayed there, but without feedback from the facilities people I was reluctant to trust it.
Monday (yesterday; wow, time flies) we were told that the cooling system should perform well; however, a part still needed to be replaced. It was on order and would be coming in late this week or early next, and would require a four-hour outage.
This morning the cooling guy visited (he was at a funeral yesterday, so fair enough) and said that, yep, we could get a nice portable unit in for around $400 for a week.
I'm not writing this down because I'm proud of how I handled this. I'm writing this down so that someone else can maybe learn the things I should've known:
I have a habit of thinking "There's not much that can be done about that." Actually, it goes even further than that; it doesn't occur to me sometimes to think about what can be done. I'm not sure if this is lack of confidence, or trying too hard to get along, or just sheer laziness, but I'm trying hard to stop doing that.
Tuesday, January 15: Notify users that there will be a brief interruption in our Internet access due to $UNIVERSITY network dep't cutover of our connection from old Bay switches to new Cisco switches. The cutover will be on Friday at 6:30am; the network dep't has said an hour, but it's expected to only be about 20 minutes.
Friday, January 18, 8:30am: Get into work to find that our Internet connection is down. I didn't get notified because the Nagios box can't send email to my cel phone if it can't get access to the Internet. Call network help desk and ask if there were problems; they say no, and everyone else is working just fine. I go to our server room and start trying to figure out what's wrong; can't find a thing. Call help desk back, who say they're going to escalate it.
10am: Get call back from the team that did the cutover. They tell me everything looks fine at their end; as we're the Nth connection to be cut over, it's not like they haven't had practice with it. I debug things with them some more, and we still can't find anything wrong: their settings are correct, mine haven't changed and yet I can't ping our gateway. (The firewall is an OpenBSD box with two interfaces, set up as a transparent bridging firewall.) As the firewall box is an older desktop that had been pressed into service long ago, I decide it'd be worth taking the new, currently spare (YOU NEVER HEARD ME SAY THAT) desktop machine and trying that.
Noon: Realize I have no spare ethernet cards (wha'?). Find two Intel Pro 100s at the second store I go to. Install OpenBSD 4.2 (yay for ordering the CD!), copy over config files, and put it into place. No luck. Still can't ping gateway. While working on the firewall, I notice something weird: I've accidentally set up a bridge with only one interface, while my laptop sits behind pinging the gateway (fruitlessly) ten times a second. (I got desperate.) When I add the second interface, the connection works — but only for 0.3 seconds. The behaviour is repeatable.
3pm: Right after that, the network people show up to see how things are going. I tell them the results (nothing except for 0.3 seconds) and they're mystified. We decide to back out the change from the morning and debug it next week. Things work again instantly. As the new firewall works, I leave it in place.
7.02pm: The connection goes down again. I don't get notified.
Saturday January 19, Noon: I get a call from the boss, who tells me that a meeting at the offices isn't going well because they have no Internet access. Call and verify that, yep, that's the case, and I can't ping there from home. Drive into work.
1.30pm: Arrive and start debugging. Again, nothing wrong that I can see but I can't ping our gateway or see its MAC address. Call help desk who say they have no record of problems. They'll put in a trouble ticket, but would like me to double-check before they escalate it. That's fine — I didn't wait long before calling them — so I do.
2pm: I get a call from the head of the network team that did the cutover; he'd seen the ticket and is calling to see what's going on. He and I debug further for 90 minutes. We try hooking up my laptop to the port the firewall is usually connected to, but that doesn't work; he can see my laptop's MAC address, but I can't see his.
4pm: He calls The Big Kahuna, who calls me and starts debugging further while his osso bucco cooks. We still can't get anywhere. I try putting my laptop on another port in another room, hoping that net access will work from there and maybe I can just string a cable across. It doesn't.
6pm: We call it a night; he and the other guy are going to come in tomorrow to track it down. I call nine bosses and one sysadmin to keep them filled in.
6.30pm: Drive home.
Sunday, January 20, 10.30am: We all show up and start working. We still can't find anything wrong. The boss calls to ask me to set up a meeting with the network department for tomorrow; I tell him I will after we finish fixing the problem.
11.30am: The network team lead gets desperate enough to suggest rebooting the switch stack. It works. We all slap our heads in disgust. Turns out that a broadcast storm on Friday evening triggered a logical failure in the switch we were connected to, resulting in the firewall's port alone being turned off.
Noon: The boss shows up to see how things are going. He talks with the network lead while I'm on the phone with The Big Kahuna; we've decided to try moving to the Cisco switches and make that work while everyone's here.
12.30pm: The Big Kahuna tells me that the problem is the Spanning
Tree Protocol packets coming from my firewall box; the Cisco switch
doesn't like that and shuts down the switch. I go through man pages
until I find the blocknonip
option for brconfig
. 30 seconds later,
everything is working. Apparently, I'm the only one they've come
across who's running a transparent bridging firewall, so this is the
first time they've seen this problem.
1pm: Debrief the boss. Notify other bosses, sysadmins and users that everything is back up again, then do some last-minute maintenance.
2pm: Drive home.
One thing: the usual configuration for other departments (that don't run their own firewall) is to have two Cisco switches running HSRP; they act as redundant gateways/firewalls that fail over automagically. The Big Kahuna mentions in passing that this doesn't work with OpenBSD bridging firewalls. (Our configuration had been simplified to one switch only on Friday as part of debugging the first problem; I mention this in case this is helpful to someone. I don't understand why this might be the case, so I'm going to ask him about this tomorrow.)
The upgrade to Solaris 10 did not work. The main problem was that logging in at the console (even as root!) simply would not work: I'd get logged right back out again each time, with no error message or anything. WTF?
I managed to go into single-user mode, provide the root password (see? they do trust me) and get access that way. But I still couldn't figure out what was going wrong. Eventually I came across this entry in the logs
svc.startd[7]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 16
And /var/svc/log/system-console-login:default.log
said:
[ Aug 4 14:23:48 Executing start method ("/lib/svc/method/console-login") ]
[ Aug 4 14:24:05 Stopping because all processes in service exited. ]
Eventually I had to give up and revert back to Solaris 9. That part worked well, at least.
I've no idea what went wrong at this point, but since I haven't come
across this before with other Solaris 10 installs I'm starting to
wonder if it's a product of luupgrade attemting to merge the machine's
current settings with Sol10. Between that suspicion and the increase
in disk space needed to run luupgrade (not sure why, but for example
/usr
needed a couple extra GB of space in order to complete
luupgrade
; I presume something's being added or kept around, but
there's no explanation I can find for this), I'm starting to think
that just going with a clean install of Sol10 is the way to go.
Arghh. Live Upgrade was supposed to just work.
This is going to be a long story, but I hope it'll be instructive. Bear with me.
Back at my last job, we had a Samba server, running on FreeBSD, acting as a Primary Domain Controller for around 35 W2K machines. The same machine also acted as NIS master for a similar number of FreeBSD machines. It also did printing, mail, DNS, and half a dozen other things. This machine was getting old; it's CPU usage was often pegged by a large print job, it was running out of disk space, and I was beginning to be worried about the inevitable day of death. I began planning for the upgrade: a new machine, faster and bigger hard drives, more memory and gigabit ethernet for the day we all moved to GigE. Oh, and rack-mounted...definitely rack-mounted.
The opportunity was taken to upgrade much of the software on the machine, including Samba. I decided to move from 2.2 to the 3.0 series; the speed differences seemed pretty impressive. I also wanted to get as many of the big upgrades done at once as possible: the prospect of going through the upgrade repeatedly did not appeal.
Of all the upgrades I was doing, Samba made me the most nervous. I read through the excellent (and Free) Samba HOWTO and made notes: how to move to the tdsam password database, changes in configuration options, and so on. I had the new server for a while, so I was able to run through many tests: getting a Windows machine to log on, DNS queries, and so on.
Finally, the big day came. I went in on a Saturday and made the move. Most of the rest of the day was spent testing, chasing down the inevitable mistakes, and testing some more. I tested by logging into machines after they'd joined the domain, and making sure that everyone could still log into their workstations. All told, things went pretty damned well, and I congratulated myself on a job well done.
Later though, a few things began to crop up that I haven't been able to explain. I could no longer add new domain accounts to SSH under Cygwin. A shared printer wasn't being shared any longer. In fact, shares weren't working at all. I banged my head against this for a while, but since the problems were pretty erratic they tended to fall to the wayside in favour of explaining, one more time, why the words "spare computer" were self-contradictory.
Finally, though, I put some more time into it. And it's a little hairy, especially for this Unix guy, so bear with me.
(Incidentally, I couldn't have figured out half of this without the help of Clarence Lee, a co-op student working with us. Sure, he uses IIS, but he firewalls it with OpenBSD and he got an internship at Microsoft. He's a good guy.)
The shared printer: could not figure out what was going on here. Guy
who had it could print to it, no problem. Used to work for everyone,
no problem. Now it wouldn't work. Broke the problem down to the point
where I was using smbclient
on FreeBSD, or net view
on W2K, to try
and list the shares, and that didn't work. Not any of them -- not
IPC$
or anything. I was fairly sure this wasn't supposed to be
happening.
There was a machine in limbo (not the same as spare, thenk yew!) while a coop student became permanent. I got it using the other networked printer, and tried sharing it. Again, command-line utilities would simply not list the shares. What's more, when I tried getting other people to log into the machine (I was fairly irritated at this point, and not at my most rational), they couldn't log in. WTF? I could log in, and there had been no complaints from the person whose machine it had been.In a moment of irritation, I got the test machine to rejoin the domain...and suddenly, everything was working: I could list shares on it, other people could list shares on it, people could log in, and everything. Yay! It's so simple! Rejoin the domain! Everything will be great!
Ha! It is to laugh. Profiles were not coming in when people logged
in. My Documents
was empty, they got that stupid, evil, vile "Let's
take a tour of Windows! And let me help you set up your network! DO
IT!" popup window. I couldn't figure it out.
Clarence and I banged out heads against it some more, and finally came to a conclusion.
When you migrate Samba, you're meant to take the old SID with you
using net(8) GETLOCALSID
and SETLOCALSID
. The SID is meant to
be a world-unique string/number that identifies a domain, or an
account -- think something like the DN in LDAP, or NIS domainname +
UID in Unix. (A user's SID has a part that belongs to the domain, and
another, smaller part that is unique to that user.) I didn't do that
-- screwup -- and so the Samba server had generated a new SID. As far
as Windows is concerned, the identity of your domain is solely
determined by the SID; the name is their just for your
convenience. (Insert snide remark here about how magic invisible
numbers have no business being that important.)
As a result, the machines that were present at the migration didn't know where their Primary Domain Controller (PDC-- the machine officially in charge of the domain) had gone, and were running on cached credentials, profiles and so on. (This is the same thing that allows you to log into a Windows laptop that belongs to a domain, even when you've taken it home and aren't able to reach your PDC any more.) Printing and shared resources from the Samba server continued to run because of open permissions or credentials (ie, user name and password) that don't depend on SIDs.
This also explained why I could log into the machines without problems: because, as sysadmin, I'd logged into all of them before to do maintenance. My credentials were cached, so the machines were able to authenticate me w/o consulting with their (now missing) PDC. And of course, everyone was able to log into their own workstations for the same reason.
So: machine rejoins the domain and people can log in, because now the
machine can find its PDC and verify their passwords. But profiles
aren't showing up because the profile's NTUSER.DAT
-- the user's
hive, loaded into the registry at HKEY_CURRENT_USER
when they log in
-- belonged to/was marked with/was owned by the account's old SID,
and Windows refused to load it and lots of stuff broke or was missing.
After some more searching, I finally figured out the way around this.
First, you need to use the profiles(1) tool in Samba to change
the SID on NTUSER.DAT
, which'll be wherever Samba keeps
profiles. You should check their SID in Samba by using
pdbedit(8), though odds are the user ID/group ID part will have
remained the same.
Second, you need to take care of the profile. There are a few ways of
doing this. The easiest way is to copy the modified NTUSER.DAT
to
their profile directory, then log into the machine as Administrator
and join the new domain, then get the user to log in. Their profile
will be copied over, just as if they'd logged into a machine for the
first time. However, this can cause problems with certain programs who
haven't been informed about the change.
To illustrate: if the domain name is named EXAMPLE
, and the user
account is jdoe
, then their profile will usually be at C:\Documents
and Settings\jdoe
(let's just call that D&S\jdoe
for
short). However, D&S\jdoe
will belong, after joining the new domain,
to an old account that's no longer around, which means that Windows
will put their profile somewhere else -- probably something like
D&S\jdoe.EXAMPLE
. Odds are, though, that the old path will still be
in the registry or other files, which means a lot of cycles of
"Why-did-that-break-let-me-fix-it". Another option is simply to move
D&S\jdoe
out of the way, so that paths can remain the same. Finally,
you can also change ownership recursively to the new account once
you've joined the domain; this will take a while, but it's probably
quicker than copying the profile over wholecloth if they've got a lot
of files. If you do this, it's best to remove the machine's copy of
their NTUSER.DAT
file; it'll just be copied over from the server.
This took a lot of work, of course, and usually there were things like
Outlook.pst
to screw things up further. But after much work, I
finally got everyone moved over to the new domain, and things were
good again.
Lessons learned:
Now this is interesting.
Guy comes up to me at work and sez, "Hey, that new Linux machine is really slow." How can that be? It's an umpty-GHz processor with a gig of RAM, a nice hard drive and the same 100Mb/s connection to the network that the FreeBSD machine beside it has. "It's just slow." Slow how? Doing what? "It's just slow -- all the time."
Eventually we got it down to a working demonstration: log in. The developers've got a fairly intricate set of .cshrc files, so they echo some progress reports: Setting FOO...setting BAR...setting LD_LIBRARY_PATH...only it's taking for-freaking-ever -- well, relatively speaking: 8 seconds sez /usr/bin/time. cf. ~2 s. on the (if anything, slower) FreeBSD machine right beside it. WTF?
At first I started looking at the rc scripts. By deking out various bits, I could see where the 8 seconds was coming from -- half a second there, two and a half seconds there...but then I came to my senses and realized that looping over half a dozen items should not be causing this kind of nonsense.
I checked DMA on the hard drives. Aha, they're off! But all the home
directory access is over NFS, so it's probably not much of an
issue. And hdparm -d /dev/hda 1
just came back with permission
denied
(even as root...I seem to remember something about newer Intel
chipsets having DMA built in), so I left that out.
Out of desperation more than anything else, I tried running strace
/bin/csh /bin/echo foo
-- and hot damn if we're not trying to open
209 different directories to find libncurses! Holy hell!
And what is this happy crappy? It's checking out a crapload of
directories we haven't even told it about. For example, check out
what it does for this one element in LD_LIBRARY_PATH,
/home/foo/lib/bling
:
open("/home/foo/lib/bling/tls/i686/mmv/cmov/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/tls/i686/mmv/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/tls/i686/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/tls/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/i686/mmv/cmov/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/i686/mmv/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/i686/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/mmv/cmov/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/mmv/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
open("/home/foo/lib/bling/cmov/libncurses.so.5", O_RDONLY) = -1 ENOENT (no such file or directory)
That's right, folks, the preloader has taken it upon itself to do some
kinda combinatoric search for hardware-optimized libraries, and as a
result a measly thirteen entries in LD_LIBRARY_PATH
turn into two
hundred and nine. Add to that the aggravation of all these entries
being in people's home directories which are NFS-mounted from
elsewhere, and it's not too hard to see why the hell it's slow.
To be fair, this is meant to do something good: you can have libraries compiled for different hardware capabilities (HWCAP is the term to search for), which I imagine would be handy if you want one disk image for a bunch of different hardware. The trouble, of course, is that you get into these ridiculously long lists of directories that might exist...at least, if you're not using Alpha CPUs.
Fortunately, the folks at Debian have anticipated my whining and have done something about it.
Unfortunately, it's SOOPERSEKRIT.
Fortunately, I dug it out of a very cranky email to debian-devel:
# touch /etc/ld.so.nohwcap
With that in place, the formerly-plodding test that took 8 seconds to finish now runs in 1.5. And that is one hell of a performance gain from just touching one file.
Had an interesting couple of problems at work this week.
First thing was a FreeBSD system where, suddenly, ipfw didn't work anymore. Only "suddenly"'s not exactly true: this happened after a kernel upgrade. And "didn't work" is a bit inaccurate too: it would list firewall rules -- it just couldn't add them. (Good thing this machine had "default accept" as its firewall policy...) So, like, WTF?
First I tried adding a very simple rule: /sbin/ipfw add 100 allow
all from any to any
Nope, didn't work: ipfw:
getsockopt(IP_FW_ADD): Invalid argument
I tried that rule on another
machine to make sure my syntax was okay -- no problems. Well, what
about the command itself? The MD5 checksum of /sbin/ipfw on both
machines was the same. I considered briefly blaming the problems on
3133+ cR5><0rZ
who'd found an MD5 collision in ipfw, but decided not
to try that on my boss. (I did copy the command from the working
machine to the stupid machine just to be sure...yep, same result. So
much for superstition.)
Hey, wait a minute -- hadn't we patched the kernel on the stupid machine? Sure we had! So that means...well, I don't know what. I had a look at the patches (very simple stuff, so I was able to follow along), and couldn't see what might be causing the problem. I mean, yes they did change the firewall functionality, but...well, maybe we should chase that up, yes? Yes.
And here I fell down a rabbit hole: I started to wonder if maybe the fact that FreeBSD compiled modules for everything (sure seems that way) despite the functionality being included in your KERNCONF file maybe meant that said functionality might still actually reside in the modules -- that the kernel wasn't being statically compiled at all, or at least for this particular bit, but there were Secret! Invisible! Modules! that actually had the bit of code we were looking for. Oh sure, kldstat doesn't show them, but that just shows how tricky those damn FreeBSD kernel developers are, right? And yeah.
Since the stupid machine'd had its kernel copied over by hand -- ie,
we did scp foo@bar:/kernel /
(I KNOW, I KNOW) and rebooted,
and didn't copy all those Secret! Invisible! Modules! over along
with /kernel, why, sure we were gonna run into problems! Of course!
It all makes sense now! It was the Freemasons all along!
Lemme tell you, I was yea close to copying over /modules/ipfw.ko
and
trying that (I did go so far as to try ldd /kernel
(I KNOW, I KNOW), but it turns out that ldd actually tries to
execute a file in order to figure out what libraries it uses, so
it just gave me a smack for being such an idiot), but for some reason
had another look at the patches we'd made. Okay, yep, that bit in
here, that bit over there, and not one bloody file in
/usr/src/sbin/ipfw/ipfw.c
, so why the...
Oh. Header files.
/sbin/ipfw
's compilationWell, crap. But hey, easy to test and easy to fix: patch the header
file, recompile ipfw
, and ha! Working! All I had to do was compose a
suitably superior-sounding email about the dangers of passing
/kernel
files around willy-nilly, and all was well again.
Coming up next: Gentoo on a dual G5. Woohoo!