31 Oct 2009
Thanks to this conference's theme band, Soul Coughing!
Saskatoon is in the room
Pyongyang is in the room...
Is Chicago
Is not Chicago
"Is Chicago, Is Not Chicago" -- Soul Coughing
Midway through my flight to Baltimore and I'm in Chicago, listening to
periodic announcements that the Threat Advisory Level is Orange.
The wifi here isn't working for me (associates fine but no address by
DHCP), so I'm sititng at my gate, with two hours 'til I leave,
wondering if any of the people around me are going to LISA as well.
The airport here has this amazing tunnel that goes between two
concourses. Again, it made me think I was in Logan's Run
and it was only the thought of being arrested that kept me from
running down the moving sidewalk, shouting "Carousel is a LIE!"

Departure was entirely uneventful; I didn't even get pulled over for
extra questions. One odd thing was that (like O'Hare) the customs
section of YVR was quite warm, and each of the customs officers had
identical clip-on fans placed above them. The cords curled down out
of site, and the reflection in the cubicle glass reminded me of
spines; I kept thinking they were skeleton decorations for
Hallowe'en.
Tags:
lisa
30 Oct 2009
Hey, everyone -- I'm organizing a BoF at LISA this year on conference
organization. For a couple of years, I've wanted to create a local
conference on system administration here in Vancouver, but I've been
unsure how to start. I figure what better place to brainstorm and
seek advice than at LISA?
So if you have questions or knowledge to share on:
- Scheduling talks and getting speakers
- Technical and organizational requirements
- Finding volunteers and sponsors
- Figuring out a budget (or "Just how far does this shoestring have to stretch?")
then drop on by the Dover C room on Thursday, November 5th, between
8:30 and 9:30pm. C'mon, you've gotta kill that hour before Matt's
BoFs somehow...
Tags:
lisa
conferenceorganization
30 Oct 2009
Following in Matt's footsteps, I ran into a serious problem just
before heading to LISA.
Wednesday afternoon, I'm showing my (sort of) backup how to connect to
the console server. Since we're already on the firewall, I get him to
SSH to it from there, I show him how to connect to a serial port, and
we move on.
About an hour later, I get paged about problems with the database
server: SSH and SNMP aren't responding. I try to log in, and sure
enough it hangs. I connect to its console and log in as root; it
works instantly. Uhoh, I smell LDAP problems...only there's nothing
in the logs, and id <uid>
works fine. I flip to another terminal
and try SSHing to another machine, and that doesn't work either.
But already-existing sessions work fine until I try to run sudo
or
do ls -l
. So yeah, that's LDAP.
I try connecting via openssl to the LDAP server (stick alias
telnets='openssl s_client -connect'
in your .bashrc today!) and get
this:
...and that's all. Wha? I tried connecting to it from the other LDAP
server and got the usual (certificate, certificate chain, cipher,
driver's license, note from mom, etc). Now that's just weird.
After a long and fruitless hour trying to figure out if the LDAP
server had suddenly decided that SSL was for suckers and chumps, I
finally thought to run tcpdump on the client, the LDAP server and the
firewall (which sits between the two). And there it was, plain as
day:
- 3-way handshake
- client says "I speak SSL!"
- server says "I speak SSL too! Here you go!"
- but the client never sees that packet
- and neither does the firewall.
Near as I can figure, this was the sequence of events:
- We SSH'd from the firewall, with its two bridged Intel GigE jumbo-enabled
NICs
- to the console server, which only does 10/100
- which somehow prompted a renegotiation of the link speed on the
firewall's interface
- which settled on 100 MBit, full duplex, but with jumbo frames
- which the switch saw as completely bogus
- which prompted the switch to (silently, natch) drop all jumbo frames directed at the
firewall's outside interface
- which, in the context of an LDAP lookup done by a client inside the
firewall, meant that the first packet that failed was the "I speak
SSL too! Here you go!" packet
- which left the client with an established TCP connection to the LDAP
server, waiting for a certificate
- which meant that it never actually failed over to the other LDAP
server.
This took me two hours to figure out, and another 90 minutes to fix;
setting the link speed manually on the firewall just convinced the
nic/driver/kernel that there was no carrier there. In the end the
combination that worked was telling the switch it was a gigabit port,
but letting it negotiate duplexiciousnessity.
Gah. Just gah.
Tags:
networking
warstory
openbsd
lisa
jumboframes
28 Oct 2009
So this morning, again, I got paged about machines in our server
room dropping off the network. And again, it was the bridge that was
the problem. This time, though, I think I've figured out what the
problem is.
The firewall has two interfaces, em0
(on the outside) and em1
(on
the inside) , which are bridged. em1
has an IP address. I was able
to SSH to the machine from the outside and poke around a bit. I still
didn't find anything in the logs, but I did notice this (edited for brevity):
$ ifconfig
em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 9000
lladdr 00:15:17:ab:cd:ef
media: Ethernet autoselect (1000baseT full-duplex)
status: active
inet6 fe80::215:17ff:feab:cdef%em0 prefixlen 64 scopeid 0x1
em1: flags=8d43<UP,BROADCAST,RUNNING,PROMISC,OACTIVE,SIMPLEX,MULTICAST> mtu 9000
lladdr 00:15:17:ab:cd:ee:
groups: egress
media: Ethernet autoselect (1000baseT full-duplex)
status: active
inet 10.0.0.1 netmask 0xffffff80 broadcast 10.0.0.1
inet6 fe80::215:17ff:feab:cdee%em1 prefixlen 64 scopeid 0x2
See that? em1
has OACTIVE
set. A quick search turned up
some interesting hits, so for fun I tried resetting the
interface:
$ sudo ifconfig em1 down
$ sudo ifconfig em1 up
and huzzah! it worked.
When I got to work I did some more digging and figured out that this
and the earlier outage were almost certainly caused by running
a full backup, via Bacula, of the /home
partition on the machine.
The timing was just about exact. The weird thing, though, is that
the partition itself is smaller than var
, which was backed up
successfully both times:
$ df -hl
Filesystem Size Used Avail Capacity Mounted on
/dev/sd0a 509M 42.4M 442M 9% /
/dev/sd0g 106G 11.4G 89.1G 11% /home
/dev/sd0d 3.9G 6.0K 3.7G 0% /tmp
/dev/sd0f 15.7G 2.4G 12.5G 16% /usr
/dev/sd0e 15.7G 13.6G 1.4G 91% /var
The bacula file daemon logged this on the firewall:
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Fatal error: backup.c:892 Network send error to SD. ERR=Broken pipe
Oct 28 02:46:15 bacula-fd: backup-fd JobId 3761: Error: bsock.c:306 Write error sending 36841 bytes to Storage daemon:backup.example.com:9103: ERR=Broken pipe
With the earlier outage it was 65536 bytes, but otherwise the same
error.
Okay, so the firewall's working again...now what? I'm about to head
off to LISA in three days, so I can't very well upgrade to the
latest OpenBSD right now. I settled for:
- turning off full backups on the firewall (everything important is
kept in Subversion anyhow), and
- running a script from cron every 10 minutes that checks for the
OACTIVE
flag and, if found, resets the interface.
Hopefully that'll keep things going 'til I get back.
Tags:
openbsd
networking
lisa
23 Oct 2009
At $WORK I've just switched to using Foswiki (formerly TWiki) for
documentation. I miss editing files directly from Emacs like you can
with Confluence, but I'll get over it. The main reason I like Foswiki
is that, at heart, the source files are plain text, and are available
as plain text -- no need to trawl through a database.
Another nice feature that Confluence has is the ability to export a
space (Foswiki calls it a web) directly as PDF. A bit of scripting
takes care of that, but since this is the second time I've lashed
together a Makefile to generate a PDF from Foswiki, I figure it's
time to post it.
You can find the Makefile here. It uses lynx, wget and htmldoc;
of those, I suppose only htmldoc is hard to replace. There's one
important assumption built into the Makefile, though: that every page
is linked to from the front page of the web, which is how I organize
my pages.
Share and enjoy!
Update: Just showed the boss the printed version, and he was very
impressed. Yay me! :-)
Tags:
scripting
documentation
15 Oct 2009
As part of a slow migration from Confluence to FosWiki, I
had to grab the Confluence markup from an XML dump. I found a
Python script to do this, but I think the XML format must have
changed in the meantime; the script was unable to grab the body
content.
I've lashed together a version available here that works with
Confluence 2.10 and Python 2.5 . Now to convert the pages to Foswiki
markup...
Ack! Just discovered this updated script in the comments
section. Looks like that one grabs a lot more than this does (labels,
attachments). Oh well, I needed the practice with Python.
Tags:
python
15 Oct 2009
I ran into a couple problems compiling NUT on Solaris 10 today.
They were pretty much due to bad setup on my part, but they did take a
while to track down. For the record:
libtool: link: only absolute run-paths are allowed
: This turned
out to be an obscure way of saying "You don't have libsnmp
installed". Solution: configure --without-snmp
.
false cru
: The full error was:
libtool: link: false cru .libs/libparseconf.a .libs/parseconf.o
gmake[1]: *** [libparseconf.la] Error 1
This turned out to be a consequence of not having /usr/ccs/bin
in
my $PATH
.
Tags:
solaris
09 Oct 2009
I'm not sure exactly where I saw that DRAC6 Express does not do
console redirection -- it was on a mailing list somewhere -- but
that turns out to be just wrong:
- the command has been renamed
- and if you RTFM you'll find the right settings for the BIOS and grub.
(For the record, it was the "External Serial Connector" in BIOS that
got me; it should be "serial device 1", not "Remote Access Device".)
I can now SSH to the DRAC and get a console just fine. I wish to
apologize to Dell, the people of Monaco and the constellation
Sagitarrius.
Tags:
correction
hardware
dell
05 Oct 2009
Install logwatch on Solaris fileserver.
Notice that logwatch emails are not coming in.
Log in and run logwatch by hand.
Inspect mail log and notice lack of any entries.
Notice that Postfix is in maintenance mode; start it up.
Notice continued lack of emails.
Notice that Postfix is running, which confused svcadm when told
to start up Postfix. It fails to do so and fails to log this.
killall postfix, svcadm enable postfix.
man svcadm; svcadm clear postfix; svcadm enable postfix.
Run logwatch by hand; notice emailed report to
"root@localhost.localdomain", which gets bounced by Postfix on the
mail server because it's a non-existent host.
Resist temptation to go down that rabbit hole just now, and stick
to the problem at hand.
Edit /opt/csw/etc/log.d/logwatch.conf and set MailTo to
proper address.
Re-run logwatch and note that reports are still going to
root@localhost.
After much swearing, notice that actually, logwatch is set to look
in /opt/csw/etc/log.d/conf/logwatch.conf for configuration.
Edit that file, re-run logwatch.
Notice errors from Postfix: "postdrop[13848]: [ID 947731
mail.warning] warning: mailqueueenter: create file
maildrop/908447.13848: Permission denied".
Run "postfix set-permissions". Test mail; still failing.
Check permissions on another system and set by hand.
Re-run logwatch. Still no email. Re-run with debug=high and get
email.
Wonder idly about futility of self-aware log watching system that
can't report on its own heisenbug-induced failure, crappy packaging
practices, inability to check end-to-end email connectivity, other
career options.
(Update) Realize that the emails show up if "Detail" is set to
Medium or High ; Low, the default, makes the report silent.
(Update) Uninstall the package and reinstall, only to find that the
symlink to conf/logwatch.conf is set up at installation, and that
this is probably a case of $EDITOR breaking the symlink. Apply head
to desk.
Tags:
solaris
packagemanagement
yakshaving
monitoring
05 Oct 2009
Yesterday I got paged by one of my two Nagios boxes (learned that
trick the hard way): a bunch of the machines in our server room
had dropped off the network. Weirdly, this did not include the
other Nagios box that's over there. WTF?
I logged into the server room's Nagios box, and sure enough couldn't
ping the servers or the firewall. I could ping the console
server...which was also on the Outside VLAN along with the monitoring
box, as opposed to the Inside VLAN with the servers, which sat behind
our firewall.
I was also able to ping the management cards/ILOMs/SPs/whatever the
kids are calling them in the servers. Thankfully they're Sun boxes,
so no Vista-like maze of flavours there...they all come with
console redirection. I logged in and fired up a console, panicing
because I thought that perhaps the newly-installed NUT clients
had shut down the machines because I'd overlooked something.
But no...the machines were up, though hung if you tried to do any LDAP
lookups. (Through an oversight, the LDAP server was also on the
Outside VLAN. I'll be fixing that today.) Modulo that, they seemed
fine.
So I logged into the firewall, which runs OpenBSD 4.3 in bridging
mode. And this is where the weirdness lay: the bridge, and/or its
component cards, was not working. ifconfig
and brconfig
said
they were up and fine, and the ARP table was still populated (not sure
what the lifetime of entries is -- isn't it around 20 minutes or so?
must check -- but by this time the problem had been going on for about
an hour). Yet I couldn't ping the firewall (one of those cards has an
address) from either side, and I couldn't ping anything from the
firewall.
pfctl -s all
didn't show anything suspicious. There were no obvious
problems in dmesg
or /var/log/messages
. I disabled, then
re-enabled, the firewall to no effect. I ran /etc/netstart
to no
effect.
I even checked on the switches to see if the firewall's MAC address
was showing up anywhere, and it was not -- not even directly after
pinging it (and getting no response).
In the end I rebooted the machine and all was well.
The NIC in question is a dual-port Intel Pro 1000 (MT, I believe) that
I've never had problems with. I've never come across problems like
this before on OpenBSD (or, I think, anywhere else). The onboard
Broadcom (boo, hiss) was acting fine...it was also on the ILOM's
VLAN, and could see the other ILOMs just fine. (In fact, I should
have just SSHd to the firewall using that VLAN from the Nagios box,
rather than futz around with a 9600 bps console. Next time.)
So...that's my mystery for the weekend.
In other news, my older son (3.25 yrs) has taken to the stage in a
big way: he now stands on top of the steps going up from our living
room and sings us songs into one of at least two microphones.
"Barbara Ann", anything by The Wiggles, and "Yo Gabba Gabba!" songs
are prominent. This is after at least three solid weeks of guitar
playing, where anything and everything gets strummed while being
cradled in his arms while he sings, or maybe makes feedback sounds
that'd make Yo La Tengo proud.
Meanwhile, my younger (1.5 yrs) has started saying lots of different
phonemes, which is a real contrast to using "Dat!" for monkey, cereal,
ball, yes, no, President Barack Obama's attempted health care reforms,
and Linux. He has also begun sleeping in 'til 6:30 or 7:00 in the
morning, which lets me write things like this. Both are infinitely
endearing.
And incidentally, I really need to set up Nagios dependencies. I've
had to ACK 27 services in a row (unrelated (I think) problem with ILOM
temperature taking means SNMP checks are timing out). Either that or
there's some way that you can select n services in Nagios to ack all
at once. Anyone?
Tags:
openbsd
networking
03 Oct 2009
Ran into a little problem this week when I tried to do a restore from
a backup at work. Bacula loaded the tape, then said it couldn't read
the label. Wha?
After much investigation, during which I completely neglected to
cut-n-paste the error messages, I think I've figured out what
happened:
- I upgraded the license key for our storage library;
- I rebooted the library, 'cos that's what you gotta do;
- but the tape was still in there, say halfway through after the last
batch of backups;
- so the drive rewound the tape after being power-cycled;
- and Bacula didn't know this;
- so it wrote the next backups that night at the beginning of the
tape, not realizing this would be a Bad Thing(tm).
Ack. Needless to say, this was not good. Fortunately, the file in
question was not a terribly important one; unfortunately, that's about
the last 2 weeks of incrementals gone. Lesson learned: don't assume
your backup program knows what's going on when hardware reboots from
under it.
In other news: on Thursday I got 5 new Dell servers. Woot! One of
'em will be our new LDAP/web/email/FTP server (Xen ftw!); the rest are
going to be running protein search engines for various researchers
across BC. They're racked and I'm stoked, except that it turns out
the difference between the DRAC6 Express and Enterprise, besides a few
hundred dollars, is that the Enterprise does console redirection and
the Express doesn't. Dammit.
I'm going to see if there's any trickery that can be done, but I'm not
holding out hope. I have got a 32-port console server, but it's two
racks away...might have to run a small batch o' cables up and over to
make this work.
Tags:
oops
backups
hardware
virtualization
dell
30 Sep 2009
I've come across a few LISA items today, and it's only 9am...
Matt Simmons is going, and got one of the blogger gigs
too.
The BOFs are starting to fill up: Matt's got one for bloggers
and another for small infrastructure, there's one for lightning
talks, and one for uninvited talks.
OpenDNS is hosting a happy hour at a nice-looking pub, which
alleges it was actually "designed and built in Ireland and shipped
over in the fall of 2002, where it was then fitted on site." Huh.
Man, I'm looking forward to this.
Tags:
lisa
28 Sep 2009
"I say we take off and nuke the entire site from orbit. It's the only
way to be sure."
Saturday afternoon my home web server got cracked. I found out
because Google started refusing my searches, asking me to fill out a
CAPTCHA form (incidentally, I hate the word CAPTCHA, and even typing
it gives me hives) to prove I was human. What the hell?
So I checked on the server, which is also our firewall, which isn't
good but frankly I was tired of maintaining a complex network at home,
and sure enough there was some perl script running as user www-data
(which Debian uses to run the webserver), sending off tons of Google
queries and taking commands on IRC the way I keep hearing nobody does
anymore. Crap.
Fortunately I've been running Bacula for a while now, backing up to an
external hard drive, and so I figured that even though it probably
would go away when I rebooted, I'd Do The Right Thing(tm) and rebuild
from scratch.
This had to wait 'til the evening, so I shut down the webserver, ran
backups a bunch more times, got more info, and moved the machine (a
tiny li'l Shuttle box) from my youngest son's bedroom (apparently the
only room in the house w/a phone outlet not covered by an ADSL filter)
to our bedroom upstairs, running the network cable up the stairs.
In the end, it all went pretty smoothly. I was able to get all my
packages back and restore from backup; the only thing I messed up was
getting the ownership wrong on my restored crontab. (Debian uses a
pool of UIDs for daemons, so you're not guaranteed to get the same
UIDs if you reinstall.)
As a bandaid, I've firewalled off www-data from initiating connections
out. I should have done this long before. Now I'm starting to
think about the next step -- Xen, maybe, or SELinux. (I did briefly
consider other distros, or even a BSD: CentOS for SELinux, FreeBSD for
pf and jails. But I decided that one problem at a time was quite
enough, thanks.)
Tags:
nukeitfromorbit
security
linux
21 Sep 2009
At $WORK, I'm going to be taking over the administration of four
servers that currently do stuff for a variety of researchers scattered
around the province. There are a number of players here:
- My department, which contains:
- Me, the guy whose services are being promised, and
- The researcher who's arranging all this (my local contact)
- The agency that owns them, who I don't think has any techical staff
- The agency that currently hosts and administers the servers
The owning agency has also ponied up for an upgrade to the four
servers; I'll be taking delivery some time next week.
I've got some preliminary information -- what the servers do,
how the users use the thing, etc -- but I'm preparing a more detailed plan.
In the meantime, I've compiled a list of questions for my local
contact.
In the middle of that, it occurred to me that this would be a good
discussion topic. Have I missed anything? Let me know!
- Will the old servers be moved over, or will the new ones replace them?
- What's the primary means of talking to users? (Mailing list, status page)
- Where's the list of those users? (one of the above, spreadsheet)
What info do users/owners expect from us? How? (Mailing list, status page;
2 weeks notice of downtime, monthly stats by CPU)
- Are any funding decisions influenced by this information?
Where is the info for the software?
- media
- license #, what we have licenses for (unlimited use, # cores, etc)
- support #, what it covers
Can I see a demo of the software?
Do any of the labs have shell access? What do they do with it?
What exactly is involved in maintenance? Where is this documented?
What DNS changes will be made? Who makes them?
Who makes policy/purchase decisions about these servers? How do I contact them?
Tags:
work
migration
21 Sep 2009
title: Can I send email spam from your servers?
date: Mon Sep 21 08:40:54 PDT 2009
Mon Sep 21 08:40:54 PDT 2009
Depressing.
Tags:
16 Sep 2009
Just got the approval from the boss...LISA, here I come! w00t!
Tags:
lisa
11 Sep 2009
In the spirit of Michael Geist, here's my submission on copyright
reform. Originally I intended to write about how this affects me as a
sysadmin, but then the stuff about my kids just came out...
Tags:
copyright
09 Sep 2009
Just ran into an interesting problem: after replacing memory on a
server, CentOS booting hung at "Starting system message bus..."
So what does dbus have to do with anything? This turned out to be an
LDAP failure; dbus was trying to run as UID root
, and since the LDAP
server couldn't be contacted it hung. Why couldn't the LDAP server be
contacted? The LDAP server logs only showed this:
[09/Sep/2009:12:04:32 -0700] conn=41492 op=-1 fd=112 closed - SSL
peer cannot verify your certificate.
The CA cert I use was in place, and another machine had just rebooted
w/o problems (all this is taken care of with cfengine, so they
were identical in this respect). I could connect to the LDAP server
on the right port without any problems.
I finally figured out what was going on when I ran:
openssl s_client -connect ldap.example.com:636 -CApath /path/to/cacert_directory
and saw:
Verify return code: 9 (certificate is not yet valid)
date
said it was December 31, 2001. What the what now? ntpdate
to set things correctly, then I got:
Verify return code: 0 (ok)
I figure the CMOS clock (or whatever the kids are calling it these
days) got reset when we had to remove the CPU daughtercard to get at
the memory underneath.
And now you know...the rest of the story.
Tags:
ldap
cfengine
09 Sep 2009
I just saw on Undeadly.org that orders for OpenBSD CDs are 'way
down this year. Without OpenSSH and pf, I wouldn't be able to do my
job nearly as well as I do. I've ordered a set for work (good excuse
to upgrade that firewall), and ordered a set for home and tossed 'em
$50 as well. I encourage you to do the same.
In the words of the original rant:
Do you use OpenBSD for fun? Contribute.
Do you use OpenBSD for work? Contribute.
Does OpenBSD allow you to worry about the problem you are trying
to solve rather rather than the tools? Contribute.
Do you wish your employer used the OpenBSD quality standard in
your work? Contribute.
Does your employer use OpenBSD? Ask them to contribute (after
you do, of course).
Do you bundle OpenBSD or subprojects like OpenSSH into your
product? Contribute big! (you won't, you rarely do, but hey,
I'll ask anyway)
Do you find yourself wondering why so few take computer software
quality seriously? Contribute!
Tags:
openbsd
wontyoupleaselendahand
08 Sep 2009
It's the start of school here at $UNIVERSITY, and for some reason I
find myself noticing it more than last year. Then and now, my job has
been one that is not flooded in September with new students (unlike
a lot of my friends and coworkers), but rather it's more like a steady
trickle. Grad students show up days or even weeks late; new faculty
come in when they're good and ready; no one really has a firm idea
when someone's showing up, but everyone's confident they'll be here
Real Soon Now.
As a result, the biggest effect this usually has on me is the press of
humanity in the bus and SkyTrain. My commute is a long one -- bus,
SkyTrain, then another bus -- and it takes between 90 and 100 minutes,
door to door. I get a lot of reading done, or I listen to podcasts,
or if the Lithium Ion Gods are with me I fiddle with Emacs. This
happens no matter what, but in September you've got all the people
learning how the bus works, how far in advance they need to show up,
and so on. The buses and SkyTrains are crowded because everyone's
afraid it's the last one, or they'll be late for class, or everyone
else is getting on so they must know something I don't.
And then it calms down. Some get tired of the bus and drive. Most
figure out how late they can sleep in. (I vaguely remember that, in
the same way that I vaguely remember kindergarten.) Things thin out.
Before you know it it's December and it gets really empty. Winter
brings humidity and rain, wet smells and drips on book pages.
And then it's spring, and then summer, and things get positively
luxurious. There's room to stretch out, room for laptops, and lots to
see. The kids' birthdays come around.
And then...September again.
Tags: