The Life of a Sysadmin

Carousel is a lie!

Entries from February 2010.

Jumbo frames again
Wed Feb 3 11:27:19 PST 2010

Arghh...I just spent 24 hours trying to figure out why shadow migration was causing our new 7310 to hang. The answer? Because jumbo frames were not enabled on the switch the 7310 was on, and they were on the machine we're migrating from. Arghh, I say!

1 comments. Tags: debugging, jumboframes, networking.
NFS dotfiles
Fri Feb 5 10:40:12 PST 2010

Reminder to myself: Got a file called .nfs.*? Here's what's going on:

# These files are created by NFS clients when an open file is
# removed. To preserve some semblance of Unix semantics the client
# renames the file to a unique name so that the file appears to have
# been removed from the directory, but is still usable by the process
# that has the file open.

That quote is from /usr/lib/fs/nfs/nfsfind, a shell script on Solaris 10 that's run once a week from root's crontab. Some references:

Tags: networking, opensolaris, solaris, toptip, unix.
Happy birthday, Pre!
Fri Feb 12 05:51:22 PST 2010

So to return the compliment, I should mention that my wife is turning 36 today. She's wise, completely supportive (including giving me a boot to the head when I need one), and helped me get started as a sysadmin. She's let me take time to go to conferences and make beer. She sometimes thinks she's a smurf, but that won't stop her ripping your heart out.

She convinced me that this would be a good idea:

Two damned cute kids

And sometimes she looks like this:

Popotch!

But she never sopped reaching for that rainbow:

No, she was never really a cheerleader

Happy birthday, Pre!

1 comments. No tags
Valerie Aurora does it again
Mon Feb 15 13:05:52 PST 2010

Valerie Aurora always makes for interesting reading. This entry is no exception:

If you spend all day with your co-workers, socialize only with your co-workers, and then come home and eat dinner with -- you guessed it -- your co-worker, you might go several years without hearing the words, "Run Solaris on my desktop? Are you f-ing kidding me?"

Schwartz's "the financial crisis did it" explanation for Sun's demise is a symptom of an inbred company culture in which employees at all levels voluntarily isolated themselves from the larger Silicon Valley culture. Tech journalists write incessantly about the exchange of expertise and best practice between companies as a major driver of the Bay area's success. But you have to actually talk to your competition to do that -- over a beer, or maybe a pillow.

Tags: solaris.
More good reading
Tue Feb 16 06:08:27 PST 2010

Not to turn this blog into just a collection of links, but Bunnie Huang has a fascinating couple of entries up on MicroSD cards. The first is a bit of info on how they're packaged; the second details how he came across some poorly made cards and what that reveals about the economics of MicroSD manufacturing. Makes me wonder what kind of ghost parts might be in my server room...

No tags
IPv6 up again
Tue Feb 16 13:31:59 PST 2010

I've set up ipv6 again on my home server; a reboot + doing everything by hand + not writring it down means a) I'm a baaaad sysadmin and b) had to wait 'til now to find the time to get it going again.

I'm really curious to know what IPv6 connectivity is available at UBC. Must ask mailing list...

Tags: ipv6, meta.
Fishworks and LDAP
Tue Feb 16 16:02:18 PST 2010

Remember: when adding access to your Fishworks/Unified Storage System 7310, LDAP entries must include objectClass: shadowAccount. That took me a while to track down.

Tags: debugging.
Whoopsie
Thu Feb 18 06:09:40 PST 2010

Sorry -- importing some old entries from my Slashdot journal, and I forgot the date from one of 'em...which made it look like it was 2002 all over again.

Got a root canal today. Wish me luck.

1 comments. Tags: meta.
Randomized Updates
Tue Feb 23 05:55:06 PST 2010

Backups: Bacula has been giving me problems the last week or so. I've got this file server I'm trying to back up; it's got a 2TB partition, and I've been naively trying to just grab it all in one go. Partly that's because it hasn't been backed up before, and I figured this'd be the quickest, simplest way to get going.

What's happened is that after slurping 2 TB over a 100 Mbit connection (no, there's no way to make that quicker), which takes 53 hours, the writing to tape fails for reasons I've yet to figure out. Bacula doesn't say "Oh, the first bit worked so I can just grab that next time...." (To be fair, that's probably a much harder problem than I imagine.) And in the meantime, despite having two drives and two pools of tapes, backups for other stuff pile up behind this big backup and then don't work: they get put on spool space, but then despooling to tape fails.

Contact manglement: I've been looking for a contact management program for $WORK. Requirements:

This turns out to be surprisingly hard to find, and not just because Freshmeat's interface is terrible. Applications appear to fall into n categories:

So now I'm trying to decide between using Dadabik, which'll let me make a frontend w/o much work as long as I can come up with a schema, or modifying one of the complete-but-bletcherous apps and getting a prettier page. (I'm always paranoid about people refusing to use a web-based tool because it isn't pretty enough; I don't know how to make it prettier and it's not something I personally care about enough to do something about, so I'm caught between don't care and don't know how to fix it if I do care. As a result I panic.)

Family: Son #2 went to the hospital Sunday night with his mom; he's fine, but I was up 'til they got back at midnight. Still got up at 5:30am as usual, thinking I'd catch up last night. Then Son #1 had a bad nightmare last night and it took a while to get him calmed down. Spent a couple hours after that staring at the ceiling, trying to get myself calmed down. Still up at 5:30am as usual.

Dentist: Root canal didn't work. My former dentist, who is the second most graceless dentist I've ever seen, couldn't get through and referred me to an endodontist (someone who does root canals; thank you, Wikipedia). My appointment for them is on April 1st.

And that is that.

2 comments. Tags: backups, geekdad.
It's a race to the finish
Thu Feb 25 10:04:52 PST 2010

I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.

Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:

9279 Full    BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish
9496 Full    BackupCatalog.2010-02-23_21.10.00_13 is waiting execution
9498 Full    bigass_server-d_drive.2010-02-24_03.05.01_15 is running
9520 Increme  little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape
9521 Increme  little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs

but storage says this:

Running Jobs:
Writing: Full Backup job bigass_server-d_drive JobId=9498
Volume="000031"
pool="Monthly" device="Drive-0" (/dev/nst1)
spooling=1 despooling=0 despool_wait=0
Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559
FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16
Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=156 Bytes=3,403,527,093 Bytes/sec=72,415
FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9
Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=9 Bytes=183,606 Bytes/sec=3
FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10
Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=5 Bytes=182,029 Bytes/sec=3
FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19
Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=0 Bytes=0 Bytes/sec=0
FDSocket closed

Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:

cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1
cbs-01-sd: dircmd.c:232-0 Do command: unmount
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0
cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1
cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1
cbs-01-sd: dircmd.c:625-0 Found device Drive-1
cbs-01-sd: block.c:133-0 Returning new block=39cee10
cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev

...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).

This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...

1 comments. Tags: backups, bug, debugging, upgrades.
That's a what now?
Thu Feb 25 11:51:43 PST 2010

Just saw a Microsoft wireless keyboard and mouse, packaged together, for $320. For that much money, I want Darwinian evolution.

No tags
Late night
Thu Feb 25 21:29:25 PST 2010

Ugh, late night...which these days, means anything past 9:30pm. Two machines down at work with what I think are unrelated problems.

First one appears to have had OOM-killer run repeatedly and leave the ethernet driver in a bad state; I know, but the OOM-killer kept killing things until we got this bug.

Second one appears to have crashed and/or rebooted, but the hardware clock got reset to December 2001 in the process -- which meant that when it tried to contact the LDAP servers, none of their certificates were valid yet.

Again, ugh. But I did come across this helpful addition to my toolkit:

 openssl s_client -CAfile /path/to/CA_cert -connect host:port

which, I just realized, I've rediscovered, along with having the same fucking problem again.

And did I mention I'm up at 5am tomorrow to move some equipment around at work? Ah well, I have safety boots now. I'll be suitably rewarded in Valhalla.

2 comments. Tags: debugging.

RSS Feed