The Life of a Sysadmin

Carousel is a lie!

NFS dotfiles
Fri Feb 5 10:40:12 PST 2010

Reminder to myself: Got a file called .nfs.*? Here's what's going on:

# These files are created by NFS clients when an open file is
# removed. To preserve some semblance of Unix semantics the client
# renames the file to a unique name so that the file appears to have
# been removed from the directory, but is still usable by the process
# that has the file open.

That quote is from /usr/lib/fs/nfs/nfsfind, a shell script on Solaris 10 that's run once a week from root's crontab. Some references:

Tags: networking, opensolaris, solaris, toptip, unix.
Jumbo frames again
Wed Feb 3 11:27:19 PST 2010

Arghh...I just spent 24 hours trying to figure out why shadow migration was causing our new 7310 to hang. The answer? Because jumbo frames were not enabled on the switch the 7310 was on, and they were on the machine we're migrating from. Arghh, I say!

1 comment. Tags: debugging, jumboframes, networking.
Reading for the end of January
Fri Jan 29 14:33:39 PST 2010

Off to go craft a budget for next year.

Tags: linky.
Hate hate hate
Fri Jan 29 05:53:03 PST 2010

Here's show to piss me off:

2 comments. Tags: rant, software.
Cassini and Saturn
Tue Jan 26 06:02:31 PST 2010

I came across these pictures taken by the Cassini probe, while looking for pictures of Saturn to show my oldest son. They're beautiful, and I'm heartsick that I'll never get to see these views first-hand.

Tags: wow.
Checks
Wed Jan 20 15:42:57 PST 2010

The more I work with Python, the more I don't just like it but admire it.

Ugh...not much more right now. I've got a blocked eustachian tube that I'm self-medicating with a Python script^W^Wcold medicine, and the acetominiphen in it is making me hazy.

Tags: monitoring, security.
Powerpoint of the damned
Sat Jan 16 06:45:44 PST 2010

From (I think) a fellow Canuck:

A few weeks ago, I was sent a power point presentation on the "[Dynamic
Planning for COIN in Afghanistan][3]". I looked at it briefly, but thought
that it was some kind of joke; so, I flushed it immediately. However,
I received it from another source. So, it appears the joke is on me.

A quick look at this bird’s nest of a concept, would seem to suggest
that Dilbert or some escapee from the Project Management Institute has
taken over planning for COIN operations in Afghanistan. What I see is
yet another attempt to take a complex human activity and turn it into
an MBA project management flowchart. I can see the thinking, “Now that
we have the power point correct, we are sure to win the war in
Afghanistan!” In fact, I’m sure that, if we showed this power point to
the insurgents, they would throw in the towel, convinced that our
superior power point skills indicate that we cannot be
defeated. Really, I don’t know how we fought wars before power point.

Original post here. Found via WarHistorian.org (well recommended).

Tags: wtf.
Must change title
Tue Jan 12 05:38:13 PST 2010

Happy 2010 everyone! Now that it seems to be well and truly under way, I feel I can say that safely.

It's been busy so far. All the stuff I didn't do in 2009 is still on my plate...which is obvious, right? but it still caught me by surprise after the 3 days doing Xmas maintenance on my own. It was easy to forget that there are, you know, people waiting to show up and do work.

Like the new students we've got for one of the faculty members. I'd upgraded OpenSuSE on their new workstations over the holidays, then when they came in yesterday the carefully-tweaked dual monitor displays weren't working. Arghh.

Or the guy who's let me know that he wants to get moving on the MySQL/PHP website he's building...which reminds me that I've still got to move the website to a virtualized machine. I'm tempted to do that RIGHT NOW and put his site in there, but I don't think that'll be the best way to do it.

Or the new project my boss is part of, which involves researchers from across Canada. For me, it's a new website, hardware recommendation and purchases, maybe a new LDAP server. I could add a new root suffix to the existing LDAP server, but

a. we don't need it yet a. that seems like it'll make it more difficult to move later a. while I can create one in the existing LDAP server (Fedora/389/CentOS DS), the cn=config tree seems suspiciously empty of any entries related to the new root...so I'm leery of trusting it.

I still haven't sat down yet and tried to plan my year. Partly I've been busy, partly my planning tools are a bit of a mess (daytimer + orgmode + RT). But at some point I need to get my priorities straight and oh, how I long to have them straight. I feel a bit like I'm spinning my wheels right now.

Ah well. In other news, Xmas was good; my kids got two guitars (one acoustic with an Elmo sticker, one fake double-neck electric) which makes four guitars they have now. Since they no longer have that to fight over, they've taken to fighting over a microphone (cardboard tube stuck in a toy that acts like a stand). But damnit, they're still cute.

Family

Finally: Just for fun right now I did a word count of all my blog entries. I've been blogging since 2004, and I've got something like 158,000 words. Amazing. And there are still some entries I've got to grab from my old Slashdot journal.

Tags: geekdad, work.
Well, that'll teach me
Thu Dec 31 12:41:24 PST 2009

While trying to figure out why Nagios was suddenly unable to check up on our databases, I suddenly realized that the permissions on /dev/null were wrong: 0600 instead of 0666. What the hell? I've had this problem before, and I was in the middle of something, so I set them back and went on with my life. Then in happened again, not half an hour later. I was in the same shell, so I figured it had to have been a command I'd run that had inadvertantly done this.

Yep: don't run the MySQL client as root. Yes yes yes, it's bad anyway, I'll go to sysadmin hell, but this is an interesting bug. The environment variable MYSQL_HISTFILE is set to /dev/null for root...and when you exit the client, it sets the permissions for the history file to 0600. So, you know, don't do that then. (Still no fix committed, btw...)

2 comments. Tags: bug.
Xmas maintenance
Thu Dec 31 05:57:47 PST 2009

A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.

Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.

It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.

One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)

The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:

  1. You have to rebuild the drivers after a kernel change.
  2. This only showed up on two servers because the third server had not upgraded its kernel (or indeed, any of its packages). Why? cfservd had refused its connection because I had the MaxConnections parameter too low.
  3. And of the two that did upgrade, the one machine we'd tested the Linux drivers on still had an old multipath.conf file in /etc, which even though the multipathd. service wasn't starting up was enough to get drivers loaded. This took a while to figure out because I'd completely forgotten how to tell which driver was in use.

I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)

Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.

(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)

Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.

And best of 2010 to all of you!

Tags: centos, cfengine, monitoring, packagemanagement, serverroom, upgrades, work.

RSS feed

Created by Chronicle v3.7