exec 3>&1 ; /path/to/script 2>&1 >&3 3>&- | egrep -v 'useless|junk' ; exec 3>&-
This was my first week on call at $WORK, and naturally a few things came up -- nothing really huge, but enough that the rhythm I'd been slowly developing (and coming to relish) was pretty much lost. And then Friday night/Saturday morning I was paged three times (11pm, 1am and 5.30am) -- mostly minor things, but enough that I was pretty much a wreck yesterday. I'm coming to dread the sad trombone.
Besides that, I've also been blogging about the LISA14 conferencefor USENIX, along with Katherine Daniels (@beerops) and Mark Lamourine (@markllama). They've got some excellent articles up; Mark wrote about LISA workshops, and Katherine described why she's going to LISA. Awesome stuff and worth your time.
I managed to brew last week for the first time since thrice-blessed February; it's a saison (yeast, wheat malt, acidulated malt) with a crapton of homegrown hops (roughly a pound). I'm looking forward to this one.
Going to San Francisco again week after next for $WORK. (Prospective busyness.)
Kids are back to school! Youngest is in grade 1 and oldest in grade 3. Wow.
Today I complete my (scurries to check calendar) third week of work at OpenDNS. There is a lot to take in.
I flew in to San Francisco without problems; like previous times, there was no having to opt out of scanners at YVR (the airport, not the office). I got moderately tangled up in the BART because I'd become convinced it was Saturday instead of Monday, but once I figured that out I got to the office without problems. The corporate apartment is right around the corner, which is definitely handy, and close to the 21st Amendment Brewpub which was even more handy. Went there for supper:
I stayed for a while, listening to the startups happening around me (not even kidding), then went back to the apartment and slept fitfully.
Next day was first day. There were 7 of us starting that week, including one other YVRite. The onboarding (and now I'm using that word) was very, very well organized: we've had talks from HR, from the CFO, from the VP of Sales and from actual sales people, from Security, from the IT guy (and it is strange not to be the IT guy) and from the...oh god, I'd have to look it up. There was a lot, but it was interesting to get such a broad overview of the company.
The second meeting of the day, right after HR got us to fill out the necessary forms, was to get our laptops. Mine is a 15" MacBook Pro AirPony CloudTastic or some such; 16 GB of RAM, Retina display, SSD. What it works out to be is wicked fast and pretty. It has not been nearly as hard to get used to it as I thought it would be -- not just because I'm adaptable and like computers, but because even though it's not Linux it's not getting in my way any (which in turn is because, much more than anywere I've worked previously, so very very much of what we do is done through a browser, using apps/services which have been designed within the last five years).
Natch this makes me question my ideological purity. But I can also see, really see the point of having things be easy, particularly at scale. Which is kind of a ridiculous thing to say for a sysadmin, whose job is supposedly making things easy for other people. But there you go. I love Linux, but there's no question that (modulo the fact that I'm seeing the end product, not the work that went into it) making everything that seamless would probably be a lot more work.
Speaking of which, I'd just like to give shoutouts to the IT people at OpenDNS. They are incredibly well organized, efficient, friendly and helpful. I need to take notes. Oh, and: it is strange not being the IT person -- at one point my laptop was misbehaving, and I had to/got to ask someone else for help fixing it. Wah.
Oh, and: the HR department is well-organized too. Everyone shows up to their new desk which is clearly marked with a) balloons and b) swag:
In the midst of this week full of meetings, I got to meet my coworkers. Some I'd interviewed with, some were new to me (like Keith, who a has degree in accounting: "I learned two things: don't screw with the IRS, and I hate accounting!"). They are all friendly and smart. There were knowledge drops and trips to the lunch wagons and finding different meeting rooms (".cn is booked." "What about .gr?") and whiteboarding and I don't know what-all. Oh, and one of the new people starting that week is Levi, another systems engineer, who came over after 7 years at Facebook. Wonderful guy; I was intimidated, but it turned out I knew a few things he didn't (and of course vice-versa), so that restored my confidence.
Things are organized. There is agile and kanban boards and managers who actually help -- not that they wouldn't, I guess, but I'm so used either being on my own or wishing my manager would just go away. This is nice. There are coworkers (have I mentioned them?) who help -- it's not just me anymore. This means not only that I don't have to do anything, but that I can't just go rabbiting off in all directions when something cool comes up.
Oh, and: there are these wonderful sit/stand desks from GeekDesk.com -- they're MOTORIZED! They're all over the SFO office, and will soon be coming to the YVR office. They're wonderful; if I ever work from home on a regular basis, I will really really want one.
There wasn't a lot of time for wandering around -- mostly, by the end of the day I was pretty exhausted -- but Thursday night I walked across town, from King Street BART station to 39th pier. It was ~ 9km all told, and it was a wonderful walk. I ended up going past City Lights Bookstore and Washington Square park; back in 1999, my wife and I spent an afternoon in that park, where a homeless guy insisted that I remove my sunglasses so he could see if I was an alien (I wasn't). It was cool to see it again. The touristy stuff was great in its schlocky, touristy way, and I hunted around for sportsball tshirts for my kids.
Friday we had the weekly OpenDNS all-hands meeting, where (among other things) new hires tell three fun facts about themselves. Mine were:
I counted moose from a helicopter when I participated in a moose population survey. And when I say "participated" I mean "was ballast". I worked one summer for the Ministry of Natural Resources in Ontario. A helicopter was flying out with one pilot and two biologists, so it was unbalanced. I came along so the helicopter could stay level. Saved a lot of lives that day.
I'm an early investor in David Ulevitch, OpenDNS' CEO. Back when he was running EveryDNS, which provided free DNS service for domains, I sent in $35 as a donation. When Dyn.com bought EveryDNS, they grandfathered in all the people who'd donated, and I've now got free DNS for my domains for life. Woot!
And of course, the story of the golden pony.
Friday afternoon I flew back; opted out of the scanner (and forgot to tell my coworker flying back with me that I'd be half an hour getting through security; apologized later), had supper and a beer at the airport, and just generally had an unventful flight home. The beers I brought home for my wife made it through everything intact, there were stickers for the kids, and everyone was happy to see me (aw!).
I was gonna write up the first two weeks at OpenDNS, but then my youngest son couldn't get to sleep. That doesn't happen often, and he's always upset when it does, so it was my job to tell him a long, rambling story about his day and try to get him to relax. So there went my writing time.
Quickly, then: the people are great; there's a lot to take in; we move to a new office tomorrow (temp space for two months, then our final digs); overall, it's a big, big challenge and it's pretty wonderful. My head is still swimming. Pix still needed.
First day completed. There is a lot of stuff to learn. People at OpenDNS are wonderful. San Francisco is fascinating, and it's really neat listening to the American Music CLub album of the same name while here.
I will put in pictures later. Time to read more.
I'm late doing this, but last Friday July 11th was my last day at UBC. And what did my wonderful coworkers do for me? They got me a going-away present:
I had to show the new pony the picture of the pony I put up in my office:
And c/o Steve MacDonald, Serious pony is serious:
From the WayBac Machine comes the last time I celebrated ponies with such gusto:
That was a going-away present (seems to be a pattern...) from coworkers when I left to start at UBC -- nine days after my oldest son was born. I was at UBC for eight years and one day, and had worked for CHiBi for just about six years.
Today I fly off to San Francisco to start work for OpenDNS; I'll be there 'til Friday getting trained and oriented and I don't know what-all. My dad wants a hoodie, my kids want stickers, my wife wants beer and I want to see it all.
For a while now, I've been wanting to work in a different environment. UBC is a lovely place to work, and the people at CHiBi are wonderful...but I've been there more than five years now, and I was getting itchy feet.
Last year I wrote down what exactly I wanted out of a new job:
Larger scale: I took my current job because it was a chance to work with so much that I hadn't before: dozens of servers, an actual server room, HPC, and so on. I want that same feeling of "I've never done that before!" (See also: "Holy crap, what have I got myself into?")
Linux/Unix focused: It's no secret that Linux makes the sun shine and the grass grow, and BSDs make the planets go in their orbits. Why would I ever want anything else?
Actual coworkers: For most of my time as a sysadmin, I've worked on my own. I had a junior for a while (Hi Paul!) and that was wonderful, but other than that I've been alone. I really, really wanted to change that. Andy Seely, a damn good friend of mine, likes to say "If I find myself the smartest person in the room, I know I need to find a new room." That was exactly how I was feeling.
Friendly. I work in a friendly, open place, and I've no desire to give that up.
I kept my eye out. And back in April I saw that OpenDNS was hiring. So I sent in a resume. They got back to me. There were lots of interviews (I think I talked with five different people), a coding test (two, actually, and they made me sweat) and a technical test. And then, finally, I was sitting in their offices in Gastown, talking to the guy who'd just offered me a job.
Larger scale: check; they've just opened their nth and n+1nth data centres in Vancouver and Toronto. Linux/Unix focused: yep; Linux and FreeBSD rule the coop. Actual coworkers: they're on it; there are two other people I'll be working with (and they've been running all, or at least a lot, of the infrastructure for the last few years). Friendly: four for four, because everyone there has been really, really...well, friendly.
So: I start July 15th as a Systems Engineer with the good folks at OpenDNS. I'm excited and a little freaked out to be working with all these good, smart people.
In the meantime: if you want a job as a Linux sysadmin, working with the excellent people at the Centre for High-Throughput Biology who do a science EVERY DAY, you can apply here. Closing date is Friday, June 20th, so hurry. Apply early and apply often!
I found out before Xmas that my request for an office had been approved. We had some empty ones hanging around, and my boss encouraged me to ask for one. Yesterday I worked in it for the first time; today I started moving in earnest, and moved my workstation over.
And my god, is it ever cool. The office itself is small but nice -- lots of desk space, a bookshelf, and windows (OMG natural light). But holy crap, was it ever wonderful to sit in there, alone, uninterrupted, and work. Just work. Like, all the stuff I want to do? I was able to sit down, plan out my year, and figure out that in the grand scheme of things it's not too much. (The problem is wanting to do everything right away.) And today I worked on getting our verdammt printing accounting exposed to Windows users, setting up Samba for the first time in eons and even getting it to talk to LDAP.
Not only that -- not only that, I say, but when I had interruptions -- when people came to me with questions -- it was fine. I didn't feel angry, or lost, or helpless. I helped them as best I could, and moved on. And got more shit done in a day than I've done in a week.
I'm going to miss hanging out with the people in the cubicle I was in. Yes, they're only ten metres away, but there's a world of difference between having people there and having to walk over to see them. I'm not terribly outgoing, and it's just in the last six months or so that I've really come to enjoy all the people around me. They're fun, and it's nice to be able to talk to them. (There's even a programmer who homebrews, for a wonderful mixture of tech and beer talk.) But oh my sweet darling door made of steel and everything, I love this place.
Prompted by fierce internecine rivalry with Tampa Bay Breakfasts, I'm finally putting in an update. My supervisor is my four year-old son, who's busy reading "You are the first kid on Mars" beside me while holding on to Power Ranger and Terl action figures.
Work: I've got a summer student. She was at one of the labs I work with for the last 8 months, and showed a real aptitude for computers. My boss agreed to pick up the bill for her salary, so here we are.
It's working out really, really well. She's got a lot to learn (basic networking, for example) but it is SUCH A WONDERFUL THING to have someone to send off on jobs. "Hey, have you got a minute to..." "She'll take care of it." She can help with what she knows, and what she doesn't she takes careful notes on. I've even had a chance to work on other, larger projects for, like, an hour or two at a time. It's great.
I'm going away for three weeks in June/July, and there's a lot to teach her before then. Fortunately, there are a couple other sysadmins who can help out, and a couple of other technical folk in the lab who can take on some duties. But it's been a real wake-up for me, realizing how could be made easier for someone else. It'd be nice, for example, to have something that'd let people reboot machines easily when they get stuck. Right now, I SSH to the ILOM and reset it there; what about a web page? It'd be its own set of problems, of course, and I'm not going to code something up between now and June, but it's something to think about. Or at least coming up with some handy wrapper around the ipmipower/console commands.
Home: The weather is at last, AT LAST becoming sunny and springlike. I took the telescope out on Saturday -- full moon, so I spent most of my time looking at Saturn. And holy crap, was it amazing! I saw the Cassini division for the first time, the C ring (!) and five moons. I'm starting to regret (a little) having sold the 4.3mm eyepiece; the 7.5mm is nice but does badly in the Barlow, which I suspect says more about the Barlow than anything else. (Also that night: tried looking for M65 and M66, just to see if I could find them in the suburbs under a full moon. Negative.)
I'm trying to port an astronomical utility to Rockbox; it will show altitude and azimuth for planets, Messier and NGC objects. My intention is to use it with manual setting circles on my Dob. The interesting part is that Rockbox has no floating point arithmetic, so it's not a straightforward port at all. Thus I've had to learn about fixed point arithmetic, lookup tables and the like. My trig and bitwise arithmetic are, how do you say, weak from underuse, so this is a bit of a slog. But I'm hopeful.
And now my other supervisor is coming for a status report. Time to go!
Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.
I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.
Last year I tried getting machines to upgrade using Cfengine like so:
centos.some_group_of_servers.Hr14.Day29.December.Yr2009:: "/usr/bin/yum -q -y clean all" "/usr/bin/yum -q -y upgrade" "/usr/bin/reboot"
This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.
This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.
This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.
I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.
Quick and dirty way to make sure you don't overload your PDUs:
sleep $(expr $RANDOM / 200 ) && reboot
Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.
Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.
Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:
Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.
Lesson: This really needs to be automated.
Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.
Lesson: Next time, uninstall the driver and build a goddamn RPM.
Lesson: A better way of managing xorg.conf would be nice.
Lesson: Look for prefetch options for zypper. And start a local mirror.
Lesson: Pick a working version of the driver, and commit that fucker to Subversion.
These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:
The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.
I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.
It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.
In the spirit of Chris Siebenmann, and to kick off the new year, here's a post that's partly documentation for myself and partly an attempt to ensure I do it right: how I manage my tasks using Org Mode, The Cycle, my Daytimer and Request Tracker.
Org Mode is awesome, even more awesome than the window manager. I love Emacs and I love Org Mode's flexibility.
Tom Limoncelli's "Time Management for System Administrators." Really, I shouldn't have to tell you this.
DayTimer: because I love paper and pen. It's instant boot time, and it's maybe $75 to replace (and durable) instead of $500 (and delicate). And there is something so satisfying about crossing off an item on a list; C-c C-c just isn't as fun.
RT: Email, baby. Problem? Send an email. There's even rt-liberation for integration with Emacs (and probably Org Mode, though I haven't done that yet).
Problems that crop up, I email to RT -- especially if I'm not going to deal with them right away. This is perfect for when you're tackling one problem and you notice something else non-critical. Thus, RT is often a global todo list for me.
If I take a ticket in RT (I'm a shop of one), that means I'm planning to work on it in the next week or so.
Planning for projects, or keeping track of time spent on various tasks or for various departments, is kept in Org Mode. I also use it for things like end-of-term maintenance lists. (I work at a university.) It's plain text, I check it into SVN nightly, and Emacs is The One True Editor.
My DayTimer is where I write down what I'm going to do today, or that appointment I've got in two weeks at 3pm. I carry it everywhere, so I can always check before making a commitment. (That bit sampled pretty much directly from TL.)
Every Monday (or so; sometimes I get delayed) I look through things to see what has to be done:
I plan out my week. "A" items need to be done today; "B" items should be done by the end of the week; "C" items are done if I have time.
Once every couple of months, I go through RT and look at the list of tickets. Sometimes things have been done (or have become irrelevant) and can be closed; sometimes they've become more important and need to be worked.
I try to plan out what I want to get done in the term ahead at the beginning of the term, or better yet just before the term starts; often there are new people starting with a new term, and it's always a bit hectic.
I'm trying to get Bacula to make a separate copy of monthly full
backups that can be kept off-site. To do this, I'm experimenting with
its "Copy" directive. I was hoping to get a complete set of tapes
ready to keep offsite before I left, but it was taking much longer
than anticipated (2 days to copy 2 tapes). So I cancelled the jobs,
unmount at bconsole, and went home thinking Bacula would just
grab the right tape from the autochanger when backups came.
What I should have typed was
release lets Bacula grab
whatever tape it needs.
unmount leaves Bacula unwilling to do
anything on its own, and it waits for the operator (ie, me) to do
Result: 3 weeks of no backups. Welcome back, chump.
There are a number of things I can do to make sure this doesn't happen
again. There's a thread on the Bacula-users mailing list (came up in
my absence, even) detailing how to make sure something's mounted. I
release the way Kern intended. I can set up a separate
check that goes to my cel phone directly, and not through Nagios. I
can run a small backup job manually on Fridays just to make sure it's
going to work. And on it goes.
I knew enough not to make changes as root on Friday before going on vacation. But now I know that includes backups.
Been busy lately:
3 new workstations with OpenSuSE. Can't figure out the autoinstall, so it's checklist time, baby.
Software upgrade for a fairly important server + 3 slave nodes. Natch, after rebooting one of the ILOMs for the servers just...went away. Can't ping it from the network. Works fine with an interactive ilom shell from Linux. Sometimes I really hate Dell software.
Got a call from the reseller for a major hardware vendor who just got taken over by a major database vendor; said db vendor has just turned off educational discounts we'd spent THREE MONTHS negotiating/waiting to have approved. I am unimpressed. Strongly tempted to call up random hardware vendors and throw money at them 'til they give us stuff.
Finally got leak detection working in the server room. Stupidly long time, it took.
Working on a "Lessons Learned" presentation for LISA that'll include mention of the leak detection (among other things). Not sure how it'll be received, but I figure it's their job to tell me it sucks, not mine.
New term coming, so about six new people coming. But at least I know about them in advance.
But hey! Turns out we live in a constitutional democracy after all. There was some debate about this at 24 Sussex Drive, I understand. Score one for the good guys.
Happy 2010 everyone! Now that it seems to be well and truly under way, I feel I can say that safely.
It's been busy so far. All the stuff I didn't do in 2009 is still on my plate...which is obvious, right? but it still caught me by surprise after the 3 days doing Xmas maintenance on my own. It was easy to forget that there are, you know, people waiting to show up and do work.
Like the new students we've got for one of the faculty members. I'd upgraded OpenSuSE on their new workstations over the holidays, then when they came in yesterday the carefully-tweaked dual monitor displays weren't working. Arghh.
Or the guy who's let me know that he wants to get moving on the MySQL/PHP website he's building...which reminds me that I've still got to move the website to a virtualized machine. I'm tempted to do that RIGHT NOW and put his site in there, but I don't think that'll be the best way to do it.
Or the new project my boss is part of, which involves researchers from across Canada. For me, it's a new website, hardware recommendation and purchases, maybe a new LDAP server. I could add a new root suffix to the existing LDAP server, but
a. we don't need it yet
a. that seems like it'll make it more difficult to move later
a. while I can create one in the existing LDAP server
(Fedora/389/CentOS DS), the
cn=config tree seems suspiciously empty
of any entries related to the new root...so I'm leery of trusting it.
I still haven't sat down yet and tried to plan my year. Partly I've been busy, partly my planning tools are a bit of a mess (daytimer + orgmode + RT). But at some point I need to get my priorities straight and oh, how I long to have them straight. I feel a bit like I'm spinning my wheels right now.
Ah well. In other news, Xmas was good; my kids got two guitars (one acoustic with an Elmo sticker, one fake double-neck electric) which makes four guitars they have now. Since they no longer have that to fight over, they've taken to fighting over a microphone (cardboard tube stuck in a toy that acts like a stand). But damnit, they're still cute.
Finally: Just for fun right now I did a word count of all my blog entries. I've been blogging since 2004, and I've got something like 158,000 words. Amazing. And there are still some entries I've got to grab from my old Slashdot journal.
A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.
Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.
It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.
One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)
The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:
cfservdhad refused its connection because I had the
MaxConnectionsparameter too low.
I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)
Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.
(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)
Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.
And best of 2010 to all of you!
Wake up at 5am because the youngest son's teething and he's going to be up at 5.20am
Clean up ZFS snapshots yet again; repeat "I must schedule this in cron" for the nth time
Put in request to maintenance to look at server room humidifier; current block o' cold weather == 10% RH in there
Find out why Mailman stopped working (probably permission problems on the logs), and how to monitor this (settle for web interface for now, since that wasn't working either; will probably need to make my peace with user accounts for Nagios on machines I'm monitoring)
Figure out why Drupal is shitting cron.php files all over the place (still no idea)
Fill out performance review for self
Back up Windows 2003 server before tossing it over the fence to software installer
Start writing article for SysAdvent at last on OCSNG/GLPI
Struggle with cfengine tidy stanza that doesn't work; repeat "I must upgrade to cfengine 3 or puppet" for the nth time
Tonight, bed at 8.30pm. And there's no shame in that.
At $WORK, I'm going to be taking over the administration of four servers that currently do stuff for a variety of researchers scattered around the province. There are a number of players here:
The owning agency has also ponied up for an upgrade to the four servers; I'll be taking delivery some time next week.
I've got some preliminary information -- what the servers do, how the users use the thing, etc -- but I'm preparing a more detailed plan. In the meantime, I've compiled a list of questions for my local contact.
In the middle of that, it occurred to me that this would be a good discussion topic. Have I missed anything? Let me know!
What info do users/owners expect from us? How? (Mailing list, status page;
2 weeks notice of downtime, monthly stats by CPU)
- Are any funding decisions influenced by this information?
Where is the info for the software?
- license #, what we have licenses for (unlimited use, # cores, etc)
- support #, what it covers
Can I see a demo of the software?
Do any of the labs have shell access? What do they do with it?
What exactly is involved in maintenance? Where is this documented?
What DNS changes will be made? Who makes them?
Who makes policy/purchase decisions about these servers? How do I contact them?
Given the recent hoo-ha about abandoned blogs, and my own tendency to lose interest in writing about something the longer I put it off (I haven't graphed it, but I suspect it's a nice exponential decay), I figured I should finally write up what I've been doing the last week: the move at $WORK to our new server room.
So: construction finally got finished on our new server room. Our UPS was installed, our racks set up, and the keys handed over (though they were to be changed again twice). Our new netblock was assigned, the Internet access at the new location was in place, and movers were booked.
Things I did in advance which helped immensely:
Last Thursday morning, it all started. I got the machines shut down (thank you, SSH and ubiquitous wireless access at UBC) before the two volunteers who were helping me showed up. We started getting machines unracked; since it was only about 20 machines, I figured it wouldn't take too long. While that was true, I had not counted on the rat's nest of power cables (our power requirements were such that we had to connect machines to PDUs in adjacent racks), or the fact that we wouldn't be able to disassemble that 'til we'd got the machines out.
There was one heartstopping moment: a 1U server, while extended on its rails, came off one of the rails while no one was supporting it. Amazingly the other rail held on while it rotated quickly through 90 degrees to bang loudly against the rack. "You swear quickly," the movers remarked. (Doubly amazingly, the machine seems to be fine, though the rails for the thing are shot.)
The movers were big and burly, which was wonderful when it came to moving the Thumper. I weigh more than it does, but not by much, and I'd had the bad fortune to screw up my back a week before the move. It was tricky trying to figure out how to remove it from the rails, but the movers' trick of supporting it with a couple of big blankets, while fully extended from the rack, made such considerations less urgent. Eventually we got it figured out. I don't know how that could have gone smoother, since we'd got Sun to rack the thing and, frankly, it's not like you spend a lot of time un- and re-racking something like that. Anyhow, a minor point.
The new location was right around the corner, which was handy. The movers had put the servers in these big laundry-like carts on wheels; in the end, we only had four of em. We got the machines unloaded, racked the Thumper with the movers help, signed the paper, then went off for lunch where we picked up two more volunteers.
After that, we started racking servers. Having only one sysadmin around (me) proved to be a bottleneck; the volunteers had not worked with rackmounted machines before, and I kept having to stop what I was doing to explain something to them. It would have been a great help to have another admin around; in fact, I think this is the biggest move I'd want to make without some other admin around.
Problems we ran into:
Things that went well:
I'm going to post this now because if I don't, it'll never get done. I may come back and revise it later, but better this than nothing at all.
This has been one of those days where all I've done is stare at monitors too closely.
I know, I'm a sysadmin, what do I expect? But some days I get up, move around; I'm sedentary (and introverted) by nature but I try to talk to people, stare off into the distance, get away from my desk. Going to the server room is always a good break.
Not today, though. My carefully-chosen ATI video card (the Radeon 4550) is giving me headaches, metaphorical and real:
Dual monitors is important. My own damn fault for not getting something old enough...
Okay, so the other thing I was going to do was blog regularly. And now it's three days later.
But I've been meaning to mention another aspect of the new job as well. When, previous to working here, I'd thought about what I'd like my next job to be like, it was pretty consistent:
The last point needs a bit of expansion. See, my first job in IT was on the helpdesk of a small ISP. There were three of us on helpdesk, one webmaster, one sysadmin, one database guy, one secretary and one manager; I got some mentoring from the sysadmin (who split his time betwen us and a sister company), but not lots. My second was at a startup company; the guy who hired me was a good mentor, and then after a while after he left I got to hire a junior and be a mentor to him. The job I just left was pretty much just me, though I'm lucky enough to have other people I could talk to; UBC's a big place, but I was in a small department.
So my next job was going to be bigger (as in a bigger installation — maybe a whole data centre, even) and have more people — because I really, really wanted to hang out with my peers and learn from them. I envied the people I'd met at LISA in 2006 who were part of a team, who had people to teach and people to learn from.
Well, at this job it's...just me. Sort of; the folks I've been working with for the last six months (one lab out of the five that make up the centre) are pretty technical. They know way more about Java and MySQL and web development and how the latest CPUs from Intel compare with AMD than I do. But I'm the sysadmin. There might be another in the future, but there isn't now.
But! But, there are two sysadmins on the floor above me who work in another department. For various reasons, we're going to be working closely for the forseeable future. On Friday, I went up to talk with them about how that was going to work out.
They knew stuff I didn't know -- no surprise there -- but it turned out I could show them a trick or two as well. We swapped war stories, discussed our very different backgrounds (saved for another entry), and just shot the shit. It was wonderful.
It's weird, because I'm an introvert, and not very socially apt. (Or ept. As in "opposite of inept".) But it's really, really nice to get together with people who like being a sysadmin the way I do.
(This entry brought to you by the number i, the letter Ve, and my youngest son's 90-minute nap.)
I've been hlding off mentioning this 'til all my ducks were in a row, but at last it's settled. The job I've been working at part-time for the last six months will be my full-time job starting next Wednesday. w00t!
I've been spending my time at $job_1 making sure the documentation is complete, getting a spare workstation set up and ready to go, and dumping my brain into the sysadmin who will be helping fill in 'til a new person is hired (which might take a while).
I'm really excited about this. First off, I'll get my lunch hours back; I've been walking between the two offices (mornings at one, afternoons at the other, back to the first for the last half hour), and it'll be nice to have an hour to myself again. But the new job is exciting for me: nice big servers used for scientific computation, the chance to build an infrastructure from scratch, and some big projects. The people are friendly. The boss is nice. The place has funding for the next five years or so. It's all good. About the only thing missing is a rocket pack so I can cut down on this 90-minute commute.
And on top of all that, they're open to the idea of sending me to LISA this year. Now that would be nice…have to see if it works with the family, but I'm keeping my fingers crossed.
In other news:
Just now from the window, over the sound of a stupid high-pressure washer, I heard a Canada goose fly by, honking its head off.
Work...hell, life is busy these days.
At work, our (only) tape drive failed a couple of weeks ago; Bacula asked for a new tape, I put it in, and suddenly the "Drive Error" LED started blinking and the drive would not eject the tape. No combination of power cycling, paperclips or pleading would help. Fortunately, $UNIVERSITY_VENDOR had an external HP Ultrium 960 tape drive + 24 tapes in a local warehouse. Hurray for expedited shipping from Richmond!
Not only that, the Ultrium 3 drive can still read/write our Ultrium 2 media. By this I mean that a) I'd forgotten that the LTO standard calls for R/W for the last generation, not R/O, and b) the few tests I've been able to do with reading random old backups and reading/writing random new backups seem to go just fine.
Question for the peanut gallery: Has anyone had an Ultrium tape written by one drive that couldn't be read by another? I've read about tapes not being readable by drives other than the one that wrote it, but haven't heard any accounts first-hand for modern stuff.
Another question for the peanut gallery: I ended up finding instructions from HP that showed how to take apart a tape drive and manually eject a stuck tape. I did it for the old Ultrium 2. (No, it wasn't an HP drive, but they're all made in Hungary...so how many companies can be making these things, really?) The question is, do I trust this thing or not? My instinct is "not as far as I can throw it", but the instructions didn't mention anything one way or the other.
In other news, $NEW_ASSIGNMENT is looking to build a machine room in the basement of a building across the way, and I'm (natch) involved in that. Unfortunately, I've never been involved in one before. Fortunately, I got training on this when I went to LISA in 2006, and there's also Limoncelli, Hogan and Chalup to help out. (That link sends the author a few pennies, BTW; if you haven't bought it yet, get your boss to buy it for you.)
As part of the movement of servers from one data centre across town to new, temporary space here (in advance of this new machine room), another chunk of $UNIVERSITY has volunteered to help out with backups by sucking data over the ether with Tivoli. Nice, neighbourly think of them to do!
I met with the two sysadmins today and got a tour of their server room. (Not strictly necessary when arranging for backups, but was I gonna turn down the chance to tour a 1500-node cluster? No, I was not.) And oh, it was nice. Proper cable management...I just about cried. :-) Big racks full of blades, batteries, fibre everywhere, and a big-ass robotic Ultrium 2 tape cabinet. (I was surprised that it was 2, and not U3 or U4, but they pointed out that this had all been bought about four or five years ago…and like I've heard about other government-funded efforts, there's millions for capital and little for maintenance or upgrades.)
They told me about assembling most of it from scratch...partly for the experience, partly because they weren't happy with the way the vendor was doing it ("learning as they went along" was how they described it). I urged them to think about presenting at LISA, and was surprised that they hadn't heard of the conference or considered writing up their efforts.
Similarly, I was arranging for MX service for the new place with the university IT department, and the guy I was speaking to mentioned using Postfix. That surprised me, as I'd been under the impression that they used Sendmail, and I said so. He said that they had, but they switched to Postfix a year ago and were quite happy with it: excellent performance as an MTA (I think he said millions of emails per day, which I think is higher than my entire career total :-) and much better Milter performance than Sendmail. I told him he should make a presentation to the university sysadmin group, and he said he'd never considered it.
Oh, and I've completely passed over the A/C leak in my main job's server room…or the buttload of new servers we're gonna be getting at the new job…or adding the Sieve plugin for Dovecot on a CentOS box...or OpenBSD on a Dell R300 (completely fine; the only thing I've got to figure out is how it'll handle the onboard RAID if a drive fails). I've just been busy busy busy: two work places, still a 90-minute commute by transit, and two kids, one of whom is about to wake up right now.
Not that I'm complaining. Things are going great, and they're only getting better.
Last note: I'm seriously considering moving to Steve Kemp's Chronicle engine. Chris Siebenmann's note about the attraction of file-based systems for techies is quite true, as is his note about it being hard to do well. I haven't done it well, and I don't think I've got the time to make it good. Chronicle looks damn nice, even if it does mean opening up comments via the web again…which might mean actually getting comments every now and then. Anyhow, another project for the pile.
How to quiet noisy cron entries that send far too much to STDERR:
exec 3>&1 ; /path/to/script 2>&1 >&3 3>&- | egrep -v 'useless|junk' ; exec 3>&-
I've been very busy of late, but the biggest news is that I've started a 3-month temporary part-time assignment here. It's a neat place, and feels a lot like a software startup. Even though it's a small group, they've got certain hardware requirements that are a lot bigger than what I've worked with before; it'll be interesting, to say the least.
...after a month off, and almost no emergencies in my absence. Sweet!
Now if only I could catch up on sleep. I remember this from the first kid: you never know just how much you can accomplish on so little sleep.
This is one of the few things that would make me consider moving to the US right now.
In preparation for my new job, I've installed OpenSolaris on Pouxie, my wife's old desktop machine (a nice 2GHz Athlon). I've used Belenix, a live CD that includes a driver for Pouxie's onboard NForce ethernet interface.
So far I'm having a lot of fun. It took me three hours (spread over four days...damn this commute) to get a static IP address assigned to the thing, and then to get DNS working. But after a reinstall (a newer version of Belenix had come out that included the Sun packaging tools, which should let me use Blastwave to grab Emacs...a good first project, I think), I had it up and running in just a few minutes. Progress!
For those playing the home game, here's what I had to do:
modinfo | grep nfo: yep, the module has been loaded.
ifconfig -a | grep nfo0: Not there.
dladm show-link: But it is here.
echo "192.168.23.40 pouxie-2" >> /etc/inet/hosts
echo "pouxie-2" > /etc/hostname.nfo0 ; echo "netmask 255.255.255.0" >> /etc/hostname.nfo0
echo "192.168.23.254" > /etc/defaultrouter
reboot -- -r: to get Solaris to find the new interface (?)
ifconfig -a: Now it shows up configured.
svcadm --disable svc:/network/inetmenu: Otherwise, it interferes with the change to
nsswitch.conf I'm going to do up ahead.
svcadm --enable svc:/network/dns/client: I long to know what this actually turns on.
cp /etc/nsswitch.dns /etc/nsswitch.conf
echo "nameserver 192.168.23.254" >> /etc/resolv.conf
ping www.saintaardvarkthecarpeted.com: It's alive!
Happy birthday, OpenSolaris!
Maybe someone else can use this: