(aka "random words for titles, please")
$WORK: I've been working on OpenStack lately. It's been fun, despite its frustrations (which I won't list here because I tend to rant a lot, and I'm becoming less convinced its helpful or as funny as I think it is) (which drolly deserves its own bit of expansion...). Why fun? Because a) I've had the luxury of focusing on this for, like, a month now, pretty much to the exclusion of all else, and b) because I'm not on my own. One of my coworkers is doing this with me and he is really, really good. He's careful, his shell scripts make me cry with their beauty, and he's just a lot of fun. And it's amazing what a difference a great bunch of coworkers make (he's just one of a great gang of people).
This has been a real revelation, particularly after visiting my workplace to catch up with people. It was good to see everyone again, but it really reminded me how much all of us needed a change -- me to get the hell out of there, and them to get someone in who is enthusiastic about things again.
It has been quiet over the holidays, which is good; I was on call for NYE and hardly had anything happen at all. And the office was quiet with people being out for so much of it. But I'm looking forward to people being back in, conversations going on, HipChat having more traffic than just me asking Hubot to animate me a Christmas tree.
Holidays: not nearly as long as when I was at UBC, but it was still good. The kids had lots of school vacation, of course, and then there was COUNTDOWN TO XMAS OMGOMGOMG. They nearly lost their little minds with anticipation, but finally Xmas was here and...they loved what they got. Which was a relief; they had been pining for Xboxes and tablets and iPhone 99s and I don't know what all. None of that was going to be happening for so many reasons (money, they're 6 and 8, I have moral qualms about non-free computing) (which is ironic because I have an Apple laptop now for work); they knew that, but I wasn't sure how they'd actually handle it on the day. And it was a non-event: they got sketchbooks and books and toys and were happy as clams.
My wife and I continued our Xmas tradition of Watching Bad Movies on Xmas Eve with "The Christmas Cottage: The Thomas Kinkade" story ($5 in the grocery store bargain bin!). It's no Asylum joint (and what a concept that would be...), but it was still wonderful (by which I mean really odd: Peter O'Toole as Kinkade's mentor, and Chris Elliott as chair of the town's Chamber of Commerce). We didn't get a chance to go out on our own, but we'll fix that one night.
I've been spending time on the Stack Exchange Emacs beta; it's really shaping up. It's been fun to answer some of the questions, and really dig into Emacs; the digging (because it's rarely as straightforward as I think it's going to be) turns up a lot of stuff I never knew.
I sent off my first letters for Amnesty International last month; it's been something I've wanted to do for a while now, and I finally got off my ass and joined their Urgent Appeal network. It's easier to sign up than I thought it would be; I urge you to consider joining yourself.
Reading:
So one thing that's been hanging around in my mailbox for (checks mailbox) good God, three weeks now is an exchange I had with Jess Males (aka Hefeweizen on IRC, which is a damn good name). He wrote to me about my Infrastructure Code Testing BoF, and asked:
I don't see notes to this effect, so I'll ask: what's the difference between monitoning and test-driven infrastructure? Monitoring started as a way to tell us that the infrastructure we need is available and operating as expected. Test-driven infrastructure serves the role of verifying that the environment we're describing (in code) is being implemented, and thus, operating as expected. Those sound awfully similar to me. Is there a nuance that flies over my head?
Before I insert my uninformed opinions, I'll point you to an excellent article from this year's SysAdvent by Yvonne Lam called How To Talk About Monitors, Tests, and Diagnostics. But if you're still interested, here goes...
First, often (though not always) it's a pain in the ass to point already-existing monitoring at possibly ephemeral VMs, dev machines and the like. Just think of the pain involved in adding a new machine + a bunch of services AND remembering to disable alerts while you do so. Not to say it can't be done, just that it's a source of friction, which means it's less likely to be done.
Second, there are often times when we're building something new, and we don't have already-existing monitoring to point at it. Case in point: I recently set up RabbitMQ at work; this was new to us, and I was completely unfamiliar with it. The tests I added can go on to form the basis of new monitoring, but they emerged from my desire to get familiar with them.
Third, these tests were also about getting familiar with RabbitMQ (and Puppet, which is new to me), and doubtless there are some things in there that will not be needed for monitoring. These are valuable to have in testing, but don't always need to be kept around.
I fully stipulate that monitoring, as often implemented, falls woefully short of our ideal. More often than not, monitoring is a ping check or a port check. Our test driven environment should check for page load times or members behind a load-balancer, or &c. If what we really want are better, more accurate environment measuring, then know there's a refreshing reimagination of monitoring with #monitoringlove. If they're already marching in our direction, let's join ranks.
True story. I've shot myself in the foot more times than I care to remember by, for example, testing that port 80's open without checking the content coming back.
Now that I've said this, I think I start to answer my own question of TDI (test-driven infrastructure) vs monitoring. I begin to see these points: write the tests first (duh, devs have been saying this for years), and better test (monitoring artifacts) generation (ideally, automatic).
Test first: yes, particularly when starting with a new technology (see above re: RabbitMQ). Also, in theory you can rip stuff out and try something else in its place (think nginx vs Apache); if the tests still pass, you're golden. Still missing: Better test generation. jI'd love something that ate serverspec tests and spat out Nagios configs; even as a first draft of the tests, it'd be valuable.
I was recently asked what it was like to join my latest job:
@saintaardvark What did you first day look like? How did you get pulled into the culture? How were you brought up to speed on the tech?
— swettk (@swettk) November 25, 2014
This is a good question, and one that with the whirlwind of switching jobs I didn't write much about. Now that I've had a few months, it's worth revisiting.
The interview process for OpenDNS was long; there was a lot of talking to people, a lot of technical interviews, a practical test, and finally a visit to the office to talk with both the Sr. Director of Infrastructure Engineering (two levels above) and the CTO (three). It was during that visit that I was finally told I'd be getting a job offer. After that I gave notice to UBC and started preparing for things.
First up was a trip to San Francisco; everyone goes there for a week, even if you work in the Vancouver office (or in Belgium, where my latest coworker is from). I made sure my passport was okay, bought a new packsack, and flew down. I got a bit turned around on the Bart, but made it into town at last and found the office. Corporate apartment was close by, and I pretty much just rested that night.
Next morning was first day at the new place. There were seven of us all together in the lobby, waiting to be collected, including one person on my team (but who works at the SF office, and one person who works at YVR (but on another team). We were kept together for most of the rest of the week, and this was a really good way of getting to know each other. I also shared accommodation with the YVR person while I was there -- we were the only two folks from YVR that week -- and that helped too.
The week was spent in meetings...lots and lots of meetings. This was about 90% good; we were given orientation on (I think) every single part of the company. Sales, marketing, recruiting, finance, IT, HR, security, engineering -- we had an hour or so with each one to hear what they did, ask questions, and learn how they contributed. There were also orientations on policy, financial stuff (including an hour with the CFO), travel, our products themselves, and a bunch of other things. (I'd just like to point out that it's weird to not be the IT person anymore, and that the IT people there are really, really well-organized, thoughtful and patient. There's a lot I'd have done differently in my previous jobs if I'd been able to work with them.)
Coming from places that have been smaller and less organized, this was really, really nice. Sure, it was a lot for one week; by the end of it our eyes were glazing over, and we had no idea what was happening next without checking our schedules. But it was pretty well organized, and did a good job of avoiding dull or repetitious. And on top of that, it gave us a chance to meet more people; particularly for those of us in YVR, it's really good to have a sense of who people are outside of HipChat.
As for the team I joined: something had to give, and this was (IMHO) where it did. I had some time with my team, and they did their best to shove knowledge into our heads -- particularly mine, as I was going back up to Vancouver at the end of the week. So there were chunks of time set aside where the senior admins stood at the whiteboard and drew diagrams while we scribbled furiously. But coming on top of everything else I was learning that week, plus new job, plus new team, plus away from home, I know a lot of it leaked straight out of my head again. (Having had the chance to repeat some of the knowledge drops since then, I think it's also a question of having context for what you're hearing. When you're able to relate this new thing with that chunk you already know, it seems to stick better than when it's twelve new things all at once with no anchor.)
It's a tough problem, to be sure; there was a lot to learn about the infrastructure, and like I keep telling people it's at a scale I haven't worked at before -- so it's hard not to be overwhelmed by everything. I think it might have been better if I'd spent a second week down at SFO, just working with the team and listening; OTOH, the passage of time seems to be necessary (or at least helpful).
Thus for technical knowledge and company knowledge. What about culture, and what about fitting in with my team?
I'll start with this in here: the HR/PeopleOps at OpenDNS are doing the work of the angels. That company has hired about 100 people in the last two years, and they've managed to keep things not only organized and working well, but working fun. That's not the best way of describing it, but it's as good as I can come up with right now. Their work, their problems, are as hard, as knotty, as deeply challenging as anything I ever hope to tackle as a techie. I'm very, very grateful that there are people who find this work rewarding; I enjoy the fruits of their labours, and I would not be a tenth as good at their job as they are.
So: Culture was as much the focus of that first week as anything else. Company values, how to use HipChat (not "where do I type?" but "most people use it as the default means of communication"), when we had lunch, paperwork ("Don't use paper. Just don't")...all of it was explained to us over that week. And people! People were good; the people are really, really nice at OpenDNS. Sitting at the lunch table, someone'd strike up a conversation with you (which is nice when you're shy/introverted): when did you start? what do you do? And then you'd learn something, like what the renewals team did (ensure that service subscriptions are renewed, instead of falling by the wayside, and hopefully upgraded along the way).
And ditto for the team. Time with the rest of my coworkers was scattered and slim, but we tried as much as possible to get to know each other; lunch, conversations about background, all that sort of thing. Mostly, though, that would wait for the next visit to SFO.
After that week, it was back to Vancouver. I'm the only one on my team in Vancouver, but there are other teams we work with quite closely -- so this was my chance to meet them. It was good, and a lot less formal and harried, more slow and organic.
I was also trying to be useful to my team; admirable ambition, but this took a long time, I think. It took a lot of small (and not-so-small) tasks, good for getting to know things. Some were well-documented processes, like building yourself a development VM, and were good for illustrating things (and showed why we wanted to automate it). Some were expeditions: changing Puppet code in a large, intricate codebase, when you're a Puppet n00b who really wants to understand how everything's put together, can take a lot of grepping.
And all this was happening when routines were imposed/kicking in, mostly new to me: standups; weekly planning (what?); round-tables between departments; multiple redundant ticketing systems (don't ask); Google hangouts where I was the only remote person; HipChat conversations with people across the room from me.
A couple of weeks later, two of my team came up to work for the week in Vancouver: my manager and the person who'd started the same week I did. (This sort of thing happens all the time at OpenDNS; it's rare that we'll have a week without someone coming up or heading down, even if there aren't coworkers at the other end.) Having these folks up was really, really good. I solidified my friendship with both of them, and the bandwidth in person is just so much higher....everything from "Hey, I found out this thing you should know!" to "OMG, this is such a pain to deal with" became so much easier. And having my manager (-to-be; she was promoted shortly after this trip) around was incredibly helpful: I got a good sense of what she wanted from me, and what our team's challenges were.
Getting up to speed on the tech took time, and I'm only just now beginning to feel like I have a good understanding. Partly that's because it's complicated; partly that's because it's not terribly well-documented (though that's a real priority for us, and it's changing fast). But also, like I said, things make a lot more sense when you have a context to fit them into -- and gaining that context is something that happens over time.
So someone tells you something, and you nod and smile and write it down but it doesn't make sense. But you can't ask about it right then, because there are 'bout a hundred things you've tripped over in the last week that you don't understand, and now it's a hundred and one, and in the meantime you've got things to do. And you forget about it until you need to understand it again two weeks later, and you ask about it again with a sheepish grin, and now you've got the slot for it. This sort of thing has been happening all the time since I started, and it has taken me a while to make my peace with it...to realize that yes, it's complicated; yes, it's taking a while; and no, it doesn't mean I'm stupid. I'm levelling up over time; I hold on to that.
Last thing I'll mention is my second trip down to SFO, which happened in October. We had a new coworker starting; he's from Belgium, so he was spending three weeks in SFO and one week in YVR before heading back home. We combined his first week with a week of team stuff: offsites, planning, and so on.
One thing which was pretty amazing: we all gave 10 minute presentations about ourselves...who we were, what we'd done, how we'd made it to the place we were at now. It sounds intimidating, but it wasn't. I learned a lot about my coworkers in a hurry, and it was a wonderful introduction to each other -- what makes us tick, what we're worried about, and the weird career paths we'd taken. (One person used to be an airline pilot; another interned for JPL; another is a trained EMT.)
There was also the mandatory what-are-we-gonna-do-now meeting, where we looked at what we'd promised other people (promises made before half of us had joined the team), figured out how we were gonna deliver, and so on. Discussions were sometimes heated, but that was good too; it didn't get out of hand, and we got a good sense for what everyone worries about. (Side note: the place where we had this meeting provided snacks and lunch, and they brought in the best scones I have ever had: crumbly, not too sweet, and just amazing. I neglected to ask where they came from. I am going to dedicate the rest of my life to duplicating that recipe.) And there was also the mandatory dinner and drinks night -- and again, this was great. The food was wonderful but the company was amazing. Getting to know everyone in a casual setting is just really, really wonderful.
So to answer the questions:
What did my first day look like? Same as the first week: a lot of meetings, mainly very well organized, that told me most of what I needed to know about the company.
How did I get pulled into the culture? By joining people down at SFO, the way everyone else in the company did; and by joining a remote office that still shares the company values.
How was I brought up to speed on tech? Lectures, small tasks, frustration and moments of revelation as things fell into place; rinse & repeat.
This year I'm blogging for the USENIX blog, so we'll see how much I actually put up here...but the thought of going w/o updating my own just makes me sad, so here we go.
Took the bus down, which was completely uneventful and pleasant. Walked from King St station to the conference hotel, which was a bit of a hike but welcome exercise. I'm on the 25th floor and have a pretty skookum view of local neon and such. Got supper and some groceries, then went out for drinks w/Matt and his wife Amy, Pat Cable (who I'm meeting in person now for the first time), Bob and Alf, and Ken Schumacher. Good times, with lots of good teasings of Matt in as well. Missing Ben Cotton, which is a shame; the two of us could pretty much get Matt to cry if we tried hard enough.
First tutorial this AM was "Stats for Ops", and it was amazing. Discovered that using a spreadsheet is a really good skill to have. I have to learn that at some point...
And now off for next tutorial.
So my latest blog post for LISA just got posted -- and that's the last long(ish) one; next week BeerOps, Mark Lamourine and I will be posting daily updates as we're there. Also, I've volunteered to help Julie Miller, the Marketing Communications Manager for USENIX, with the opening orientation on Saturday night. I seem to remember taking that the first year I went, though I don't seem to have written it down...
By the way, shouts out to BeerOps, Mark Lamourine, Matt Simmons and Noah Meyerhans for all the help during LISA Bloggity Sprint 2014. There are beers/chocolate/what-you-owed in plentitude.
On another note: I'm auditioning a Chromebook, an Acer C720, to see how it works out. Right now I'm using Debian Jessie (testing) via Crouton, which lets you install Linux to a chroot within Chrome. So far: the keyboard is smaller than I'm used to, and the Canadian keyboard in particular is annoying -- they've crammed in tons of extra keys and split the Enter and Shift keys to do so. But overall it's okay; I can run tests for Yogurty in 3 seconds (cf. 12 on my old P3 laptop/server), and even Stellarium seems to run just fine. I've got a refurbished 4GB model on order w/Walmart in the states, and I can pick that up while I'm at LISA. So, you know, looking good.
Bridget Kromhout's latest post, The First Rule of DevOps Club, is awesome. Quote:
But when the open space opening the next day had an anecdote featuring "ops guys", I'd had enough. I went up, took the mic, and told the audience of several hundred people (of whom perhaps 98% were guys) how erased I feel when I hear that.
I said what I always think (and sometimes say) when this comes up. If you are a guy, and you like to date women, would you place a personal ad that says this? "I'd like to meet a wonderful guy to fall in love and spend my life with. This guy must like long walks on the beach and holding hands, and must also be female." If that sounds ludicrous to you, then you don't actually think "guy" is gender-neutral.
That's a small part of a much longer post; go read the rest.
Much at $WORK; I've got a new team mate from Belgium who's awesome, I'm starting to find a sense of rhythm, and organizing time is as challenging as ever. There are lots, LOTS of fun things to do, and it's damn hard sometimes to say "I'm just gonna put that on the TODO list and walk away."
This week my youngest son has switched from "The Wizard of Oz" to "Treasure Island" for story time. He got bored of TWOO and we didn't finish it; I'm curious to see how long he'll stick with TI. Still so much fun to read to them both.
I've tripped over this error a few times; time to write it down.
A few times now, I've run serverspec-init
, added a couple tests,
then had the first Rake
fail like so:
Circular dependency detected: TOP => default => spec => spec:all =>
spec:default => spec:all
Tasks: TOP => default => spec => spec:all => spec:default
(See full trace by running task with --trace)
Circular dependency detected:
Turns out that this is a known problem in Serverspec, but it's not exactly a bug. The problem appears to be that some part of the Vagrantfile I'm using is named "default". The original reporter said it was the hostname, but I'm not sure I have that in mine. In any case, this causes problems with the Rakefile: the target is default, but that also matches the hostname, and so it's circular and Baby Jesus cries.
(Side rant: I really wish the Serverspec project would use a proper bug tracker, rather than just having everything in pull requests. Grrr.)
One way around this is to change the Rakefile itself. Open it up and look for this part:
namespace :spec do
targets = []
Dir.glob('./spec/*').each do |dir|
next unless File.directory?(dir)
targets << File.basename(dir)
end
task :all => targets
task :default => :all
Comment out that last line, task :default => :all
:
namespace :spec do
targets = []
Dir.glob('./spec/*').each do |dir|
next unless File.directory?(dir)
targets << File.basename(dir)
end
task :all => targets
# task :default => :all
Problem solved (though probably in a fairly hacky way...)
Busy, yo:
This was my first week on call at $WORK, and naturally a few things came up -- nothing really huge, but enough that the rhythm I'd been slowly developing (and coming to relish) was pretty much lost. And then Friday night/Saturday morning I was paged three times (11pm, 1am and 5.30am) -- mostly minor things, but enough that I was pretty much a wreck yesterday. I'm coming to dread the sad trombone.
Besides that, I've also been blogging about the LISA14 conferencefor USENIX, along with Katherine Daniels (@beerops) and Mark Lamourine (@markllama). They've got some excellent articles up; Mark wrote about LISA workshops, and Katherine described why she's going to LISA. Awesome stuff and worth your time.
I managed to brew last week for the first time since thrice-blessed February; it's a saison (yeast, wheat malt, acidulated malt) with a crapton of homegrown hops (roughly a pound). I'm looking forward to this one.
Going to San Francisco again week after next for $WORK. (Prospective busyness.)
Kids are back to school! Youngest is in grade 1 and oldest in grade 3. Wow.
Today I complete my (scurries to check calendar) third week of work at OpenDNS. There is a lot to take in.
I flew in to San Francisco without problems; like previous times, there was no having to opt out of scanners at YVR (the airport, not the office). I got moderately tangled up in the BART because I'd become convinced it was Saturday instead of Monday, but once I figured that out I got to the office without problems. The corporate apartment is right around the corner, which is definitely handy, and close to the 21st Amendment Brewpub which was even more handy. Went there for supper:
The #goldenpony
made it to the @21stAmendment
brewpub. But if it tries to drink my b33r there will be trouble. pic.twitter.com/2WowiC1c62
—
Saint Aardvark (@saintaardvark) July
15, 2014
I stayed for a while, listening to the startups happening around me (not even kidding), then went back to the apartment and slept fitfully.
Next day was first day. There were 7 of us starting that week, including one other YVRite. The onboarding (and now I'm using that word) was very, very well organized: we've had talks from HR, from the CFO, from the VP of Sales and from actual sales people, from Security, from the IT guy (and it is strange not to be the IT guy) and from the...oh god, I'd have to look it up. There was a lot, but it was interesting to get such a broad overview of the company.
The second meeting of the day, right after HR got us to fill out the necessary forms, was to get our laptops. Mine is a 15" MacBook Pro AirPony CloudTastic or some such; 16 GB of RAM, Retina display, SSD. What it works out to be is wicked fast and pretty. It has not been nearly as hard to get used to it as I thought it would be -- not just because I'm adaptable and like computers, but because even though it's not Linux it's not getting in my way any (which in turn is because, much more than anywere I've worked previously, so very very much of what we do is done through a browser, using apps/services which have been designed within the last five years).
Natch this makes me question my ideological purity. But I can also see, really see the point of having things be easy, particularly at scale. Which is kind of a ridiculous thing to say for a sysadmin, whose job is supposedly making things easy for other people. But there you go. I love Linux, but there's no question that (modulo the fact that I'm seeing the end product, not the work that went into it) making everything that seamless would probably be a lot more work.
Speaking of which, I'd just like to give shoutouts to the IT people at OpenDNS. They are incredibly well organized, efficient, friendly and helpful. I need to take notes. Oh, and: it is strange not being the IT person -- at one point my laptop was misbehaving, and I had to/got to ask someone else for help fixing it. Wah.
Oh, and: the HR department is well-organized too. Everyone shows up to their new desk which is clearly marked with a) balloons and b) swag:
Onboarding the #goldenpony. pic.twitter.com/hbdywOedKk
—
Saint Aardvark (@saintaardvark) July
15, 2014
In the midst of this week full of meetings, I got to meet my coworkers. Some I'd interviewed with, some were new to me (like Keith, who a has degree in accounting: "I learned two things: don't screw with the IRS, and I hate accounting!"). They are all friendly and smart. There were knowledge drops and trips to the lunch wagons and finding different meeting rooms (".cn is booked." "What about .gr?") and whiteboarding and I don't know what-all. Oh, and one of the new people starting that week is Levi, another systems engineer, who came over after 7 years at Facebook. Wonderful guy; I was intimidated, but it turned out I knew a few things he didn't (and of course vice-versa), so that restored my confidence.
Things are organized. There is agile and kanban boards and managers who actually help -- not that they wouldn't, I guess, but I'm so used either being on my own or wishing my manager would just go away. This is nice. There are coworkers (have I mentioned them?) who help -- it's not just me anymore. This means not only that I don't have to do anything, but that I can't just go rabbiting off in all directions when something cool comes up.
Oh, and: there are these wonderful sit/stand desks from GeekDesk.com -- they're MOTORIZED! They're all over the SFO office, and will soon be coming to the YVR office. They're wonderful; if I ever work from home on a regular basis, I will really really want one.
There wasn't a lot of time for wandering around -- mostly, by the end of the day I was pretty exhausted -- but Thursday night I walked across town, from King Street BART station to 39th pier. It was ~ 9km all told, and it was a wonderful walk. I ended up going past City Lights Bookstore and Washington Square park; back in 1999, my wife and I spent an afternoon in that park, where a homeless guy insisted that I remove my sunglasses so he could see if I was an alien (I wasn't). It was cool to see it again. The touristy stuff was great in its schlocky, touristy way, and I hunted around for sportsball tshirts for my kids.
Friday we had the weekly OpenDNS all-hands meeting, where (among other things) new hires tell three fun facts about themselves. Mine were:
I counted moose from a helicopter when I participated in a moose population survey. And when I say "participated" I mean "was ballast". I worked one summer for the Ministry of Natural Resources in Ontario. A helicopter was flying out with one pilot and two biologists, so it was unbalanced. I came along so the helicopter could stay level. Saved a lot of lives that day.
I'm an early investor in David Ulevitch, OpenDNS' CEO. Back when he was running EveryDNS, which provided free DNS service for domains, I sent in $35 as a donation. When Dyn.com bought EveryDNS, they grandfathered in all the people who'd donated, and I've now got free DNS for my domains for life. Woot!
And of course, the story of the golden pony.
Friday afternoon I flew back; opted out of the scanner (and forgot to tell my coworker flying back with me that I'd be half an hour getting through security; apologized later), had supper and a beer at the airport, and just generally had an unventful flight home. The beers I brought home for my wife made it through everything intact, there were stickers for the kids, and everyone was happy to see me (aw!).
For a while now, I've been wanting to work in a different environment. UBC is a lovely place to work, and the people at CHiBi are wonderful...but I've been there more than five years now, and I was getting itchy feet.
Last year I wrote down what exactly I wanted out of a new job:
Larger scale: I took my current job because it was a chance to work with so much that I hadn't before: dozens of servers, an actual server room, HPC, and so on. I want that same feeling of "I've never done that before!" (See also: "Holy crap, what have I got myself into?")
Linux/Unix focused: It's no secret that Linux makes the sun shine and the grass grow, and BSDs make the planets go in their orbits. Why would I ever want anything else?
Actual coworkers: For most of my time as a sysadmin, I've worked on my own. I had a junior for a while (Hi Paul!) and that was wonderful, but other than that I've been alone. I really, really wanted to change that. Andy Seely, a damn good friend of mine, likes to say "If I find myself the smartest person in the room, I know I need to find a new room." That was exactly how I was feeling.
Friendly. I work in a friendly, open place, and I've no desire to give that up.
I kept my eye out. And back in April I saw that OpenDNS was hiring. So I sent in a resume. They got back to me. There were lots of interviews (I think I talked with five different people), a coding test (two, actually, and they made me sweat) and a technical test. And then, finally, I was sitting in their offices in Gastown, talking to the guy who'd just offered me a job.
Larger scale: check; they've just opened their nth and n+1nth data centres in Vancouver and Toronto. Linux/Unix focused: yep; Linux and FreeBSD rule the coop. Actual coworkers: they're on it; there are two other people I'll be working with (and they've been running all, or at least a lot, of the infrastructure for the last few years). Friendly: four for four, because everyone there has been really, really...well, friendly.
So: I start July 15th as a Systems Engineer with the good folks at OpenDNS. I'm excited and a little freaked out to be working with all these good, smart people.
In the meantime: if you want a job as a Linux sysadmin, working with the excellent people at the Centre for High-Throughput Biology who do a science EVERY DAY, you can apply here. Closing date is Friday, June 20th, so hurry. Apply early and apply often!
Today I replaced a battery on a StorageTek 2530 controller. It's one of two, and the Sun Service advisor mentioned nothing about panics or reboots...but the CentOS 5 machine (using the Sun RDAC drivers) attached to it rebooted anyhow. Don't worry, it's only in production....Fortunately it's one of two database servers, and the other one took up the load just fine. I'm not sure if I bumped a power cable (always possible), or if the machine panicked, but I was unable to find anything in the logs so I really don't know what to think.
I'm going to be replacing the battery on another one in shortly, so I'll get to see what happens then. At least I'll know to schedule downtime...
So the other day I was asked to help get a bioinformatics tool working. Tarball was up on Sourceforge, so it shouldn't be a problem, right? Right. Download, skim the instructions, run "make" and we're done. Case closed!
Only I had to look. Which was a mistake. Because inside the tarball was another tarball. It was GNU coreutils, version 8.22. Which was dutifully compiled and built as part of the toolchain. It was committed about 18 months ago because:
this will create a new sort that is used by chrysalis to run sort in parallel speedup on hour system running a 13g dataset was from 46min to 6min runtime
That is a significant speedup. Yes. And sure, it's newer than the version in the last Ubuntu LTS (8.13), and 'way newer than the version in CentOS 5 (5.97). But that is a tarball, even if it is only 8 MB, in the subversion repo for a project that was published in Nature Protocols. Why in hell wasn't it written up as a dependency in the README? So yeah, I got angry: "I think I'm gonna submit a patch with an Ubuntu ISO in it, see if they accept it."
I'm struggling with what to write here. This is bad practice, yes, but what constructive, helpful alternative do I have to offer? The scientists I work with are brilliant, smart people who do amazing research, but their knowledge of proper (add scare quotes if you like) development practice is sorely lacking. It's not their fault, and folks like Software Carpentry are doing the angel's work to get them up to speed. But riddle me this: if you're trying to get a tool into the hands of a pretty new Linux user -- one who's going to base the next 18 months of their work on how well your tool works -- how do you handle this sort of thing?
Mark it in the README? That's great if they've got a sysadmin, and Lord knows they should...but there are many that don't, or it's the grad student in the corner doing the work and they're more focussed on their thesis. (That's not a criticism.)
Throw an error? Maybe, or maybe a warning if it's less than version umptysquat. That gets into all sorts of fun version parsing problems.
Distribute a VM? Maybe -- but read C. Titus Brown's comments on this. Plus, if we wince at the idea of telling a newbie "Just go get it installed", imagine our faces when we tell them "just go get the VM and run it." Ditto Docker, Vagrant or whatever new hotness we cool kids are using these days.
Ports tree? Now we're getting somewhere. All we need to do is have a portable, customizable, easily-extended ports tree that works for lots of different Linux distros and maybe Unices. Hear that sound? That's the NetBSD ports tree committer berzerkers coming for your brains. Because that work is HARD, and they are damned unappreciated.
We have no good alternative to offer. I can be snotty all I want (confession: IT'S SO MUCH FUN) but the truth is this is a hard problem, and people who just want to get shit done are doing it the best they can because they just want to get shit done. We have -- are -- failing them. And I don't know what to do.
This is an old, old bug that I just tripped over for the second time. Hopefully this'll save someone else...
In September 2010, I had a problem with Ocsinventory, the inventory software we use to track hour hardware: I kept getting 500 errors when running the OCS client on a machine. I filed a bug, but I wanted to show how I tracked it down.
First off, Apache logs for Ocs can be found at
/var/log/httpd/access_log
and /var/log/httpd/error_log
. Ocs itself
logs at /var/log/ocsinventory-server
(-client too, but that's not as
interesting). However, by default Ocs doesn't log very much -- so
let's change that. Logging can be twiddled by editing the Ocs/Apache
config file at /etc/httpd/conf.d/ocsinventory-server.conf
. Pay
attention to this setting: PerlSetEnv OCS_OPT_DBI_PRINT_ERROR
. It's
set to 0 by default, so set it to 1 to turn it on. Also, note that
you have to fully restart Apache in order to make changes to this
file take effect
After that, I see this error in /var/log/httpd/error_log
:
DBD::mysql::db do failed: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '|CHECKSUM|1),
NAME='server23',
WORKGROUP='example.com',
USERDOMAIN=NU' at line 4 at/usr/lib/perl5/vendor_perl/5.8.8/Apache/Ocsinventory/Server/Inventory/Update/Hardware.pm line 35.
So the fix? Edit Hardware.pm and look for these lines at the top:
package Apache::Ocsinventory::Server::Inventory::Update::Hardware;
use strict;
require Exporter;
our @ISA = qw /Exporter/;
our @EXPORT = qw / _hardware /;
use Apache::Ocsinventory::Server::Constants;
use Apache::Ocsinventory::Server::System qw / :server /;
Add this line right afterward:
use constant CHECKSUM_MAX_VALUE => 262143;
and restart Apache. After that, I see this in httpd/error_log:
Constant subroutine Apache::Ocsinventory::Server::Inventory::Update::Hardware::CHECKSUM_MAX_VALUE redefined at /usr/lib/perl5/5.8.8/constant.pm line 103.
However, it doesn't appear to affect things, and I can now run the client on the machine. Bletcherous hack, but it gets the job done.
So, Heartbleed. Straight-up dodged a bullet on this one at $WORK: we use CentOS 5 for nearly everything, and it does not come with a vulnerable version of OpenSSL -- it's stuck at 0.9.8something. As for home servers, I'm using Debian 7; IMAP was affected, and so was the HTTPS I run on my own site. I need to change the certs for those, but it's a low priority. I've been reading lots of assurances from my banks that they weren't affected, so there's that. I haven't dug into my wireless router yet, but the news cannot possibly be good.
The reading about this has been really, really interesting. First, hot off the presses, XKCD has a truly awesome explanation of the bug:
I am in awe of someone who can explain things this clearly.
Next, there's this from @Indy_Griffiths on Twitter
But enough with the funny. My new favourite blogger, Patrick McKenzie, writes about "What Heartbleed Can Teach The OSS Community About Marketing". You really need to read the whole thing, but here are just a few choice bits:
There exists a huge cultural undercurrent in the OSS community which suggests that marketing is something that vaguely disreputable Other People do which is opposed to all that is Good And Right With The World, like say open source software. Marketing is just a tool, and it can be used in the cause of truth and justice, too.
As technologists, the Heartbleed vulnerability posed an instant coordination problem. We literally had to convince hundreds of thousands of people to take action immediately. The consequences for not taking action immediately were going to be disastrous. [...]
Given the importance of this, we owe the world as responsible professionals to not just produce the engineering artifacts which will correct the problem, but to advocate for their immediate adoption successfully. If we get an A for Good Effort but do not actually achieve adoption because we stick to our usual "Put up an obtuse notice on a server in the middle of nowhere" game plan, the adversaries win. [...]
This makes marketing an engineering discipline. We have to get good at it, or we will fail ourselves, our stakeholders, our community, and the wider world.
"This makes marketing an engineering discipline." That stopped the coffee cup halfway to my mouth, I tell you what.
Then, awaking from a yearlong hibernation, Dan Kaminsky wrote about the failure of, like, everything that led to Heartbleed. Quote:
The larger takeaway actually isn't "This wouldn't have happened if we didn't add Ping”, the takeaway is "We can't even add Ping, how the heck are we going to fix everything else?".
The Wall Street Journal wrote two days ago:
Matthew Green, an encryption expert at Johns Hopkins University, said OpenSSL Project is relatively neglected, given how critical of a role it plays in the Internet. Last year, the foundation took in less than $1 million from donations and consulting contracts.
Donations have picked up since Monday, Mr. Marquess said. This week, it had raised $841.70 as of Wednesday afternoon.
I'm gonna give this a couple weeks to calm down, then I'm sending them a hundred dollars. It's not much, and Lord knows it's way short of the sustainable funding they should really have, but it's something.
(Incidentally, if you aren't following Runa A. Sandvik, Colin Percival, Matthew Green, and Matt Blaze on Twitter, you're missing out on some really interesting conversations by people who know what they're talking about.)
And now it's time to post.
This info comes from the Codeweavers page for MS Office 2010, but it didn't come up when I searched for it...
To activate MS Office 2010, running in Crossover Linux, against a KMS:
Start the "Manage Bottle" dialog from your CX installation, select your Office 2010 bottle, click on Run Command and type "regedit".
Navigate to HKEYLOCALMACHINE\Software\Microsoft\OfficeSoftwareProtectionPlatform
Create the following 2 keys via Edit -> New -> Key: String Value:
Keep regedit open and start any MS Office application (i.e. Word).
Go to HKEY_USERS\S-1-5-20\Software\Microsoft\OfficeSoftwareProtectionPlatform
Insert the following key via Edit -> New -> Key: Binary Value
Restart the office application; it should now be activated. You can check that by going to File -> Help.
Today, on my 42nd birthday, I found out that a misconfigured firewall at $WORK had been participating in a DDOS attack. It was running an NTP server that was open to all, and the firewall rules I'd thought were set to default-deny were not. It's a crappy way to start your workday.
I'm trying to take more from it than just "Oh shit, I fucked up." Complexity of setup, proper use of nmap, trust-but-verify, distributed monitoring, etc. But I'm still working my way through that sinking feeling right now.
A while back I migrated Bacula, along with my tape library, from a Linux machine to a FreeBSD machine. The FreeBSD server is a whitebox with a bunch of hard drives, and those + ZFS == 30 TB of backup space. Which is sweet.
This lets me try doing disk-to-disk-to-tape with Bacula. Mostly working, but one thing I notice is that it is damn slow getting to the end of a tape to append data when I want to migrate -- like, half an hour or more. This is way longer than the Linux machine ever took, and that's doubly weird because I moved over the FC cards with the migration so it's the same hardware (though perhaps different drivers). So WTF?
Turns out, I think, that this is a shortfall in FreeBSD. Follow the bouncing ball:
Tape drives write one EOF (end-of-file) marker for the end of a file.
Bacula writes its jobs in chunks -- I have it set up to write 8GB chunks -- and the tape driver marks each one with an EOF.
When a tape is loaded, Bacula needs to get to the end of a tape. That means lots of "please look for the next EOF marker", which takes a while.
There's a shortcut, though: the MTIOCGET ioctl can (I think) keep track of how many EOFs are on the tape. That means the driver can just tell the tape drive "please fast-forward to the nth EOF marker."
Linux implements MTIOCGET.
FreeBSD does not.
Now, this is all stuff I've put together from various mailing list posts, so I might be wrong...but it seems to explain what's going on. There's also a suggestion that I'm wrong about this -- that FreeBSD does support MTIOCGET. So much for conclusions. Stay tuned.
In order to write this blog entry, I had to shave yaks...not once, but twice. First, my Emacs functions were all messed up, and I couldn't figure out why. Hadn't I figured this out already? Then I realized I had, but on another machine. I keep dotfiles in git, of course, so it was a simple pull away...except for all the changes that had accumulated in the meantime. Merge, commit, pull, merge again and now it's working. Swear blind not to do it again, knowing full well that I will.
$WORK is busy; I got mad a while back and finally moved Bacula from one server to another, which meant moving FC cards and the tape library to the new server, and it works now but oh god did that take time and effort. It's worth it -- things are smoother now, and will be even smoother once I get job migration working. (Disk-to-disk-to-tape for the win!) But it takes discipline to keep working on it, along with all the other things I'm supposed to be working on.The next two months at $WORK look like they're going to be busier than usual, and I'm already having to say things like "I can do that for you in mid-January." I hate that; it's nothing that can't wait, really, but I still hate it. Add to that War in Heaven (buy me a beer and I'll tell you the sordid tale), paperwork catchup, mid-year changeover and the Temporal Anomaly Zone.
And so but. Yesterday, Torturedpotato sent me off to a homebrew club meeting. I was reluctant to go -- it's a fun time, but I hadn't thought about it in advance or made sure it worked for everyone. And she sent me off anyway, saying things would be all right. They were. I had some really good homebrew (plus some of my own...WAH, this year's Xmas stout is a hot mess), and got to take my mind off things for a while -- think of something other than work, computers, devops, Sysadvent, distributed in-memory databases.
Did you know there was a fork of Bacula named Bareos? Not I. Not sure whether to pronounce it "bar-ee-os" or "bear-o-s". Got Kern Sibbald, Bacula's creator, rather upset. He promises to bring over "any worthwhile features"...which is good, because there are a lot.
Post by Matthew Green titled "How does the NSA break SSL?". Should be reading that now but I'm writing this instead.
I have not read The Phoenix Project, which makes me a bad person for reacting so viscerally to things like "A personal reinterpretation of the three ways" and the self-congratulatory air of the headshot gallery. I'm trying to figure out why I react this way, and whether it's justified or just irrational dislike of people I perceive as outsiders. Seriously, though, the Information Technology Process Institute?
Got Netflix at home? Got IPv6? That might be why they think you're in Switzerland and change your shows accordingly. In my case, they thought I was in the US and offered to show "30 Rock" and "Europa Report"...until I tried to actually stream them and they figured out the truth. Damn.
Test-Driven Infrastructure with Chef. Have not used Chef before, but I like the approach the author uses in the first half of the book: here's what you need to accomplish, so go do it. The second half abandons this...sort of unfortunate, but I'm not sure explaining test infrastructure libraries (ServerSpec, etc) would work well in this approach. Another minor nitpick: there's a lot of boilerplate output from Chef et al that could easily be cut. Overall, though, I really, really like this book.
Mencius Moldbug, one of the most...I mean...I don't even. Jaw-droppingly weird. Start with "Noam Chomsky killed Aaron Swartz".
Today's problems:
pt-fifo-split doesn't like embedded newlines, which is fun to find out after 24 hours of loading data. I'm trying copying the approach in the Perl Cookbook; should have an answer in a day or two.
Some .debs (cough Rstudio) have an extra newline in their control file. This causes lots of fun when prm generates a Packages file with extra newlines. Patch going in to prm. Not sure whether this is a bug in packaging, or whether repo generators should be watching for this. Either way. This is actually my second patch for this project which, yes, makes me feel like a good citizen.
Cos man, it's been a busy couple of weeks. And that's not even the real stuff; that's just the yakshaving.
At $WORK I need to package stuff up. Most of the time, I've worked with RPMs. Those are easy; it took, I dunno, an hour? to come up with my first spec file, and it's been pretty simple after that -- even with oddball software that uses "make configure" to mean "make build" and "make" to actually install it. But deb files...man, these are hard. There is a lot more policy built into making a deb file than there ever was in an RPM, and you overlook/override/ignore it at your peril. I want to do the right thing -- which means, since I'm stubborn and have thought idly about becoming a Debian developer someday, I bang my head against the deb files until it works. Except, that as Jordan Sissel has so rightly pointed out, sometimes I just don't have time.
So Jordan Sissel, being Jordan Sissel, has put together fpm, the Effing Package Management. And it's awesome: take your source files, point fpm at them, and you get rpms AND debs. But I want more: the pbuilder system is truly awesome in its you're-gonna-compile-on-an-empty-chroot-or-you're-borked brutish stubbornness, and I want that for fpm. So I'm using vagrant for this. My approach is not as nice as debuild or pbuilder, but it is so far working for me.
It's actually pretty trivial; I'm mainly putting it up here so that I remember it. Hopefully it's useful to someone else, too.
The heart is a dirt-simple shell script; here's what I've got for vmd:
#!/bin/bash
# Bail on error; want this to be as hands-off as possible
set -e
# Add build-essential
sudo apt-get update
sudo apt-get install -y build-essential
# Install fpm. FIXME: That check is borken; not sure about using vagrant_ruby to install it.
which fpm || sudo /opt/vagrant_ruby/bin/gem install fpm
# Where to look for stuff
ORIGIN=/vagrant
INSTALLROOT=/opt
BUILDDIR=/tmp/vmd
TARBALL=vmd-1.9.1.bin.LINUXAMD64.opengl.tar.gz
BUILDDEPS=
# FIXME: gotta put in all the "--depends" by hand.
DEPS="--depends libgl1-mesa-dev --depends libglu1-mesa --depends libxinerama1 --depends libxi6"
[ -d $BUILDDIR ] || mkdir $BUILDDIR
cd $BUILDDIR
tar xvzf ${ORIGIN}/$TARBALL
cd vmd-1.9.1
export VMDINSTALLBINDIR=${INSTALLROOT}/bin
export VMDINSTALLLIBRARYDIR=${INSTALLROOT}/vmd
./configure
sudo make -C src install
# And here's the fpm magic. FIXME: Note the stupid assumptions about opt.
/opt/vagrant_ruby/bin/fpm -s dir -t deb -n vmd -v 1.9.1 -f $DEPS -p ${ORIGIN}/vmd-VERSION_ARCH.deb -x opt/vagrant_ruby -x opt/VBoxGuestAdditions* -C / opt
Drop this in a directory along with the vmd tarball:
-rwxr-xr-x 1 hugh hugh 1153 Sep 12 14:40 build_vmd.sh
-rw-r--r-- 1 hugh hugh 22916955 Sep 12 10:05 vmd-1.9.1.bin.LINUXAMD64.opengl.tar.gz
Vagrant up, then build:
$ vagrant init precise64 # That's Ubuntu 12.04, yo.
$ vagrant up
$ vagrant ssh -- /vagrant/build_vmd.sh
If all goes well, you now have a deb in that directory:
$ ls -l *deb
-rw-rw-r-- 1 hugh hugh 22396738 Sep 12 14:41 vmd-1.9.1_amd64.deb
Like I say, it's dirt-simple -- but it does make stuff a lot easier. Thanks, Mitchell and Jordan!
Yesterday a user asked me about a Java application that was unusably slow when running over SSH X forwarding. This was using Ubuntu 12.04 + Unity; weird thing was, it was fast on a laptop that was also running Ubuntu 12.04 + Unity. Other X apps were fine over SSH to the same host.
Turns out there are a lot of complaints about Java applications being slow over SSH. This bug in particular looks likely. If I read this bug correctly it's been fixed in 12, so I'm not sure why I'm seeing it. There were some differences in packages between the laptop and the desktop, but nothing jumped out; both were using NVidia graphics + the proprietary drivers.
In the end, a workaround was to use i3, a spartan (but oh so good) window manager. Interestingly, the app showed faster performance under XFCE than Unity, but not as fast as i3. Maybe a problem in a couple of different libraries?
After some upgrades (kernel and otherwise) to an Ubuntu 12 workstation, a user reported one of their monitors insisted on displaying at low resolution (800x600, instead of the 1920x1024 it had previously). I eventually figured out that X and/or the driver (both Radeon and proprietary ATI) could not get EDID info from the monitor anymore. This lead down a few rabbit holes, including a bug in Intel's driver and reflashing EDID info on the affected monitor.
In the end, though? Replacing the goram cable (analog, if that makes a difference) did the trick. I now have the cable, cut in half, hanging over my desk as a trophy.
While setting up mini-dinstall today, I tripped over this error:
$ mini-dinstall -b
Traceback (most recent call last):
File "/usr/bin/mini-dinstall", line 205, in <module>
configp.read(configfile_names)
File "/usr/lib/python2.7/ConfigParser.py", line 305, in read
self._read(fp, filename)
File "/usr/lib/python2.7/ConfigParser.py", line 546, in _read
raise e
ConfigParser.ParsingError: File contains parsing errors: /home/hugh/.mini-dinstall.conf
[line 2]: ' mail_to = sysadmin@example.com\n'
[line 3]: ' incoming_permissions = 0755\n'
[line 4]: ' architectures = all, amd64\n'
[line 5]: ' archive_style = simple-subdir\n'
[line 6]: ' dynamic_reindex = 1\n'
[line 7]: ' archivedir = /home/hugh/public_html/debian/\n'
Eventually, I figured out the reason: leading spaces in each line. I'd assumed I could write the config file like so:
[section]
# Notice the indentation!
key = value
otherkey = othervalue
but that's incorrect; it needs to be like so:
[section]
# Indentation is for suckers and chumps. Apparently.
key = value
otherkey = othervalue
Hopefully that saves someone half an hour...
Wednesday: A very important fileserver panicked and rebooted, apropos of nothing. I can't figure out why.
Thursday: Around 1.30am, a disk array at $WORK noticed one of its driveswas likely to fail shortly. It got very excited and sent me one hundred and fifty (150) (not exaggerating) text messages. When I got to work I failed the drive, put the spare into the array, the array started rebuilding, and I called Dell about 10am to arrange for a replacement to be sent out the next day (that is, Friday -- today).
When the rebuild was done it complained that another drive was likely to fail shortly. I contacted Dell and was told that the complaint about the second drive was a) misguided (it wasn't really failing) and b) really meant that the array (that is, /share/networkscratch) was likely to fail entirely. They called this a punctured stripe and there are more than a few complaints about this terminology. Anyhow. The only solution was to back up the data, delete the array, recreate it and restore from backup. "Everybody out of the pool!"
About 6pm last night the process was finally done, but the array still complained that the drive was going to fail soon. I contacted Dell again, and after looking at the array they decided that the second drive really was failing after all -- in fact, it had probably failed first, the array had been compensating for it all this time, and its problem only became evident when the other drive failed. A second replacement drive is due to arrive Monday; it was too late by this time to have it arrive today.
I brought up the server, restored the 2am backup to some spare space, and went home; this was about 9.15pm.
Friday: a long-running (ie, monthlong) rsync process decided to suck up all the memory on our webserver. It had to be forcibly rebooted.
And now I want a beer.
It was a rare clearish morning here in New West, and I just saw the ISS and the SpaceX Dragon fly over my house! How awesome is that? There were light, patchy clouds overhead, but the ISS was still bright and visible through them -- and then fainter, just a little bit ahead, was the Dragon capsule! It's undocking right now, and it's amazing luck that I got to see it. (Swoon...)
In other news: yesterday I got sleep, it was sunny, I forgot a couple of things that I should have been working on, and the resulting optimism led me to migrate the sysadmin wiki for the third time. This time it was from Foswiki to Ikiwiki. I have nothing against Foswiki except that I really, REALLY want to edit everything from Emacs; for FW, that means this complicated wrapper around sudo that was getting tiring. Now it's Git + Emacs + Multimarkdown and I am happy.
Not only that, I got a long-standing feature request (one that I made to myself) out of the way: I can now check in, in Emacs Orgmode, to a particular RT ticket when replying to that ticket. (waves hands around in insane manner) Don't you see what this MEANS? Previously I'd have to switch to Emacs, refresh the rtliberation view which'd take 5 seconds (SO BORED), run a command to add it to my Org file, switch to Org, find the new addition, check in and THEN switch back to Mutt and reply. Now it's all in Emacs. It means a new life for ALL of us, baby! You'll see!
This entry brought to you by not enough sleep, excitement about spaceflight, Emacs geekery and a mug full of coffee.
Yesterday I was asked to restore a backup for a Windows desktop, and I couldn't: I'd been backing up "Documents and Settings", not "Users". The former is appropriate for XP, which this workstation'd had at some point, but not Windows 7 which it had now. I'd missed the 286-byte size of full backups. Luckily the user had another way to retrieve his data. But I felt pretty sick for a while; still do.
When shit like this happens, I try to come up with a Nagios test to watch for it. It's the regression test for sysadmins: is Nagios okay? Then at least you aren't repeating any mistakes. But how the hell do I test for this case? I'm not sure when the change happened, because the full backups I had (going back three months; our usual policy) were all 286 bytes. I thought I could settle for "alert me about full backups under...oh, I dunno, 100KB." But a search for that in the catalog turns up maybe ten or so, nine of them legitimate, meaning an alert for this will give 90% false positives.
So all right, a list of exceptions. Except that needs to be maintained. So imagine this sequence:
I need some way of saying "Oh, that's unusual..." Which makes me think of statistics, which I don't understand very well, and I start to think this is a bigger task than I realize and I'm maybe trying to create AI in a Bash script.
And really, I've got don't-bug-me-if-this lists, and local checks and exceptions, and I've documented things as well as I can but it's never enough. I've tried hard to make things easy for my eventual successor (I'm not switching jobs any time soon; just thinking of the future), and if not easy then at least documented, but I have this nagging feeling that she'll look at all this and just shake her head, the way I've done at other setups. It feels like this baroque, Balkanized, over-intricate set of kludges, special cases, homebrown scripts littered with FIXMEs and I don't know what-all. I've got Nagios invoking Bacula, and Cfengine managing some but not all, and it just feels overgrown. Weedy. Some days I don't know the way out.
And the stupid part is that NONE OF THIS WOULD HAVE FIXED THE ORIGINAL PROBLEM: I screwed up and did not adjust the files I was backing up for a client. And that realization -- that after cycling through all these dark worryings about how I'm doing my job, I'm right back where I started, a gutkick suspicion that I shouldn't be allowed to do what I do and I can't even begin to make a go at fixing things -- that is one hell of a way to end a day at work.
The other day at $WORK, a user asked me why the jobs she was
submitting to the cluser were being deferred. They only needed one
core each, and showq
showed lots free, so WTF?
By the time I checked on the state of these deferred jobs, the jobs
were already running -- and yeah, there were lots of cores free.
The checkjob
command showed something interesting, though:
$ checkjob 34141 | grep Messages
Messages: cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'
I thought this was from the node that the job was on now:
$ qstat -f 34141 | grep exec_host
exec_host = compute-3-5/19
but that was a red herring. (I could've also got the host from "checkjob | grep -2 "Allocated Nodes".) Instead, grepping through maui.log showed that it had been compute-1-11 that was the real problem:
/opt/maui/log $ sudo grep 34141 maui.log.3 maui.log.2 maui.log.1 maui.log |grep -E 'WARN|ERROR'
maui.log.3:03/05 16:21:48 ERROR: job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:48 ERROR: job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:50 ERROR: job '34141' cannot be started: (rc: 15041 errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN' hostlist: 'compute-1-11')
maui.log.3:03/05 16:21:50 WARNING: cannot start job '34141' through resource manager
maui.log.3:03/05 16:21:50 ERROR: cannot start job '34141' in partition DEFAULT
maui.log.3:03/05 17:21:56 ERROR: job '34141' cannot be started: (rc: 15041 errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN' hostlist: 'compute-1-11')
There were lots of messages like this; I think the scheduler kept only gave up on that node much later (hours).
checknode showed nothing wrong; in fact, it was running a job currently and had 4 free cores:
$ checknode compute-1-11
checking node compute-1-11
State: Busy (in current state for 6:23:11:32)
Configured Resources: PROCS: 12 MEM: 47G SWAP: 46G DISK: 1M
Utilized Resources: PROCS: 8
Dedicated Resources: PROCS: 8
Opsys: linux Arch: [NONE]
Speed: 1.00 Load: 13.610
Network: [DEFAULT]
Features: [NONE]
Attributes: [Batch]
Classes: [default 0:12]
Total Time: INFINITY Up: INFINITY (98.74%) Active: INFINITY (18.08%)
Reservations:
Job '33849'(8) -6:23:12:03 -> 93:00:47:56 (99:23:59:59)
JobList: 33849
maui.log showed an alert:
maui.log.10:03/03 22:32:26 ALERT: RM state corruption. job '34001' has idle node 'compute-1-11' allocated (node forced to active state)
but that was another red herring; this is common and benign.
dmesg on compute-1-11 showed the problem:
compute-1-11 $ dmesg | tail
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
<<vendor>> ASC=0x80 ASCQ=0x87ASC=0x80 <<vendor>> ASCQ=0x87
Info fld=0x10489
end_request: I/O error, dev sda, sector 66697
Aborting journal on device sda1.
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
(Linux)|Wed Mar 06 09:37:20|[compute-1-11:~]$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda5 on /state/partition1 type ext3 (rw)
/dev/sda2 on /var type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
sophie:/export/scratch on /share/networkscratch type nfs (rw,addr=10.1.1.1)
mount: warning /etc/mtab is not writable (e.g. read-only filesystem).
It's possible that information reported by mount(8) is not
up to date. For actual information about system mount points
check the /proc/mounts file.
but this was also logged on the head node in /var/log/messages:
$ sudo grep compute-1-11.local /var/log/* |grep -vE 'automount|snmpd|qmgr|smtp|pam_unix|Accepted publickey' > ~/rt_1526/compute-1-11.syslog
/var/log/messages:Mar 6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
/var/log/messages:Mar 6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in remtree, unlink failed on /opt/torque/mom_priv/jobs/34038.sophie.TK
/var/log/messages:Mar 6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
and in /var/log/kern:
$ sudo tail /var/log/kern
Mar 5 10:05:00 compute-1-11.local kernel: Aborting journal on device sda1.
Mar 5 10:05:01 compute-1-11.local kernel: ext3_abort called.
Mar 5 10:05:01 compute-1-11.local kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Mar 5 10:05:01 compute-1-11.local kernel: Remounting filesystem read-only
Mar 7 05:18:06 compute-1-11.local kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range
There are a few things I've learned from this:
I've started to put some of these commands in a sub -- that's a really awesome framework from 37 signals to collect commonly-used commands together. In this case, I've named the sub "sophie", after the cluster I work on (named in turn after the daughter of the PI). You can find it on github or my own server (github is great, but what happens when it goes away? ...but that's a rant for another day.) Right now there are only a few things in there, and they're somewhat specific to my environment, and doubtless they could be improved -- but it's helping a lot so far.
This was a busy-ass day, yo. Got up at 5.30am to make beer, only to find out that a server at work had gone down and its ILOM no longer works. A few hours later, I've convinced everyone that a trip to UBC would be lovely; I go in, reboot the server and we drive back. It's no 6000 km from Calgary to California and back like my brother does, but for us that's a long drive.
And then it's time to make beer, because I've left it mashing overnight. Boil, chill, sanitize, pitch, lug, clean, and we're one batch closer to 50 (50!). A call to my parents (oh yeah: Dad, you guys can totally stay here in May) and then its supper. And then it's time for astronomizing. Computers, beer and astronomy: this day had it all.
So tonight's run was mostly about trying out the manual setting circles. I don't have a tablet or smart phone to run something like Stellarium on, so for now I'm printing out a spreadsheet with a three-hour timeline, 15 minute intervals, of whatever Messiers are above the horizon.
How did it work? Well, first I zeroed the azimuth on Kochab (Beta UM) rather than Polaris, and kept wondering why the hell the azimuth was off on everything. I realized my mistake, set things right, and tried again. And...it worked well, when I could recognize things.
M42, for example, was easy. (It was the first thing I found by dialing everything in, and when I took a look there was a satellite crossing the FOV. Neat!) But then, it's big, easy to recognize, and i've seen it before. Ditto M45. M1? Not so much; I haven't seen it before, and I didn't have a map ready to look at. M35, surprisingly, was hard to find; M34 was relatively easy, and M36 was found mainly because I knew what to look for in the finder.
This should not be surprising. I've been tracking down objects by starhopping for a while now, so why I thought it would be easier now that I could dial stuff in is beyond me. It's my first time, and the positions were calculated for Vancouver, not New West (though I'm curious how much diff that actually makes).
There were some other things I looked for, though.
M51: found the right location via starhopping, and confirmed on my chart. But could I see it? Could I bollocks.
M50: Pretty sure I did see this; looking at sketches, they seem pretty similar to what I saw. (And that's another thing: it really does feel too easy, like I haven't earned it, and I can't be sure I've really found it.) Oh, and saw Pakan 3, an asterism shaped like a 3/E/M/W, nearby.
M65/M66: Maybe M65; found the location via starhopping and confirmed the position in my chart. Seems like I had the barest hint of M65 visible.
At the request of my kids:
Jupiter: Three bands; not as steady as I thought it would be.
Pleidies: Very nice, but I do wis I had a wider field lens.
M36: Eli suggested a star cluster, aso I went with this. Lovely X shape.
Betelgeuse: Nice colour. Almost forgot about this, and had to look at it through trees before I went home.
Some time between mid-December and January at $WORK, we noticed that FTP transfers from the NIH NCBI were nearly always failing; maybe one attempt in 15 or so would work. I got it down to this test case:
wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34777/matrix/GSE34777_series_matrix.txt.gz
The failed downloads would not fail right away, but instead hung when the data connection from the remote end should have transferred the file to us. Twiddling passive did nothing. If I tried HTTP instead of FTP, I'd get about 16k and then the transfer would hang.
That hostname, ftp.ncbi.nlm.nih.gov, resolves to 4-6 different IP addresses, with a TTL of 30 seconds. I found that ftp transfers from these IP addresses failed:
but this one worked:
The A record you get doesn't seem to have a pattern, so I presume it's being handled by a load balancer rather than simple round-robin. It didn't come up very often, which I think accounts for the low rate of success.
At first I thought this might indicate network problems here at $WORK, but the folks I contacted insisted nothing has changed, we're not behind any additional firewalls, and all our packets take the same route to both sets of addresses. So I checked our firewall, and couldn't find anything there -- no blocked packets, and to the best of my knowledge no changed settings. Weirdly, running the wget command on the firewall itself (which runs OpenBSD, instead of CentOS Linux like our servers) worked...that was an interesting rabbit hole. But if I deked out the firewall entirely and put a server outside, it still failed.
Then I tripped over the fix: lowering the MTU from our usual 9000 bytes to 8500 bytes made the transfers work successfully. (Yes, 8500 and no more; 8501 fails, 8500 or below works.) And what has an MTU of 8500 bytes? Cisco Firewall Service Modules, which are in use here at $WORK -- though not (I thought) on our network. I contacted the network folks again, they double-checked, and said no, we're not suddenly behind an FSM. And in fact, their MTU is 8500 nearly everywhere...which probably didn't happen overnight.
Changing the MTU here was an imposing thought; I'd have to change it everywhere, at once, and test with reboots...Bleah. Instead, I decided to try TCP MSS clamping instead with this iptables rule:
iptables -A OUTPUT -p tcp -d 130.14.250.0/24 --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 8460
(Again, 8460 or below works; 8461 or above works fine.) It's a hack, but it works. I'm going to contact the NCBI folks and ask if anything's changed at their end.
I found out before Xmas that my request for an office had been approved. We had some empty ones hanging around, and my boss encouraged me to ask for one. Yesterday I worked in it for the first time; today I started moving in earnest, and moved my workstation over.
And my god, is it ever cool. The office itself is small but nice -- lots of desk space, a bookshelf, and windows (OMG natural light). But holy crap, was it ever wonderful to sit in there, alone, uninterrupted, and work. Just work. Like, all the stuff I want to do? I was able to sit down, plan out my year, and figure out that in the grand scheme of things it's not too much. (The problem is wanting to do everything right away.) And today I worked on getting our verdammt printing accounting exposed to Windows users, setting up Samba for the first time in eons and even getting it to talk to LDAP.
Not only that -- not only that, I say, but when I had interruptions -- when people came to me with questions -- it was fine. I didn't feel angry, or lost, or helpless. I helped them as best I could, and moved on. And got more shit done in a day than I've done in a week.
I'm going to miss hanging out with the people in the cubicle I was in. Yes, they're only ten metres away, but there's a world of difference between having people there and having to walk over to see them. I'm not terribly outgoing, and it's just in the last six months or so that I've really come to enjoy all the people around me. They're fun, and it's nice to be able to talk to them. (There's even a programmer who homebrews, for a wonderful mixture of tech and beer talk.) But oh my sweet darling door made of steel and everything, I love this place.
First day back at $WORK after the winter break yesterday, and some...interesting...things. Like finding out about the service that didn't come back after a power outage three weeks ago. Fuck. Add the check to Nagios, bring it up; when the light turns green, the trap is clean.
Or when I got a page about a service that I recognized as having, somehow, to do with a webapp we monitor, but no real recollection of what it does or why it's important. Go talk to my boss, find out he's restarted it and it'll be up in a minute, get the 25-word version of what it does, add him to the contact list for that service and add the info to documentation.
I start to think about how to include a link to documentation in Nagios alerts, and a quick search turns up "Default monitoring alerts are awful" , a blog post by Jeff Goldschrafe about just this. His approach looks damned cool, and I'm hoping he'll share how he does this. Inna meantime, there's the Nagios config options "notes", "notesurl" and "actionurl", which I didn't know about. I'll start adding stuff to the Nagios config. (Which really makes me wish I had a way of generating Nagios config...sigh. Maybe NConf?)
But also on Jeff's blog I found a post about Kaboli, which lets you interact with Nagios/Icinga through email. That's cool. Repo here.
Planning. I want to do something better with planning. I've got RT to catch problems as they emerge, and track them to completion. Combined with orgmode, it's pretty good at giving me a handy reference for what I'm working on (RT #666) and having the whole history available. What it's not good at is big-picture planning...everything is just a big list of stuff to do, not sorted by priority or labelled by project, and it's a big intimidating mess. I heard about Kanban when I was at LISA this year, and I want to give it a try...not suure if it's exactly right, but it seems close.
And then I came across Behaviour-driven infrastructure through Cucumber, a blog post from Lindsay Holmwood. Which is damn cool, and about which I'll write more another time. Which led to the Github repo for a cucumber/nagios plugin, and reading more about Cucumber, and behaviour-driven development versus test-driven development (hint: they're almost exactly the same thing).
My god, it's full of stars.
I'm still digesting all the stuff that came out of LISA this year. But there are a number of things I want to try out:
I learned a little bit about agile development, mainly from Geoff Halprin's training material and keynote, and it seemed interesting. One of the things that resonated with me was the idea of only having a small number of work stages for stuff: in the queue, next, working, and done. (Going without the info here, so quite possibly wrong.) I like that: work on stuff in two-week chunks, commit to that and get it done. That seems much more manageable than having stuff in the queue with no real idea of a schedule. And a two-week chunk is at least a good place to start: interruptions aren't about to go away any time soon, and I can adjust this as necessary.
A corollary is that it's probably not best to plan more than two such things in a month. I'm thinking about things like switching from Nagios to Icinga, setting up Ganeti, and such: more than I can do in an hour, less than a semester's work.
I really want to work on eliminating pain points this year. Icinga's one; Nagios' web interface is painful. (I'd also like to look at Sensu.) I want to make backups better. I want to add proper testing for Cfengine with Vagrant and Git, so I can go on more than a wing and a prayer when pushing changes.
I also need to work more closely with the faculty in my department. Part of that is committing to more manageable work, and part of that is just following through more. Part of it, though, is working with people that intimidate me, and letting them know what I can do for them.
I need to manage my time better, and I think a big part of that is interruptions. I've just been told I'm getting an office, which is a mixed blessing. There's a certain amount of flux in the office, and I've been making friends with the people around me lately. I'll miss them/that, but I think the ability to retreat and work on something is going to be valuable.
Another part of managing time is, I think/hope, a better routine. Like: one hour every day for long-term project work. (The office makes this easier to imagine.) Set times for the things I want to get done at home (where my free time comes in one-hour chunks). Deciding if I want to work on transit (I can take my laptop home with me, and it's a 90 minute commute), and how (fun projects? stuff I can't get done at work? blue-sky stuff?). If, because a. my eyes will bug out if I stare at a screen all day and b. I firmly intend to keep a limit on my work time. So it'd probably be a couple days a week, to allow time for all the podcasts and books I want to inhale.
Microboxing for productivity. Interesting stuff.
Kanban. Related to Agile, but I forgot about it. Pomodoro + Emacs + Orgmode, too.
Probably more as I think of it...but right now it's time to sleep. 5.30am comes awful early after 11 days off...
So yesterday I got an email from another sysadmin: "Hey, looks like there's a lot of blocked connections to your server X. Anything happening there?" Me: "Well, I moved email from X to Y on Tuesday...but I changed the MX to point at Y. What's happening there?"
Turns out I'd missed a fucking domain: I'd left the MX pointing to the old server instead of moving it to the new one. And when I turned off the mail server on the old domain, delivery to this domain stopped. Fortunately I was able to get things going again: I changed the MX to point at the new server, and turned on the old server again to handle things until the new record propogated.
So how in hell did this happen? I can see two things I did wrong:
Poor planning: my plans and checklists included all the steps I needed to do, but did not mention the actual domains being moved. I relied on memory, which meant I remembered (and tested) two and forgot the third. I should have included the actual domains: both a note to check the settings and a test of email delivery.
No email delivery check by Nagios: Nagios checks that the email server is up, displays a banner and so on, but does not check actual email delivery for the domains I'm responsible for. There's a plugin for that, of course, and I'm going to be adding that.
I try to make a point of writing down things that go bad at $WORK, along with things that go well. This is one of those things.
Yesterday I finally moved the $WORK mail server (well, services) from a workstation under my desk to a proper VM and all. Mailman, Postfix, Dovecot -- all went. Not only that, but I've got them running under SELinux no less. Woot!
Next step was to update all the documentation, or at least most of it, that referred to the old location. In the process I came across something I'd written in advance of the last time I went to LISA: "My workstation is not important. It does no services. I mention this so that no one will panic if it goes down."
Whoops: not true! While migrating to Cfengine 3, I'd set up the Cf3 master server on my workstation. After all, it was only for testing, right? Heh. We all know how that goes. So I finally bit the bullet and moved it over to a now-even-more-important VM (no, not the mail server) and put the policy files under /masterfiles so that bootstrapping works. Now we're back to my workstation only holding my stuff. Hurrah!
And did I mention that I'm going to LISA? True story. Sunday I'm doing Amazon Web Services training; Monday I'm in the HPC workshop; Tuesday I'm doing Powershell Fundamentals (time to see how the other half lives, and anyway I've heard good things about Powershell) and Ganeti (wanted to learn about that for a while). As for the talks: I'm not as overwhelmed this year, but the Vint Cerf speech oughta be good, and anyhow I'm sure there will be lots I can figure out on the day.
Completely non-techrelated link of the day: "On Drawing". This woman is an amazing writer.
Last Friday was not a good day. First, I try installing a couple hard drives I bought for a Dell MD3200 disk array, and it rejected them; turns out that it will not work with drives that are not pre-approved. It's right there in the documentation. I was aware of last year's kerfuffle with Dell announcing that their servers would no longer work w/unapproved drives, and then their backdown on that...but the disk arrays are different. So now I have a couple extra 3 TB SATA drives to find a place for, and a couple drives to buy from Dell.
While I'm inin the server room staring at the blinking error light on the disk array and wondering if I'd just brought down the server it was attached to, I notice another blinking error light. This one was on the server that hosts Xen VMs that run LDAP, monitoring and a web server. It had a failing drive. Good thing it's in RAID 6, right? Sure, but it failed nearly a month ago -- I had not set up email alerts on the server's ILOM, so I never got notified about this. Fuck.
I send off an email to order a drive, then figure out how to get alerted about this. Email alerts are configured, but belt and suspenders: I get the CLI tool for the RAID card, find a Nagios plugin that runs it, and add the check to Nagios, running on the server's dom0. Hurrah, it alerts me! I ack the alert, and now it's time to head home.
On my way home I start getting pages about the VMs on this machine -- nothing down, but lots of timeouts. The machine recovers, then stumbles and stays down. (These alerts were coming from a second instance of Nagios I have set up, which is mostly there to monitor the main instance that runs on this server.) My commute is 90 minutes, and I have no net access along the way. When I finally get home, I SSH to work and find that the machine is hung; as far as I can tell, the CLI tool was just not exiting, and after enough accumulated the RAID card just stopped responding entirely. I reboot the machine, and ten minutes later we're back up.
Ten minutes after that, I realize I'm still in trouble: I'm getting pages about a few other machines that are not responding. Remember how one of the VMs on the original server ran LDAP? It's one of three LDAP servers I have, because I fucking hate it when LDAP goes down. The clients are configured to fail over if their preferred server (the VM) isn't responding. I check on one of the machines, and nscd had about a thousand open sockets...which makes me think that the sequence was something like this:
During the hang, the VM was responding a little bit -- probably just enough to complete a 3-way handshake.
nscd would keep that connection open, because it had decided that the server was there and would be answering. But it wouldn't.
Another query would come along, and nscd would open another connection. Rinse and repeat.
Eventually, nscd bumped up against the open FD limit (1024), and was unable to open up new connections to any LDAP server.
I'm thinking about putting in a check for the number of open FDs nscd has, but I'm starting to second-guess myself; it feels a bit circular somehow. Not the right word, but I'm tired and can't think of a better.
Gah.
Yesterday I did a long-anticipated firmware upgrade on a disk array at $work. It's attached to the head node of a small cluster we have, and holds the usual assortment of home directories and data. The process was kind of involved:
shut down the cluster to prevent disk I/O ("not mandatory, but strongly recommended" -- thx, I'll just go with "mandatory");
remove the current management software from the head node, reboot and then reinstall;
X was needed for the installation ("not mandatory, but--" Okay, right, got it, thx): twice via SSH, once by running startx locally;
I couldn't upgrade directly to the new firmware itself, but had to install bridge firmware, wait 30 minutes for things to settle out (!), then install the new firmware
oh, and "due to limitations of the Linux environment", I couldn't install the firmware from the head node itself that just had the management software upgraded -- instead, I had to install that software on another machine and install it from there.
Which is why this all took about four hours to do. But that's not all:
Before all that, I read the many, many manuals; did a dress rehearsal to shake out problems; and made sure I had a checklist (thank you, Tom Limoncelli and Orgmode) with the exact commands to run
During the upgrade, I took notes on things I'd forgotten and problems I'd encountered.
After the upgrade, I did a postmortem: updated my documentation and filed bugs, notified the users that things were back up, and watched for problems.
Which is why a 4 hour upgrade took me 9.5 hours. I think there might be a handy rule of thumb for big work like this, though I can't decide if it's "it always takes twice as long" or "it always takes five hours longer than you think." Heh.
One other top tip: stop NFS exports while you're working on a server (but see the next paragraph!). One user started a session on another machine, which automounted her home directory from the head node. This was close to the end of my work, and while I could have used another reboot, I elected not to because I didn't want to mess up her session. Yes, the reboot was important, but I'd neglected to think about this situation, and I didn't think she should have to pay for my mistake.
And if you're going to turn off NFS exports, make damn sure you have your monitoring system checking exports in the first place; that way, you won't forget to turn it back on afterward. (/me scurries to add that test to Nagios right now...)
Tomorrow I've upgrading firmware on a disk array that's attached to a small cluster I manage; yesterday, in preparation for that, I ran a full backup of the disks in question. I noticed that the home directories were taking longer than I thought, so I checked out how full they were. The answer was 97%. Oh, fuck.
The prof whose cluster this is asked for quotas to be set up for everyone; he didn't have a lot of disk space to attach, and wanted to impose some discipline on his lab. And I'd done so...only somehow, the quotas were off now, probably because I'd left it off the last time I'd had to fiddle with quotas. Because of that, one user was taking up nearly half the disk, and another was taking up almost a third. To make things worse, I had not set up my usual Nagios monitoring for this machine (disk space, say) because Ganglia was set up on it, and I'd vaguely thought that two such systems would be silly...so I was not getting my usual "OMG WTF BBQ" messages from Nagios.
It gets worse. I'd put in cron scripts that maintained the quota files, nagged users by email and CC'd me...but the permissions were 544, which meant they never ran. No email? Well, then, everything must be fine, right? Sigh.
So:
I talked to the user w/half the disk space, and it turned out that almost all of it was in a directory called "old" which she could delete w/o problems. That got us space.
I whipped up a simple Nagios plugin to check that quotas were on, and made sure I got a complaint; I turned on quotas on another partition, and made sure Nagios told me it was fine.
I fixed the permmissions on the cron scripts, and made sure they ran (I left the debug setting on, and holy crap is it verbose...I'll need to fix that).
I'm considering adding a Nagios plugin that checks for cron files (/etc/cron.*) that are not executable (although if I'm lucky, maybe there's something in the cron runner that'll complain about this).
And as a reminder to myself: if repquota gives horribly wrong information, run "quotaon -p" to verify that quotas are, in fact, on.
At $WORK I'm trying to install tmap on 64-bit CentOS 5. Here's how it goes:
Built RPM for tmap, and it works -- but not using tcmalloc.
What's tcmalloc? Part of Google-perftools; a faster malloc. We really want tcmalloc.
Found RPM for google-perftools installed, but includes only the 32-bit version of tcmalloc due to dependence of other parts of perftools on libunwind.
installing libunwind on 64-bit CentOS 5 a big PITA and I decide to try working around it.
Conveniently, tcmalloc can be compiled on 64-bit platform; produces libtcmalloc_minimal, which documentation says is perfectly valid malloc.
tmap does not come, out of the box, configured to look for (in configure script) tcmallocminimal, but there is an commented-out option to do so. You can remove the comment and run autogen.sh, and then configure will look for libtcmallocminimal.
...but this fails because the way I compiled/built rpm for tcmalloc does not include libtcmallocminimal.so; includes libtcmallocminimal.so.4.
and so my half-assed RPM/devel skillz come back to bite me in the ass again.
Today I gave some impromptu training at $WORK; the approximate topic was "Saving State in Linux". I've been meaning to do something like this for a while, but it was prompted by a conversation yesterday with one of the researchers who kept losing work state when shit happened -- Emacs window arrangements, SSH sessions to other machines, and so on. I found myself mentioning things like tmux, workgroups, and Emacs daemon mode...and after a while, I said "Let me talk to you about this tomorrow."
So today I found half an hour, decided to mention this to everyone in the lab, crowded into a meeting room, set up my laptop and the projector, and away I went. For a fly-by-the-seat-of-my-pants first attempt, I think it went relatively well. Best idea: asking people for questions. It hadn't occurred to me that people would want to know more basic stuff like "How do I split windows in Emacs?". I'm never sure what people already know, so I don't want to bore them...
Next time:
In other news: finally converted my SVN repos to Git yesterday in a fit of pique. The big three -- my org-mode stuff, and the two Cfengine repos (Cf2 and -3) -- are already in use, as in that's where I'm checking stuff into. The rest (Nagios configs, for example) are being done as I get to them. It's really, really wonderful.
Family: holy house o' plague, Batman!
Gah. We're getting the house boiled next week. (Update, March 13: too late; I puked on Friday night and spent Saturday moaning in bed; my wife did the same thing Saturday night/Sunday. FUCK.)
Also? There's a Planet Lisp. Who knew?
Last night's shutdown went...okay. The turning-off-servers part went well, and I was able to do it all in less than half an hour -- not bad for an orderly shutdown. Turning things on went less well. Partly it's because I should be using IPMI commands instead of trying to SSH to the various ILOMs/DRACs and running commands; I kept getting error messages from Fab even though I could run the commands by hand just fine.
Partly, though, it's because about 80% of the Sun ILOMs needed to be rebooted in order to control the machines, report power status correctly, get a console, etc. That's annoying. In all other respects they're the most consistently good of their sort -- they work right off the bat, they don't crap out and drop off the network (hell, Dell) and they're simple. But they seem to get wedged, and I have to SSH in as "sunservice" (it's actually Linux under the hood -- shhhh) and forcibly reboot them.
And there's the usual assortment of hardware problems/irritations that the first reboot in a few months brings to light, like the second machine in a week with a possibly failing hard drive -- complicated, of course, by not having the tool around to actually query the RAID card to see how things are going. Worryingly, a reboot made (reporting of) the problem disappear. Good, I need another project.
In other news: I've had to abandon my brief infatuation with gitit, a git-based wiki written in Haskell. It is nearly perfect, but its syntax for tables -- based on Pandoc's extension of Markdown -- is bad, non-Orgmode compatible (Orgmode tables are SO AWESOME), and best ignored in favour of direct HTML. And when editing HTML for tables becomes your best option, I'm (sadly, regretfully, heartbreakingly) outta here. So for now, for $WORK, I'm sticking with Foswiki and my awful hack of a Bash function for editing it with Emacs:
wen () {
WIKIPAGE=${@};
sudo chown apache:wheel $WIKIPAGE ${WIKIPAGE},v;
sudo chmod 775 $WIKIPAGE ${WIKIPAGE},v;
rcs -l $WIKIPAGE;
$EDITOR $WIKIPAGE;
ci -mnone -t-none -u $WIKIPAGE;
sudo chown apache:wheel $WIKIPAGE ${WIKIPAGE},v;
sudo chmod 775 $WIKIPAGE ${WIKIPAGE},v
}
The chown/chmod (and thus the sudo) are needed to maintain web-editability for the pages...not that I use them very often, but for an eventual successor/coworker. I really miss Confluence-mode.
There's a power outage scheduled for our server room tomorrow night (which means I'm gonna miss the VanBrewers meeting, BOO HISS). I've been looking for a way to script the shutdown and startup of 50-odd servers, and here's what I've come up with.
Previously I've been using cssh, which works well enough for shutdown but not so well for startup (long story). Still, SSH is the way I want to go, so I looked at pssh and sshpt, but was unable to get both to work with a pseudo-tty to allow sudo. Then I came across Fab, and with a little bit of reading and a little bit of code I came up with something that should work.
Of course, the real problem is that I'm hand-coding so much of it: this server is HP and should be shut down like this, that server is Sun and should be shut down like so...really, I should be using ipmitool to do all this. But the mgt network is on a private subnet, I'm somewhere else, and this is the simplest quick thing I could do. We'll see how it goes.
(This half-assed blog entry brought to you by cold medication and a sinus infection. Whee!)
Last year (hah! last year!), by which I mean Xmas 2011, just two weeks ago, I did all my updates and disruptive work in the week BEFORE Xmas, rather than after. It's one of the perks of working for a university that I get a week off between Xmas and the new year, and I decided to actually take advantage of it this year.
(I could make that paragraph better, but it's beyond me right now.)
(The other advantage, of course, is free honey from honeybee researchers.)
I allowed myself three days, but was actually done in two. That's considerably better than last year, and in large part that's because I learned a (not the) right way to uninstall/reinstall proprietary ATI drivers. Unlike last year, in fact, it was practically painless.
This might explain why it was not terribly surprising to come back to a problem related to the upgrade: a user telling me that PyMOL no longer worked. And sure enough, when I tried running it over SSH, it crashed:
$ pymol
/usr/bin/pymol: line 2: 17723 Segmentation fault /usr/bin/python //usr/lib64/python2.6/site-packages/pymol/__init__.py "$@"
...which wasn't exactly what she was seeing, but I was pretty sure that it was only the tip of the iceberg.
A backtrace (caught by the wonderful catchsegv, which I only just
found out about) showed that it was failing at XF86DRIQueryVersion
,
which turns up in approximately 38% of all web pages. They're all
related to problems w/the proprietary ATI driver, how it eats ponies,
and how everything was sunshine and lollipops once they rolled back to
the Mesa-provided copy of libGL.
We are running the proprietary ATI driver -- we need the 3D performance -- so this made sense. And after last year's fiasco I was quite prepared to believe that ATI has a nasty appetite for BBQ. But much searching showed that before the Xmas upgrade, everyone'd been using the ATI-supplied libGL w/, presumably, no problems. I decided to prove it by reinstalling Mesa on a test machine. Yep, now it works fine. ATI hates the world!
...but I'd forgotten that I was running this over SSH. With X forwarding. And this made a difference. The Truth Hammer smacked me when I tried PyMOL on another workstation, from an actual sit-at-the-keyboard-and-use-the-coffee-holder X session, and it worked fine. I SSH'd to the original user's machine, and that worked fine.
I checked the version of libGL on my machine, and sure enough it was
different: 7.7.1 versus 7.8.2. My suspicion is that either the
XF86DRIQueryVersion
routine has changed enough that this causes
problems, or there was some other difference (32-bit vs 64bit? could
be...) between my machine and theirs (mine runs a different distro, so
there's lots of chances for interesting differences).
I simply did not expect there to be any problem debugging X programs over SSH; probably naive, but what the hell. Now I know.
Oh, and the user's problems? Wack PYTHONHOME
. Unset that and all
is well.
Happy new year, everyone!
I've been contacted by one of the co-chairs for this year's Cascadia IT Conference to ask if I'd add a link, and I'm happy to do so. This is the second year of the conference, and if last year's is anything to go by it should be another great time.
Unfortunately I won't be able to go, but if you're anywhere in the area next March I'd recommend it. And if anyone's interested, I have no financial interest in the conference, but I have met some of the organizers at past LISAs.