Carousel is a lie!

Observing report -- May 6, 2013
7th May 2013

After a clusterfuck of a weekend (about which more later), it was time to go astronomizing. But to do that, I'd have to take Tuesday off -- so I did. And what with the nearly new moon and the XOMG weather, I decided I'd go to Boundary Bay. After the last time I went, I got an email from a guy named Scott saying I should invite him along the next time I go. I admire a man who's direct, so I did. And wonder of wonders, he showed up, bringing his 120mm (I think) Skywatcher refractor. We arrived just after sundown and chatted while the ducks and herons dive-bombed us. It was his first time out at the Bay, and my 3rd.

We checked out Jupiter and caught it shortly before it set, low in the sky. We tried looking for the GRS (due to transit at 10.06pm, not 9pm like I thought) but didn't see it. After that it was Saturn, rising in the east. I couldn't see the Cassini Division then (9.30pm, maybe?), but I did see it fleetingly around midnight.

While Scott took pictures of Saturn, I looked up M35 before it set, and found it relatively easily -- much easier than the last time I'd tried to find it. I can nearly persuade myself that I saw NGC 2158.

And by now, M13 was up; I showed it to Scott, and he found it in his scope. "That's the first non-planet I've seen with this scope," he said. Wonderful, beautiful, and lovely to see after the winter. M57 was another reminder of last summer, and so was Albireo. Gorgeous in Scott's scope.

Markarian's Chain -- would you believe I saw part of this? True: M84, M86 and NGC 4438/4435, which may have been split. Barely. And M87 again, as I starhopped over. And but so after that it was time to head home, where I'm typing this now and am tired enough that I'm going to just post this and go.

Tags: astronomy.
There and back again
29th April 2013

So Saturday I get an email from Noah, a sysadmin I met at LISA (in 2006. Whoah. Anyway:) saying he was going to be in Bellingham for Linux Fest Northwest, and did I want to meet up? Talked it over with Clara, and she was up for it, so sure, why not?

We headed through the Aldergrove border crossing after hearing that the Peace Arch was on fire or some such, and got through in about five minutes. And after four hundred questions from the kids about were we there yet? is this America? why are you turning around here? we found where we needed to be quite by accident, a good two hours earlier than we figured we'd be. Met up with Noah who's sporting the most awesome 70s Mountain Man look, and agreed to meet up again for lunch -- we had kids to maintain, and figured we'd hit a park or something to run them for a bit.

Walked around the vendor area for a bit first, which showed that someone had managed to get my number: not only the FSF and the EFF, but a guy with an automated homebrewing setup controlled by Linux and the local astronomer's club. The kids got lots of stickers, and someone gave my wife a Pear Linux install CD. ("Why do I want this? I've got Ubuntu.")

We drove off looking for a park but couldn't find one; instead, we went to Trader Joe's. Clara took my oldest son inside while I stayed out minding my youngest, who'd managed to fall asleep on the way. There was also a toy store beside it, and when the youngest woke up he and his brother headed in there to spend their one American dollar each while I walked through TJ's. And holy crap: America's alcohol selection is incredible. A box of wine for $9. Lots of beer I had not seen, and for damn good prices too. I bought up a bunch, threw it in the trunk of the car and then met up with the kids who were just about to buy a big bag of army men. "I love America," sighed the youngest.

We drove down to Boundary Bay Brewing for lunch with Noah and Sarah. Clara had the ESB (okay, but we agreed that Central City's was better), Noah and I had the single-hopped Amarillo Pale Ale (wonderful) and Sarah got the Tripel (which was awesome; a light version with noticeable coriander, not like the dark/fruity dubbel's I've brewed). Oh, and the food was pretty good, too. We caught up with each other, talked about life in small hick towns, and just had a grand old time.

Finally it was time to go. I thought I'd brought down a growler, but no such luck, and we decided we'd stick with the beer we got at TJ's. I gave Noah and Sarah the two bottles of homebrew (#46, a dubbel, and #47, an IPA) we'd brought them, and we said goodbye. The drive back was completely uneventful except for paying duty at the border, which basically doubled the amount we paid for the alcohol.

All in all, a damn good time. Next year I may even go to the actual talks...

No tags
Random Reading
14th April 2013

Oh, what a week. I'll write about that later. Inna meantime:

No tags
Week of DOOOOOOOOM
5th April 2013

Wednesday: A very important fileserver panicked and rebooted, apropos of nothing. I can't figure out why.

Thursday: Around 1.30am, a disk array at $WORK noticed one of its driveswas likely to fail shortly. It got very excited and sent me one hundred and fifty (150) (not exaggerating) text messages. When I got to work I failed the drive, put the spare into the array, the array started rebuilding, and I called Dell about 10am to arrange for a replacement to be sent out the next day (that is, Friday -- today).

When the rebuild was done it complained that another drive was likely to fail shortly. I contacted Dell and was told that the complaint about the second drive was a) misguided (it wasn't really failing) and b) really meant that the array (that is, /share/networkscratch) was likely to fail entirely. They called this a punctured stripe and there are more than a few complaints about this terminology. Anyhow. The only solution was to back up the data, delete the array, recreate it and restore from backup. "Everybody out of the pool!"

About 6pm last night the process was finally done, but the array still complained that the drive was going to fail soon. I contacted Dell again, and after looking at the array they decided that the second drive really was failing after all -- in fact, it had probably failed first, the array had been compensating for it all this time, and its problem only became evident when the other drive failed. A second replacement drive is due to arrive Monday; it was too late by this time to have it arrive today.

I brought up the server, restored the 2am backup to some spare space, and went home; this was about 9.15pm.

Friday: a long-running (ie, monthlong) rsync process decided to suck up all the memory on our webserver. It had to be forcibly rebooted.

And now I want a beer.

2 comments. Tags: sysadmin.
Observing Report -- March 31, 2013
1st April 2013

Last night, after a beautifully clear day spent with family, I drove out to Boundary Bay to observe. It's near a small airport south of Vancouver, and it's far enough away that you can see the light dome of the city rather than be enveloped in it. I've been there once or twice before for the star parties that the local RAC chapter has here (last Saturday of the month; check Twitter for updates), but not to observe. I've only been out of the city once a year or so for semi-dark skies, so I thought it was the right time: a four-day weekend with an unconscionably gorgeous stretch of weather.

I'd packed up earlier, so after saying goodnight to the family (and making a note to try for some of the Virgo Messiers) I hopped in the car and drove off. Even with a wrong turn it was only half an hour there (35km), and I arrived about twenty minutes past sunset. I parked, finished up my tea and set up the scope and table (plus a cardboard flat from Costco that had held coffee to use as a dew shield, which worked amazingly well). I don't usually drive to observe, and I was pleasantly surprised at how quickly everything was going. The only downside was that by the time I remembered I'd wanted to collimate (something I don't do nearly often enough), it was too dark for me to see in the eyepiece...a laser collimator would definitely be nice. I'd also forgotten the dew shield for the scope, but it turned out not to be a problem.

Twilight deepened; I listened to the wildlife. There are a ton of birds there -- I saw a heron not five meters away -- and it was enchanting to think "What's that weird sound?" and realize it was the hissing of a flock of birds going by. The stars came out, and I was surprised at how high Sirius was: enough to see the whole of Canis Major, and Crater and Corvus -- constellations I never saw before. The horizon there is flat nearly all the way around, and that is such a change from my usual location. More than that, though, it was darker (even a half hour before twilight was over) than I ever see at my usual place (a suburban park with no shield from the many streetlights). I resolved to come back in the summer to look at Sagittarius and Scorpius.

Finally it was dark enough to look for Comet PANSTARRS. I hadn't prepared much, but I knew that it was near M31. A short pan around in the binos, and there it was about 6 degrees below the galaxy -- almost in the same FOV. I was just able to make it out the tail with direct vision, and it became an obvious fan in averted vision. Viewing it in the scope at 48X brought out more of the fan and made it brighter, but it was still better in AV. In both the binos and the scope, the nucleus was obvious and pointlike.

By this time twilight was over. I took a moment to see how dark it was (mag 5 with AV), then for fun pointed the binos at Sirius. Could I see...yes, I could: M41 (below the trees where I usually observe), M46 and M47 (which I'd had the devil's own time trying to find this winter). I took a look at them through the scope, too. I don't remember much about M41, but it was pretty enough. M47 was sparse, stars in obvious chains and arcs. M46, though...wow: a cloudy scattering (obtypo: scattery, which I think sounds really cool) of faint stars, almost glob-like in the way it was just on the verge of resolving. Almost as good as M11, the Wild Duck cluster, and that's saying something.

A couple had parked earlier to go for a walk, and at this point they came back. I asked if they wanted a look through the scope, and they were happy to do so. I showed them Jupiter, M42 and the Pleides; they were amazed. We talked for a while longer, and I told them about the observing parties the RASC puts on. Hopefully they'll make it out the next tme.

It was 10pm by this point, and I decided it was time to try for M65 and M66. These pretty much skunked me the last two times I tried for them, and I was trying not to get my hopes up about seeing them here. But YES: in the binos, if I held them steady, they showed up with AV, and through the scope at 30X with AV. Awesome! Bumped up the power to 48X and saw them both with DV, faint but there. Not only that, but I was just able to pick out NGC 3628 at 100X with AV and complete the Leo Triplet. At 160X I could see a definite nucleus to M66, but no features on M65. Man, I was happy about this.

Well, if I can get those three, let's move on, right? I went for M51 next. It might have shown up with AV in steadied binos, but it was obvious (and obviously two parts) at 30X in the scope. At 100X a satellite went through the FOV, which always makes me smile. At 160X it almost seemed like one of the parts -- the main galaxy, I think -- had a starlike nucleus. The two parts were definitely separated by now, but I could not see any spirals or any sign of the bridge between them. Still, this was another galaxy that had skunked me the last time I'd tried for it, and I was really pumped about finally seeing it. (BTW, this sketch of M51 through a 28" reflector is incredible.)

Saturn was up, though still very low, and I took the chance to see it. Lovely; no sign of the Cassini division, which was not surprising.

At this point I realized that M63 was close to M51. Should I try for it? Why the hell not? And again, obvious at 30X; 48X showed a slight brightening on one side, I think.

It was getting late, and the caffeine was starting to wear out, but I wanted to try one more thing: I'd printed out setting circle locations for M84, and I wanted to try dialing it in. I didn't hold out much hope for it, since I'd had such mixed results with setting cirlcles previously. But what the hell...140 degrees azimuth, 47.8 degrees altitude...look through the 40mm eyepiece, move it around a bit -- and holy shit it's there: a dim but obvious elliptical. Success!

Now at this point I ran into difficulties: yes, I'd found a galaxy. But confirming that it was M84 was tough. (I was happy to have seen anything, but I wanted to know what it was I was looking at. Plus, I was hoping to see more of Markarian's chain.) Have you ever looked at a chart for the Virgo area? I had, but I hadn't paid attention. It's a mass of galaxies and labels, with a handful of faint stars thrown in. It was extremely difficult to see what I was looking at. I sketched the area as best I could, then closed up shop at midnight and headed home; I took the wrong turn but still made the trip in a reasonable time.

Looking more closely at this today, I'm fairly sure that what I actually saw was M87, not M84; there's a slight J of three stars due south in my sketch, and a rectangle of stars to the east. And M87 is only a degree off from M84, so I was definitely in the right neighbourhood. I'm going to call it M87. Too bad it wasn't part of Markarian's chain. I really need to start making these tough observations earlier in the night.

So: it was an amazing night. The dew wasn't a problem thanks to the shield; the horizon was simply incredible to see; I did my good deed for the day with some sidewalk astronomy; I found a comet; the setting circles worked; and I saw an assortment of galaxies and clusters that I haven't been able to see at home. I feel bad about using the car, but it really was wonderful to see all these things. My lovely wife ran interference with the kids this morning and let me sleep in 'til 9am. I had a great time with the kids despite the messed-up sleep (us old folks need their rest), and when I got cranky and stupid later in the day I held my tongue and did not lose my temper. (Now that I'm proud of.) Only thing missing is an observing partner...it would be smart to go there with someone else.

I've added seven Messier objects to my list: M41, M46, M51 (which I'd checked off before, but I don't think that's right), M65, M66, M63 and M87. That brings me up to 40 out of the list -- not bad at all.

Tags: astronomy.
Chasing the Dragon (sorry, couldn't resist)
26th March 2013

It was a rare clearish morning here in New West, and I just saw the ISS and the SpaceX Dragon fly over my house! How awesome is that? There were light, patchy clouds overhead, but the ISS was still bright and visible through them -- and then fainter, just a little bit ahead, was the Dragon capsule! It's undocking right now, and it's amazing luck that I got to see it. (Swoon...)

In other news: yesterday I got sleep, it was sunny, I forgot a couple of things that I should have been working on, and the resulting optimism led me to migrate the sysadmin wiki for the third time. This time it was from Foswiki to Ikiwiki. I have nothing against Foswiki except that I really, REALLY want to edit everything from Emacs; for FW, that means this complicated wrapper around sudo that was getting tiring. Now it's Git + Emacs + Multimarkdown and I am happy.

Not only that, I got a long-standing feature request (one that I made to myself) out of the way: I can now check in, in Emacs Orgmode, to a particular RT ticket when replying to that ticket. (waves hands around in insane manner) Don't you see what this MEANS? Previously I'd have to switch to Emacs, refresh the rtliberation view which'd take 5 seconds (SO BORED), run a command to add it to my Org file, switch to Org, find the new addition, check in and THEN switch back to Mutt and reply. Now it's all in Emacs. It means a new life for ALL of us, baby! You'll see!

This entry brought to you by not enough sleep, excitement about spaceflight, Emacs geekery and a mug full of coffee.

Tags: astronomy, sysadmin.
Random stuff
21st March 2013

I came to PyCon with two women colleagues, one of whom was harassed nearly constantly by men, albeit on a low level. Both of them are friendly people who are willing to engage at both a personal and a technical level with others, and apparently that signals to some that they can now feel free to comment on "hotness", proposition them, and otherwise act like 14 year old guys. As one friend said, (paraphrased) "I'd be more flattered that they seem to want to sleep with me, if they'd indicated any interest in me as a human being -- you know, asked me why I was at PyCon, what I did, what I worked on, what I thought about things. But they didn't."

And:

As a community, we need to change the way we treat women, because my daughters will TASER YOU ALL INTO OBLIVION in 10-20 years if we don't.

Tags: space.
Epicycles
13th March 2013

Yesterday I was asked to restore a backup for a Windows desktop, and I couldn't: I'd been backing up "Documents and Settings", not "Users". The former is appropriate for XP, which this workstation'd had at some point, but not Windows 7 which it had now. I'd missed the 286-byte size of full backups. Luckily the user had another way to retrieve his data. But I felt pretty sick for a while; still do.

When shit like this happens, I try to come up with a Nagios test to watch for it. It's the regression test for sysadmins: is Nagios okay? Then at least you aren't repeating any mistakes. But how the hell do I test for this case? I'm not sure when the change happened, because the full backups I had (going back three months; our usual policy) were all 286 bytes. I thought I could settle for "alert me about full backups under...oh, I dunno, 100KB." But a search for that in the catalog turns up maybe ten or so, nine of them legitimate, meaning an alert for this will give 90% false positives.

So all right, a list of exceptions. Except that needs to be maintained. So imagine this sequence:

  1. A tiny filesystem is being backed up, and it's on the don't-bug-me-if-it's-small list.
  2. It actually starts holding files, which are now backed up, so it's probably important.
  3. But I don't update the don't-bug-me-if-it's-small list.
  4. Something goes wrong and the backups go back to being small.
  5. Someone requests the restore, and I can't provide it.

I need some way of saying "Oh, that's unusual..." Which makes me think of statistics, which I don't understand very well, and I start to think this is a bigger task than I realize and I'm maybe trying to create AI in a Bash script.

And really, I've got don't-bug-me-if-this lists, and local checks and exceptions, and I've documented things as well as I can but it's never enough. I've tried hard to make things easy for my eventual successor (I'm not switching jobs any time soon; just thinking of the future), and if not easy then at least documented, but I have this nagging feeling that she'll look at all this and just shake her head, the way I've done at other setups. It feels like this baroque, Balkanized, over-intricate set of kludges, special cases, homebrown scripts littered with FIXMEs and I don't know what-all. I've got Nagios invoking Bacula, and Cfengine managing some but not all, and it just feels overgrown. Weedy. Some days I don't know the way out.

And the stupid part is that NONE OF THIS WOULD HAVE FIXED THE ORIGINAL PROBLEM: I screwed up and did not adjust the files I was backing up for a client. And that realization -- that after cycling through all these dark worryings about how I'm doing my job, I'm right back where I started, a gutkick suspicion that I shouldn't be allowed to do what I do and I can't even begin to make a go at fixing things -- that is one hell of a way to end a day at work.

4 comments. Tags: backups, sysadmin.
Observing Report -- March 8, 2013
8th March 2013

All right, this was a frustrating night. It was clear and moonless and lovely, but I spent much of my time looking for faint fuzzies and not finding them. Which is probably not surprising since I'm in the goram suburbs, but still. I prepared for this run by getting my maps out in advance, and printing out sketches of stuff I was looking for...but all for nought. NOUGHT, I say.

M33: Just for fun, before it sunk out of sight. No.

M1: No, 2x. First with the manual setting circles, and then another attempt following an actual chart. The second attempt actually got me in the right area, but I still couldn't see it. I realized that the chart I'd printed out seemed to be about 5 degrees off in azimuth -- not good. Same thing happened with M50. (Pretty sure I aligned with Polaris this time...)

X Cancri: A carbon star. Found, hurrah! Very nice.

Castor: Resolved to double at 100X, and maybe to triple at 160X. Interestingly, it looked like a sheaf of wheat at 160X. I wish I'd collimated before heading out.

M66: I spent a long time tracking this down, and the best I could get was maybe-possibly with averted vision. Maybe.

Other than a mandatory quick look at Jupiter, that was pretty much it. Not much found, not much accomplished. It feels a bit likewhen I was first starting out: can't find things, can't see them when I do. Grrr.

Tags: astronomy.
Troubleshooting deferred jobs, episode 80
7th March 2013

The other day at $WORK, a user asked me why the jobs she was submitting to the cluser were being deferred. THey only needed one core each, and showq showed lots free, so WTF?

By the time I checked on the state of these deferred jobs, the jobs were already running -- and yeah, there were lots of cores free. The checkjob command showed something interesting, though:

$ checkjob 34141 | grep Messages
Messages:  cannot start job - RM failure, rc: 15041, msg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'

I thought this was from the node that the job was on now:

$ qstat -f 34141 | grep exec_host
exec_host = compute-3-5/19

but that was a red herring. (I could've also got the host from "checkjob | grep -2 "Allocated Nodes".) Instead, grepping through maui.log showed that it had been compute-1-11 that was the real problem:

/opt/maui/log $ sudo grep 34141 maui.log.3 maui.log.2 maui.log.1 maui.log |grep -E 'WARN|ERROR'
maui.log.3:03/05 16:21:48 ERROR:    job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:48 ERROR:    job '34141' has NULL WCLimit field
maui.log.3:03/05 16:21:50 ERROR:    job '34141' cannot be started: (rc: 15041  errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'  hostlist: 'compute-1-11')
maui.log.3:03/05 16:21:50 WARNING:  cannot start job '34141' through resource manager
maui.log.3:03/05 16:21:50 ERROR:    cannot start job '34141' in partition DEFAULT
maui.log.3:03/05 17:21:56 ERROR:    job '34141' cannot be started: (rc: 15041  errmsg: 'Execution server rejected request MSG=cannot send job to mom, state=PRERUN'  hostlist: 'compute-1-11')

There were lots of messages like this; I think the scheduler kept only gave up on that node much later (hours).

checknode showed nothing wrong; in fact, it was running a job currently and had 4 free cores:

$ checknode compute-1-11
checking node compute-1-11

State:      Busy  (in current state for 6:23:11:32)
Configured Resources: PROCS: 12  MEM: 47G  SWAP: 46G  DISK: 1M
Utilized   Resources: PROCS: 8
Dedicated  Resources: PROCS: 8
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:      13.610
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [default 0:12]

Total Time:   INFINITY  Up:   INFINITY (98.74%)  Active:   INFINITY (18.08%)

Reservations:
  Job '33849'(8)  -6:23:12:03 -> 93:00:47:56 (99:23:59:59)
JobList:  33849

maui.log showed an alert:

maui.log.10:03/03 22:32:26 ALERT: RM state corruption. job '34001' has idle node 'compute-1-11' allocated (node forced to active state)

but that was another red herring; this is common and benign.

dmesg on compute-1-11 showed the problem:

compute-1-11 $ dmesg | tail
sd 0:0:0:0: SCSI error: return code = 0x08000002
sda: Current: sense key: Hardware Error
<<vendor>> ASC=0x80 ASCQ=0x87ASC=0x80 <<vendor>> ASCQ=0x87

Info fld=0x10489
end_request: I/O error, dev sda, sector 66697
Aborting journal on device sda1.
ext3_abort called.
EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
(Linux)|Wed Mar 06 09:37:20|[compute-1-11:~]$ mount
/dev/sda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda5 on /state/partition1 type ext3 (rw)
/dev/sda2 on /var type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
sophie:/export/scratch on /share/networkscratch type nfs (rw,addr=10.1.1.1)

mount: warning /etc/mtab is not writable (e.g. read-only filesystem).
   It's possible that information reported by mount(8) is not
   up to date. For actual information about system mount points
   check the /proc/mounts file.

but this was also logged on the head node in /var/log/messages:

$ sudo grep compute-1-11.local /var/log/* |grep -vE 'automount|snmpd|qmgr|smtp|pam_unix|Accepted publickey' > ~/rt_1526/compute-1-11.syslog
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in remtree, unlink failed on /opt/torque/mom_priv/jobs/34038.sophie.TK
/var/log/messages:Mar  6 00:26:22 compute-1-11.local pbs_mom: LOG_ERROR::Read-only file system (30) in job_purge, Unlink of job file failed

and in /var/log/kern:

$ sudo tail /var/log/kern
Mar  5 10:05:00 compute-1-11.local kernel: Aborting journal on device sda1.
Mar  5 10:05:01 compute-1-11.local kernel: ext3_abort called.
Mar  5 10:05:01 compute-1-11.local kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
Mar  5 10:05:01 compute-1-11.local kernel: Remounting filesystem read-only
Mar  7 05:18:06 compute-1-11.local kernel: Memory for crash kernel (0x0 to 0x0) notwithin permissible range

There are a few things I've learned from this:

I've started to put some of these commands in a sub -- that's a really awesome framework from 37 signals to collect commonly-used commands together. In this case, I've named the sub "sophie", after the cluster I work on (named in turn after the daughter of the PI). You can find it on github or my own server (github is great, but what happens when it goes away? ...but that's a rant for another day.) Right now there are only a few things in there, and they're somewhat specific to my environment, and doubtless they could be improved -- but it's helping a lot so far.

Tags: hpc, rocks, sysadmin.

RSS feed

Created by Chronicle v4.6