The Life of a Sysadmin

Carousel is a lie!

Entries tagged "hardware".

The Jerker
28 July 2002 12:00:00 PST

So I just moved into a new place with my wife: main floor suite of a house, tons more space than the one bedroom apartment we had. Went to Ikea today and got a new desk: The Jerker (no, really). And is this baby ever sweet!

It's $144 (Canadian), which was one of the cheapest desks around, and it's absolutely perfect for my needs. For a start, it's rock fucking solid. Even putting it together, when I only had the uprights and one crosspiece bolted together, it wasn't wobbly in the least. For another, it's got a huge expanse of desk area, both wide and deep; this is nice, since I've got a big-assed 21" monitor (free, but another story). Third, it's got a shelf above for books and dippin'. Fourth, it's all adjustable: you bolt the shelf and desk plank (what the hell's the right word? Top, I suppose) into holes in the uprights, spaced at 1" intervals.

The only thing this is missing is a hole for cables, but that's a minor complaint. Also, there's no drawers or cd holders included, but that's all good for me; I hate 'em.

Original entry.

Tags: hardware.
Fucking Spammers
25 September 2002 12:00:00 PST

Update time.

I got into work today and found that the mail server had just come up after *half a fucking hour* of being down because of the insane load placed on it by spam -- just spam -- coming in. The owner of the company couldn't send email. I started setting up the new mail server.

And it was nice. I got to go away, away from the help desk, sit down and figure out how to make it work. FreeBSD's vinum + Promise raid controller == kernel panic (details later on). Finally got vinum figured out -- I've only worked w/it once before -- and before I was grabbed back to help desk had the disk setup about 80% done.

So some more details: there's 4 x 40GB maxtor IDE drives. (Yeah yeah yeah SCSI.) We've got an onboard Promise controller chip; I'll put in the mobo tomorrow and make this all seamless. First it turns out we've got the Promise Lite (Less Filling!) BIOS, which means we can only have one (1) array of two disks; the other two disks can be single arrays on their own, which is useful in some alternate universe I'm sure. So okay, try setting up one mirrored (Raid 1? 0? I can't keep 'em straight) array, and we'll use vinum to tie it together with the other single drives...

Only as soon as I try using vinum to do _anything_ with the Promise'd arrays, BANG: kernel panic. This is 4.6, not the latest (4.7RC1 as I type), but still. Arghh. Doesn't matter whether vinum tries raid 0, 1 or 5 -- just panics right away. If I had more time and a box of my own to fool around with, I'd try [Michael Lucas'|http://www.oreillynet.com/pub/a/bsd/2002/03/21/Big_Scary_Daemons.html] [SlashdotJournal_25September2002-02]1 (Buy his book!) and contribute something useful to the FreeBSD folk. Alas, it's not my box or my time, and if I were to post this message to freebsd-hackers-important-vinum-people tomorrow I'd (deservedly) get laughed at so hard I'd feel it over the ether.

Anyway. Point is I can't get vinum to play nice w/the Promise'd chip even as an IDE controller. The BIOS of the box allows you to turn the Promise chip on, off, or to ATA/IDE; but even set to the latter, it panics once vinum touches /dev/ar*. You have been warned.

So get vinum using the four drives on the first two IDE channels, and that works fine once I learn the intricacies of disklabel (set type to vinum, kids!) and vinum init (and that takes a long time w/3*35GB partitions^H^H^H^H^H^H^H^H^subsooperplexen). 1 5m 5o 133t!

OT: One of my side notes was going to be about how I'm posting this w/Lynx 'cos Mozilla won't let me use vi, editor of the Elder Gods, as an editor. Then I realized I could have just fired up a shell and used vi in there. Sigh. Rumours of my cleverness have been exaggerated.

Original entry.

Tags: bsd, hardware, spam.
e133t HA0rZ!
04 February 2003 12:00:00 PST

So here it is, 8.30pm, and I'm restoring a Cobalt Raq 4 to something approaching virginity. It belongs to a colo'd customer, and it got cracked; we offered, for a modest cost, to restore it, and here I am.

It's Linux under the hood of course -- Red Hat, or at least they use RPM --and it's interesting to see what's been done with it. The management page is pretty slick, though it always leaves me wanting to log on. To do that, I need to telnet -- shudder -- and of course the cust. hasn't got SSH on it. (Confirmation that we had a cracker was nmap showing lots of open ports that responded with an SSH banner. Seems weird to me that a cracker would install ssh, but oh well.) But all the web functionality seems to be there, and it seems pretty and easy to use.

The cust. kept up to date with the patches from Sun (part of what I'm reinstalling right now), but I think there's still a few holes; I'm pretty sure there's an old version of Apache, for instance. And would it kill them to have OpenSSH? Or firewalling tools?

Anyhow, it's the first time I've worked with an automatic patch installer that wasn't Windows, and I must admit I'm impressed. Download the patch -- which is a tarball of script + rpms + patches -- clicky-click install on the web interface, and away you go. I'm sure it's not news for most of you, but it's neat for me. The only thing is that it reboots between a lot of them -- c'mon guys, I thought this was Linux! :-)

Random idea for a program: I'm hooked up to this thing by a crossover cable to another Linux box, just to keep it off the 'net while it's having everything reinstalled. I telnet in occasionally to make sure things are working, but the damn prompt always takes so long to come up. It's the Raq doing a reverse lookup on my DNS, of course, but because it's just on an Xover cable it sits there until the queries time out. We're talking a minute or so to time out, which is unacceptable. I'm an important man, after all.

So my idea is to have a program listening for queries like that and answering them, masquerading as whatever DNS server the query was directed at. Basically, just fake 'em out with whatever info they want. In cases like this (which I can see coming up, oh, at least once a year), it'd speed things up immensely. Anyone heard of anything like this, or is it just full of Crak(tm)?

...urghh. Just rebooted for a patch that alleges fixing Apache and OpenSSL problems. Why the hell does this need a reboot?

Original entry

Tags: hardware, linux.
Hardware demons
30 June 2003 12:00:00 PST

Jesus Christ. Every time I mess around with hardware or upgrades, I swear I'll never do it again. Then I forget.

My first computer, bought eight years ago now, was a 486 w/16MB of RAM and some amount of HD space. I installed Slackware on it, got a 33.6 modem, and had email and net access. Then a roommate sold me his old P90. It crashed constantly until I figured out I had set the CPU voltage wrong. It took me a long time to figure that out, and I was nearly ready to hurl the thing out the window.

A few years later I upgraded to my current desktop machine, a 333 Celeron overclocked to 450 MHz. The machine is fine unless I open up the case to add/remove/shift something in it; then it will, for a day, spontaneously reboot. I've checked it for shorts and can't find any. I don't know what I'm missing, but I'm sure it would be obvious to someone else.

And now the latest. My wife bought an iMac from her old work a few years ago, and has had problems w/it since. It just crashes for no good reason. It'll work fine for two weeks, then she can't keep it running for more than an hour. So last week I went out and bought her a fairly skookum machine: Athlon 2600 (I think...details to follow), ECS K7S5A mobo, 60GB HD and 256 MB RAM.

I got it all home and assembled it. The mobo and Red Hat 9 (not my favourite, but great for my wife) called the CPU a 2000 (1.6GHz instead of 2.0), so I looked around and decided a BIOS upgrade would be in order. Did that and promptly lost the back USB -- bad, since her keyboard and mouse are USB. The front ones, hooked up to the pins on the motherboard, still worked. Tried rolling the BIOS back, but nothing: the back, onboard USB just didn't go. Fuck.

So I went out and got some additional USB risers a few days later. I added them; no problem. Then I had to add a connector from the CDROM's audio to the motherboard. I made the mistake of removing one of the USB connectors while the power was still on. Didn't even think; just did it. Now the BIOS freezes at "Checking NVRAM...". Flashed the CMOS half a dozen times, left it off most of the night while we went to see Finding Nemo (not as good as Monsters, Inc., but still well worth it), and no change.

Today I'm going to stop by my new [hardware supplier of choice|http://www.ntcw.com/] (well recommended if you're in Vancouver; prices nearly as good as Atic, but much better service) to pick up a Gigabyte 7VAX. We'll see if I got ripped off on the CPU or what.

Mostly, though, I am not going to fuck with this computer again. I mean it this time.

Original entry

Tags: hardware, linux, upgrades.
Gloria!
2004-08-16 14:38:56

My wife and I kinda made an impulse purchase on the weekend: a new 12" iBook G4. It was weird: I made a joke about buying a laptop. Then I explained that I was only joking, but if we were going to buy one it should be an iBook since I kept hearing how sweet they were. Then we were going to go to Stanley Park, hang out at the beach, but maybe go to London Drugs (I don't know about you Americans, but in Canada we go to the drugstore for everything...car insurance, furniture, computers, you name it. Oh, and occasionally prescriptions) to see what prices were like. Then we were buying one. It all happened so fast.

So far, it's pretty damned impressive. After all the trouble I had to go to get gphoto to work with our digital camera, my wife just plugged it in here and it worked with iPhoto right away. Not only that, but we were looking at a slideshow of the crack-induced photos we'd taken while Fur Elise played in the background. Fucking unreal, man.

It's weird: I do feel a bit like I've made a deal with the devil. I've come to agree more and more with RMS about Free-as-in-Freedom, and here I am with a closed-source OS. Yada-yada-Darwin, what about Aqua? But it's sooooo nice...well, mostly, anyway.

I'm trying to use MacStumbler at the moment to find a wireless network to hook up to, but no luck: it just sits there, looking like it's scanning but with no more feedback than a scrolling bar. Dammit, I thought W2K was the only culprit there...and dammit, if I can't blog from the steps of the Vancouver Art Gallery, this thing is going back to the store. I suspect a problem with MacStumbler, but it's hard to be sure; I managed to find five or six access points at the office with Knoppix and the work laptop, and (apparently) wasn't able to find a thing with MS. I need to find a command-line version.

So far, though, that's my only complaint. Pretty fucking sweet, if you ask me.

Had a problem at work with Debian and VNC: the alt keys wouldn't work, for some reason. This was pretty annoying for the developer who really, really wanted to use Emacs. It took me about an hour of poring through Google -- Jesus Christ, the number of complaints about ALT keys disappearing, and Good God the long uber-thread about the change in keyboard behaviour between Debian versions -- to find the solution: vncserver --compatiblekbd A-ha!

Back to work and still no wireless access. Carousel is a LIE!!!

UPDATE: The VNC trick doesn't work. Details: The developer is running VNCViewer under VNC to connect to an X desktop on a Debian machine. On that machine, he's opening up an xterm and running User-Mode Linux. Alt-equals-meta works for Emacs when run on the Debian machine, but not for Emacs when run in the User-Mode Linux xterm. Fuck. UPDATE: Buddy found the trick: shift-left-click in the xterm to get the menu, then click "Meta sends escape". Double fuck!

Tags: emacs, hardware.
Big Hair Books
2004-10-06 19:11:26

Network problems again last week. Cheap switches will be the death of me, I swear, unless cable management gets me first. (Actually, it was both this time...cable looped back on itself + cheap switch == lots of embarassing explanations.)

But there are bright spots in this morass -- 48 of them, to be precise, in the form of 2 x HP 2626 Procurve Managed Switches. SSH login, VLANs up the wazoo, and much muchness. The only thing I'm not sure about is whether or not it does port mirroring (which I can live without, but it'd be nice). (UPDATE: Yes it does. Weeoo!) If these work out, then I think it'll be 2 x 2650s to replace the DLink unmanaged ones that keep crashing. The Ciscos seem nice and all, but the cost...oh my. And the respondents to the recent Ask Slashdot seemed to like HP a lot. Plus, we used to use 'em at my old job, and everyone was pretty happy. We'll see how it goes.

Just bought Neal Stephenson's The System Of The World at Big Hair Bookstore. Twenty-two pages and I love it already. God, the man can write.

Tags: books, hardware.
The firewall is dead. Long live the firewall!
2004-11-07 16:03:46

I decided this week to get Amanda working properly at home. I've got an old DDS3 tapedrive in Francisco, my FreeBSD firewall box, but all I've been doing so far is tarring to it once a week.

Setting up Amanda wasn't much of a problem, but I kept getting short write errors -- the damn thing was giving up and saying the tape was full after only about 3GB. I decided to run amtapetype, which takes about two hours per run with my hardware, in order to figure out exactly how much space I had. The first time, it said 2GB. WTF? The second time, the drive crapped out with errors about how a power reset had been detected. I decided to shut down Francisco and reseat the cables just in case. No problem, right?

Wrong! When I brought up Francisco again, it refused to boot -- lots of scary errors about how the hard drive couldn't be read, or found, and maybe the LIES about having a hard drive present should just stop now, huh? Francisco is old: it's an old P90 scrounged from an old job, stuck in this black case with non-working LEDs and a Punisher logo someone poked out in toothpick-sized holes on the front. No cooling fan, four ISA slots and three PCI, and I had to jiggle the BIOS so that it would boot from a 100MB partition at the beginning of an 80GB hard drive. Seems like as good a time as any to simply replace the damned thing...

...but first, a firewall. I tried booting it from an old laptop hard drive I had around, but that didn't work. I tried getting it to boot from a Slackware Live cd, but the whole concept of booting from a CD just made Francisco huddle in the corner in the fetal position.

Nothing else for it: it was time to do The Bad Thing. I grabbed one of the ethernet cards from Francisco, shut down Thornhill (P3, 500MHz, web and DNS server, Slackware and 2.6.7 kernel) and threw it in. A quick module recompile for tulip^Wvia-rhine and that was up; some judicious editing of the firewall set it up for NAT. Ph35r m3!

(Side note: Man, it's been far too long since I set up NAT on Linux; I still don't really understand what I've done. I've worked with FreeBSD for firewalls almost exclusively over the last four years, and I have some serious catching up to do.)

So now the question is: what do I do to replace Francisco? I know, finding a Pentium similar to Francisco is not that hard at all. But dammit, I'm tired of big, noisy boxes that are just waiting to die. I want something small, quiet, and reasonably new; I don't want to be fiddling with it, or worrying about it running out of memory (I tend to run far too much on a firewall, and 92MB of RAM just aggravates the problem).

It's complicated a bit by the recent heat-death of Hardesty, a 300MHz Celeron that had, 'til recently, been my desktop machine. I'd been hoping to replace or upgrade that, too; I've gotten quite used to a fast processor and lots of memory at work, and 15 seconds to render Slashdot's front page seems less like acceptable and more like a sign that civilization is in decline.

So...one option is a VIA Epia Cl6000. Dual ethernet, fanless goodness. That, and a case -- unless I decide to build my own Bubba can computer -- and some memory, and maybe a hard drive or maybe PXE booting. Whee! That'd make a pretty decent firewall and fileserver, no question.

But another option would be to let Thornhill keep doing the firewall thing, even though it's a webserver and should, like, rilly be outside the firewall, or at least in a DMZ. I could do something really funky like run Apache inside User-Mode Linux. Or maybe my own stuff, although I'm sure X would be a bear to get working.

A third option would be to keep using Francisco, but w/o a hard drive: let it PXE boot and do all the firewall stuff that way, totally stateless (well, hard drive-less). That could be interesting: almost no moving parts at that point. That would let me get a Mini-ITX something-or-other to use as a desktop machine. They're not the most powerful processors around, but when you can compile a kernel in 6 minutes, who the hell cares? Or maybe a Shuttle, so I could keep using my video card. Hm...

Well, enough of that for now; my cat needs chasing. And anyhow, King of the Hill season premiere tonight! @Woo!

Tags: freebsd, hardware, linux.
Fetch me m'shotgun!
Nov 9, 2004 18:45:58 PST

The sumbitches are at it agin', mother. Comment spam is infecting both my blog and my wife's. So far a relatively small number of keywords -- poker, Texas, debt -- is sufficient to keep 'em away from where Google can see 'em. Well, that and OCD-like running of SELECT statements in MySQL. But the fuckers are gonna be the death of me, or at least blog comments. Although maybe some sort of SURBL plugin for URLs in the post...that'd be cool. Someone must have something like that already.

Not that I notice a whole lot of comments, anyhow, at least away from the Slashdot side of things...although I do notice that I've made it onto somebody's blogroll. How'd that happen?

In other news: I finally decided what to do about new computers: buy a new Shuttle Sk43G, Sempron processor, and make that my web server; then, make my current webserver (older Compaq P3-500 desktop machine) my desktop and firewall: lots of room for ethernet cards, tape drives and whatnot.

I agree, it's a little silly that the more powerful box becomes the horribly underutilized server, but such is life. If there was a comparably cheap shuttle that came with two onboard ethernet interfaces, I'd be buying that instead.

So dive right in, right? I got the new box home last night, assembled it and booted w/o problems. It took little effort to move the hard drive from the web server and put it in the new, tiny box; sure, I had to recompile the kernel (8 minutes! eat that, P90!) to get the right drivers in, but nothing big. Until, that is, it froze. Hard. And only a few minutes after booting. If I ran top and set it to update continuously, I could get to freeze within seconds.

Some fiddling with Grub (boot loader of the GODS, man) showed that the problem seemed to go away if I went with the original Slackware stock 2.4.20 kernel instead of the 2.6.7 kernel I'd last compiled. (I'm a packrat, and that includes keeping every kernel compiled on this damned thing, Just In Case, because You Never Know.) We've got one of these boxes at work with an Athlon XP and it works fine; admittedly, it's not doing much, but neither is my web server. (Ba-zing!)

God only knows what's going on there, but it didn't last: I left it on overnight to see if it'd keep going, and sure enough it froze again around 10pm. I put the HD back in the P3 and left it. I'm going to see Wilco tonight (Whoo! WilCO! WHOO!), so this'll take a back seat to some serious RAWK. Except I'll probably be speculating about crappy memory or badly applied heatsink paste the whole time. No. No, I won't. It's Wilco.

Actually, I'm thinking I may have to upgrade the BIOS in order to get it to work properly with the Sempron; originally it was detected as a 900MHz Athlon, and I had to tweak the bus speed and whatnot to get it to run at 1.5GHz. (Interestingly, this seemed to have no effect whatsoever on how quickly it would crash, compared to the difference the different kernel version made.) (God, that's an awful sentence. I'm sorry, everyone.)

Anyhow, there's probably lots wrong with the settings; I never really wanted to learn about memory spacings and CPU voltages and I don't know what-all.

In other other news, I mentioned that I moved last week, but I didn't mention that I came back to two, count 'em TWO dead computers. (Before you ask: Support contracts are for the weak, and I suspect I'm about to get very weak.) One was a Linux box whose hard drive gave up the ghost. Stupid IDE hard drives in a dusty, hot environment anyway! But the other was was an old Duron whose motherboard's capacitors yearned to be one with the cosmos (ie, they blew up real good). That was running Windows, so the whole let's-just-throw-the-hard-drive-into-another-box-and-see-if-it-boots thing was good for a very, very bitter laugh but little else.

Instead, I reinstalled not only Windows but Cygwin, too. That proved to be harder; we use Cygwin to compile very particular things that depend on version 2.2 of Python. Version 2.3 makes things cry. And no matter how much you tell the Cygwin installer that you don't want to upgrade Python, it goes ahead and does so anyway like some hyperactive sugar-fueled kid who's certain he knows how to fix things.

After far too much experimentation, I did what I should have done in the first place: I found an old archive of Cygwin, with the right version of Python, and I mirrored it. One gigantic, nine-hour long sucking sound later, and I had a local copy to point the Cygwin installer at. Thank god.

Finally, just got in the first 19" LCD monitor at work. This was, of course, two weeks after assuring someone that they were too expensive to get past the boss. My bad. I'm going to get a lot of mean looks, I think. But then, if I was a people person, why would I have become a sysadmin?

Recommendation of the Day: Vicious Battle Rap, by DJ Format and Abdominal. Bow down, baby.

Tags: hardware, spam.
By George, I think I've got it
2004-11-11 12:46:16

SK43G, Sempron 2200. eth0: Via Rhine driver -- DLink 350TX? I'll have to look it up. eth1: RealTek 8139 onboard. ifconfig eth0 192.168.0.1 netmask 255.255.255.0 route add default gw 192.168.0.254 (log in as self) ssh 192.168.0.254 BAM -- freezes hard, and even the Magic SysRq key does nothing. Reboot... ifconfig eth1 192.168.0.1 netmask 255.255.255.0 route add default gw 192.168.0.254 (log in as self) ssh 192.168.0.254 Password: BAM! (the good BAM, this time) Yay! No BIOS upgrade required maybe! (UPDATE: Spelled out which one was eth1 [the onboard Realtek]. What a maroon!)

Tags: hardware.
Update on Shuttle/DLink problems
2004-11-14 20:39:10

Here's a few more details on the problem with the new Shuttle. First, the card is a DLink DFE-530TX; the Shuttle is an SK43G. If the DLink is connected to my internal network switch, and from there to the gateway box, this sequence will make it freeze:

  1. ifconfig eth1 192.168.0.1
  2. route add default gw 192.168.0.254
  3. ssh 192.168.23.254

Interestingly, if the network cable is unplugged, the problem doesn't show up...so it appears there's something about the response to the three-way handshake is what's causing the problems.

I managed to find some reports of wireless cards locking up hard with the VIA KM400 chipset, including cards from DLink. I tried setting all the IRQs to "Reserved" in the BIOS, and that didn't work; however, the card was grabbing IRQ 17, and the BIOS wouldn't let me reserve that one. I also tried upgrading the BIOS, and that didn't work either.

I'd love to pursue it further, but it's now officially the new webserver; I wanted to get it installed while I had a day to fool around with it and get everything working. So far there don't appear to be any problems.

And now, of course, I've got what used to be Thornhill as my desktop machine: P3 500MHz, 640Mb, and a new 160GB Seagate Barracuda. Once again, I'm going with Debian, God's own distro. Still gotta come up with a name for it.

I'm currently trying out KDE and Konqueror -- usually I use IceWM and Firefox, but I thought I'd give something fancier a try now that I've got a slightly hibbier machine. It's not bad so far, although having to set up all the keyboard shortcuts that come with Ice is a little annoying. We'll see how long it lasts.

Tags: hardware.
Random updates
2004-12-11 14:16:50

After a lot of consideration, and some reassurance from JWSmythe, I'm going with the Promise VTrak 15100 array for work. It has almost everything I want: serial ATA, dual SCSI adapters, and an ethernet interface. The downside is that Promise doesn't have an office in Canada, so there's the possibility that getting parts across the border could be a problem. However, there's a local company that'll do service, so that makes me feel better.

The other options just weren't as good: one was parallel ATA and had no ethernet interface. The other was the Fastora DAS-315, which certainly looked good -- but the local resellers couldn't be bothered to give me the time of day, let alone answer the questions I had. Best bit: when I asked for a copy of the service level agreement, the sales guy replied that he'd "have to see" if he could release it.

And at home, I've been running into problems with bridging, the 2.6.9 kernel and the 8139too driver. I thought I would enable bridging on Thornhill for some User-mode Linux fun, so I enabled it as a module, then rebuilt and reinstalled the modules. However, when I tried inserting it, I got unknown symbol: br_handle_frame_hook. Okay, what about rebuilding the kernel and including bridging within it? Tried that; when I booted, the kernel panicked as soon as it came time for the onboard 8139 interface to grab an address by DHCP.

It was similar to the earlier problems I had with the Shuttle, in that if I took out the ethernet cable everything was fine -- it was only when the response came in that the kernel panicked. And keep in mind this was without setting up a bridge at boot time, or anything like that. I had to go to the backup 2.6.7 kernel in order to calm things down.

I found this thread on LKML, and it seems to match pretty closely what I saw -- the stack trace matches what I saw; I wasn't able to see the whole message, because it would scroll off the screen. However, I'm reluctant to try this patch; I spent a whole evening rebooting (Sorry, Aaron) and trying different things before I finally confirmed that having bridging in the kernel was just a bad thing.

Interesting bit: I didn't realize that Linux does not have panic core dumping built into the kernel, as FreeBSD does; it's only available as a separate patch. Minus one for Linux.

Finally, it's the day after the office Xmas party, and what am I doing? Heading into work to unplug everything. The power is being shut off in our building (thirty-floor or so high-rise) while upgrades are done, so I'm shutting everything down and disconnecting it just in case. Tomorrow I go back in to reverse the process. Whee!

3 comments. Tags: hardware.
Two good deeds
2004-12-22 22:21:00

Well, I did the right thing today -- twice. Damn right I'm bragging.

First off, it turns out that the FreeBSD Foundation has run into a (good!) problem: its donations have been too big. In order to keep its US charitable status, it needs to have two-thirds of its donations be relatively small. Due to a couple of big donations, this ratio is a little out of whack at the moment, and they need a bunch of small donations.

Welp, I've been administering FreeBSD systems for a living for...well, I was gonna say four years, but it's more like two and a half or three. I've been working on them for four, though; my rent and food has been paid in large part because of the generosity of the people who put together FreeBSD. A donation went off in short order.

Then I remembered that I've been meaning to join the Free Software Foundation for a while now. The motivation is the same: I've been paying my bills for a long time now (and enjoying myself immensely in the process) because of the generosity of Free-as-in-Freedom software people: Stallman, Torvalds, Wall, and a zillion others. I have a hard time imagining what I'd be doing now without Free software; I suspect that, if I was lucky, I'd be working as a grocery store manager right now. So: off to the FSF website to sign up for an associate membership.

And what did I find but two, count 'em TWO cool things:

  1. If you refer three people to the FSF for associate memberships, RMS or Eben Moglen will record a message for you, suitable for voicemail, Hallowe'en or impressing the ladies. I did a quick search on Google, but couldn't find anyone with the link...damn shame. Better than a free iPod, cooler than a CmdrTaco TiVo -- join the FSF and get RMS to say "All Hail Liddy!"

  2. The FSF is looking for a senior sysadmin. God, that'd be cool. Decent enough pay (no, it's not the sort of job you take because of the money, but it's nice to think about), all the Free software you can handle, and an IBM Thinkpad to run it on. Of course, I think I'd have some 'plainin' to do about the laptop I'm writing this on...and, of course, it would mean living in the US. Frankly, that scares the crap out of me these days. Goddamned PATRIOT Act...

In other news, work continues apace. We're losing two coop students and gaining one, gaining another full-time person, and I'm still trying to get my RAID array -- credit app is with the boss, and after that's done the order'll finally go in.

Rough guess (wild hope) at this point is that it'll be in my hands in mid-January, which won't be a moment too soon. There's a new Linux server I'm setting up that I'm desperately hoping won't have problems due to proprietary kernel modules in the software I'm installing. (I'm just writing myself further and further out of that job, aren't I?)

And I'm wondering if the simplest way to get Nagios to make sure the right machines are exporting the right filesystems is to check if amd is mounting them correctly. (No matter whether the machine or amd fails, something needs to be fixed.) Or maybe I just need to figure out the right wrapper for showmount -e.)

On the spam front: good god, what a smoking hole Movable Type is turning out to be. First there were the license changes, then the comment spammers (who seem to be posting a lot more aggressive to MT than to WordPress)...Of course, comment spam affects all blogs, not just MT. Still, this whole idea of rebuilding static pages every time the stars move seems to be causing them a lot of trouble. (Yep, that last sentence was pure FUD. Or bullshit.) And okay, no, I don't use MT, so what precisely is my beef?

As I'm not going to put up, I should shut up. I still have to upgrade WP -- though according to this posting, there are still lots of XSS issues left unfixed. I'm also upgrading PHP, and I should probably use ApacheToolbox to do that automagically, rather than periodically editing my own Makefile.

The release party for Where Are They Coming From? came off JUST FINE, thank you. EVERYONE was there. Top Stars include Topo, Phil Knight and Mos Def, fresh from the set of HHGTTG. Uh huh.

Further thoughts on the MySQL + GPhoto2 thing: gphoto2 does have the ability to pipe to STDOUT, which I don't think I knew...maybe it won't be as much work to insert directly into a database as I thought. Might even be able to do it as a Perl script.

Finally: what a gorgeous day. It's downtown Vancouver on the back steps of the Art Gallery, it's sunny (in December, too) and just cold enough to make you go "brr". The skater kids are practicing their synchronised jumping -- just in time for the Olympics, I'm sure. A far-too-generous co-worker has handed out chocolate, another has handed out home-made rum and brandy balls, and I'm taking off early to go drinking with a third. Feeling pretty damned good right now.

Update: Too bad Topo's not so great -- fever of 102.8F, as of a couple minutes ago. (Still haven't figured out what that is in Celsius; bad Canuckistanian!) It's down a bit from earlier this afternoon, though, so I'm thinking good things. And these pages say to not worry if it's less than a couple days, so I'm not worrying. Nope.

Tags: bsd, hardware, meta, politics, rant, spam, wontyoupleaselendahand.
A power bar you can SSH to
2005-02-07 11:19:45

I was shopping for a new rack and the necessary accessories, when I came across the power bar you can SSH to. That's right: not only does it have a digital readout on the thing that lets you know how much power/current you're drawing (and oh man, does that ever make this thing worth it; I'm scared to plug in new machines right now for fear I'm gonna trip a breaker), but you can ssh to the damn thing. There's even a "how to recover a lost password" procedure.

Tags: hardware.
Easily amused
2005-02-25 23:24:47

Welp, the Promise array is here at last. I don't have any disks yet -- they're coming in next week -- but I've had a chance to play around with the firmware. First off, it's running Linux, just like JWSmythe said. The firmware that came with the box said "Now uncompressing Linux..." at boot time; it may be indicative of something that the newer firmware says "Now uncompressing kernel..." Promise doesn't mention anywhere on their website that the 15100 uses Linux, which surprises me a little. They also don't offer the source code anywhere. I've sent 'em an email asking about that; their autoresponder said I should hear about that today.

Second, I've yet to figure out how to enable SSH on the thing, and I'm increasingly lacking confidence that it even offers this, even after the firmware upgrade. Naturally, this is in strict contrast to what's listed on the website. I've sent them an email about this.

Third, I've yet to figure out how to monitor the thing by SNMP. I can run snmpwalk, sure, and I get info back, but but I don't see anything like network traffic or disk stats or anything. (Compare and contrast with the PDU from APC, which included the SNMP schema [if that's the right word] on the CD.) Then again, this may be because I haven't got any disks in there. We'll see.

Fourth, it looks like there was corruption of the firmware. Got it in yesterday, booted fine, upgraded firmware by TFTP, all good, turned it off before going home (and not for the first time that day, either). This morning I booted it, and things were just wrong: the network address was obviously bogus and couldn't be changed, various menu entries were showing garbage instead of "Promise VTrak 15100" or whatever, and so on. I called tech support, who told me the secret:

  1. Reboot.
  2. When booting, hold down ctrl-F to get to the BPD prompt (which is some sort of bootloader prompt).
  3. Type "diag".
  4. Select "Clear or Test FRAM".
  5. Let it do its thing.
  6. Quit the diag tool.
  7. Type "reset" to reboot.

Note: if you fry your array by following this advice, you're on your own. But it worked for me. Of course, this doesn't explain why it happened in the first place. I'm going to be watching it carefully.

Funny moment: While waiting for me to figure out how to reboot the array [which took a few minutes because of the menu corruption I called to complain about], the techie I was talking to was having a conversation with someone else. "Are you reading? [pause] Okay, are you working on projects? [pause] It's okay if you're using the web to work on projects. [pause] But if you're just surfing the web looking for a job, that's not working on projects." Second funny moment: The warranty registration page on the Promise website asks for suggestions and comments to "help us imporve in the future." Third funny moment: When registering the extended support, the page that asked for the value of the product purchased barfed with "Internal Error" when I put a dollar sign in the amount. (Okay, so I'm just easily amused.) Finally, it's just plain odd to be asked for your bona fides by your power bar:

  1. Access: Enabled
  2. Protocol Mode: SSH Version 2 only
  3. Telnet Port: 23
  4. SSH Port: 22
  5. Advanced SSH Configuration
  6. Accept Changes : Pending?- Help, esc- Cancel Changes, enter- Refresh, ctrl -L- Event Log > 6 LICENSE AGREEMENT By enabling this security feature, you are agreeing to the following statements: A. This Product includes cryptographic software subject to export controls under the U.S. Export Administration Regulations. You agree to cooperate with American Power Conversion Corporation as reasonably necessary to ensure compliance with the laws and regulations of the United States and all other relevant countries, relating to exports and re-exports ("Export Laws"). You shall not import, export, re- export or transfer, directly or indirectly, including via remote access, any part of the Products into or to any country (or its nationals or permanent residents) or to any end user or end use for which prior written governmental authorization is required under applicable Export Laws, without first obtaining such authorization. By ACCEPTING THESE TERMS, you are representing and warranting that neither your use nor your receipt of any part of the Products requires prior written authorization under any Export Laws. You are responsible for complying with any local laws in your jurisdiction which may impact your right to access or use this product. B. By ACCEPTING THESE TERMS, you are representing and warranting that (1) you are not located in or a national of any U.S.-sanctioned or terrorist-supporting countries, (2) identified on the U.S. Treasury Department's List of Specially Designated Nationals, the U.S. Commerce Department's Entity List, or the U.S. Commerce Department's Denied Parties List; or (3) engaged in any proliferation-based or terrorist- supporting activities. Do you accept the terms of this license agreement? Enter 'YES' to continue or ENTER to cancel :
Tags: gpl, hardware, linux.
Argh, Billy.
2005-04-19 17:46:35

Less and less impressed with Promise. Here's what I had done: while doing some copying onto a logical drive, I yanked one out. I wanted to see what would happen, what would need to be done, and so on -- I don't want to be figuring this out for the first time when it happens. Well, it started beeping, and the event log said that the logical drive was critical. Start rebuilding, right? Wrong: policy for that drive was set to non-auto-rebuilding. Try turning that on, and it doesn't work: keeps saying it's non-auto-rebuilding. Manual for the VTrak:

if your fault-tolerant logical drive goes offline, go to the Promise website (www.promise.com) and download a document called_Array Recovery Procedure_.

Damn good thing I'm not doing this for real. Go to www.promise.com and type "array recovery procedure" into the search bar. The result:

Microsoft OLE DB Provider for ODBC Drivers error '80040e14' [Microsoft][ODBC SQL Server Driver][SQL Server]Cannot use a CONTAINS or FREETEXT predicate on table 'product' because it is not full-text indexed. /search_insert_eng.asp, line 34

Fuck me. Use Google to find the document, which has instructions for the UltraTrak, the predecessor to the one I've got. Hope it still applies and read on. It sez to reboot the array (!) in order to trigger the rebuild (!!). Sure enough, it works. Oh, and have I mentioned they still haven't sent me the SNMP OIDs/MIBs after six weeks of calling their technical support manager? FUCK ME.

Tags: hardware.
Blast from the past
2005-07-01 10:52:57

(Note: this was actually written back in May.)

Top Tip: Filenames with a tilde in them can confuse Samba.

Case in point: last week a user was having problems loading his profile: W2K kept choking and saying that the file Local Data\Applications\foo\backup\~AvariciousMonkeys.c was in use. Naturally, lsof on the Samba server turned up nothing, and I couldn't see any obvious problem. On a hunch, I tried renaming the file to AvariciousMonkeys.c~, and hey presto! goodness all over.

This week I'm trying to get FAI going in seriousness. I've worked on it before, but now I've got three developers who want to switch to Linux. The last thing I want is another series of one-offs, so I'm taking the time to do it right. Now there's a CD version in beta, and so far it's working well. Cf. the usual way of doing it, which is to do PXE booting and grab everything off the network. I'm not opposed to that, but one of the things I wanted out of FAI before was the ability to do CD-based, kickstart-like Debian installs; looks like it's finally going to work.

Looks like we're having a problem with a Maxtor PCI IDE controller and the Intel mobo in our backup server. It's been mysteriously crashing in the middle of the night w/no log messages. Some checking in the BIOS turned up another problem: going to the hardware monitoring page to look at the CPU temperature made the damn thing freeze. WTF? Sure seems like the symptom we were seeing, and backups running at night make big use of the Vinum array that uses drives attached to the IDE adapter...long story short, taking out the card stopped the BIOS freezing. It remains to be seen if it'll work for the random midnight freezes, but it's good to have something to try. I'm hopeful that FreeBSD will be able to handle SATA drives attached to this thing...we'll have to see.

Which brings me to the next bit: fleshing out plans for server upgrades. As I mentioned, last week we had a power supply fail on our Very Important Server, and I want to try and keep that from happening again. Of course, adding umpty thousand dollars worth of hardware to your budget four months before the end of fiscal doesn't really work too well, so as much as possible I need to do this w/o new hardware. Ha! But I'll give it a try.

First off is setting up OpenLDAP and importing Samba's information into it. That'll be neat, since I've never worked w/LDAP before. Second is to set up some BDCs using OpenLDAP to query the master. (Or do they just suck over the whole database? Hm. Either way.) Third is to set up some Linux machines. Why? Two reasons:

LinuxHA and DRBD seem fantastic, and there just doesn't seem to be anything comparable on the FreeBSD side. As for the hardware...well, my first impression of server hardware from IBM, HP and the like (no, don't talk to me about Dell) is that I'm going to need a newer version of FreeBSD than we currently use in order to run SATA drives. (I know SCSI is the way to go, but I was quoted two thousand dollars for two IBM 73GB 15k drives! I know: 15k, IBM, etc, but even halving that means two -- two! -- 73GB drives for a thousand bucks, a/o/t two 200GB drives for, what, four hundred. Heh.)

We're using an older version of the 4-series FreeBSD here. I've already set up one server using a newer 4-series release, and it's a pain: too many differences, one more thing to keep in mind when making changes, and so on. I haven't worked with the 5-series yet, and I don't want to start now...not entirely sure that it'd work for us. Plus, we'll probably migrate to Linux anyway, so I don't mind doing it for a server.

Anyhow! Get a Real Server and throw Linux on it. Hook it up to our drive array and start migrating home directories to ReiserFS from UFS/FreeBSD. Not trivial, but doable. Add more Linux servers as budget allows.

Tags: bsd, hardware, installation, linux, samba, upgrades.
And then nothing turned itself inside-out.
2006-03-03 07:17:25

Checked my email this morning and saw that backups of my wife's computer had timed out. Weird, I thought, but didn't look into it further. Then my wife comes out and says, "Hey, my computer's having a stroke.". Uh-oh.

So I have a look and it's constantly, randomly, power cycling. It will get to the Ubuntu splash page then shut off, then get halfway through the BIOS check and shut off, then get halfway through boot and shut off, then stay off for two minutes, then turn on again. WTF?

First thought is cooling, of course. But the power supply feels cool to the touch, and when I get to the BIOS temperature page it says the CPU is at 51C -- eminently reasonable. (Then it shut itself off.) Okay, flaky RAM? Wonky graphics card? Dying, though not from lack of cooling, power supply?

Then it makes it all the way to Ubuntu's login page. I switch to a console and start looking at logs. This thing has been rebooting all night -- as in log messages about how shutdown has been invoked. And then I check /var/log/acpid and I see lots and lots and lots of entries saying that event POWERBTN (or some such) had been receieved, so Ubuntu was executing /etc/acpi/powerbtn.sh and shutting down nicely. And then I saw a broadcast message from root saying that the system was going down for reboot NOW!

Tempted to just try booting w/o ACPI, but I think that would just mask the issue. Back to Google...

Tags: hardware.
Bones of an Idol
2006-11-17 21:50:30

Thursday: Go to The Other University to do some prep for the move coming up next week. Check in with their computer store (where you pretty much have to buy things) to see how the order on the console server is going. The guy behind the counter looks up the order, frowns, and tells me that it seems their supplier does not have one in any of their three Canadian warehouses. Okay, so how long will it take to get one in? He looks at me earnestly and says that, sometimes, they never come in. I ask at what point I can count on the supplier a) giving up and b) informing me of that fact. He frowns again, and suggests that I check back in a couple weeks (four weeks after I've placed the order) just to be safe.

Friday: Get email from contractor/university liason for new building to say that network and electrical connections will not be ready in time because the requests were received so very late. While The Other Guy was supposed to get them in long ago, I should've been on top of this.

Monday, a stat in Canada: Go to the old building to do a serverectomy on a soon-to-be-formerly shared rack. The Other Guy mentions that the new server room has water on the floor. I go over to look, and it's a rapidly evaporating puddle, irregular in shape and maybe two metres across at its widest. I can't figure out where it's coming from. Turns out there's some other stuff that should become formerly shared as well, so I spend time poring over Sun Enterprise 1 workstations (which I like) and old inkjet cartridges for printers that may no longer be around (which I don't like). Ask The Other Guy, who's been involved with the move a lot longer than I have, what electrical connections he's asked for him and for me (long story) in the new building. He says that he gave them the model number of the Sun rack he's got (which has built-in, and very nice, PDUs) and asked them to figure out what he needs.

Tuesday: Moving day. As expected, network and electrical are not present; we've got 2 x 15A 120V circuits. Also, the leak is back, and we can see that it's coming from a small leak in the concrete roof. I move my rack into another room; The Other Guy spreads a blanket over his rack. The liason promises us that the contractors are on the job to fix the roof. The network connections (two fiber, two Cat5) get terminated, so I call the local network folks to get that taken care of. The university wireless network is not present in the new building.

Wednesday: The contractors show up to start fixing the leak. The network connections have been set up. The contractors have put in a big tube of plastic sheeting, taped to the roof at one end and a 40-gallon recycling barrel at the other. The Other Guy decides things are good enough and starts setting up his rack; I elect to hold off another day.

Thursday: The contractors say the roof is fixed, so I move the rack in and start hooking things up. The new OpenBSD firewall comes up nicely -- thank you, pf developers -- as does the main Sun server. Next up is the SunRays in the lab, only they're not. I take my laptop in and try to verify connectivity. I can't. The Other Guys suggests that the VLANs on my new switch are the problem and suggests just simplifying things. I do and keep testing. Traffic from the laptop's RFC 1918 address just never makes it to the server. In a fit of desperation I try using an address in our routable subnet, and it works. This takes me until 8pm to figure out. I email various bosses explaining how far I've got, and the campus network folks to ask if they're filtering this subnet in some way. (This isn't completely out of the question; this place has a reputation for a pretty locked-down network.)

Friday: I buttonhole the guy at the campus network office and ask him about this. He considers this and realizes that while he's forgotten to unblock DHCP (told you it was pretty locked down), the other behaviour I'm seeing can be explained if I've somehow got my interfaces crossed. I'm doubtful but give it a try, which is a good thing because suddenly everything works. I don't understand it or what I did wrong, but assume that I was simply too tired the previous night and thank him profusely for taking the time to talk to me. I am now where I should have been twenty hours before. Mighty battles emerge with Sun's DHCP and Sunray servers. In the end, I have to delete the Sunray configuration, delete all DHCP configurations, and then add the Sunray configuration back. This works, which annoys me; why are there all these opaque configurations around? Not a single plain-text file in sight. I manage to get a printer working, then another. DHCP is modified so that laptops work as well. I call it a night and head home.

Tags: fail, hardware, network.
Not borked after all!
Sun Mar 18 10:21:08 PDT 2007

While doing some work on one of my WRT54-GL routers last night, I managed to bork OpenWRT: after a reboot, the power LED just kept flashing, and there was no response at its usual IP address. I could ping it on 192.168.1.1 (though, weirdly, I'd only get 3 reponses every 30 seconds or so), but neither telnet nor SSH was working.

Some folks suggested getting out the serial cable, or shorting pins on the flash chip, but a simple TFTP did the job.

Now to get OpenVPN going again, and this time without breaking the damn thing!

Tags: hardware.
IBM/Lenovo T60 memory upgrade
Fri May 25 13:41:48 EDT 2007

I bought a T60 for my boss a while back, and have just finished putting in another memory module. Man, I knew this was the lower end of their laptops, but I had no idea it would feel so cheap.

To get at the memory, you take out a few screws on the back, then lift off the palm guard below the keyboard. It's flimsy plastic, and it's hard to get back in the right place - doubly so, since it feels like instead of clicking into place it's going to break. And you need to remove the ribbon that connects the touch pad and fingerprint reader in order to fully remove it; when putting it back in, it looks like it's going to get crimped. That can't be right.

I had been considering getting one of these, despite having fallen in love with my other boss' Dell D420. But this just makes me think that the extra money for the D420 would be worth it. Of course, I haven't had to crack that one open yet…

Tags: hardware, upgrade.
"Failed opcode was: 0xef" considered harmless
Fri Jun 8 14:09:11 EDT 2007

This morning I noticed these entries in the logs of my monitoring machine at work:

hda: drive_cmd: status=0x51 { DriveReady SeekComplete Error }
hda: drive_cmd: error=0x04 { DriveStatusError }
ide: failed opcode was: 0xef
hda: task_no_data_intr: status=0x51 { DriveReady SeekComplete Error }
hda: task_no_data_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: 0xef

After a lot of Googling, I managed to find a few things that explained it:

setting drive keep features to 1 (on)
HDIO_DRIVE_CMD(keepsettings) failed: Input/output error

This is a completely benign error, of course…I really don't care if we have to run hdparm with every boot. I had also tested the drive by booting into Knoppix and md5summing every file on the drive — no errors produced at all.

Don't know what's worse — wasting two hours on this, or not noticing it before now. At any rate, this failed opcode appears to be completely harmless.

Tags: hardware.
I'm sorry, let's try that again
Tue Jun 19 22:59:01 EDT 2007

My wife was using her iBook tonight when alla sudden Apple Mail said the Inbox was read-only. Wha'? Couldn't remove or create files from the Terminal, and /var/log/system.log showed this message:

kernel: disk0s3: I/O error
kernel: jnl: do_jnl_io: strategy err 0x5

A lot of scary messages turned up in the search results about replacing hard drives, memory and mainboards, but I decided to try a fsck for the fun of it. Splat-s sent the Apple into single-user mode, and then fsck -f -y said the volume had been repaired successfully. Reboot and things look good: I can create and remove files, and Apple Mail is fine. Interestingly, the disk said it had an extra GB free compared to before the reboot.

The drive is old, and may still need replacing. Thankfully, I've set up a cron job on this thing to rsync the home directory daily to another machine.

Tags: hardware.
Mail server
Sat Sep 8 18:29:36 PDT 2007

Just when I was about to sign off for the day, suddenly the mail server's down. No response to pings, no response on the console server even. It's an old E220R, and while it's underpowered for all we're asking from it, I haven't had problems with it before. (Well, except for the CDROM drive not powering up. But I can live with that.)

So drive into work with the wife and kid, on the off chance that it'll all be fine quickly. No such luck. It hadn't walked away, the cables were all still in place, and I had to power cycle it to get it to come back up. A lot of fscking later, and I'm waiting for it to finish booting. I can't remember what it was like the last time I rebooted it, but this time it seems rather ridiculous (20 minutes). More stuff to add to the documentation once I'm done…

And once more: sysadmin documentation MUST NOT depend on external services. (The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.)

Time for pizza.

Tags: hardware.
If'n it ain't one thing...
Mon Sep 10 17:22:16 PDT 2007

...it's another. Busted CPU on a Sun 440 at the university across town meant I spent a bigger part of my day on the bus than usual. Remove the CPU card/assembly/whatever (god, they're mother huge) and we're back in business.

Incidentally, it amazes me that you can turn up fully spec'd V440s on Ebay for, like, $8000 US. 4 x 1GHz CPUs, 16 GB of RAM, 4 x 72GB drives...what's not to like?

Tags: hardware.
Resume, laptop, presentation
Sun Sep 23 13:31:40 PDT 2007

Just updated my resume for the first time since starting my current job. It's nice to look back at what you've done and realize that, hey, there's been a lot.

In other news, I finally gave in to lust the other day and bought a Dell C400 on eBay. Nothing too special — 1.2GHz, 256MB, 30GB hard drive — but I was mainly after the 12" screen, so that I'd be able to (say) debug raw ethernet frames on my daily commute. About $280 when all was said and done; the strong Canuckistan peso was part of the incentive to buy now. Should be at the office in a week or so, and I can't wait.

It amazed me to see how many off-lease laptops were available, and just how cheap you could pick them up. A whilte back my boss got a D420; with extra memory and a few other things, it came in at about $1700 or so Canadian. But if you look around, there are plenty of D400s and D410s around for less than $500 — even less than $400 if you look hard. Add another $100 (say) for a working battery, and you're in pretty good shape.

Virtualbox has made it to Debian testing — hurrah! Only it won't run (Open)?Solaris. Dang.

On Tuesday, I'm giving a short presentation on my work's subnet at SNAG, the UBC System and Network Administrator's Group. I found Bruce in OpenBSD's ports tree on my laptop; the documentation is (ahem) thin, but it works. Wish me luck.

And there's Arlo up. Time to go get him.

Tags: bsd, dell, hardware.
It's here!
Sat Oct 6 19:53:54 PDT 2007

The laptop I bought off eBay arrived at work on Wednesday...which is my day at home with Arlo. Thursday I was off sick with flu. Yesterday I was back at work and slashing open the box it came in, eager to see what I'd got.

Well, I already knew: it's a Dell C400. 12" screen, 1.2GHz P3 (but running at 800MHz with SpeedStep and all), 256MB RAM and a 30GB drive. Not a whole lot of memory, and a bigger hard drive would always be nice, but I can always upgrade. There's no CD drive in this thing, and I hadn't plumped for the docking station, so I set up PXE booting to install Debian. It was a trifle slow, but it worked! (Especially the second time, after I'd accidentally overwritten Debian trying to install OpenBSD on another partition. :-)

I'm surprised at how much Just Works in this thing: X.org (no configuration needed, just start up XDM…man, that's nice), suspend-to-disk, ethernet (well, it's a 3c905; what do you expect?). Even the battery, which I'd written off in advance, appears to hold a decent charge — about four hours so far. The one thing that's dicy is the onboard wireless, a Dell 1370 from everybody's favourite company. But again, I'd written that off in advance.

Next up: I've ordered the OpenBSD 4.2 CD set, so I'll be installing that once it arrives. And Noah has shown the way to longer battery life; I'm getting my 2.6.22 kernel now from Backports. (Oh, the shame of not compiling my own kernel...)

On another note, I think someone had one too many Dilbert moments:

$ dig newcastle.edu.au mx

; <<>> DiG 8.3 <<>> newcastle.edu.au mx
;; res options: init recurs defnam dnsrch
;; got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 4, ADDITIONAL: 4
;; QUERY SECTION:
;;      newcastle.edu.au, type = MX, class = IN

;; ANSWER SECTION:
newcastle.edu.au.       11h59m12s IN MX  10 proactive.newcastle.edu.au.
newcastle.edu.au.       11h59m12s IN MX  10 synergy.newcastle.edu.au.

Perhaps they got the names from /dev/bollocks.

Tags: bsd, dell, hardware.
OpenBSD wins this one...for now!
Sat Oct 20 21:11:21 PDT 2007

I ordered the 4.2 CD set of OpenBSD at work, in another optimistic step toward reorganizing the firewall there. In order to (ahem) road-test it, I installed it on my new laptop (which, you'll recall, is running Debian Stable) in a 5GB partition I'd left for just this purpose.

Onboard wireless, like with Debian, did not work, and I didn't expect it to; fuck you too, Broadcom. But my dad offered to send out a couple of wireless cards he couldn't use, and I figured one of 'em would have to work.

One was a Broadcom (op cit.), so that was out. The other, a DWL-650 (which appears to have umpty different versions over the years with not one change in model number) looked promising: a Realtek chipset, so should be good, right?

Well, it worked on OpenBSD -- but not in Linux. There's no driver in the tree for it, and the outside project to make drivers for it had its last official release in 2005. What's more, the CVS version, for some reason, removes all of its source files when I compile it, then complains that there are no files left to compile. To be fair, I think this is because of a makefile included from /lib/modules/2.6.22-2-686/build rather than the code itself.

Update: Just read Tourrilhes' page on the RealTek driver, and learned something: there's a fork/resurrection of the project I'd looked at, and it appears to be relatively current. I'll have to take a look. SooperUpdate: the new project fixes the let's-delete-all-the-files problem. Score!

What OpenBSD does not do on this laptop is suspend -- or more accurately, come back from suspension. This works reasonably well under Debian, which means that I still have one rose to give away to The Next Laptop OS for Saint Aardvark.

Tags: bsd, hardware.
Working wireless for Linux on a Dell C400
Fri Oct 26 13:47:47 PDT 2007

Turns out you can get the built-in Broadcom wireless card in my laptop (Dell C400) to work, but it did take me a bit of effort.

First off, I'd been looking at the wrong web page for the BCM43XX project — the right one, as Prakash pointed out, is much more up-to-date.

Second, again at Prakash's suggestion (thanks for that!), I downloaded the drivers for the Dell 1370. Running the .exe in Wine extracted the .sys file successfully. However, when I pointed fwcutter at them I got this message:

Sorry, the input file is either wrong or not supported by b43-fwcutter.
This file has an unknown MD5sum 8d49f11238815a320880fee9f98b2c92.

So that .sys file was one not supported…at least, not for a while now. That commit message was one of the few I could find that mentioned this number. So I checked out revision 396 from the Subversion repo, compiled it and pointed at the sys file…success! Extraction!

Except that it still didn't work:

bcm43xx: Error: Microcode "bcm43xx_microcode5.fw" not available or load failed.

Turns out it had extracted all the files to /lib/firmware/bcm430x_*, rather than /lib/firmware/bcm43xx_*. Quick little shell-fu:

for i in bcm430x_* ; do j=$(echo $i | sed -e's/bcm430x/bcm43xx/') ; sudo ln -s $i $j ; done

and it worked when next I inserted the module…working right now, in fact, despite lots of error messages like:

bcm43xx: WARNING: Writing invalid LOpair (low: 0, high: -115, index:
120)
 [<d0ba6ebb>] bcm43xx_phy_lo_adjust+0x1e6/0x223 [bcm43xx]
 [<d0ba7d04>] bcm43xx_phy_lo_g_measure+0x915/0xaeb [bcm43xx]
 [<c01eb6db>] bit_cursor+0x479/0x48e
 [<c02a4416>] __sched_text_start+0x686/0x73b
 [<d0b9dde4>] bcm43xx_periodic_work_handler+0x15c/0x407 [bcm43xx]
 [<d0b9dc88>] bcm43xx_periodic_work_handler+0x0/0x407 [bcm43xx]
 [<c0130260>] run_workqueue+0x7d/0x109
 [<c0133308>] prepare_to_wait+0x12/0x49
 [<c0130a5d>] worker_thread+0x0/0xc7
 [<c0130b17>] worker_thread+0xba/0xc7
 [<c01331f5>] autoremove_wake_function+0x0/0x35
 [<c013312e>] kthread+0x38/0x5e
 [<c01330f6>] kthread+0x0/0x5e
 [<c01049c3>] kernel_thread_helper+0x7/0x10

in the kernel log.

No idea why I had to go through so much rigamarole, but hopefully this will save time for someone else. Oh, and for the record: this is with Debian Etch, 2.6.22 kernel from backports.org.

Tags: dell, hardware.
Vishnu ate my laptop
Thu Nov 1 21:02:39 PDT 2007

Dude, my laptop screen just turned blue. I'd booted into OpenBSD (4.2) and was trying to figure out how to turn off the audible bell. I'd gone from X to a virtual console to see if the problem happened there (it did), then tried ctrl-alt-f5 to get back to X.

My laptop screen turned from black with white text to grey with grey text to light blue with dark blue text, over the course of a minute or so. I thought I'd suddenly borked the LCD screen, but when I rebooted to Debian it was all fine. Just tried switching to a console, then back to X (alsoin Debian), and that's fine too. Bizarre.

Just checked the logs in OpenBSD and found a series of entries like this:

Nov  1 16:47:17 laptop /bsd: agp_release_helper: mem 0 is bound
Nov  1 16:47:17 laptop /bsd: agp_release_helper: mem 1 is bound
Nov  1 16:47:17 laptop /bsd: agp_release_helper: mem 2 is bound
Nov  1 16:47:17 laptop /bsd: agp_release_helper: mem 3 is bound
Nov  1 16:47:17 laptop /bsd: agp_release_helper: mem 4 is bound
Nov  1 16:47:24 laptop /bsd: agp_release_helper: mem 5 is bound
Nov  1 16:47:24 laptop /bsd: agp_release_helper: mem 6 is bound
Nov  1 16:47:24 laptop /bsd: agp_release_helper: mem 7 is bound
Nov  1 16:47:24 laptop /bsd: agp_release_helper: mem 8 is bound
Nov  1 16:47:24 laptop /bsd: agp_release_helper: mem 9 is bound
Nov  1 16:47:31 laptop /bsd: agp_release_helper: mem 10 is bound
Nov  1 16:47:31 laptop /bsd: agp_release_helper: mem 11 is bound
Nov  1 16:47:31 laptop /bsd: agp_release_helper: mem 12 is bound
Nov  1 16:47:31 laptop /bsd: agp_release_helper: mem 13 is bound
Nov  1 16:47:31 laptop /bsd: agp_release_helper: mem 14 is bound
Nov  1 16:47:38 laptop /bsd: agp_release_helper: mem 15 is bound
Nov  1 16:47:38 laptop /bsd: agp_release_helper: mem 16 is bound
Nov  1 16:47:38 laptop /bsd: agp_release_helper: mem 17 is bound
Nov  1 16:47:38 laptop /bsd: agp_release_helper: mem 18 is bound
Nov  1 16:47:38 laptop /bsd: agp_release_helper: mem 19 is bound

Very weird. On the bus, so Googling that'll have to wait. Although I do have the code on that partition…here we go: says it's the AGPIOC_RELEASE ioctl for agp. Aha! Maybe I'll explain money laundering while I'm at it.

And btw, here's a memo for the world: if you're on the toilet, don't take a phone call. It's really not that important.

Update, October 15 2008: Still happening with OpenBSD 4.3. And for the record, this is a Dell C300 laptop.

Tags: bsd, dell, hardware.
Power outage
Tue Nov 6 20:23:23 PST 2007

We had a power outage today at work. The good news is, the UPS' worked. The bad news is, the servers were not set to shut themselves down automatically, and the UPS' ran out literally two minutes before the power came back on. Arghh.

Having a flashlight in the server room is a good thing. So is making sure that your servers are all connected to switches powered by the UPS. So is making sure that you have a laptop with a charged battery and a ready-to-use serial cable connected to your otherwise-accessible-through-SSH console server. So is Sun making an x86-based OS that doesn't hang every time it reboots badly.

In other news: as mentioned on the Dragonfly BSD digest, ICANN blogs (!). They've taken this moment to let us know that the address of L.ROOT-SERVERS.NET has changed. Now you know.

Tags: hardware.
f(220R) = 280R
Wed Nov 21 20:56:03 PST 2007

At work, our mail server is an aging E220R. While underpowered for all it does, it has behaved well, more or less, until recently.

A couple of months ago it power cycled itself for no apparent reason. This weekend, it did the same thing. This is exactly the same behaviour I saw from another E220R at $other_university, and in that case it got progressively worse. Another sysadmin here says he's seen the same behaviour with two in his care. I'm preparing for the worst.

Part of that has meant preparing to move its functionality to another machine; this has been an excellent chance to delve into the bowels of our mail and list system. I've been steadily improving (read: creating) this for some time now, but this points out some bits I hadn't. So that's good.

Plan C is a loaner E280R from the other sysadmin (op cit.). I ran into trouble getting it working, though. First, I couldn't get a serial console working. (Getting a serial port working always seems to be a pain for me, no matter what the machine.) It has two of the old DB-25 ports; no problem, since I had a splitter and had got that working on the E220R. Except that it didn't work: no matter which port I hooked it up to, I couldn't see any output. I tried flipping the key around to diagnostic mode, but I still didn't see anything. (The manual said that you should be able to force output to ttyA by power-cycling the machine and hitting the power button twice when the amber service LED started blinking…but I never saw the blinking.)

This was especially weird to me because I had been able to get output from the RSC card using the same setup: OpenBSD laptop -> usb serial adapter -> DB-9 to RJ-45 adapter -> Cat 5 cable -> RJ-45 on RSC card. (The only difference was that, with the DB-25 port, the Cat5 cable had fit into the back of the DB-25 splitter.) But I couldn't log into the RSC card, and a quick Google turned up no easy way of resetting its password. (Putting it into the other E280 I have, which runs our database and website, was not an option.)

Out of desperation I finally hooked up the Cat5 to the DB-25 splitter on one side, and the console server on the other…and that worked. Damned if I know what was going on.

But then I had another problem: when it booted, I kept seeing line after line of I2C reset error; after a while, it would power-cycle itself and the pattern would start again. I remembered that op cit. had slotted the second CPU for me, so what the hell: I reseated it, and that did the trick.

Next up is detaching $failing_machine's second hard drive from the mirror and seeing if I can get it to boot in the 280. Let's hope.

In other news, LinuxFest Northwest is calling for papers. Were that not right around the due date of Project U-14, I might try submitting something and see what happens. Oh well...next beer in Jerusalem!

And there's the laptop battery...shoulda charged it at work.

Tags: hardware, solaris.
Scratch that
Fri Nov 23 05:55:34 PST 2007

E280R takes different SCSI drives than the E220R. Serial ports and SCSI connectors: A Study in Nemesisssysadminss. Discuss.

Tags: hardware.
Coming up
Fri Jan 18 06:07:07 PST 2008

My laptop hard drive started giving scary errors a couple days ago on the way to work (I've got a 90-minute commute by public transit [uck] so I fill the time by reading, listening to podcasts, or working on Project U-13). Fortunately, working at a university means that there are two computer stores on campus. I ran out at lunch, picked up a 100GB drive, and had things back to normal by the next morning.

Well, normal modulo one false start with Debian; I decided to try encrypted filesystems just for fun. But then I suspended, came back with a newere kernel, and it could not read the encrypted LVM group anymore. Whoops.

Still lots of free space on this thing, and I'm thinking of installing Ubuntu, FreeBSD and maybe NetBSD just for fun. Of course, I've got to do it all via PXE since this thing doesn't have any CDROM drive, but that just adds to the geek points.

Project U-13 is coming up on 0.0.3, btw; Andy suggested adding Rackmonkey, which looks quite cool. There's no package for it, so I'm having to do some rather ugly scripted installation…but I can stand it for now. And I've got the barest skeleton of a cfengine file in there too. Watch the skies!

Tags: bsd, cfengine, hardware, projectu13.
New toy
Fri Jan 25 13:12:30 PST 2008

My workplace just got me a new cel phone: the Sony Ericsson W200a Sony Walkman Phone. The provider is Rogers; minus two points for not letting me make an MP3 into a ring tone, but plus three for letting MidpSSH work. It was a lark to be able to check mail on my firewall box; Mutt was surprisingly useful. No idea how much data costs on the plan I've got, and I don't plan on actually SSHing around very much, if at all…but still, fun. And, as mentioned elsewhere, kudos for including a USB cable and making it show up as an ordinary mass storage device.

1 comments. Tags: hardware.
Heh ---
Sun Feb 3 08:50:14 PST 2008

Matthew Garret's presentation on Suspend-to-Disk make fun reading.

Arlo's sick with flu or something; I was up 'til 1am last night rocking him to sleep. Haven't done that in a while…

Telling detail: I'm about to blow away Debian testing on my desktop machine and install Ubuntu's Gutsy Gibbon. Partly it's because I'm tired of installing 80MB worth of updates every two weeks, and partly it's because it'll make setting up the printer a breeze.

I'll probably leave half the drive aside for good ol' Debian stable, but Ubuntu'll stay there for experimenting and so my parents, on their next visit, will not have to bring out their 4-tonne laptop.

I'll be reinstalling Ubuntu on my laptop as well; due to a stupid error, I installed Dapper, not Gutsy. I tried updating in one fell swoop, and after three days of apt-get -f install I finally got things working…except for the boot artwork, and GDM doesn't start one time out of three. Interesting experiment, but I think I'll take a do-over.

I may even install it twice, so that I can try out The Depenguinator, which appears to be a lot easier than trying to figure out PXE booting for FreeBSD. Unlike OpenBSD, there's no readily apparent "official way" of doing it, and the handful of HOWTOs I've found have contradicted each other. At this point I'm just too lazy to keep trying and seeing what I'm doing wrong.

Tags: bsd, geekdad, hardware, linux.
My perfect notebook
Wed Feb 6 12:32:17 PST 2008

I agree completely with Chris Siebenmann's entry on the utility of keeping a notebook. I've done this almost as long as I've been working in IT, and it's saved my ass repeatedly. Also, the way I keep my journal — random notes at the front working toward the back, daily summary at the back working toward the front — means that it's fairly simple to search for my notes on a particular task, or explain to management just what I do with my time.

I love paper. I tried a PDA for a while; hated it, didn't trust it, and gave it up promptly. Scribbling with a pen is faster, more satisfying, and doesn't make me wait for something to reboot or awaken, or force me to learn a different way to scribble. At the best of times, it forces me to think a bit about what I'm doing or seeing, rather than just typing blindly at the problem. (What, you never do that?)

But while a paper notebook is wonderful, it's not perfect. Here's what would be perfect:

Let me paste screen captures right into my notebook. (I'm talking both screenshots and the log files from GNU screen.)

Let me paste sections of my .history file into my notebook complete with timestamps.

Let me cut-and-paste from my notebook to Emacs (or vi, you heathens), and vice-versa.

Let everything I write or paste be timestamped automagically.

Let everything I write or paste be sync'd automagically to some plain text-like format, suitable for grepping, munging, merging into a database, pushing to syslogd, or what have you.

Tags: hardware.
Fiendish Giggle
Fri Feb 8 10:10:54 PST 2008

New Dell 2950 server. 2 x quad-core Xeons, 2 x 6MB cache on each die, 16GB RAM, 6 x 300GB SAS 10K SCSI drives in a RAID-6 array using the PERC/6 controller.

/usr/src/linux-source-2.6.18# time make -j 9 bzImage
[snip]
Root device is (8, 3)
Boot sector 512 bytes.
Setup is 7295 bytes.
System is 1222 kB
Kernel: arch/i386/boot/bzImage is ready  (#1)

real    0m22.668s
user    2m20.425s
sys     0m14.537s

That's just insane.

Tags: dell, hardware, linux.
Saturday work
Sun Mar 30 14:51:44 PDT 2008

Yesterday I spent the day at work testing our installation of APCUPd and tidying up the goram rat's nest of network and electrical cables my predecessor left me.

APCUPSd worked with only a few hitches:

  1. I had one machine polling a UPS, and told it to shut down when there was 30% charge left. The other machines, which poll the master, were set to shut down 30 seconds after the power went out. They shut down, but that bumped up the charge reading on the battery because the load was that much lower. So I didn't get to test the automatic shutdown of the master.
  2. The other three machines were all set to shut down after 30 seconds; however, NFS cross-mounting made for problems with one of them. I'll need to stagger those three machines, whether they're looking at the charge or just shutting down n minutes after the power goes out.
  3. The Solaris 10 box shut down just fine, but when it restarted it did not let me log in — even in the console. Since Solaris 10's boot sequence is dead silent by default (thank you, Sun), it was hard to be sure what was happening. The last time I was patching this machine, reboots took 10 minutes; I gave 20 this time before giving up and going to single-user mode. The problem appears to be /etc/nologin, stuck there from the shutdown. This prevented a login prompt from coming up even in the console, without any sort of warning. Arghh.

As for the cleanup: satisfying. I'm no longer quite so ashamed of the server room.

Tags: hardware.
Can a mouse crash?
Wed Jul 2 09:43:02 PDT 2008

Just had a repeat of the weird mouse-X disconnect I've encountered before. This time though, I'm running Debian Etch — so no more blaming the problem on SuSE (as I secretly always did :-).

One noticeable problem this time was that the middle button did not work, making click-to-paste impossible; I even ran xev and saw no events for middle-clicking. (This in addition to clicking being inconsistent, the client receiving the click being inconsistent, etc). Running cat /dev/input/mouse0 did not work. What did work was disconnecting the mouse (a USB 3-button optical jobbie), then plugging it back in. Sure, coulda been the mouse driver, or X, or something, but I wonder if the hardware itself — whatever little controller chip is in there — maybe got wedged. Interesting to think about…

Tags: hardware.
That's a mighty big catchup I got goin' there
Thu Sep 25 06:14:13 PDT 2008

Work...hell, life is busy these days.

At work, our (only) tape drive failed a couple of weeks ago; Bacula asked for a new tape, I put it in, and suddenly the "Drive Error" LED started blinking and the drive would not eject the tape. No combination of power cycling, paperclips or pleading would help. Fortunately, $UNIVERSITY_VENDOR had an external HP Ultrium 960 tape drive + 24 tapes in a local warehouse. Hurray for expedited shipping from Richmond!

Not only that, the Ultrium 3 drive can still read/write our Ultrium 2 media. By this I mean that a) I'd forgotten that the LTO standard calls for R/W for the last generation, not R/O, and b) the few tests I've been able to do with reading random old backups and reading/writing random new backups seem to go just fine.

Question for the peanut gallery: Has anyone had an Ultrium tape written by one drive that couldn't be read by another? I've read about tapes not being readable by drives other than the one that wrote it, but haven't heard any accounts first-hand for modern stuff.

Another question for the peanut gallery: I ended up finding instructions from HP that showed how to take apart a tape drive and manually eject a stuck tape. I did it for the old Ultrium 2. (No, it wasn't an HP drive, but they're all made in Hungary...so how many companies can be making these things, really?) The question is, do I trust this thing or not? My instinct is "not as far as I can throw it", but the instructions didn't mention anything one way or the other.

In other news, $NEW_ASSIGNMENT is looking to build a machine room in the basement of a building across the way, and I'm (natch) involved in that. Unfortunately, I've never been involved in one before. Fortunately, I got training on this when I went to LISA in 2006, and there's also Limoncelli, Hogan and Chalup to help out. (That link sends the author a few pennies, BTW; if you haven't bought it yet, get your boss to buy it for you.)

As part of the movement of servers from one data centre across town to new, temporary space here (in advance of this new machine room), another chunk of $UNIVERSITY has volunteered to help out with backups by sucking data over the ether with Tivoli. Nice, neighbourly think of them to do!

I met with the two sysadmins today and got a tour of their server room. (Not strictly necessary when arranging for backups, but was I gonna turn down the chance to tour a 1500-node cluster? No, I was not.) And oh, it was nice. Proper cable management...I just about cried. :-) Big racks full of blades, batteries, fibre everywhere, and a big-ass robotic Ultrium 2 tape cabinet. (I was surprised that it was 2, and not U3 or U4, but they pointed out that this had all been bought about four or five years ago…and like I've heard about other government-funded efforts, there's millions for capital and little for maintenance or upgrades.)

They told me about assembling most of it from scratch...partly for the experience, partly because they weren't happy with the way the vendor was doing it ("learning as they went along" was how they described it). I urged them to think about presenting at LISA, and was surprised that they hadn't heard of the conference or considered writing up their efforts.

Similarly, I was arranging for MX service for the new place with the university IT department, and the guy I was speaking to mentioned using Postfix. That surprised me, as I'd been under the impression that they used Sendmail, and I said so. He said that they had, but they switched to Postfix a year ago and were quite happy with it: excellent performance as an MTA (I think he said millions of emails per day, which I think is higher than my entire career total :-) and much better Milter performance than Sendmail. I told him he should make a presentation to the university sysadmin group, and he said he'd never considered it.

Oh, and I've completely passed over the A/C leak in my main job's server room…or the buttload of new servers we're gonna be getting at the new job…or adding the Sieve plugin for Dovecot on a CentOS box...or OpenBSD on a Dell R300 (completely fine; the only thing I've got to figure out is how it'll handle the onboard RAID if a drive fails). I've just been busy busy busy: two work places, still a 90-minute commute by transit, and two kids, one of whom is about to wake up right now.

Not that I'm complaining. Things are going great, and they're only getting better.

Last note: I'm seriously considering moving to Steve Kemp's Chronicle engine. Chris Siebenmann's note about the attraction of file-based systems for techies is quite true, as is his note about it being hard to do well. I haven't done it well, and I don't think I've got the time to make it good. Chronicle looks damn nice, even if it does mean opening up comments via the web again…which might mean actually getting comments every now and then. Anyhow, another project for the pile.

Tags: backups, hardware, lisa, meta, networking, work.
Insomnia
Wed Oct 1 15:27:30 PDT 2008

The good thing about being up at 3am is that, with a laptop, you can keep yourself entertained by whipping up a quick spreadsheet of the rack, switch and console server layout for the new server room.

The bad thing is that you may not trip over Sun's handy-dandy power calculators (like for the X4140 or the X4440 until the next day, leaving you twelve hours to wonder blearily if you've blown your server room's power budget all in one go.

Tags: hardware.
Now \*there's\* unexpected
Tue Oct 7 12:16:58 PDT 2008

Seen while applying software updates to a new Mac at $WORK:

The Aluminum Keyboard Firmware Update will update the keyboard
firmware on your aluminum Apple Keyboard.  Important:  Do not interupt
the update, your keyboard will not function while it is being updated.

I guess a mouse crashing is not entirely out of the question...

Tags: hardware.
What I've been up to lately
Thu Nov 13 16:07:28 PST 2008

The last few weeks, I've been setting up a small (5 racks) server room with the purchases that $OTHER_JOB recently made: 10 Sun X4140s, 2 — wait, 4 — X4240s, and one Thumper.

It's occupied a lot of my time, and before I lose the impulse, or fall asleep on my feet (second kid up at 4:30am for the last week or so; simultaneous discovery that at 4:30am I have a hard time getting back to sleep), I want to put down the things I learned.

But...my first batch of homebrew beer has been bottled, and a second brew day is coming up on Saturday. And apparently I'm not the only sysadmin who brews...though I'm not nearly ready to do all-grain just yet.

Tags: beer, hardware.
This is The Working Hour; we are paid by those who learn by our mistakes
Tue Nov 18 20:21:26 PST 2008

I'm in the process of setting up a bunch of new servers for $job_2. All but one are CentOS 5.2, kickstart installed and managed with cfengine. This is the third time I've goen thorugh a cfengine setup, and it always feels like starting from scratch each time. It seems — and I'm not at all sure this is fair or accurate — that each time I set up one of these systems, there's a lot that I've lost from the last time and have to relearn. I'm fortunate this time that I can refer to $job_1's setup to see how I did things last time, but if I didn't have that I'd be significantly further behind than I am.

I'm not sure what the solution is. Part of me thinks I should just be more aggressive about taking notes, or committing stuff to a private repository, or writing it down here more; part of me thinks that this might be a clue that cfengine is too low-level for my head. It feels like when I was trying to learn C, and couldn't believe that I had to remember all this stuff just to print something, or read a file, or connect to another machine over the Internet. By contrast, Perl (or any other scripted language) was such a relief…just print, or open, or use the Net::Telnet module, or whatever. The details are there and they are important, sometimes very much so; that doesn't mean I want to learn more metallurgy every time I need a fork. (No, I don't think that metaphor's tortured; why do you ask?)

Another thing is that I'm trying to get multipath connections working for the first time. We've got two database servers, each of which is connected via dual SAS HBAs to outboard disk arrays. (I don't think anyone else calls them "outboard", but I like the sound of it. See this hard drive? It's outboard, baby!) The arrays are from Sun and come with drivers, but the documentation is confusing: it says it's available for RHEL 5 (aka CentOS 5), but the actual download says it's only for RHEL 4.

As a temporary respite, I'm trying to see if I can get these workign using Linux's own multipath daemon, and it's also confusing. The documentation for it is tough to track down, and I just don't understand the different device names: am I meant to put /dev/dm-2 in fstab, or /dev/mpath/mpath2p1? If the latter, why does the name sometimes change to the WWUID (/dev/mpath/$(cat /dev/random)) when I restart multipathd? (use_friendly_names is uncommented in the config file.) If the whole point of multipath is failover, why does this sequence:

(where /mnt is where I've got this array mounted, obvs) sometimes work, and sometimes end with "I/O error" being logged, and the filesystem being read-only? Is this the sort of thing that the Sun driver will fix? I can't find anything about this.

And I mentioned electrical problems. When we got our servers installed, the Sun guys told us they'd tripped breakers on the PDU and/or breakers in the room's electrical cabinet. Since it had a sign on it saying "100A", I figured we might be running up against power limtis — either in the room as a whole, if my figures were 'way out, or on individual PDUs. Turns out I was probably wrong: I missed the bit on the sign that said 3-phase, which means (deep breath) we probably have 3 x 100A power available (I think).

It's more complicated than that, because some of it is in 120V, some of it is in twist-lock 220V 30A circuits, and so on. But I should've checked before emailing the faculty member who, in a year or two, will be going into this room (we're there as guests of the department) and happens to sit on the facilities committee. He had asked how we were doing, so I sent him an email — nice, polite, and including a bit about how grateful we were for the room and the help of the local sysadmins (all of which is true).

I was under the impression that he was asking for info now, so that he could bring it up for action in a few months when we were out. Instead, two hours later when I'm swearing at multipath, in come the facilities manager and one of the sysadmins I was dealing with, looking to find out just how much power we were using anyhow. I apologized profusely, and they were very cool about it. But when the committee guy asks questions, people jump. I had not anticipated this. Welcome to University Politics 101. I emailed again and explained my mistake.

There are lots of remedial courses I could take. However, today I would most like to take "Electricity and wiring for sysadmins".

And on another note: Ack! My laptop's home partition is 93% full! How the hell did that happen?

And again: How did I not know about apt-file? This is perfect!

(Touch o' the hat to Tears For Fears and Steve Kemp; I'm moving closer every day to switching to Chronicle.)

Tags: cfengine, hardware, linux, meta.
Random notes
Wed Jan 28 06:34:47 PST 2009
Tags: books, dell, emacs, hardware, rant, solaris.
Sleep!
Wed Feb 4 20:43:54 PST 2009

I can't believe it...my youngest son, after nearly three weeks of being up four or five times each night, slept nearly all the way through without a break: he only woke up at 1am and 5:15am, which is close enough to my usual wakeup time as makes no difference. It was wonderful to have a bit of sleep.

This comes after staying up late (11pm!) on Sunday bottling the latest batch of beer, a Grapefruit Bitter recipe from the local homebrew shop. You know, it really does taste like grapefruit, and even this early I'm really looking forward to this beer.

My laptop has a broken hinge, dammit. I carry it around in my backpack without any padding, so I guess I'm lucky it's lasted this long. Fortunately the monitor still works and mostly stays upright. I've had a look at some directions on how to replace it; it looks fiddly, but spending $20 on a new set of hinges from eBay is a lot more attractive than spending $100. Of course, the other consideration is whether I can get three hours to work on it….But in the meantime, I've got it on the SkyTrain for the first time in a week; it's been hard to want to do anything but sleep lately.

Work is still busy:

Update: turned out to be an MTU problem:

I had no idea there were GigE NICs that did not support Jumbo frames. Though maybe that's just the OpenBSD driver for it. Hm.

Tags: backups, beer, geekdad, hardware, networking, web.
Physicists
Tue Feb 10 14:01:27 PST 2009

"Phycicists are fun to be around. I was watching TV with one, and a commercial came on for OxyClean. The announcer's voice comes in, strong and deep, and says, What's the most powerful force in the universe? The guy I'm with starts pumping his fist and chanting, Strong nuclear force! Strong nuclear force! The announcer comes back and says, That's right, oxygen! Poor bastard looked like someone just shat in his ear."

(Conversation with a friend just now.)

Two things that didn't work:

Explanation: there's ou=Smith and ou=Jones, both of which are under ou=People,dc=example,dc=org. Smith wants to offer Jones the use of a few of his machines, which means setting up accounts for Jones and a few of his folks (cn=Alice, cn=Bob, and cn=Charlie). Obviously, these should be in ou=Jones, right? But if Smith's machines, through the wonders of pam_ldap, are set to check ou=Smith, how do Jones' folks log in?

(Digression: actually, Smith's machines right now check under ou=People — not ou=Smith,ou=People. Smith is the first one to use LDAP, so I stuck with that. I was going to change that at some point anyway, and I thought this might be a good chance to do just that.)

I thought I could try adding an alias, under ou=Smith, that'd point to cn=Alice,ou=Jones. And if I told LDAP that it was a posixAccount as well, then I could look at the account details with id and getent. But the logs showed that it just didn't work:

pam_ldap: error trying to bind as user "uid=Alice,ou=Jones,ou=People,dc=example,dc=org" (Inappropriate authentication)

Couldn't track down the error quickly, so went to plan B: stick with the current setup (machines checking ou=People) and put 'em under ou=Jones. I can always add host restrictions later on.

Explanation: Smith had a bunch of these machines at another location before getting server room space at UBC (and new servers). My access to them previously was via SSH only — there was no console access at all (sigh). Now they're at UBC, and one of 'em's gonna be my monitoring machine/second LDAP server ("The new server room: now with redundancy!") But while it was simple to turn on console redirection and choose PXE booting from the comfort of my office, I ended up borking the kickstart process and having to come back here anyway to set up the install. There's the BMC, which apparently I can access via the serial console if I so choose, but I'm still trying to figure out what that'll get me — ie, I can't find a manual in 11 seconds, so I'm putting that off for now.

Oh, and my new (work) laptop is in. Yay! It's a Dell D630, and aside from it's obscene footprint compared to my (ailing) C400, it's great. Ubuntu (Hardy for compatibility with the desktops here) is on so far, and CentOS (server work) and OpenBSD (instant firewall) aren't far behind.

Tags: d ell, funny, hardware, ldap.
Cooling
Tue Feb 24 15:33:07 PST 2009

Last week was reading week here at UBC. Monday I was off sick. Tuesday we got an email from the folks at the building where we've got guest access to one of their server rooms: the cooling was being shut down from 7am on Wednesday to 3pm on Thursday, so we'd have to turn off our servers. We're guests, so it's not like we've got a lot of say in the matter.

Natch, Thursday 3pm came and went. We got an email at 3:45pm from a manager there, saying that unexpected problems had arisen; they were hoping to have things back up by the weekend. That night I pointed our website at a backup server; it was not serving my boss' big web app, as there was no way to make that tiny little box serve a nearly 1TB database.

Friday I obsessed over the ambient temperature on our firewall (which I'd left turned on); it hovered around 35C. Around 10am we were told that they were hoping to have it on later that day, but that another shutdown might need to be scheduled for the next week (this week). At noon we were told that things were looking hopeful, but they couldn't guarantee cooling over the weekend.

At 2pm I found a local A/C rental agency who told us they'd be out to look at the room on Monday. 4pm I emailed my contact at the other department, plus his manager, to ask for updates and whether any further shutdowns could be scheduled after we'd arranged for cooling.

Over the weekend I obsessed over the temperature some more; it had dropped to 21C and stayed there, but without feedback from the facilities people I was reluctant to trust it.

Monday (yesterday; wow, time flies) we were told that the cooling system should perform well; however, a part still needed to be replaced. It was on order and would be coming in late this week or early next, and would require a four-hour outage.

This morning the cooling guy visited (he was at a funeral yesterday, so fair enough) and said that, yep, we could get a nice portable unit in for around $400 for a week.

I'm not writing this down because I'm proud of how I handled this. I'm writing this down so that someone else can maybe learn the things I should've known:

I have a habit of thinking "There's not much that can be done about that." Actually, it goes even further than that; it doesn't occur to me sometimes to think about what can be done. I'm not sure if this is lack of confidence, or trying too hard to get along, or just sheer laziness, but I'm trying hard to stop doing that.

Tags: hardware, warstory.
Laptop suspend mode
Wed Mar 18 08:43:18 PDT 2009

Okay, I feel like a bit of a tool for never realizing how cool suspend-to-ram is in a laptop. My new laptop for work is a Dell D630, which I'd got 'cos its hardware is pretty much completely compatable w/Linux. However, I've also figured out that a) Ubuntu does suspend-to-ram quite nicely (aside from a couple times when the keyboard doesn't work, but closing/reopening the lid makes it work), and b) it just sips — sips, I tell you! — from the battery.

Now to try and make it work on my own laptop, which is currently sitting at the shop waiting for me to pick it up.

Today's agenda:

See? I am still a sysadmin.

Tags: hardware, ldap, linux, networking.
Cable organization porn
Thu Mar 19 09:56:43 PDT 2009

We've got a new server room being built right now; it should be done in about six weeks, so I'm putting together an order for bits and pieces that I'll need.

I've mentioned before that cable management is one thing I get obsessed about, so this site is like porn for me. I'm not shilling for them; haven't ordered from them, no idea if they kill puppies in their spare time or what, but holy CRAP this is all the stuff I've ever wanted: RipWrap (so that's what it's called!), label printers, 87 varieties of zap straps, and I don't know what all.

Wow. Just wow.

Edit: Okay, seriously. There's some really good stuff in here among the advertisements.

Tags: hardware.
Rack design tools
Fri Mar 20 11:50:01 PDT 2009

With the move to the server room coming up in a couple months, I've been spending some time trying to lay out the racks we'll have there. My current layout is in an OpenOffice spreadsheet; I thought I'd try some other tools and see how they shape up.

Still sticking with a spreadsheet for now; it's not the best, but it is flexible and quick. Any other tools I missed?

Tags: hardware.
Case study for a server room move
Fri Mar 20 14:05:21 PDT 2009

Actually for a whole office. Excellent reading. Wish I'd known about this at $JOB-2...

Tags: hardware.
Oh, joy
Mon Mar 23 15:50:42 PDT 2009

NetSNMP uses 32-bit counters for disk sizes. Guess what happens when you've got one of these?

Due to be fixed in the next release, so at least that's something.

Tags: hardware, networking.
Squint
Tue Apr 28 16:34:11 PDT 2009

This has been one of those days where all I've done is stare at monitors too closely.

I know, I'm a sysadmin, what do I expect? But some days I get up, move around; I'm sedentary (and introverted) by nature but I try to talk to people, stare off into the distance, get away from my desk. Going to the server room is always a good break.

Not today, though. My carefully-chosen ATI video card (the Radeon 4550) is giving me headaches, metaphorical and real:

Dual monitors is important. My own damn fault for not getting something old enough...

Tags: hardware, linux, work.
New server room ours at last
Wed Jun 10 21:07:30 PDT 2009

Given the recent hoo-ha about abandoned blogs, and my own tendency to lose interest in writing about something the longer I put it off (I haven't graphed it, but I suspect it's a nice exponential decay), I figured I should finally write up what I've been doing the last week: the move at $WORK to our new server room.

So: construction finally got finished on our new server room. Our UPS was installed, our racks set up, and the keys handed over (though they were to be changed again twice). Our new netblock was assigned, the Internet access at the new location was in place, and movers were booked.

Things I did in advance which helped immensely:

Last Thursday morning, it all started. I got the machines shut down (thank you, SSH and ubiquitous wireless access at UBC) before the two volunteers who were helping me showed up. We started getting machines unracked; since it was only about 20 machines, I figured it wouldn't take too long. While that was true, I had not counted on the rat's nest of power cables (our power requirements were such that we had to connect machines to PDUs in adjacent racks), or the fact that we wouldn't be able to disassemble that 'til we'd got the machines out.

There was one heartstopping moment: a 1U server, while extended on its rails, came off one of the rails while no one was supporting it. Amazingly the other rail held on while it rotated quickly through 90 degrees to bang loudly against the rack. "You swear quickly," the movers remarked. (Doubly amazingly, the machine seems to be fine, though the rails for the thing are shot.)

The movers were big and burly, which was wonderful when it came to moving the Thumper. I weigh more than it does, but not by much, and I'd had the bad fortune to screw up my back a week before the move. It was tricky trying to figure out how to remove it from the rails, but the movers' trick of supporting it with a couple of big blankets, while fully extended from the rack, made such considerations less urgent. Eventually we got it figured out. I don't know how that could have gone smoother, since we'd got Sun to rack the thing and, frankly, it's not like you spend a lot of time un- and re-racking something like that. Anyhow, a minor point.

The new location was right around the corner, which was handy. The movers had put the servers in these big laundry-like carts on wheels; in the end, we only had four of em. We got the machines unloaded, racked the Thumper with the movers help, signed the paper, then went off for lunch where we picked up two more volunteers.

After that, we started racking servers. Having only one sysadmin around (me) proved to be a bottleneck; the volunteers had not worked with rackmounted machines before, and I kept having to stop what I was doing to explain something to them. It would have been a great help to have another admin around; in fact, I think this is the biggest move I'd want to make without some other admin around.

Problems we ran into:

Things that went well:

I'm going to post this now because if I don't, it'll never get done. I may come back and revise it later, but better this than nothing at all.

Tags: emacs, hardware, serverroom, work.
Tour, FC
Thu Jun 11 20:42:19 PDT 2009

Gave a tour of the new server room today to about 30-odd people in the department. Ended on a bit of a low note ("…and that's the end! Any questions?") but other than that it went well. Even got an ounce of champagne at the end of it.

Oh, and yesterday I found out that our SL-500 has three fibre channel interfaces, compared to the one interface in the server we bought. I think the sales folks assumed we had a fibre switch, and I didn't realize it all (data + control) wouldn't go over one cable. Arghh.

Just saw a character named Terence on "Entourage" who was not Terrance Stamp. Now I want to see "Bowfinger" and "The Limey", in that order.

Tags: backup, hardware, serverroom.
Once more, with feeling:
Mon Jun 15 12:16:46 PDT 2009

Dress rehearsal includes checking to see if you can, in fact, unrack something. I was uanble to move a switch this morning because it was stuck behind a PDU. Arghh.

The saga of our crashing UPS continues. The techs came out to visit this morning, which meant I needed to schedule downtime so they could bypass the UPS manually. They were unable to find any smoking gun (or capacitors), and need to confer with HQ again. Best case: the UPS control panel continues to work, and they can do the next round of work w/o a manual bypass. Worst case: the control panel crashes again, and we schedule another round of downtime.

Tags: hardware, serverroom.
Busyness
Thu Jun 18 16:12:32 PDT 2009

Full day:

Tags: beer, hardware, networking.
1246317421 seconds since the epoch...
Mon Jun 29 16:17:01 PDT 2009

I'm back at work after a week off. The UPS control panel continues to work (!), but there is no word back from the manufacturer (says the contractor who installed the thing and filed the ticket). I find this troubling; either the manufacturer really hasn't got back to us yet (bad), or I should have insisted on being a contact for the ticket. I'll have tos ort this out tomorrow.

Spent much of my day tearing my hair out over mod_proxy_html. Turns out that, by default, it strips the DTD from the HTML it proxies; this is a problem for one app that we're proxying. Not only that, the DTDs it does support are HTML, XHTML, and either with a "Transitional"/Legacy flag — but no URI to a DTD, like the one pointing to the Loose DTD that our app uses and the damned thing threw to the floor. (Sorry, brain cells on strike today and my ability to write clearly is going downhill.)

You can specify your own DTD, including a URI (undocumented feature, whee!), and thus put back in the original — but it doesn't append a newline, there's no way to append a newline that I could figure out, and so it mushes the DTD together with the first html opening tag and makes baby Firefox cry and render the page badly.

My rule of thumb for a long time was that if I start lppooking at source code, I'm in over my head. I'm starting to think that may not be entirely true anymore, that I've advanced to the point where I can read C (say) and generally understand what's going on. But when I start looking for API documentation for Apache 2.2 (surprisingly hard to find) to find out if, say, ap_fputs or apr_pstrdup chomp newlines or something (near as I can tell, they don't), or just what AP_INIT_TAKE12 takes as arguments…well, then I am in over my head. If nothing else, I don't want to make some silly error because I don't know what the hell I'm doing. (That's not a slam against the Debian folks; I just mean that I felt shivers when I read about that, because I dread making the same sort of highly-visible, catastrophic error) (unlike the rest of the planet, you understand).

Tags: hardware, programming, web.
GPT and MBR
Fri Jul 3 12:17:25 PDT 2009

I've run into an interesting problem with the new backup machine.

It's a Sun X4240 with 10 x 15k disks in it: 2 x 73GB (mirrored for the OS) and 8 x, um, a bunch (250GB?), RAID0 for Bacula spooling. (I want fast disk access, so RAID0 it is.) RAID is taken care of by an onboard RAID card, so these look like regular disks to Linux.

Now the spool disk works out to about 2.2TB or so — which is big enough to make baby fdisk cry:

WARNING: The size of this disk is 2.4 TB (2391994793984 bytes).
DOS partition table format can not be used on drives for volumes
larger than 2.2 TB (2199023255040 bytes). Use parted(1) and GUID
partition table format (GPT).

Well, okay, haven't used parted before but that's no reason to hold back. I follow directions and eventually figure out that mkpart gpt ext3 0 2392G will do what I want. GPT? Piece of cake! And then I rebooted, and I couldn't boot up again. Blank screen after the POST. Crap!

The first time this happened, the reboot also coincided with some additional problems during the POST where too many cards were trying to shove their ROM into the BIOS memory (or some such); I thought the two were connected. But then I did it again today, and I finally started digging.

The problem is that parted overwrites the MBR when setting up a GPT disklabel. This has been noted and argued over. My understanding of the two sides of the debate is:

Meanwhile, the parted camp has a number of bugs dealing with this very issue, two opened a year ago, and none have any response in them.

This enterprising soul submitted a patch back in December 2008, which appears to have fallen to the floor.

As for me, I was able to convince the BIOS to boot from the smaller disk, and then get a rescue CentOS image going via PXE booting, and then reinstall grub on the smaller disk. Sorted. All I had to do was change root (hd1,0) to `root (hd0,0) in grub.conf.

A touch anti-climactic after all that, perhaps. But it was interesting a) to learn about all this (I hadn't really thought about successors to the DOS partition format before), and b) to see what a slender thread we (okay, I) hang our hopes on sometimes. It's a necessary, sobering thing to realize how much of what I use, depend on, believe in is created by volunteers who are smart, hard-working people — they argue and and focus and forget just like real people, not inhabitants of some shining city on a hill I sometimes take them for ("Next beer in Jerusalem!").

Tags: backups, hardware, linux.
Eject, *then* reboot
Sat Oct 3 13:56:22 PDT 2009

Ran into a little problem this week when I tried to do a restore from a backup at work. Bacula loaded the tape, then said it couldn't read the label. Wha?

After much investigation, during which I completely neglected to cut-n-paste the error messages, I think I've figured out what happened:

Ack. Needless to say, this was not good. Fortunately, the file in question was not a terribly important one; unfortunately, that's about the last 2 weeks of incrementals gone. Lesson learned: don't assume your backup program knows what's going on when hardware reboots from under it.

In other news: on Thursday I got 5 new Dell servers. Woot! One of 'em will be our new LDAP/web/email/FTP server (Xen ftw!); the rest are going to be running protein search engines for various researchers across BC. They're racked and I'm stoked, except that it turns out the difference between the DRAC6 Express and Enterprise, besides a few hundred dollars, is that the Enterprise does console redirection and the Express doesn't. Dammit.

I'm going to see if there's any trickery that can be done, but I'm not holding out hope. I have got a 32-port console server, but it's two racks away...might have to run a small batch o' cables up and over to make this work.

2 comments. Tags: backups, dell, hardware, oops, virtualization.
Wrong, wrong, wrong
Fri Oct 9 16:18:06 PDT 2009

I'm not sure exactly where I saw that DRAC6 Express does not do console redirection -- it was on a mailing list somewhere -- but that turns out to be just wrong:

(For the record, it was the "External Serial Connector" in BIOS that got me; it should be "serial device 1", not "Remote Access Device".)

I can now SSH to the DRAC and get a console just fine. I wish to apologize to Dell, the people of Monaco and the constellation Sagitarrius.

3 comments. Tags: correction, dell, hardware.
Try the oven next time
Thu Nov 19 14:44:14 PST 2009

As recycled by Bradley M. Kuhn on identi.ca, here's another tool for recovering a dead hard drive: a toaster oven.

1 comments. Tags: hardware.
Serial console FAIL (somewhere...)
Mon Nov 23 12:07:47 PST 2009

This is irritating...

We've got four new Dell R410 servers at work. Natch, I want 'em working with serial consoles so I don't have to sit in the server room. Three of them worked; the fourth did not, despite having identical BIOS/Grub settings.

The symptom was quite maddening: After getting past the various BIOS checks, the Grub menu would not appear unless you sat there and typed something. After that, you'd get the usual Grub entries and could boot as usual. If you did not hit a key, the machine would just hang -- no response to keypresses at all, and you'd have to power cycle.

I spent a stupid amount of time comparing BIOS and Grub settings but was unable to find anything different. Finally today I typed "grub console timeout serial dell" into Google and found this bug in Launchpad, with this comment as the last one:

Having the same hanging issue at the Grub 1.5 stage on brand new R200 Dell servers running OpenSuse 10.3. The terminal timeout is set to 10 and we get 10 press any key to continue messages and then a full system hang requiring a hard reboot.

If we do press any key on a connected console (using Dell's Serial Over Lan) or locally before then end of the timeout then it boots fine so seems to be a bug in continuing at the end of the wait time.

Removing the terminal line from /boot/grub/menu.1st seems to fix the issue on our servers. The console in this case is sent by BMC to both the local screen and the remote console with no timeout so works a treat. This may only work with Dell's BMC/SOL but thought I'd mention it in case anyone else has spent a day getting frustrated with this like we have.

This worked a treat, with the added bit of weirdness that I had two "terminal" lines:

terminal --timeout=2 serial console
serial --unit=0 --speed=9600
default=0
timeout=5
serial --unit=1 --speed=115200
terminal --timeout=5 serial console

and now I have one:

terminal --timeout=2 serial console
serial --unit=0 --speed=9600
default=0
timeout=5
serial --unit=1 --speed=115200
# terminal --timeout=5 serial console

Yes, I know that's redundant, but again: it worked on the other three machines.

I don't know if this is a problem with Grub, with Dell's firmware or something else, but Gott in himmell I hate bugs like this.

Tags: bugs, dell, hardware.
Five years and still going
Tue Dec 22 06:38:37 PST 2009

At the risk of tempting fate, I just realized that my web server is five years old (and a bit). Happy birthday, Thornhill!

Tags: hardware.
Embedded embeddedness
Wed Mar 31 11:01:23 PDT 2010

The bge driver for OpenBSD says that the Broadcom BCM5700 series of interfaces has two MIPS R4000 cpus. And you can run Linux on an R4000, or NetBSD.

Must...stop...recursion...

Tags: hardware, networking.
Hopping
Tue Apr 27 16:26:40 PDT 2010

Been busy lately:

But hey! Turns out we live in a constitutional democracy after all. There was some debate about this at 24 Sussex Drive, I understand. Score one for the good guys.

Tags: dell, hardware, politics, work.

RSS Feed