(Mostly) Done, thank the gods

Saturday I upgraded the big machine at work to Solaris 10 11/06. This did not go well.

First off, I ended up installing onto a disk that held home directories. The install was a manual one, and I'd carefully noted in advance the disk I'd be installing to: the second internal hard drive, the one I'd tried doing the luactivate on a couple weeks ago.

Only the disk targets/names/whatever changed, and so c1t0d1 (say) was now one of the home partitions mounted from the external StorEdge array. Fuck. There were backups: I'd taken a backup before starting the install. Unfortunately, they were taken 3 hours before the install started, and during that time the machine had been up and running. The install started at 8am, so I'm hopeful there wasn't too much lost between 5am and 8am. But don't think I'm trying to minimize that mistake.

Second, I'd also managed to bork the disklabel for the original Solaris 9 install. I dug up the original disklabel somewhere — it wasn't in the documentation we've got, and I should have put it in there a long time ago — and restored everything to the way it was. It hadn't been formatted, so everything was okay.

Third, when it came up only one of the three external drives from the StorEdge was present, and I could not figure out where the others had gone. (It took me a while to figure this out; when first I realized my first mistake, I thought I'd installed over all the home directories. That was an awful moment.)

It took a lot of Googling to figure out what I should have already known about Solaris in general, and what should have been documented about this machine in particular: that /kernel/drv/sd.conf had been modified to add additional entries for LUNs that otherwise Solaris wouldn't have looked for.

(Many thanks to Brandon Hutchinson, whose entry on this very subject saved my butt. I wrote him a grateful email, and I wish him the best.)

(Incidentally, a reconfiguration reboot on a VS480 takes between 10 and 25 minutes. It's not a fast process. Also not a fast process is installing Solaris patches; I spent at least two hours on this all told, not counting reconfiguration reboots.)

I restored the one home directory (having recreated it in ZFS…one bright spot in all that) and mounted the others. All this got me, at 6pm, where I should have been at noon.

I was there 'til 11:30pm on Saturday fixing things up to the point where it was more or less ready for SSH-based logins. Then I took a cab home. Then I came in yesterday at 10am and got almost everything else working: SunRays (oh, the new desktop is beautiful), printing, software, and I can't even remember what all at this point.

I took lots of notes and did everything from within screen with logging turned on. (Bonus points for next time: set the prompt to show the time, so I can tell what order I did things in.) I'll be going over all of it to do things better next time.

Here's some stuff I already know:

  • Backups. It's said you never know how much you need 'em 'til you need 'em. True 'nuff.

  • DOCUMENTATION. I spent a good part of yesterday getting information on every disk while waiting for other software to install. I should have done this long, long ago.

(Incidentally, on that front I owe Blastwave an apology: right on the goddamn HOWTO page there's a section on automation. My mistake. But I still don't like the fact that the remove option (-r) is undocumented, and presumably undocumented because of the warning it prints that it's not very smart and shouldn't be used.)

  • Know what you're dealing with. The home partition I erased was bigger than the disk I expected to install on, but I wasn't sure of its size.

  • Stop if you're not sure. I should have stopped at the last point.

  • Be paranoid. Usually I am, but it would have been good to disconnect every superfluous drive rather than go through all this hell.

Sometimes it really amazes me that I get paid to do this work because it's so much fun. And sometimes I'm amazed because I figure I shouldn't be allowed to touch computers with a ten-foot pole.

I'm feeling pretty damned humble this morning. With luck that feeling will stay.

Tags: solaris upgrade

Wha'?

I never expected to read that Ken MacLeod has Prince tickets to sell.

(Incidentally, if you haven't read his books already I can't recommend them enough. Start with Cosmonaut Keep and just keep on going.)

Tags: books

This is ridiculous

I've complained about Blastwave before, but this is just terrible.

Trying to install VLC on a Solaris 10 machine using Blastwave. Says that CSWcommon is out of date, so please run pkg-get -u. As this always includes thousands of prompts that look like this:

The following package is currently installed:
CSWoldapclient  openldap_client - OpenLDAP client executables (oldapclient)
               (sparc) 2.3.31,REV=2007.01.07

Do you want to remove this package? [y,n,?,q] y

## Removing installed package instance <CSWoldapclient>
## Verifying package <CSWoldapclient> dependencies in global zone
WARNING:
The <CSWoldap> package depends on the package currently
being removed.
Dependency checking failed.

Do you want to continue with the removal of this package [y,n,?,q]


...I look around for a way to automate this. And surprise, there is, and I've missed it the whole time. My bad. So: pkg-get -f upgrade it is, then.

It runs for 45 minutes and stops with an error about CSWcommon:

Current administration requires that a unique instance of the
<CSWcommon> package be created.  However, the maximum number of
instances of the package which may be supported at one time on the
same system has already been met.

Hm, sez I. That's strange, but maybe that's what it's like for package managers that suck. pkg-get -r common and pkg-get -i common, and I'm ready for the upgrade again.

Somehow in the process I managed to remove the pkg_get package, which (surprise) contains the pkg-get command. Fortunately I have a backup copy around and use that to install pkg_get. Life continues.

And it's not for another 15 minutes after that that I notice that the package manager is going in loops. It keeps going over the same packages again and again, giving the same errror about unique instances each time. A quick search turns up this link, which tells me I'm a fool for believing the help offered by pkg-get:

$ pkg-get -h
pkg-get,   by Philip Brown , phil@bolthole.com
 (Internal SCCS code revision 3.6)
Originally from http://www.bolthole.com/solaris/pkg-get.html

pkg-get is used to install free software packages
pkg-get
Need one of 'install', 'upgrade', 'available','compare'
  '-i|install'   installs a package
  '-u|upgrade'   upgrades already installed packages if possible
  '-a|available' lists the available packages in the catalog
  '-c|compare'   shows installed package versions vs available
  '-l|list'      shows installed packages by software name only

Optional modifiers:
  '-d|download'  just download the package, not install
  '-D|describe'  describe available packages, or search for one
  '-U|updatecatalog'   updates download site inventory
  '-S|sync'      Makes update mode sync to version on mirror site
  '-f'           dont ask any questions: force default pkgadd behaviour
         Normally used with an override admin file
         See /var/pkg-get/admin-fullauto
  '-s ftp://site/dir'  temporarily override site to get from

and that the correct way to do what I want is to run:

true | sudo pkg-get upgrade

I admit that I neither knew nor sought to find out what "default pkgadd behaviour" would be, so that's my fault. I admit that I was the one who borked things by removing the pkg-get command. I admit that I did not think to record all of this with script, so at the moment I'm going on scribbled notes and memory. This is not a bug report, which is what I really should be writing. These are all things I did wrong or badly.

But isn't this what apt has fixed? On its worst day, I've never had to set up yes to be the drinking bird that would let me get stuff done. And — when all was done, and I got to go back to installing VLC — I've never had it depend on gcc.

Arghh. Arghh arghh arghh.

Tags: rant solaris packagemanagement

IPv6, Gibson, missing links

I spent the better part of the day yesterday setting up IPv6 at home now that I've got my subnet from SixXS. I'm running rtadvd on my OpenBSD firewall, and was testing it with rtsold on a laptop running OpenbSD. I'm not sure what I was doing wrong, but for the longest time all the laptop would pick up was the gateway; it would not set up a global address, but stick with the link-local address only. Every time I tried to ping the dancing turtle it would try sending it with the fe80 address, which of course did not work.

In the end, after a few reboots of both machines, it did work. My notes were a little thin (hey, this is my vacation here :-), but I can't think of what changed…the laptop just started setting itself a global address, routing worked, and that was that. Weird.

Next up will be to get the website working on IPv6. Maybe a dancing daemon or something…

And hey, I won tickets to see William Gibson speak! "Hey, Mr. Gibson...you know that book you wrote called Virtual Light? ...It was really cool." Ah, fanboys. But my wife wants to go too, 'cos she loved Pattern Recognition. Should be a fun night.

And I just realized that although I've been generating an RSS2 feed, I've never linked to the RSS2 feed until now. Enjoy.

Tags: ipv6 books

The deluge opens

Somehow in the move of the websites and files from Linode back to Thornhill (home server on the other end of DSL; 1.5GHz Sempron and 1GB of RAM in a nice Shuttle box), I copied ~/.spamassassin to the wrong directory...and wow, did this ever make a difference to spam filtering. My mailbox was flooded with stuff coming in to an old (12 years!) address that I pretty much just use for WHOIS contacts these days.

I didn't realize what was going on at first, so I tried training it on my saved spam and ham. 90k messages later, it still didn't do it properly. I did some digging, then figured out what had happened and copied the files to the right place. Boom — the sweet, sweet sound of a nearly-empty inbox.

The user_prefs files were the same each time, so it was just the Bayes token files that were different. The only thing I can think of is that the working files were the result of training SA on its mistakes, rather than on its successes.

Of course, I should probably just get the address cancelled or changed…the last time I looked, well over 95% of the spam I've got came to that address. But still, I'm starting to think that I should be keeping the Bayes files under revision control...

Tags: spam

Holy crap, I got aggregated!

While obsessively prowling my referrers today, I noticed that I've been aggregated on Planet Sysadmin. I'm incredibly flattered. Looks like there's some damn fine reading there, and it looks like I have to fix my RSS feed...apologies for the lack of paragraph breaks.

Tags: meta

That took a while...

The move of all the websites and mail from the server in Atlanta to home took longer than I thought. First I came across problems with the quad-hme interface in the Sparc Ultra 1 workstation I'd been using as a firewall, and I had to resurrect Francisco, an AMD Pentium clone, and install OpenBSD 4.1 on it. Then using pf and spamd to do greylisting didn't work so well, and I had to turn it off. Then some DNS/routing stuff I'd missed before…

Done, though, at long last. Time to sleep.

Tags: meta

Bats and Leathermen and Hunter

When I got my first job in IT, a friend of mine bought me a copy of the third edition of Unix in a Nutshell. (Incidentally, why does O'Reilly's search, which in my client returns "Sorry, no matches were found containing ." (sic), suck so much?) Sure, it was help desk on a small ISP, but it was something. I read that book front to back on the bus to and from work, and filled it full of stickers from all the servers or PCs I assembled.

The sysadmin at that first job also had a cordless drill, and that made things so much easier when assembling or racking servers. I wanted one, but I didn't buy one 'cos I figured I hadn't earned it yet. When my Italian millwright father-in-law bought me one, I felt like it was a vote of confidence in a way.

Another thing the sysadmin had was a Leatherman Wave. Again, I wanted one, but I didn't think I'd earned it yet. Last week, I decided to get one; and if I was going to get one, I was going to wear the damn thing. I started wearing the sheath on my belt, and waited for a chance to use it.

Today I had that chance.

I got to work and went to the kitchen to grab a coffee. "There's a bat behind the fridge," I heard.

What?

The cleaning woman pointed. "I moved out the fridge to clean it," she said. "There was a bat behind it. I don't want to touch it."

I looked, and sure enough there was one hanging by the edge of the cupbard. It was small, like a mouse wearing an overcoat. (Goth mouse?)

And then my moment came.

There were no gloves (I was worried about rabies), but there was a towel. I draped the towel over the bat while frightened coworkers watched, and then covered it with a recycling bin.

And then I took out the Leatherman, and flipped out the knife. "I need help cutting cardboard," I said, and the receptionist came to help. She sliced up a cardboard box and gave me a square of it. I slid it between the cupboard and the towel, sandwiching the bat gently between it and the towel, with the recycling bin behind.

I carried it outside to a clump of trees (ah, the advantages of living on a beautiful campus), found a stick, coaxed it onto it and then left it up a tree.

But I couldn't have done it...

...without the Leatherman.

(This writing style brought to you by my third reading of Battlefield Earth. Our motto: Yeah, it's trash...so what?)

In other news, Hunter Matthews is giving a workshop on server room best practices at LISA '07. I met him at LISA last year, when he was another attendee of an otherwise thin tutorial on setting up server rooms/closet. He was also at the documentation BOF, and the one who said "I've got one user who considers 7-bit ASCII a luxury compared to what you can get from 5 or 6 bits." (Oh, and: "Cooperative collaboration. Yeah, its part of our vision statement.") He's a good guy and a good teacher, and if you're going to LISA you could do a lot worse than going to his workshop.

Tags: books lisa

Memo_to_myself2


title: Memo to myself(2) date: Thu Aug 9 19:06:23 EDT 2007

There is always time to document something. Even if it's just throwing a typescript file on a wiki. And there is always time to turn the typescript dump into real documentation once things have calmed down.

Tags:

Time to fire up the IPv6 tunnel again

I've been fiddling with IPv6 for years, but have never actually done anything serious with it. When I started work at Dowco, and my web server was a 200MHz Pentium I inherited from friends of mine, my plan was to get a tunnel from Hurricane Electric, then run a tunnel broker service of my own for customers. (There was a burning thirst for IPv6 subnets, let me tell you) It foundered when it got to the point of coming up with a website that'd let you register; cookies and sessions and I don't know what all just bored me to tears.

At my next/last job, IPv6 was used in-house. The sysadmin before me had set up 6to4 because he wanted to connect to his machine at work without NAT. I kept it going long past the time he left, and as far as I know it's still there. But beyond presenting many more ways for DNS problems to screw things up, not much was ever done with it.

Last year I signed up for another account from HE, got a prefix, then lost track of it when it came time to add IPv6 rules for the firewall. Of course, there was other stuff going on too.

This year HE's registration form is borked, saying that it can't insert my MD5'd password into MySQL, so I've applied for an account with SixXS. (Sadly, it seems that despite appearances, my ISP isn't interested.) I've got a week's vacation coming up, so along with moving the server from Atlanta to home I think I'll try to get IPv6 working as well.

Next beer in Jerusalem! (Which, shet my mouth, is not even close to original.)

Tags: ipv6

New Gibson!

I had no idea. And he's speaking about it here in Vancouver. 12 years here and I still haven't run into him, unlike folks I know. Here's hoping I win tickets.

Tags: books

Well, *that* happened

The upgrade to Solaris 10 did not work. The main problem was that logging in at the console (even as root!) simply would not work: I'd get logged right back out again each time, with no error message or anything. WTF?

I managed to go into single-user mode, provide the root password (see? they do trust me) and get access that way. But I still couldn't figure out what was going wrong. Eventually I came across this entry in the logs

svc.startd[7]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 16

And /var/svc/log/system-console-login:default.log said:

[ Aug  4 14:23:48 Executing start method ("/lib/svc/method/console-login") ]
[ Aug  4 14:24:05 Stopping because all processes in service exited. ]

Eventually I had to give up and revert back to Solaris 9. That part worked well, at least.

I've no idea what went wrong at this point, but since I haven't come across this before with other Solaris 10 installs I'm starting to wonder if it's a product of luupgrade attemting to merge the machine's current settings with Sol10. Between that suspicion and the increase in disk space needed to run luupgrade (not sure why, but for example /usr needed a couple extra GB of space in order to complete luupgrade; I presume something's being added or kept around, but there's no explanation I can find for this), I'm starting to think that just going with a clean install of Sol10 is the way to go.

Arghh. Live Upgrade was supposed to just work.

Tags: solaris upgrade warstory

Solaris Live Upgrade

I'm running Solaris Live Upgrade at work to upgrade our main server from Solaris 9 to Solaris 10, and one thing I haven't seen mentioned in all the things I've read about it is how long it takes.

Right now, for example, I'm running luactivate to activate the new boot environment. It's been running for half an hour now, with no indication about how long it's going to take. If I'd known it would take this long, I'd have scheduled it for earlier this morning. And yeah, it would've been obvious if I'd thought about it...

Shet my mouth, it just finished after 38 minutes. For the record, this is on a V480/ 16GB of RAM, and call it 50GB total of disks to be synced.

Tags: solaris

Interesting website generation

Just came across norman.walsh.name while looking for information on Mercurial, and I'm intrigued. I'll have to take a look at the Makefile and maybe steal some ideas...by the beard of Saint Tim, this site could use a rewrite.

Tags: meta

2nd edition out!

Woot! The second edition of "The Practice of System and Network Administration" has finally started shipping! Just ordered my copy, along with Beautiful Code and Perl Best Practices. I love books, oh yes I do.

Tags: books

Geek dreams

I dreamt last night that I met Sammy Davis Junior and talked to him to release all of his songs under a Creative Commons license 20 years after his death.

"Hugh, I'm convinced," he said. "And I want to write a cheque for ten thousand dollars, too. Tell me which organization to give it too."

(I think I also piqued RMS's interest in the issue, too, but I can't remember that part.)

In other news, my old employers have been bought by another BC ISP. I can't say I'm surprised, but it's a shame the new company uses IIS for their webserver...yeah, I'm a Unix geek, all right.

And many thanks to tobutaz for providing a much better answer to the question "How do I get the home directory of a user whose name is in a variable?":

USER=jdoe
eval "USERHOME=~$USER"

Every now and then I'm reminded of eval and then forget it when it's useful. Thanks again, tobutaz!

Tags: scripting

Bash: How to get the home directory of a user in a variable

Not the clearest title...what I mean is, I was writing a Bash script like this:

SOMEONE=foo
...
mv $SOMETHING ~$SOMEONE

only it kept failing with "~foo: No such file or directory". I had a look at the manual, but it wasn't terribly clear. In a nutshell, though, Bash wasn't doing the tilde expansion no matter what combination of braces and quotes tried.

Eventually, I came across a pirated copy of the ever-excellent Unix Power Tools, 3rd Ed. (mine's at work, or I'd've checked it a lot sooner), and it had a solution:

FINAL_RESTING_PLACE=$(/bin/bash -c "echo ~${SOMEONE}")

I'm sure there must be a better way of doing this, though...Woot, any suggestions?

Tags: scripting

It's a love affair...mainly Nagios and my network

I can get really, really focussed sometimes. Every now and then that happens with Nagios.

Yesterday I had some time to kill before I went home, so I looked over my tickets in RT. (I work in a small shop, so a lot of the time the tickets in RT are a way of adding things to my to-do list.) There was one that said to watch for changes in our web site's main page; I'd added that one after MySQL'd had problems one time -- ran out of connections, I think -- and Mambo had displayed a nice "Whoops! Can someone please tell the sysadmin?" page (a nice change from the usual cryptic error when there's no database connection). Someone did, but it would've been nice to be paged about it.

At home I use WebSec to keep track of some pages that don't change very often (worse luck…), and I thought of using that. It sends you the new web page with the different bits highlighted, which is a nice touch. But I wanted something tied in with Nagios, rather than another separate and special system.

So I started looking at the Nagios plugins I had, and I was surprised to find that check_http has a raft of different options, including the ability to check for regexes in the content. Sweet! I added a couple strings that'll almost certainly be there until The Next Big Redesign(tm), and done.

I started looking at the other plugins, and noticed check_hpjd. A few minutes later I was checking our printers for errors...just in time to notice a weird error that someone had emailed me about 30 seconds before. Nice!

This morning (I work from home on Saturdays in return for getting Wednesdays off to take care of Arlo) I was checking Cacti (which rocks even if they do call it a solution). /home/visitors with no free space? Wha'? Someone had run a job that'd managed to fill the whole damned partition.

Well, there's check_disk, but that's only for mounted disks — and I don't want the monitoring machine freezing if there's a problem with NFS. SNMP should do this, right? Right — the net-snmp project has the ability to throw errors if there's less than a certain amount of free space on a disk. For some reason I'd never set that up before, nor got Nagios to monitor for it. A few minutes later and check_snmp was looking for non-empty error messages:

$USER1$/check_snmp -H $HOSTADDRESS$ -o UCD-SNMP-MIB::dskErrorMsg.$ARG1$ -s ""

I looked ahead in snmpd.conf and noticed the process section. Well, hell! It's all very good to check that the web server is running, but what if there are too many Apache processes? Or too few of MySQL? Or no Postfix? Can't believe I never set this up before…

I've finally come up for breath. This wasn't what I planned on doing this morning, but I love it when a plan will come together next time.

Tags: monitoring mysql

Happy_sysadmin_appreciation_day


title: Happy Sysadmin Appreciation Day! date: Fri Jul 27 14:01:19 EDT 2007

This DNS regression suite looks tres cool. I've just upgraded BIND at work to the latest version, so maybe this is the next thing to try.

Tags:

Info on Sun's doors

Which is very timely, as I'm trying to track down why nscd door access is taking so long: http://au.sun.com/news/onsun/2002-11/tech_tips.html

Tags: solaris