27 Aug 2007
Saturday I upgraded the big machine at work to Solaris 10 11/06. This
did not go well.
First off, I ended up installing onto a disk that held home
directories. The install was a manual one, and I'd carefully noted in
advance the disk I'd be installing to: the second internal hard drive,
the one I'd tried doing the luactivate on a couple weeks ago.
Only the disk targets/names/whatever changed, and so c1t0d1 (say)
was now one of the home partitions mounted from the external StorEdge
array. Fuck. There were backups: I'd taken a backup before starting
the install. Unfortunately, they were taken 3 hours before the install
started, and during that time the machine had been up and running. The
install started at 8am, so I'm hopeful there wasn't too much lost
between 5am and 8am. But don't think I'm trying to minimize that
mistake.
Second, I'd also managed to bork the disklabel for the original
Solaris 9 install. I dug up the original disklabel somewhere — it
wasn't in the documentation we've got, and I should have put it in
there a long time ago — and restored everything to the way it was. It
hadn't been formatted, so everything was okay.
Third, when it came up only one of the three external drives from the
StorEdge was present, and I could not figure out where the others had
gone. (It took me a while to figure this out; when first I realized my
first mistake, I thought I'd installed over all the home
directories. That was an awful moment.)
It took a lot of Googling to figure out what I should have already
known about Solaris in general, and what should have been documented
about this machine in particular: that /kernel/drv/sd.conf
had been
modified to add additional entries for LUNs that otherwise Solaris
wouldn't have looked for.
(Many thanks to Brandon Hutchinson, whose entry on this very
subject saved my butt. I wrote him a grateful email, and I wish
him the best.)
(Incidentally, a reconfiguration reboot on a VS480 takes between 10
and 25 minutes. It's not a fast process. Also not a fast process is
installing Solaris patches; I spent at least two hours on this all
told, not counting reconfiguration reboots.)
I restored the one home directory (having recreated it in ZFS…one
bright spot in all that) and mounted the others. All this got me, at
6pm, where I should have been at noon.
I was there 'til 11:30pm on Saturday fixing things up to the point
where it was more or less ready for SSH-based logins. Then I took a
cab home. Then I came in yesterday at 10am and got almost everything
else working: SunRays (oh, the new desktop is beautiful), printing,
software, and I can't even remember what all at this point.
I took lots of notes and did everything from within screen
with
logging turned on. (Bonus points for next time: set the prompt to show
the time, so I can tell what order I did things in.) I'll be going
over all of it to do things better next time.
Here's some stuff I already know:
Backups. It's said you never know how much you need 'em 'til you
need 'em. True 'nuff.
DOCUMENTATION. I spent a good part of yesterday getting information
on every disk while waiting for other software to install. I
should have done this long, long ago.
(Incidentally, on that front I owe Blastwave an apology: right on the
goddamn HOWTO page there's a section on automation. My
mistake. But I still don't like the fact that the remove option (-r
)
is undocumented, and presumably undocumented because of the warning it
prints that it's not very smart and shouldn't be used.)
Know what you're dealing with. The home partition I erased was
bigger than the disk I expected to install on, but I wasn't sure of
its size.
Stop if you're not sure. I should have stopped at the last point.
Be paranoid. Usually I am, but it would have been good to disconnect
every superfluous drive rather than go through all this hell.
Sometimes it really amazes me that I get paid to do this work because
it's so much fun. And sometimes I'm amazed because I figure I
shouldn't be allowed to touch computers with a ten-foot pole.
I'm feeling pretty damned humble this morning. With luck that feeling
will stay.
Tags:
solaris
upgrade
22 Aug 2007
I never expected to read that Ken MacLeod has Prince tickets to sell.
(Incidentally, if you haven't read his books already I can't recommend them enough. Start with Cosmonaut Keep and just keep on going.)
Tags:
books
20 Aug 2007
I've complained about Blastwave before, but this is just terrible.
Trying to install VLC on a Solaris 10 machine using Blastwave. Says
that CSWcommon
is out of date, so please run pkg-get -u
. As this
always includes thousands of prompts that look like this:
The following package is currently installed:
CSWoldapclient openldap_client - OpenLDAP client executables (oldapclient)
(sparc) 2.3.31,REV=2007.01.07
Do you want to remove this package? [y,n,?,q] y
## Removing installed package instance <CSWoldapclient>
## Verifying package <CSWoldapclient> dependencies in global zone
WARNING:
The <CSWoldap> package depends on the package currently
being removed.
Dependency checking failed.
Do you want to continue with the removal of this package [y,n,?,q]
...I look around for a way to automate this. And surprise, there
is, and I've missed it the whole time. My bad. So: pkg-get -f
upgrade
it is, then.
It runs for 45 minutes and stops with an error about CSWcommon:
Current administration requires that a unique instance of the
<CSWcommon> package be created. However, the maximum number of
instances of the package which may be supported at one time on the
same system has already been met.
Hm, sez I. That's strange, but maybe that's what it's like for package
managers that suck. pkg-get -r common
and pkg-get -i common
, and
I'm ready for the upgrade again.
Somehow in the process I managed to remove the pkg_get
package,
which (surprise) contains the pkg-get
command. Fortunately I have a
backup copy around and use that to install pkg_get
. Life continues.
And it's not for another 15 minutes after that that I notice that
the package manager is going in loops. It keeps going over the same
packages again and again, giving the same errror about unique
instances each time. A quick search turns up this link, which
tells me I'm a fool for believing the help offered by pkg-get:
$ pkg-get -h
pkg-get, by Philip Brown , phil@bolthole.com
(Internal SCCS code revision 3.6)
Originally from http://www.bolthole.com/solaris/pkg-get.html
pkg-get is used to install free software packages
pkg-get
Need one of 'install', 'upgrade', 'available','compare'
'-i|install' installs a package
'-u|upgrade' upgrades already installed packages if possible
'-a|available' lists the available packages in the catalog
'-c|compare' shows installed package versions vs available
'-l|list' shows installed packages by software name only
Optional modifiers:
'-d|download' just download the package, not install
'-D|describe' describe available packages, or search for one
'-U|updatecatalog' updates download site inventory
'-S|sync' Makes update mode sync to version on mirror site
'-f' dont ask any questions: force default pkgadd behaviour
Normally used with an override admin file
See /var/pkg-get/admin-fullauto
'-s ftp://site/dir' temporarily override site to get from
and that the correct way to do what I want is to run:
true | sudo pkg-get upgrade
I admit that I neither knew nor sought to find out what "default pkgadd behaviour" would be, so that's my fault. I admit that I was the one who borked things by removing the pkg-get
command. I admit that I did not think to record all of this with script
, so at the moment I'm going on scribbled notes and memory. This is not a bug report, which is what I really should be writing. These are all things I did wrong or badly.
But isn't this what apt has fixed? On its worst day, I've never
had to set up yes
to be the drinking bird that would let me
get stuff done. And — when all was done, and I got to go back to
installing VLC — I've never had it depend on gcc.
Arghh. Arghh arghh arghh.
Tags:
rant
solaris
packagemanagement
17 Aug 2007
I spent the better part of the day yesterday setting up IPv6 at home
now that I've got my subnet from SixXS. I'm running rtadvd
on
my OpenBSD firewall, and was testing it with rtsold
on a laptop
running OpenbSD. I'm not sure what I was doing wrong, but for the
longest time all the laptop would pick up was the gateway; it would
not set up a global address, but stick with the link-local address
only. Every time I tried to ping the dancing turtle it would try
sending it with the fe80
address, which of course did not work.
In the end, after a few reboots of both machines, it did work. My
notes were a little thin (hey, this is my vacation here :-), but I
can't think of what changed…the laptop just started setting itself a
global address, routing worked, and that was that. Weird.
Next up will be to get the website working on IPv6. Maybe a dancing
daemon or something…
And hey, I won tickets to see William Gibson speak! "Hey,
Mr. Gibson...you know that book you wrote called Virtual Light?
...It was really cool." Ah, fanboys. But my wife wants to go
too, 'cos she loved Pattern Recognition. Should be a fun night.
And I just realized that although I've been generating an RSS2
feed, I've never linked to the RSS2 feed until now.
Enjoy.
Tags:
ipv6
books
14 Aug 2007
Somehow in the move of the websites and files from Linode
back to Thornhill (home server on the other end of DSL; 1.5GHz Sempron
and 1GB of RAM in a nice Shuttle box), I copied ~/.spamassassin
to
the wrong directory...and wow, did this ever make a difference to spam
filtering. My mailbox was flooded with stuff coming in to an old
(12 years!) address that I pretty much just use for WHOIS contacts
these days.
I didn't realize what was going on at first, so I tried training it on
my saved spam and ham. 90k messages later, it still didn't do it
properly. I did some digging, then figured out what had happened and
copied the files to the right place. Boom — the sweet, sweet sound of
a nearly-empty inbox.
The user_prefs
files were the same each time, so it was just the
Bayes token files that were different. The only thing I can think of
is that the working files were the result of training SA on its
mistakes, rather than on its successes.
Of course, I should probably just get the address cancelled or
changed…the last time I looked, well over 95% of the spam I've got
came to that address. But still, I'm starting to think that I should
be keeping the Bayes files under revision control...
Tags:
spam
14 Aug 2007
While obsessively prowling my referrers today, I noticed that I've
been aggregated on Planet Sysadmin. I'm incredibly
flattered. Looks like there's some damn fine reading there, and it
looks like I have to fix my RSS feed...apologies for the lack of
paragraph breaks.
Tags:
meta
13 Aug 2007
The move of all the websites and mail from the server in Atlanta to
home took longer than I thought. First I came across problems with the
quad-hme interface in the Sparc Ultra 1 workstation I'd been using as
a firewall, and I had to resurrect Francisco, an AMD Pentium clone,
and install OpenBSD 4.1 on it. Then using pf and spamd to do
greylisting didn't work so well, and I had to turn it off. Then some
DNS/routing stuff I'd missed before…
Done, though, at long last. Time to sleep.
Tags:
meta
10 Aug 2007
When I got my first job in IT, a friend of mine bought me a copy
of the third edition of Unix in a Nutshell. (Incidentally, why
does O'Reilly's search, which in my client returns "Sorry, no
matches were found containing ." (sic), suck so much?) Sure, it was
help desk on a small ISP, but it was something. I read that book front
to back on the bus to and from work, and filled it full of stickers
from all the servers or PCs I assembled.
The sysadmin at that first job also had a cordless drill, and that
made things so much easier when assembling or racking servers. I
wanted one, but I didn't buy one 'cos I figured I hadn't earned it
yet. When my Italian millwright father-in-law bought me one, I
felt like it was a vote of confidence in a way.
Another thing the sysadmin had was a Leatherman Wave. Again, I
wanted one, but I didn't think I'd earned it yet. Last week, I decided
to get one; and if I was going to get one, I was going to wear the
damn thing. I started wearing the sheath on my belt, and waited for a
chance to use it.
Today I had that chance.
I got to work and went to the kitchen to grab a coffee. "There's a bat
behind the fridge," I heard.
What?
The cleaning woman pointed. "I moved out the fridge to clean it," she
said. "There was a bat behind it. I don't want to touch it."
I looked, and sure enough there was one hanging by the edge of the
cupbard. It was small, like a mouse wearing an overcoat. (Goth mouse?)
And then my moment came.
There were no gloves (I was worried about rabies), but there was a
towel. I draped the towel over the bat while frightened coworkers
watched, and then covered it with a recycling bin.
And then I took out the Leatherman, and flipped out the knife. "I need
help cutting cardboard," I said, and the receptionist came to
help. She sliced up a cardboard box and gave me a square of it. I slid
it between the cupboard and the towel, sandwiching the bat gently
between it and the towel, with the recycling bin behind.
I carried it outside to a clump of trees (ah, the advantages of living
on a beautiful campus), found a stick, coaxed it onto it and then
left it up a tree.
But I couldn't have done it...
...without the Leatherman.
(This writing style brought to you by my third reading of Battlefield
Earth. Our motto: Yeah, it's trash...so what?)
In other news, Hunter Matthews is giving a workshop on server
room best practices at LISA '07. I met him at LISA last year, when
he was another attendee of an otherwise thin tutorial on setting
up server rooms/closet. He was also at the documentation BOF, and the
one who said "I've got one user who considers 7-bit ASCII a
luxury compared to what you can get from 5 or 6 bits." (Oh, and:
"Cooperative collaboration. Yeah, its part of our vision statement.")
He's a good guy and a good teacher, and if you're going to LISA you
could do a lot worse than going to his workshop.
Tags:
books
lisa
09 Aug 2007
title: Memo to myself(2)
date: Thu Aug 9 19:06:23 EDT 2007
There is always time to document something. Even if it's just
throwing a typescript
file on a wiki. And there is always time to
turn the typescript
dump into real documentation once things have
calmed down.
Tags:
06 Aug 2007
I've been fiddling with IPv6 for years, but have never actually done
anything serious with it. When I started work at Dowco, and my
web server was a 200MHz Pentium I inherited from friends of mine,
my plan was to get a tunnel from Hurricane Electric, then run a
tunnel broker service of my own for customers. (There was a burning
thirst for IPv6 subnets, let me tell you) It foundered when it got to
the point of coming up with a website that'd let you register; cookies
and sessions and I don't know what all just bored me to tears.
At my next/last job, IPv6 was used in-house. The sysadmin before me
had set up 6to4 because he wanted to connect to his machine at
work without NAT. I kept it going long past the time he left, and as
far as I know it's still there. But beyond presenting many more ways
for DNS problems to screw things up, not much was ever done with it.
Last year I signed up for another account from HE, got a prefix, then
lost track of it when it came time to add IPv6 rules for the
firewall. Of course, there was other stuff going on too.
This year HE's registration form is borked, saying that it can't
insert my MD5'd password into MySQL, so I've applied for an account
with SixXS. (Sadly, it seems that despite appearances, my
ISP isn't interested.) I've got a week's vacation coming up, so
along with moving the server from Atlanta to home I think I'll
try to get IPv6 working as well.
Next beer in Jerusalem! (Which, shet my mouth, is not even close
to original.)
Tags:
ipv6
06 Aug 2007
I had no idea. And he's speaking about it here in
Vancouver. 12 years here and I still haven't run into him, unlike
folks I know. Here's hoping I win tickets.
Tags:
books
05 Aug 2007
The upgrade to Solaris 10 did not work. The main problem was that
logging in at the console (even as root!) simply would not work: I'd
get logged right back out again each time, with no error message or
anything. WTF?
I managed to go into single-user mode, provide the root password (see?
they do trust me) and get access that way. But I still couldn't figure
out what was going wrong. Eventually I came across this entry in the
logs
svc.startd[7]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 16
And /var/svc/log/system-console-login:default.log
said:
[ Aug 4 14:23:48 Executing start method ("/lib/svc/method/console-login") ]
[ Aug 4 14:24:05 Stopping because all processes in service exited. ]
Eventually I had to give up and revert back to Solaris 9. That
part worked well, at least.
I've no idea what went wrong at this point, but since I haven't come
across this before with other Solaris 10 installs I'm starting to
wonder if it's a product of luupgrade attemting to merge the machine's
current settings with Sol10. Between that suspicion and the increase
in disk space needed to run luupgrade (not sure why, but for example
/usr
needed a couple extra GB of space in order to complete
luupgrade
; I presume something's being added or kept around, but
there's no explanation I can find for this), I'm starting to think
that just going with a clean install of Sol10 is the way to go.
Arghh. Live Upgrade was supposed to just work.
Tags:
solaris
upgrade
warstory
04 Aug 2007
I'm running Solaris Live Upgrade at work to upgrade our main server
from Solaris 9 to Solaris 10, and one thing I haven't seen mentioned
in all the things I've read about it is how long it takes.
Right now, for example, I'm running luactivate
to activate the new
boot environment. It's been running for half an hour now, with no
indication about how long it's going to take. If I'd known it would
take this long, I'd have scheduled it for earlier this morning. And
yeah, it would've been obvious if I'd thought about it...
Shet my mouth, it just finished after 38 minutes. For the record, this
is on a V480/ 16GB of RAM, and call it 50GB total of disks to be
synced.
Tags:
solaris
02 Aug 2007
Just came across norman.walsh.name while looking for information
on Mercurial, and I'm intrigued. I'll have to take a look at the
Makefile and maybe steal some ideas...by the beard of Saint Tim,
this site could use a rewrite.
Tags:
meta
30 Jul 2007
Woot! The second edition of "The Practice of System and Network
Administration" has finally started shipping! Just ordered my
copy, along with Beautiful Code and Perl Best Practices. I
love books, oh yes I do.
Tags:
books
30 Jul 2007
I dreamt last night that I met Sammy Davis Junior and talked to him to
release all of his songs under a Creative Commons license 20
years after his death.
"Hugh, I'm convinced," he said. "And I want to write a cheque for ten
thousand dollars, too. Tell me which organization to give it too."
(I think I also piqued RMS's interest in the issue, too, but I
can't remember that part.)
In other news, my old employers have been bought by another BC
ISP. I can't say I'm surprised, but it's a shame the new company
uses IIS for their webserver...yeah, I'm a Unix geek, all right.
And many thanks to tobutaz for providing a much better answer to
the question "How do I get the home directory of a user whose name is
in a variable?":
USER=jdoe
eval "USERHOME=~$USER"
Every now and then I'm reminded of eval and then forget it when it's
useful. Thanks again, tobutaz!
Tags:
scripting
28 Jul 2007
Not the clearest title...what I mean is, I was writing a Bash script like this:
SOMEONE=foo
...
mv $SOMETHING ~$SOMEONE
only it kept failing with "~foo: No such file or directory". I had a
look at the manual, but it wasn't terribly clear. In a nutshell,
though, Bash wasn't doing the tilde expansion no matter what
combination of braces and quotes tried.
Eventually, I came across a pirated copy of the ever-excellent Unix
Power Tools, 3rd Ed. (mine's at work, or I'd've checked it a lot
sooner), and it had a solution:
FINAL_RESTING_PLACE=$(/bin/bash -c "echo ~${SOMEONE}")
I'm sure there must be a better way of doing this, though...Woot, any
suggestions?
Tags:
scripting
28 Jul 2007
I can get really, really focussed sometimes. Every now and then that
happens with Nagios.
Yesterday I had some time to kill before I went home, so I looked over
my tickets in RT. (I work in a small shop, so a lot of the time
the tickets in RT are a way of adding things to my to-do list.) There
was one that said to watch for changes in our web site's main page;
I'd added that one after MySQL'd had problems one time -- ran out of
connections, I think -- and Mambo had displayed a nice "Whoops! Can
someone please tell the sysadmin?" page (a nice change from the usual
cryptic error when there's no database connection). Someone did, but
it would've been nice to be paged about it.
At home I use WebSec to keep track of some pages that don't
change very often (worse luck…), and I thought of using that. It
sends you the new web page with the different bits highlighted, which
is a nice touch. But I wanted something tied in with Nagios, rather
than another separate and special system.
So I started looking at the Nagios plugins I had, and I was surprised
to find that check_http
has a raft of different options, including
the ability to check for regexes in the content. Sweet! I added a
couple strings that'll almost certainly be there until The Next Big
Redesign(tm), and done.
I started looking at the other plugins, and noticed check_hpjd
. A
few minutes later I was checking our printers for errors...just in
time to notice a weird error that someone had emailed me about 30
seconds before. Nice!
This morning (I work from home on Saturdays in return for getting
Wednesdays off to take care of Arlo) I was checking Cacti
(which rocks even if they do call it a solution). /home/visitors
with no free space? Wha'? Someone had run a job that'd managed to fill
the whole damned partition.
Well, there's check_disk
, but that's only for mounted disks — and I
don't want the monitoring machine freezing if there's a problem with
NFS. SNMP should do this, right? Right — the net-snmp project has
the ability to throw errors if there's less than a certain amount of
free space on a disk. For some reason I'd never set that up before,
nor got Nagios to monitor for it. A few minutes later and check_snmp
was looking for non-empty error messages:
$USER1$/check_snmp -H $HOSTADDRESS$ -o UCD-SNMP-MIB::dskErrorMsg.$ARG1$ -s ""
I looked ahead in snmpd.conf
and noticed the process section. Well,
hell! It's all very good to check that the web server is running, but
what if there are too many Apache processes? Or too few of MySQL? Or
no Postfix? Can't believe I never set this up before…
I've finally come up for breath. This wasn't what I planned on doing
this morning, but I love it when a plan will come together next
time.
Tags:
monitoring
mysql
27 Jul 2007
title: Happy Sysadmin Appreciation Day!
date: Fri Jul 27 14:01:19 EDT 2007
This DNS regression suite looks tres cool. I've just upgraded BIND at work to the latest version, so maybe this is the next thing to try.
Tags:
21 Jul 2007
Which is very timely, as I'm trying to track down why nscd
door access is taking so long: http://au.sun.com/news/onsun/2002-11/tech_tips.html
Tags:
solaris