Attended a talk today about the upgrade of UBC's network from supporting VLANs to supporting VRFs. Complicated but neat. I'm hoping that the presentation slides and video will make it to the IT Services website; I'll post a link if it does.
Also interesting is this external review of the IT department at UBC. It touches on some things I've been peripherally involved or interested in (funding models, culture and management); I've only skimmed it so far, but it's fascinating to read something so straightforward.
Saturday I upgraded the big machine at work to Solaris 10 11/06. This did not go well.
First off, I ended up installing onto a disk that held home directories. The install was a manual one, and I'd carefully noted in advance the disk I'd be installing to: the second internal hard drive, the one I'd tried doing the luactivate on a couple weeks ago.
Only the disk targets/names/whatever changed, and so c1t0d1 (say) was now one of the home partitions mounted from the external StorEdge array. Fuck. There were backups: I'd taken a backup before starting the install. Unfortunately, they were taken 3 hours before the install started, and during that time the machine had been up and running. The install started at 8am, so I'm hopeful there wasn't too much lost between 5am and 8am. But don't think I'm trying to minimize that mistake.
Second, I'd also managed to bork the disklabel for the original Solaris 9 install. I dug up the original disklabel somewhere — it wasn't in the documentation we've got, and I should have put it in there a long time ago — and restored everything to the way it was. It hadn't been formatted, so everything was okay.
Third, when it came up only one of the three external drives from the StorEdge was present, and I could not figure out where the others had gone. (It took me a while to figure this out; when first I realized my first mistake, I thought I'd installed over all the home directories. That was an awful moment.)
It took a lot of Googling to figure out what I should have already
known about Solaris in general, and what should have been documented
about this machine in particular: that /kernel/drv/sd.conf
had been
modified to add additional entries for LUNs that otherwise Solaris
wouldn't have looked for.
(Many thanks to Brandon Hutchinson, whose entry on this very subject saved my butt. I wrote him a grateful email, and I wish him the best.)
(Incidentally, a reconfiguration reboot on a VS480 takes between 10 and 25 minutes. It's not a fast process. Also not a fast process is installing Solaris patches; I spent at least two hours on this all told, not counting reconfiguration reboots.)
I restored the one home directory (having recreated it in ZFS…one bright spot in all that) and mounted the others. All this got me, at 6pm, where I should have been at noon.
I was there 'til 11:30pm on Saturday fixing things up to the point where it was more or less ready for SSH-based logins. Then I took a cab home. Then I came in yesterday at 10am and got almost everything else working: SunRays (oh, the new desktop is beautiful), printing, software, and I can't even remember what all at this point.
I took lots of notes and did everything from within screen
with
logging turned on. (Bonus points for next time: set the prompt to show
the time, so I can tell what order I did things in.) I'll be going
over all of it to do things better next time.
Here's some stuff I already know:
Backups. It's said you never know how much you need 'em 'til you need 'em. True 'nuff.
DOCUMENTATION. I spent a good part of yesterday getting information on every disk while waiting for other software to install. I should have done this long, long ago.
(Incidentally, on that front I owe Blastwave an apology: right on the
goddamn HOWTO page there's a section on automation. My
mistake. But I still don't like the fact that the remove option (-r
)
is undocumented, and presumably undocumented because of the warning it
prints that it's not very smart and shouldn't be used.)
Know what you're dealing with. The home partition I erased was bigger than the disk I expected to install on, but I wasn't sure of its size.
Stop if you're not sure. I should have stopped at the last point.
Be paranoid. Usually I am, but it would have been good to disconnect every superfluous drive rather than go through all this hell.
Sometimes it really amazes me that I get paid to do this work because it's so much fun. And sometimes I'm amazed because I figure I shouldn't be allowed to touch computers with a ten-foot pole.
I'm feeling pretty damned humble this morning. With luck that feeling will stay.
The upgrade to Solaris 10 did not work. The main problem was that logging in at the console (even as root!) simply would not work: I'd get logged right back out again each time, with no error message or anything. WTF?
I managed to go into single-user mode, provide the root password (see? they do trust me) and get access that way. But I still couldn't figure out what was going wrong. Eventually I came across this entry in the logs
svc.startd[7]: [ID 694882 daemon.notice] instance svc:/system/console-login:default exited with status 16
And /var/svc/log/system-console-login:default.log
said:
[ Aug 4 14:23:48 Executing start method ("/lib/svc/method/console-login") ]
[ Aug 4 14:24:05 Stopping because all processes in service exited. ]
Eventually I had to give up and revert back to Solaris 9. That part worked well, at least.
I've no idea what went wrong at this point, but since I haven't come
across this before with other Solaris 10 installs I'm starting to
wonder if it's a product of luupgrade attemting to merge the machine's
current settings with Sol10. Between that suspicion and the increase
in disk space needed to run luupgrade (not sure why, but for example
/usr
needed a couple extra GB of space in order to complete
luupgrade
; I presume something's being added or kept around, but
there's no explanation I can find for this), I'm starting to think
that just going with a clean install of Sol10 is the way to go.
Arghh. Live Upgrade was supposed to just work.
I bought a T60 for my boss a while back, and have just finished putting in another memory module. Man, I knew this was the lower end of their laptops, but I had no idea it would feel so cheap.
To get at the memory, you take out a few screws on the back, then lift off the palm guard below the keyboard. It's flimsy plastic, and it's hard to get back in the right place - doubly so, since it feels like instead of clicking into place it's going to break. And you need to remove the ribbon that connects the touch pad and fingerprint reader in order to fully remove it; when putting it back in, it looks like it's going to get crimped. That can't be right.
I had been considering getting one of these, despite having fallen in love with my other boss' Dell D420. But this just makes me think that the extra money for the D420 would be worth it. Of course, I haven't had to crack that one open yet…