So yesterday I got an email from another sysadmin: "Hey, looks like there's a lot of blocked connections to your server X. Anything happening there?" Me: "Well, I moved email from X to Y on Tuesday...but I changed the MX to point at Y. What's happening there?"
Turns out I'd missed a fucking domain: I'd left the MX pointing to the old server instead of moving it to the new one. And when I turned off the mail server on the old domain, delivery to this domain stopped. Fortunately I was able to get things going again: I changed the MX to point at the new server, and turned on the old server again to handle things until the new record propogated.
So how in hell did this happen? I can see two things I did wrong:
Poor planning: my plans and checklists included all the steps I needed to do, but did not mention the actual domains being moved. I relied on memory, which meant I remembered (and tested) two and forgot the third. I should have included the actual domains: both a note to check the settings and a test of email delivery.
No email delivery check by Nagios: Nagios checks that the email server is up, displays a banner and so on, but does not check actual email delivery for the domains I'm responsible for. There's a plugin for that, of course, and I'm going to be adding that.
I try to make a point of writing down things that go bad at $WORK, along with things that go well. This is one of those things.
At $WORK, I'm going to be taking over the administration of four servers that currently do stuff for a variety of researchers scattered around the province. There are a number of players here:
The owning agency has also ponied up for an upgrade to the four servers; I'll be taking delivery some time next week.
I've got some preliminary information -- what the servers do, how the users use the thing, etc -- but I'm preparing a more detailed plan. In the meantime, I've compiled a list of questions for my local contact.
In the middle of that, it occurred to me that this would be a good discussion topic. Have I missed anything? Let me know!
What info do users/owners expect from us? How? (Mailing list, status page;
2 weeks notice of downtime, monthly stats by CPU)
- Are any funding decisions influenced by this information?
Where is the info for the software?
- media
- license #, what we have licenses for (unlimited use, # cores, etc)
- support #, what it covers
Can I see a demo of the software?
Do any of the labs have shell access? What do they do with it?
What exactly is involved in maintenance? Where is this documented?
What DNS changes will be made? Who makes them?
Who makes policy/purchase decisions about these servers? How do I contact them?
This is going to be a long story, but I hope it'll be instructive. Bear with me.
Back at my last job, we had a Samba server, running on FreeBSD, acting as a Primary Domain Controller for around 35 W2K machines. The same machine also acted as NIS master for a similar number of FreeBSD machines. It also did printing, mail, DNS, and half a dozen other things. This machine was getting old; it's CPU usage was often pegged by a large print job, it was running out of disk space, and I was beginning to be worried about the inevitable day of death. I began planning for the upgrade: a new machine, faster and bigger hard drives, more memory and gigabit ethernet for the day we all moved to GigE. Oh, and rack-mounted...definitely rack-mounted.
The opportunity was taken to upgrade much of the software on the machine, including Samba. I decided to move from 2.2 to the 3.0 series; the speed differences seemed pretty impressive. I also wanted to get as many of the big upgrades done at once as possible: the prospect of going through the upgrade repeatedly did not appeal.
Of all the upgrades I was doing, Samba made me the most nervous. I read through the excellent (and Free) Samba HOWTO and made notes: how to move to the tdsam password database, changes in configuration options, and so on. I had the new server for a while, so I was able to run through many tests: getting a Windows machine to log on, DNS queries, and so on.
Finally, the big day came. I went in on a Saturday and made the move. Most of the rest of the day was spent testing, chasing down the inevitable mistakes, and testing some more. I tested by logging into machines after they'd joined the domain, and making sure that everyone could still log into their workstations. All told, things went pretty damned well, and I congratulated myself on a job well done.
Later though, a few things began to crop up that I haven't been able to explain. I could no longer add new domain accounts to SSH under Cygwin. A shared printer wasn't being shared any longer. In fact, shares weren't working at all. I banged my head against this for a while, but since the problems were pretty erratic they tended to fall to the wayside in favour of explaining, one more time, why the words "spare computer" were self-contradictory.
Finally, though, I put some more time into it. And it's a little hairy, especially for this Unix guy, so bear with me.
(Incidentally, I couldn't have figured out half of this without the help of Clarence Lee, a co-op student working with us. Sure, he uses IIS, but he firewalls it with OpenBSD and he got an internship at Microsoft. He's a good guy.)
The shared printer: could not figure out what was going on here. Guy
who had it could print to it, no problem. Used to work for everyone,
no problem. Now it wouldn't work. Broke the problem down to the point
where I was using smbclient
on FreeBSD, or net view
on W2K, to try
and list the shares, and that didn't work. Not any of them -- not
IPC$
or anything. I was fairly sure this wasn't supposed to be
happening.
There was a machine in limbo (not the same as spare, thenk yew!) while a coop student became permanent. I got it using the other networked printer, and tried sharing it. Again, command-line utilities would simply not list the shares. What's more, when I tried getting other people to log into the machine (I was fairly irritated at this point, and not at my most rational), they couldn't log in. WTF? I could log in, and there had been no complaints from the person whose machine it had been.In a moment of irritation, I got the test machine to rejoin the domain...and suddenly, everything was working: I could list shares on it, other people could list shares on it, people could log in, and everything. Yay! It's so simple! Rejoin the domain! Everything will be great!
Ha! It is to laugh. Profiles were not coming in when people logged
in. My Documents
was empty, they got that stupid, evil, vile "Let's
take a tour of Windows! And let me help you set up your network! DO
IT!" popup window. I couldn't figure it out.
Clarence and I banged out heads against it some more, and finally came to a conclusion.
When you migrate Samba, you're meant to take the old SID with you
using net(8) GETLOCALSID
and SETLOCALSID
. The SID is meant to
be a world-unique string/number that identifies a domain, or an
account -- think something like the DN in LDAP, or NIS domainname +
UID in Unix. (A user's SID has a part that belongs to the domain, and
another, smaller part that is unique to that user.) I didn't do that
-- screwup -- and so the Samba server had generated a new SID. As far
as Windows is concerned, the identity of your domain is solely
determined by the SID; the name is their just for your
convenience. (Insert snide remark here about how magic invisible
numbers have no business being that important.)
As a result, the machines that were present at the migration didn't know where their Primary Domain Controller (PDC-- the machine officially in charge of the domain) had gone, and were running on cached credentials, profiles and so on. (This is the same thing that allows you to log into a Windows laptop that belongs to a domain, even when you've taken it home and aren't able to reach your PDC any more.) Printing and shared resources from the Samba server continued to run because of open permissions or credentials (ie, user name and password) that don't depend on SIDs.
This also explained why I could log into the machines without problems: because, as sysadmin, I'd logged into all of them before to do maintenance. My credentials were cached, so the machines were able to authenticate me w/o consulting with their (now missing) PDC. And of course, everyone was able to log into their own workstations for the same reason.
So: machine rejoins the domain and people can log in, because now the
machine can find its PDC and verify their passwords. But profiles
aren't showing up because the profile's NTUSER.DAT
-- the user's
hive, loaded into the registry at HKEY_CURRENT_USER
when they log in
-- belonged to/was marked with/was owned by the account's old SID,
and Windows refused to load it and lots of stuff broke or was missing.
After some more searching, I finally figured out the way around this.
First, you need to use the profiles(1) tool in Samba to change
the SID on NTUSER.DAT
, which'll be wherever Samba keeps
profiles. You should check their SID in Samba by using
pdbedit(8), though odds are the user ID/group ID part will have
remained the same.
Second, you need to take care of the profile. There are a few ways of
doing this. The easiest way is to copy the modified NTUSER.DAT
to
their profile directory, then log into the machine as Administrator
and join the new domain, then get the user to log in. Their profile
will be copied over, just as if they'd logged into a machine for the
first time. However, this can cause problems with certain programs who
haven't been informed about the change.
To illustrate: if the domain name is named EXAMPLE
, and the user
account is jdoe
, then their profile will usually be at C:\Documents
and Settings\jdoe
(let's just call that D&S\jdoe
for
short). However, D&S\jdoe
will belong, after joining the new domain,
to an old account that's no longer around, which means that Windows
will put their profile somewhere else -- probably something like
D&S\jdoe.EXAMPLE
. Odds are, though, that the old path will still be
in the registry or other files, which means a lot of cycles of
"Why-did-that-break-let-me-fix-it". Another option is simply to move
D&S\jdoe
out of the way, so that paths can remain the same. Finally,
you can also change ownership recursively to the new account once
you've joined the domain; this will take a while, but it's probably
quicker than copying the profile over wholecloth if they've got a lot
of files. If you do this, it's best to remove the machine's copy of
their NTUSER.DAT
file; it'll just be copied over from the server.
This took a lot of work, of course, and usually there were things like
Outlook.pst
to screw things up further. But after much work, I
finally got everyone moved over to the new domain, and things were
good again.
Lessons learned: