Posts tagged “migration”

November 30, 2012 Niet zo goed
So yesterday I got an email from another sysadmin: "Hey, looks like there's a lot of blocked connections to your server X. Anything happening there?" Me: "Well, I moved email from X to Y on Tuesday...but I changed the MX to point at Y. What's happening there?"

Turns out I'd missed a fucking domain: I'd left the MX pointing to the old server instead of moving it to the new one. And when I turned off the mail server on the old domain, delivery to this domain stopped. Fortunately I was able to get things going again: I changed the MX to point at the new server, and turned on the old server again to handle things until the new record propogated.

So how in hell did this happen? I can see two things I did wrong:
- Poor planning: my plans and checklists included all the steps I needed to do, but did not mention the actual domains being moved. I relied on memory, which meant I remembered (and tested) two and forgot the third. I should have included the actual domains: both a note to check the settings and a test of email delivery.
- No email delivery check by Nagios: Nagios checks that the email server is up, displays a banner and so on, but does not check actual email delivery for the domains I'm responsible for. There's a plugin for that, of course, and I'm going to be adding that.
I try to make a point of writing down things that go bad at $WORK, along with things that go well. This is one of those things.
September 21, 2009 What to ask when taking over external servers?
At $WORK, I'm going to be taking over the administration of four servers that currently do stuff for a variety of researchers scattered around the province. There are a number of players here:
- My department, which contains:
  - Me, the guy whose services are being promised, and
  - The researcher who's arranging all this (my local contact)
- The agency that owns them, who I don't think has any techical staff
- The agency that currently hosts and administers the servers
The owning agency has also ponied up for an upgrade to the four servers; I'll be taking delivery some time next week.

I've got some preliminary information -- what the servers do, how the users use the thing, etc -- but I'm preparing a more detailed plan. In the meantime, I've compiled a list of questions for my local contact.

In the middle of that, it occurred to me that this would be a good discussion topic. Have I missed anything? Let me know!
- Will the old servers be moved over, or will the new ones replace them?
- What's the primary means of talking to users? (Mailing list, status page)
- Where's the list of those users? (one of the above, spreadsheet)
- What info do users/owners expect from us? How? (Mailing list, status page; 2 weeks notice of downtime, monthly stats by CPU) - Are any funding decisions influenced by this information?
- Where is the info for the software? - media - license #, what we have licenses for (unlimited use, # cores, etc) - support #, what it covers
- Can I see a demo of the software?
- Do any of the labs have shell access? What do they do with it?
- What exactly is involved in maintenance? Where is this documented?
- What DNS changes will be made? Who makes them?
- Who makes policy/purchase decisions about these servers? How do I contact them?
September 23, 2006 Samba Problems, or Don't Forget Your SID!
This is going to be a long story, but I hope it'll be instructive. Bear with me.

Back at my last job, we had a Samba server, running on FreeBSD, acting as a Primary Domain Controller for around 35 W2K machines. The same machine also acted as NIS master for a similar number of FreeBSD machines. It also did printing, mail, DNS, and half a dozen other things. This machine was getting old; it's CPU usage was often pegged by a large print job, it was running out of disk space, and I was beginning to be worried about the inevitable day of death. I began planning for the upgrade: a new machine, faster and bigger hard drives, more memory and gigabit ethernet for the day we all moved to GigE. Oh, and rack-mounted...definitely rack-mounted.

The opportunity was taken to upgrade much of the software on the machine, including Samba. I decided to move from 2.2 to the 3.0 series; the speed differences seemed pretty impressive. I also wanted to get as many of the big upgrades done at once as possible: the prospect of going through the upgrade repeatedly did not appeal.

Of all the upgrades I was doing, Samba made me the most nervous. I read through the excellent (and Free) Samba HOWTO and made notes: how to move to the tdsam password database, changes in configuration options, and so on. I had the new server for a while, so I was able to run through many tests: getting a Windows machine to log on, DNS queries, and so on.

Finally, the big day came. I went in on a Saturday and made the move. Most of the rest of the day was spent testing, chasing down the inevitable mistakes, and testing some more. I tested by logging into machines after they'd joined the domain, and making sure that everyone could still log into their workstations. All told, things went pretty damned well, and I congratulated myself on a job well done.

Later though, a few things began to crop up that I haven't been able to explain. I could no longer add new domain accounts to SSH under Cygwin. A shared printer wasn't being shared any longer. In fact, shares weren't working at all. I banged my head against this for a while, but since the problems were pretty erratic they tended to fall to the wayside in favour of explaining, one more time, why the words "spare computer" were self-contradictory.

Finally, though, I put some more time into it. And it's a little hairy, especially for this Unix guy, so bear with me.

(Incidentally, I couldn't have figured out half of this without the help of Clarence Lee, a co-op student working with us. Sure, he uses IIS, but he firewalls it with OpenBSD and he got an internship at Microsoft. He's a good guy.)

The shared printer: could not figure out what was going on here. Guy who had it could print to it, no problem. Used to work for everyone, no problem. Now it wouldn't work. Broke the problem down to the point where I was using smbclient on FreeBSD, or net view on W2K, to try and list the shares, and that didn't work. Not any of them -- not IPC$ or anything. I was fairly sure this wasn't supposed to be happening.

There was a machine in limbo (not the same as spare, thenk yew!) while a coop student became permanent. I got it using the other networked printer, and tried sharing it. Again, command-line utilities would simply not list the shares. What's more, when I tried getting other people to log into the machine (I was fairly irritated at this point, and not at my most rational), they couldn't log in. WTF? I could log in, and there had been no complaints from the person whose machine it had been.In a moment of irritation, I got the test machine to rejoin the domain...and suddenly, everything was working: I could list shares on it, other people could list shares on it, people could log in, and everything. Yay! It's so simple! Rejoin the domain! Everything will be great!

Ha! It is to laugh. Profiles were not coming in when people logged in. My Documents was empty, they got that stupid, evil, vile "Let's take a tour of Windows! And let me help you set up your network! DO IT!" popup window. I couldn't figure it out.

Clarence and I banged out heads against it some more, and finally came to a conclusion.

When you migrate Samba, you're meant to take the old SID with you using net(8) GETLOCALSID and SETLOCALSID. The SID is meant to be a world-unique string/number that identifies a domain, or an account -- think something like the DN in LDAP, or NIS domainname + UID in Unix. (A user's SID has a part that belongs to the domain, and another, smaller part that is unique to that user.) I didn't do that -- screwup -- and so the Samba server had generated a new SID. As far as Windows is concerned, the identity of your domain is solely determined by the SID; the name is their just for your convenience. (Insert snide remark here about how magic invisible numbers have no business being that important.)

As a result, the machines that were present at the migration didn't know where their Primary Domain Controller (PDC-- the machine officially in charge of the domain) had gone, and were running on cached credentials, profiles and so on. (This is the same thing that allows you to log into a Windows laptop that belongs to a domain, even when you've taken it home and aren't able to reach your PDC any more.) Printing and shared resources from the Samba server continued to run because of open permissions or credentials (ie, user name and password) that don't depend on SIDs.

This also explained why I could log into the machines without problems: because, as sysadmin, I'd logged into all of them before to do maintenance. My credentials were cached, so the machines were able to authenticate me w/o consulting with their (now missing) PDC. And of course, everyone was able to log into their own workstations for the same reason.

So: machine rejoins the domain and people can log in, because now the machine can find its PDC and verify their passwords. But profiles aren't showing up because the profile's NTUSER.DAT -- the user's hive, loaded into the registry at HKEY_CURRENT_USER when they log in -- belonged to/was marked with/was owned by the account's old SID, and Windows refused to load it and lots of stuff broke or was missing.

After some more searching, I finally figured out the way around this.

First, you need to use the profiles(1) tool in Samba to change the SID on NTUSER.DAT, which'll be wherever Samba keeps profiles. You should check their SID in Samba by using pdbedit(8), though odds are the user ID/group ID part will have remained the same.

Second, you need to take care of the profile. There are a few ways of doing this. The easiest way is to copy the modified NTUSER.DAT to their profile directory, then log into the machine as Administrator and join the new domain, then get the user to log in. Their profile will be copied over, just as if they'd logged into a machine for the first time. However, this can cause problems with certain programs who haven't been informed about the change.

To illustrate: if the domain name is named EXAMPLE, and the user account is jdoe, then their profile will usually be at C:\Documents and Settings\jdoe (let's just call that D&S\jdoe for short). However, D&S\jdoe will belong, after joining the new domain, to an old account that's no longer around, which means that Windows will put their profile somewhere else -- probably something like D&S\jdoe.EXAMPLE. Odds are, though, that the old path will still be in the registry or other files, which means a lot of cycles of "Why-did-that-break-let-me-fix-it". Another option is simply to move D&S\jdoe out of the way, so that paths can remain the same. Finally, you can also change ownership recursively to the new account once you've joined the domain; this will take a while, but it's probably quicker than copying the profile over wholecloth if they've got a lot of files. If you do this, it's best to remove the machine's copy of their NTUSER.DAT file; it'll just be copied over from the server.

This took a lot of work, of course, and usually there were things like Outlook.pst to screw things up further. But after much work, I finally got everyone moved over to the new domain, and things were good again.

Lessons learned:
1. Take the new SID with you.
2. Learn how something works, even if it stinks.
3. Testing the usual is good and necessary. So is testing things that wouldn't ordinarily happen.
4. You can never know too many people on the other side of the fence.