Had a server up and die on me yesterday at work. What's more, it was the Very Important Server that does almost, but not quite, everything: Samba (only one, natch), NIS master, SMTP/POP/IMAP, CVS/SVN, printing, and since the installation of the disk array, serving quite a few home directories, too. I was answering a user's question -- "Oh, this should be on the wiki..." -- and noticed that the web server wasn't up. Another user poked up his head to ask if CVS had disappeared for a reason. Aw, crap. There were no lights on -- no power, disk or network activity, so I knew it wasn't good. The fans in the front and in the power supply weren't working, so it really wasn't good. Other things plugged into the same power bar were fine, so I tried power-cycling: no response. I unracked it, popped off the lid and watched the fans start briefly then die as I toggled the power switch again. Final verdict: not good. I took it to a better place to crack it open, and grabbed some spare parts: power supply, memory, graphics card. By the time I got everything back there, maybe five minutes had passed since I'd unracked it. And of course it turned back on. I checked the CPU temperature in BIOS: 30C. A quick check of the heatsinks and drives showed they were quite fine, too. I mean, yeah, it had been five minutes, but I'd think there'd be some residual heat I could feel. I was stumped, but decided to swap the power supply anyhow. (If anyone has any other ideas, please let me know.) So naturally, now I'm thinking about what to do about this server to keep this sort of thing from happening again. Here's a short list of the stuff it does:
- NIS master server
- SMTP/POP/IMAP
- CVS/SVN
- Samba PDC
- NFS for internal drives and the drive array
In order: NIS: Throw more slaves at it (though we've got two already, so I suspect that we're fine.) SMTP/POP/IMAP: The poor cousins, at least for now. Am assuming that an outage of SMTP/POP/IMAP that can be fixed in an hour is fine, and a longer outage indicates bigger problems. CVS/SVN: To some extent, just subsets of NFS. At any rate, I'm treating this like mail: a brief outage can be lived with, and a longer outage means I have bigger problems. Samba: A BDC is obviously in order and shouldn't be too difficult (said the guy who's never worked with LDAP before), at least as far as authentication goes. However, fileserving is made stupidly more difficult by the way we're serving home directories to Windows clients: all the home directories are listed as \\VeryImportantServer\foo. The better way to do this would be to run Samba on the other file servers as well (\\SomeSmallerServer\foo). Can't believe this only just occurred to me. NFS: The biggie. Obviously we should be breaking out home directories to some other server, but that just pushes the question over a machine or two: instead of worrying about the Very Important Server that Does Almost Everything, we're worrying about The One With The Files. Since the disk array is connected via SCSI to two machines (of which the VIS is only one), it would be possible, if the VIS was raptured again, to simply fsck the arrays and them export them from the second machine. This takes time, though: close to half an hour to fsck a 1TB drive. (I've never found the settings for newfs that are supposed to make fsck times approach that of a journaled FS; if anyone can fill me in, please let me know.) And there is some provision in amd for failover, but (as I understand it) not much. Another option is using ha+drdb, which looks quite promising. This means moving to Linux, though; I'm not opposed to that, but since I don't have a second drive array around I have no way of testing this, let alone gradually phasing this in. Hm. Any ideas, let me know.