Server failed to flush BER data back to client
I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.
Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:
9279 Full BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish 9496 Full BackupCatalog.2010-02-23_21.10.00_13 is waiting execution 9498 Full bigass_server-d_drive.2010-02-24_03.05.01_15 is running 9520 Increme little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape 9521 Increme little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs
but storage says this:
Running Jobs: Writing: Full Backup job bigass_server-d_drive JobId=9498 Volume="000031"
pool="Monthly" device="Drive-0" (/dev/nst1) spooling=1 despooling=0 despool_wait=0 Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559 FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16
Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=1 Files=156 Bytes=3,403,527,093 Bytes/sec=72,415 FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9
Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=0 Files=9 Bytes=183,606 Bytes/sec=3 FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10
Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=1 Files=5 Bytes=182,029 Bytes/sec=3 FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19
Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0) spooling=0 despooling=0 despool_wait=0 Files=0 Bytes=0 Bytes/sec=0 FDSocket closed
Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:
cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1 cbs-01-sd: dircmd.c:232-0 Do command: unmount cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0 cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1 cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1 cbs-01-sd: dircmd.c:625-0 Found device Drive-1 cbs-01-sd: block.c:133-0 Returning new block=39cee10 cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev
...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).
This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...
A nice thing about working at a university is that you get all this time off at Xmas, which is really nice; however, it's also the best possible time to do all the stuff you've been saving up. Last year my time was split between this job and my last; now, the time's all mine, baby.
Today will be my last of three days in a row where the machines have been all mine to play with^W^Wupgrade. I've been able to twiddle the firewall's NIC settings, upgrade CentOS using Cfengine, and set up a new LDAP server using Cobbler and CentOS Directory Server. I've tested our UPS' ATS, but discovered that NUT is different from APCUPSD in one important way: it doesn't easily allow you to say "shut down now, even though there's 95% battery left". I may have to leave testing of that for another day.
It hasn't all gone smoothly, but I've accomplished almost all the important things. This is a nice surprise; I'm always hesistant when I estimate how long something will take, because I feel like I have no way of knowing in advance (interruptions, unexpected obstacles...you know the drill). In this case, the time estimates for individual tasks were, in fact, 'way paranoid, but that gave me the buffer that I needed.
One example: after upgrading CentOS, two of our three servers attached to StorageTek 2500 disk arrays reported problems with the disks. Upon closer inspection, they were reporting problems with half of the LUNs that the array was presenting to them -- and they were reporting them in different ways. It had been a year or longer since I'd set them up, and my documentation was pretty damn slim, so it took me a while to figure it out. (Had to sleep on it, even.)
The servers have dual paths to the arrays. In Linux, the multipath drivers don't work so well with these, so we used the Sun drivers instead. But:
cfservdhad refused its connection because I had the
MaxConnectionsparameter too low.
I got it fixed in the end, and I expanded the documentation considerably. (49,000 words and counting in the wiki. Damn right I'm bragging!)
Putting off 'til next time, tempted though I am: reinstalling CentOS on the monitoring machine, which due to a mix of EPEL and Dag repos and operator error appears to be stuck in a corner, unable to upgrade without ripping out (say) Cacti. I moved the web server to a backup machine on Tuesday, and I'll be moving it back today; this is not the time to fiddle with the thing that's going to tell me I've moved everything back correctly.
(Incidentally, thanks to Matt for the rubber duck, who successfully talked me down off the roof when I was mulling this over. Man, that duck is so wise...)
Last day today. (Like, ever!) If I remember correctly I'm going to test the water leak detector...and I forget the rest; it's all in my daytimer and I'm too lazy to get up and look right now. Wish me luck.
And best of 2010 to all of you!
Just upgraded my laptop to Debian Lenny with only minor hiccups (my own fault). Not only have I got the latest version of Iceweasel/Firefox without any GTK version nonsense, but I've got wicd working, including my Broadcom wireless and WPA2! (I never could figure out the settings to get encryption working with the various /etc/network/ files...) I'm happy…
My lack of experience with LDAP in general, and Sun's (iPlanet|Directory Server( Enterprise Edition)?) in particular, has proven to be a bit of a handicap of late.
Case in point: when I upgraded $big_machine to Solaris 10 at the end of August, I also upgraded its LDAP server from iPlanet 5.1 to DSEE 6 (same software, different name). At the time I had two problems: I was unable to get replication to $big_server (we have a multi-master configuration; not supposed to work with 5.1, but it does/did for us) working over SSL, and replication from $big_server to other machines did not work. There were a lot of things going wrong at that point, so I set up replication in the clear from $little_machine, another LDAP server on the LAN, and left it 'til I had more time. It wasn't ideal, but it would do.
The last two Saturdays I've been trying to figure out why replication wasn't working. I concentrated on getting replication to it working over SSL. This was tough, because the logs didn't tell me much:
Server failed to flush BER data back to client
I swear, this turned up more Googlejuice today than it did a few weeks ago, because this time it turned up the ever-excellent Brandon Hutchinson again. This time he had a truly great set of instructions on installing DSEE6. That lead me to this blog entry, very helpful, giving information about the different sorts of databases you can stick your SSL certs into. (Must learn more about SSL/OpenSSL…)
However, in the end it turned out to be a simple and moderately
embarassing mistake: it's not enough, with DS6, to say
add-cert and be done with it; you actually have to specify the
certificate to use. As Brandon points out, you have to edit =dse.ldif=
in order to do so (though I had to stop the server, edit the file and
start it up again, rather than just edit and restart, in order to get
it to work).
The other thing — replication from $big_server elsewhere — is still not working. I suspect this is my fault; in an attempt to get things working, I decided that the thing to try would be initializing $big_server from $little_server, then the other way around. This did not change things, and now $little_server is unable to push its changes elsewhere. I've since been told this is a mistake on my part; arghh.
Unfortunately, there were other things I screwed up in the original install of DS6 on $big_server — embarassing and rather pointless to record for Google right now — and I strongly suspect that I'm going to have to reinstall or reinitialize $big_server just to get things into a reasonably coherent state. Fortunately, there aren't that many changes that ever happen on it, so there shouldn't be many to lose or redo if it's wiped.
And thus my Saturday.
Some fun Emacs stuff:
I had a meeting with my boss at work last week (before a nice four-day weekend…the split schedule I've got means that sort of thing happens very rarely. But I digress) to set my priorities now that the upgrade has more or less been finished (lingering issues aside; see ahead).
One of the big things is getting Zimbra set up. This will be nice; we do not have a calendar for the office right now, and this is is getting to be a pain. My boss is open to the idea of something that's not Outlook/Exchange, and that's good.
The other thing is getting a bunch more Windows machines in. This is a small shop, so "a bunch" means another 15 or 20 -- which'll double the number we have. I'm not entirely happy about that, but because this is a longer-term project I've been given time to do this right. And to me, "right" means "using open-source tools whenever possible to manage Windows". Thus, I'll be getting the time to set up Unattended and wpkg, and possibly even digging up Windflower and seeing if it's worth continuing. I'm actually kind of excited about this.
It's a little strange having a manager take this much of a hand in setting priorities; I've worked in a series of small shops and, up 'til now, have been left more or less on my own nearly the whole time. It does feel good to get a bit of direction, though. I mean, I know what needs to be done and I'm doing it, but I've always felt a bit lost trying to decide what's most important for everyone once past the finger-in-the-dike stage.
Now to go try and get Multi-TTY working on this laptop…
Ack: Just realized I never described the lingering problems with
Solaris 10. Fairly simple to describe: LDAP lookups take 'way longer
than they should (
ls -l /home/ can take 5 seconds per line
sometimes), and JDS on the SunRays is slower in parts than it should
be (click on the logout button, wait 60 seconds, message pops up
saying "Are you shure you want to log out?"). I'm hopeful I can track
those down without too much effort…
I'm back from vacation, and a relaxing time it was. We got to enjoy the hospitality of a lot of family in Ontario, and sysadmin duties were pretty minimal. Hell, I didn't even check Slashdot the whole time.
I upgraded my dad's laptop to the newest version of Ubuntu, and got him a new wireless card that'll work in Linux (though with a restricted driver) as an early Father's Day gift. (If I had been able to buy him an old Orinoco somewhere, I'd've done that instead...as it is, I'll have to cringe under the wrath of my inner RMS. :-)
I also showed him how to FTP a new Wordpress theme to the server, and I have to say I'm impressed with how easy Gnome/Nautilus makes it for him. I'm starting to understand the appeal of a nice GUI, though I'm still sticking to my xterms for now.
As a bit of reciprocation, my dad gave me a 2GB SD card for the new camera we've got -- which was nice, because the old 256MB card was filling up very quickly.
I was happy to get back to work and find that, really, there wasn't that much to clean up. Coworkers had filled in nicely for me, and the worst that had cropped up was an SQL bug in a new credit card payment form; it was failing to update the second of two places that indicate someone has paid. (Yes, redundant, but to be fixed next year.) I'm a bit irritated by this, as the bug was an SQL statement, passed to PEAR's Db module, that said:
...set updated_by="foo form" form" ...
Yes, this is my typo, but why did PHP not report this error? What happened, and why wasn't it being caught?
Anyhow. Now that I'm back, relaxed, forced by funding to put off Big Website Rewrites 'til next year and mostly done with this year's web work, I'm finally able to contemplate upgrading our Big Server(tm) to Solaris 10. That will be a bear of a job, but it'll be nice to get it done.
On the home front, I'll be switching to Uniserve's ADSL shortly. They do allow servers, and offer static IPs for a small charge; that'll be nice. We used them at my last job, and the service was fine as logn as you didn't have to contact tech support.
Surprisingly, they also have this clause in the TOS:
65. UNISERVE shall have the right, without notice, to insert advertising data into the Internet browser used by a UNSERVE customer, and transferred to a UNISERVE customer over UNISERVE's network, so long as this does not involve UNISERVE establishing the identity of the customer to whom such data is sent.
In a previous life at an ISP, we started putting in machines run by a company called Adzilla. They were, as far as I could tell, proxy servers that replaced the ads on, say, CNN's website with ones for local businesses. I thought it was scummy, but couldn't persuade the bosses of this. I'm fairly certain this is the same thing, and probably the same company too. I still don't like it, but Uniserve is the best option I've got right now. And at least they admit they're doing this.
(Note: this was actually written back in May.)
Top Tip: Filenames with a tilde in them can confuse Samba.
Case in point: last week a user was
having problems loading his profile: W2K kept choking and saying that
Local Data\Applications\foo\backup\~AvariciousMonkeys.c was
in use. Naturally, lsof on the Samba server turned up nothing, and I
couldn't see any obvious problem. On a hunch, I tried renaming the
AvariciousMonkeys.c~, and hey presto! goodness all
This week I'm trying to get FAI going in seriousness. I've worked on it before, but now I've got three developers who want to switch to Linux. The last thing I want is another series of one-offs, so I'm taking the time to do it right. Now there's a CD version in beta, and so far it's working well. Cf. the usual way of doing it, which is to do PXE booting and grab everything off the network. I'm not opposed to that, but one of the things I wanted out of FAI before was the ability to do CD-based, kickstart-like Debian installs; looks like it's finally going to work.
Looks like we're having a problem with a Maxtor PCI IDE controller and the Intel mobo in our backup server. It's been mysteriously crashing in the middle of the night w/no log messages. Some checking in the BIOS turned up another problem: going to the hardware monitoring page to look at the CPU temperature made the damn thing freeze. WTF? Sure seems like the symptom we were seeing, and backups running at night make big use of the Vinum array that uses drives attached to the IDE adapter...long story short, taking out the card stopped the BIOS freezing. It remains to be seen if it'll work for the random midnight freezes, but it's good to have something to try. I'm hopeful that FreeBSD will be able to handle SATA drives attached to this thing...we'll have to see.
Which brings me to the next bit: fleshing out plans for server upgrades. As I mentioned, last week we had a power supply fail on our Very Important Server, and I want to try and keep that from happening again. Of course, adding umpty thousand dollars worth of hardware to your budget four months before the end of fiscal doesn't really work too well, so as much as possible I need to do this w/o new hardware. Ha! But I'll give it a try.
First off is setting up OpenLDAP and importing Samba's information into it. That'll be neat, since I've never worked w/LDAP before. Second is to set up some BDCs using OpenLDAP to query the master. (Or do they just suck over the whole database? Hm. Either way.) Third is to set up some Linux machines. Why? Two reasons:
LinuxHA and DRBD seem fantastic, and there just doesn't seem to be anything comparable on the FreeBSD side. As for the hardware...well, my first impression of server hardware from IBM, HP and the like (no, don't talk to me about Dell) is that I'm going to need a newer version of FreeBSD than we currently use in order to run SATA drives. (I know SCSI is the way to go, but I was quoted two thousand dollars for two IBM 73GB 15k drives! I know: 15k, IBM, etc, but even halving that means two -- two! -- 73GB drives for a thousand bucks, a/o/t two 200GB drives for, what, four hundred. Heh.)
We're using an older version of the 4-series FreeBSD here. I've already set up one server using a newer 4-series release, and it's a pain: too many differences, one more thing to keep in mind when making changes, and so on. I haven't worked with the 5-series yet, and I don't want to start now...not entirely sure that it'd work for us. Plus, we'll probably migrate to Linux anyway, so I don't mind doing it for a server.
Anyhow! Get a Real Server and throw Linux on it. Hook it up to our drive array and start migrating home directories to ReiserFS from UFS/FreeBSD. Not trivial, but doable. Add more Linux servers as budget allows.
Jesus Christ. Every time I mess around with hardware or upgrades, I swear I'll never do it again. Then I forget.
My first computer, bought eight years ago now, was a 486 w/16MB of RAM and some amount of HD space. I installed Slackware on it, got a 33.6 modem, and had email and net access. Then a roommate sold me his old P90. It crashed constantly until I figured out I had set the CPU voltage wrong. It took me a long time to figure that out, and I was nearly ready to hurl the thing out the window.
A few years later I upgraded to my current desktop machine, a 333 Celeron overclocked to 450 MHz. The machine is fine unless I open up the case to add/remove/shift something in it; then it will, for a day, spontaneously reboot. I've checked it for shorts and can't find any. I don't know what I'm missing, but I'm sure it would be obvious to someone else.
And now the latest. My wife bought an iMac from her old work a few years ago, and has had problems w/it since. It just crashes for no good reason. It'll work fine for two weeks, then she can't keep it running for more than an hour. So last week I went out and bought her a fairly skookum machine: Athlon 2600 (I think...details to follow), ECS K7S5A mobo, 60GB HD and 256 MB RAM.
I got it all home and assembled it. The mobo and Red Hat 9 (not my favourite, but great for my wife) called the CPU a 2000 (1.6GHz instead of 2.0), so I looked around and decided a BIOS upgrade would be in order. Did that and promptly lost the back USB -- bad, since her keyboard and mouse are USB. The front ones, hooked up to the pins on the motherboard, still worked. Tried rolling the BIOS back, but nothing: the back, onboard USB just didn't go. Fuck.
So I went out and got some additional USB risers a few days later. I added them; no problem. Then I had to add a connector from the CDROM's audio to the motherboard. I made the mistake of removing one of the USB connectors while the power was still on. Didn't even think; just did it. Now the BIOS freezes at "Checking NVRAM...". Flashed the CMOS half a dozen times, left it off most of the night while we went to see Finding Nemo (not as good as Monsters, Inc., but still well worth it), and no change.
Today I'm going to stop by my new hardware supplier of choice|http://www.ntcw.com/ to pick up a Gigabyte 7VAX. We'll see if I got ripped off on the CPU or what.
Mostly, though, I am not going to fuck with this computer again. I mean it this time.