Posts tagged “bug”

February 25, 2010 It's a race to the finish

I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.

Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:

9279 Full    BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish
9496 Full    BackupCatalog.2010-02-23_21.10.00_13 is waiting execution
9498 Full    bigass_server-d_drive.2010-02-24_03.05.01_15 is running
9520 Increme  little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape
9521 Increme  little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs

but storage says this:

Running Jobs:
Writing: Full Backup job bigass_server-d_drive JobId=9498
Volume="000031"

pool="Monthly" device="Drive-0" (/dev/nst1)
spooling=1 despooling=0 despool_wait=0
Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559
FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16

Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=156 Bytes=3,403,527,093 Bytes/sec=72,415
FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9

Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=9 Bytes=183,606 Bytes/sec=3
FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10

Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=5 Bytes=182,029 Bytes/sec=3
FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19

Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"

pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=0 Bytes=0 Bytes/sec=0
FDSocket closed

Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:

cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1
cbs-01-sd: dircmd.c:232-0 Do command: unmount
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0
cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1
cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1
cbs-01-sd: dircmd.c:625-0 Found device Drive-1
cbs-01-sd: block.c:133-0 Returning new block=39cee10
cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev

...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).

This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...

December 31, 2009 Well, that'll teach me
While trying to figure out why Nagios was suddenly unable to check up on our databases, I suddenly realized that the permissions on /dev/null were wrong: 0600 instead of 0666. What the hell? I've had this problem before, and I was in the middle of something, so I set them back and went on with my life. Then in happened again, not half an hour later. I was in the same shell, so I figured it had to have been a command I'd run that had inadvertantly done this.

Yep: don't run the MySQL client as root. Yes yes yes, it's bad anyway, I'll go to sysadmin hell, but this is an interesting bug. The environment variable MYSQL_HISTFILE is set to /dev/null for root...and when you exit the client, it sets the permissions for the history file to 0600. So, you know, don't do that then. (Still no fix committed, btw...)
September 07, 2009 Wordpress worm
Just spent the better part of five hours cleaning up four old, out-of-date Wordpress installations after they got infected with this worm. I host nine sites on my home server for friends and family; I'm cutting that down to three (just family), and maybe looking at mu-wordpress, as of Real Soon Now.

Happy Labour Day, everyone!

Update: I meant to add in here a few things I looked for, because this info was hard to track down.
- I found extra admin-level users in the wp_users table; some had their email address set to "www@www.com", some had random made-up or possibly real addresses, and some had the same email address as already-existing users.
- On one blog (possibly infected much earlier) I found 42,000 (!!) approved, spammy comments.
- I searched for infected posts using a query from here:
```
SELECT * FROM wp_posts WHERE post_content LIKE '%iframe%'
UNION
SELECT * FROM wp_posts WHERE post_content LIKE '%noscript%'
UNION
SELECT * FROM wp_posts WHERE post_content LIKE '%display:%'
```
November 23, 2008 Well, \*that\* took a long time to track down
I just spent the weekend (well, like an hour a day...kids, life, you know how it is) trying to track down why a bunch of new CentOS 5.2 installs at $job_2 couldn't pipe:
```
$ ls foo
foo
$ ls | grep foo
$ echo $?
141
```
(Actually, I didn't think to look at the error code 'til someone else pointed it out…141 turns out to be SIGIPE) In the end, it would have been quicker if I'd simply searched for the first thing I saw when logging in:
```
-bash: [: =: unary operator expected
-bash: [: -le: unary operator expected
```
This was particularly aggravating to track down because not every machine was doing this, and no matter what I thought to look at (/etc contents, /tmp permissions (those have a habit of going wonky on me for some reason), SELinux) I couldn't figure out what was different.

Turned out to be an upstream bug in nss_ldap. (The Bugzilla entry makes for some interesting reading, to be sure…) And I didn't see it on each machine because I hadn't upgraded after installation on all machines. (They're not yet in production, and I'm working on getting my kickstart straight.)

Man, it was gratifying to upgrade nss_ldap and see the problem go away…
July 23, 2008 aaaaaaaaaaaand there it is
April 14, 2008 Blastwave upgrade: heads up
Heads up for those of you using Blastwave and CUPS: after upgrading to the latest stable version, printing stopped working for me (and a few users :-). I eventually tracked it down to the movement of two files: suddenly
```
/opt/csw/lib/cups/filter/pstopxl
/opt/csw/lib/cups/filter/pstoraster
```
were moved to
```
/opt/csw/lib/cups/pstopxl
/opt/csw/lib/cups/pstoraster
```
resulting in many error messages like Unsupported format text/plain! and Hint: is ESP ghostscript installed?. Moving them both back into place and restarting CUPS fixed things just fine.

According to Bacula (yay Bacula!) both files were in the right directory as of last night, and Blastwave's file list for Ghostscript shows the new location for these two files. A bug has been filed.
September 07, 2007 Now here's a weird one...
One of the problems I've been working on since the upgrade to Solaris 10 has been the slowness of the SunRay terminals. There are a few different problems here, but one of 'em is that after typing in your password and hitting Enter, it takes about a minute to get the JDS "Loading your desktop…" icons up.

I scratched my head over this one for a long time 'til I saw this:
```
ptree 10533
906   /usr/dt/bin/dtlogin -daemon -udpPort 0
  10445 /usr/dt/bin/dtlogin -daemon -udpPort 0
```
10533 /bin/ksh /usr/dt/config/Xstartup
  10551 /bin/ksh -p /opt/SUNWut/lib/utdmsession -c 4
    10585 /bin/ksh -p /etc/opt/SUNWut/basedir/lib/utscrevent -c 4 -z utdmsession
      10587 ksh -c echo 'CREATE_SESSION 4 # utdmsession' &gt;/dev/tcp/127.0.0.1/7013
```
```
which just sat there and sat there for, oh, about a minute. So I run netcat on port 7013, log out and log in again, and boom! quick as anything.

/etc/services says:
```
utscreventd     7013/tcp                        # SUNWut SRCOM event deamon
```
which we're not running; something to do with smart cards. So why does it hang so long? Because for some reason, the host isn't sending back an RST packet (I presume; can't listen to find out) to kill the connection, like it does on $other_server.

So now I'm trying to figure out why that is. It's not the firewall; they're identical. I've tried looking at ndd /dev/tcp \? but I don't see anything obvious there. My google-fu doesn't appear to be up to the task either. I may have to cheat and go visit a fellow sysadmin to find out.
May 27, 2007 write\_intr: wrong transfer direction
I tried ripping a CD recently on my desktop machine running Debian testing. grip seemed to hang, and I kept getting this error message in the logs:
```
hdc: write_intr: wrong transfer direction!
```
Google didn't turn up much that helped, except to suggest a simpler test case (cdparanoia -d /dev/hdc 1-1). Data CDs seemed to work just fine.

Finally tried upgrading the kernel package, from kernel-image-2.6.8-2-386 (the default kernel after installation) to linux-image-2.6.18-4-686, and that did the trick nicely.
March 08, 2007 No way!
Hard to believe, but OpenSSH (portable version) has finally fixed the hang on exit bug. Man, I thought that'd never go away...
February 12, 2007 Disgust
Disgust is the feeling you get when you realize that the sudden and mysterious failures you've been tracking down for the last half hour happened because you typed $debug == 1;.
January 06, 2007 And \_another\_ two hours gone
At work, one of our visitors had a problem: his browser kept crashing whenever he visited crashing whenever he visited his university's webpage. That's what caused this problem, with the ginormous core files; Firefox would start grabbing memory at a rate of about 10-20MB/s, and then Solaris would kill it off once it got to around 2.5GB. Truss showed that it kept opening a copy of the Times Roman font over and over again, but I couldn't figure out why.

I tried duplicating it under my own account, but couldn't — so I moved my .mozilla out of the way and set up a new profile, which did have the problem. At first I figured it must be Javascript; I hate Javascript and almost always turn it off (but thanks to the NoScript plugin it's easy to toggle it for individual sites), so that must be it, right? Wrong. Okay, so what about prefs.js? Nope.

After copying over nearly all the files in my original profile, FF was still grabbing memory…until finally, out of desperation, I copied over the Flash plugin. And by the beard of Zeus, it worked!

I did some digging around (there are a truly depressing number of Mozilla bugs that mention Flash) and found mention (sorry, lost link) that some methods of detecting whether a browser has the Flash plugin can end up using all the memory available to the browser. That sounds similar enough to what I saw that I think I'm going to call that the problem-designate for now.