$ ls foo foo $ ls | grep foo $ echo $? 141
I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.
Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:
9279 Full BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish
9496 Full BackupCatalog.2010-02-23_21.10.00_13 is waiting execution
9498 Full bigass_server-d_drive.2010-02-24_03.05.01_15 is running
9520 Increme little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape
9521 Increme little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs
but storage says this:
Running Jobs:
Writing: Full Backup job bigass_server-d_drive JobId=9498
Volume="000031"
pool="Monthly" device="Drive-0" (/dev/nst1)
spooling=1 despooling=0 despool_wait=0
Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559
FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16
Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=156 Bytes=3,403,527,093 Bytes/sec=72,415
FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9
Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=9 Bytes=183,606 Bytes/sec=3
FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10
Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=5 Bytes=182,029 Bytes/sec=3
FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19
Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=0 Bytes=0 Bytes/sec=0
FDSocket closed
Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:
cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1
cbs-01-sd: dircmd.c:232-0 Do command: unmount
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0
cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1
cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1
cbs-01-sd: dircmd.c:625-0 Found device Drive-1
cbs-01-sd: block.c:133-0 Returning new block=39cee10
cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev
...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).
This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...
While trying to figure out why Nagios was suddenly unable to check up on our databases, I suddenly realized that the permissions on /dev/null were wrong: 0600 instead of 0666. What the hell? I've had this problem before, and I was in the middle of something, so I set them back and went on with my life. Then in happened again, not half an hour later. I was in the same shell, so I figured it had to have been a command I'd run that had inadvertantly done this.
Yep: don't run the MySQL client as root. Yes yes yes, it's bad
anyway, I'll go to sysadmin hell, but this is an interesting bug. The
environment variable MYSQL_HISTFILE
is set to /dev/null
for
root...and when you exit the client, it sets the permissions for the
history file to 0600. So, you know, don't do that then. (Still no
fix committed, btw...)
Just spent the better part of five hours cleaning up four old, out-of-date Wordpress installations after they got infected with this worm. I host nine sites on my home server for friends and family; I'm cutting that down to three (just family), and maybe looking at mu-wordpress, as of Real Soon Now.
Happy Labour Day, everyone!
Update: I meant to add in here a few things I looked for, because this info was hard to track down.
I found extra admin-level users in the wp_users table; some had their email address set to "www@www.com", some had random made-up or possibly real addresses, and some had the same email address as already-existing users.
On one blog (possibly infected much earlier) I found 42,000 (!!) approved, spammy comments.
I searched for infected posts using a query from here:
SELECT * FROM wp_posts WHERE post_content LIKE '%iframe%'
UNION
SELECT * FROM wp_posts WHERE post_content LIKE '%noscript%'
UNION
SELECT * FROM wp_posts WHERE post_content LIKE '%display:%'
I just spent the weekend (well, like an hour a day...kids, life, you know how it is) trying to track down why a bunch of new CentOS 5.2 installs at $job_2 couldn't pipe:
$ ls foo foo $ ls | grep foo $ echo $? 141
(Actually, I didn't think to look at the error code 'til someone else pointed it out…141 turns out to be SIGIPE) In the end, it would have been quicker if I'd simply searched for the first thing I saw when logging in:
-bash: [: =: unary operator expected -bash: [: -le: unary operator expected
This was particularly aggravating to track down because not every machine was doing this, and no matter what I thought to look at (/etc contents, /tmp permissions (those have a habit of going wonky on me for some reason), SELinux) I couldn't figure out what was different.
Turned out to be an upstream bug in nss_ldap. (The Bugzilla entry makes for some interesting reading, to be sure…) And I didn't see it on each machine because I hadn't upgraded after installation on all machines. (They're not yet in production, and I'm working on getting my kickstart straight.)
Man, it was gratifying to upgrade nss_ldap and see the problem go away…
Heads up for those of you using Blastwave and CUPS: after upgrading to the latest stable version, printing stopped working for me (and a few users :-). I eventually tracked it down to the movement of two files: suddenly
/opt/csw/lib/cups/filter/pstopxl /opt/csw/lib/cups/filter/pstoraster
were moved to
/opt/csw/lib/cups/pstopxl /opt/csw/lib/cups/pstoraster
resulting in many error messages like Unsupported format text/plain! and Hint: is ESP ghostscript installed?
. Moving them both back into place and restarting CUPS fixed things just fine.
According to Bacula (yay Bacula!) both files were in the right directory as of last night, and Blastwave's file list for Ghostscript shows the new location for these two files. A bug has been filed.
One of the problems I've been working on since the upgrade to Solaris 10 has been the slowness of the SunRay terminals. There are a few different problems here, but one of 'em is that after typing in your password and hitting Enter, it takes about a minute to get the JDS "Loading your desktop…" icons up.
I scratched my head over this one for a long time 'til I saw this:
ptree 10533 906 /usr/dt/bin/dtlogin -daemon -udpPort 0 10445 /usr/dt/bin/dtlogin -daemon -udpPort 0 ``` 10533 /bin/ksh /usr/dt/config/Xstartup 10551 /bin/ksh -p /opt/SUNWut/lib/utdmsession -c 4 10585 /bin/ksh -p /etc/opt/SUNWut/basedir/lib/utscrevent -c 4 -z utdmsession 10587 ksh -c echo 'CREATE_SESSION 4 # utdmsession' >/dev/tcp/127.0.0.1/7013
which just sat there and sat there for, oh, about a minute. So I run netcat on port 7013, log out and log in again, and boom! quick as anything.
/etc/services
says:
utscreventd 7013/tcp # SUNWut SRCOM event deamon
which we're not running; something to do with smart cards. So why does
it hang so long? Because for some reason, the host isn't sending back
an RST packet (I presume; can't listen to find out) to kill the
connection, like it does on $other_server
.
So now I'm trying to figure out why that is. It's not the firewall;
they're identical. I've tried looking at ndd /dev/tcp \?
but I don't
see anything obvious there. My google-fu doesn't appear to be up to
the task either. I may have to cheat and go visit a fellow sysadmin to
find out.
I tried ripping a CD recently on my desktop machine running Debian testing. grip seemed to hang, and I kept getting this error message in the logs:
hdc: write_intr: wrong transfer direction!
Google didn't turn up much that helped, except to suggest a simpler test case (cdparanoia -d /dev/hdc 1-1
). Data CDs seemed to work just fine.
Finally tried upgrading the kernel package, from kernel-image-2.6.8-2-386
(the default kernel after installation) to linux-image-2.6.18-4-686
, and that did the trick nicely.
Hard to believe, but OpenSSH (portable version) has finally fixed the hang on exit bug. Man, I thought that'd never go away...
Disgust is the feeling you get when you realize that the sudden and
mysterious failures you've been tracking down for the last half hour
happened because you typed $debug == 1;
.
At work, one of our visitors had a problem: his browser kept crashing whenever he visited crashing whenever he visited his university's webpage. That's what caused this problem, with the ginormous core files; Firefox would start grabbing memory at a rate of about 10-20MB/s, and then Solaris would kill it off once it got to around 2.5GB. Truss showed that it kept opening a copy of the Times Roman font over and over again, but I couldn't figure out why.
I tried duplicating it under my own account, but couldn't — so I moved
my .mozilla
out of the way and set up a new profile, which did have
the problem. At first I figured it must be Javascript; I hate
Javascript and almost always turn it off (but thanks to the NoScript
plugin it's easy to toggle it for individual sites), so that must
be it, right? Wrong. Okay, so what about prefs.js
? Nope.
After copying over nearly all the files in my original profile, FF was still grabbing memory…until finally, out of desperation, I copied over the Flash plugin. And by the beard of Zeus, it worked!
I did some digging around (there are a truly depressing number of Mozilla bugs that mention Flash) and found mention (sorry, lost link) that some methods of detecting whether a browser has the Flash plugin can end up using all the memory available to the browser. That sounds similar enough to what I saw that I think I'm going to call that the problem-designate for now.