I came across a problem compiling GotoBLAS2 at work today. It went well on a practice cluster, but on the new one I got this error:
gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation
The solution was simple:
gmake clean
gmake TARGET=NEHALEM
The problem appears to be that newer CPUs (Intel X5650 in my case) are
not detected properly by the CPU ID routine in GotoBlas2. You can
verify this by checking the contents of config.h
in the top-level
directory. Without TARGET=NEHALEM
, I saw this line:
#define INTEL_UNKNOWN
But with TARGET=NEHALEM
, this becomes:
#define NEHALEM
The problem with gemm_ncopy_4.S
arises because it defines
RPRETCHSIZE
and WPREFETCHSIZE
using #ifdef
statements depending
on CPU type. There is an entry for #ifdef GENERIC
, but that was not
set for me in config.h
.
In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean" first, you get a little further before you run into a similar error:
usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value
If I was a better person, I'd have a look at how the sizes are defined
and figure out what the right value is for newer CPUs, then modify
cpuid.c
(which I presume is what's being used to generate
config.h
, or at least this part of it. Maybe another day...
Today I installed Windows 7 Pro on a Macbook Pro with Boot Camp. I ran into two problems I figured I should document:
First, Boot Camp would not proceed past the offer to burn a CD with drivers for the disk. It's a bug, and I was able to ignore it without problems; the networking and graphics came up w/o problems.
Second, after installing Windows Defender and rebooting, Windows would freeze at the login screen. This too is a known problem, with lots of suggestions on how to fix it. What worked for me was booting into safe mode w/the command prompt (every other safe mode would freeze), then disabling the Windows Defender service with MMC. After that, I was able to boot; after that, I was able to set it to "Automatic Start", reboot, and I've had no further troubles.
Just spent two hours staring at this Python regex:
method_re = re.compile('(?P<urlcmd>\s*url\s.*)|(?P<nfscmd>\s*nfs\s.*)')
and trying to figure out why it wouldn't match this line:
install url --url http://cobbler.example.org/cobbler/ks_mirror/CentOS-5.4/os/x86_64
After much research, I can report two things:
In python, the match()
method of a regular expression object
matches from the beginning of the line. If you think in Perl,
like I do, it's like if they prepended a ^
to the beginning of the
regex.
(?P<urlcmd>
puts the match in the variable urlcmd
. That wasn't
a problem, I just think it's neat (though obscure).
A kickstart file, which the line is taken from, is meant to have the
install
line be separate from the method (in this case, url
).
I did not know that.
And now to read more of Dr. Zhivago.
Update: by way of explanation, I was trying to use koan and Cobbler to install a new qemu instance like so:
/usr/bin/koan --nogfx --server=cobbler.example.com --virt --system=virtserver-01.example.com
and was getting two errors:
warning: kickstart found but no install_tree found
Cannot find install source in kickstart file, aborting.
The problem turned out to be that my kickstart file had this line:
install url --url http://cobbler.example.org/cobbler/ks_mirror/CentOS-5.4/os/x86_64
which should have been two lines:
install
url --url http://cobbler.example.org/cobbler/ks_mirror/CentOS-5.4/os/x86_64
Michael W. Lucas, of Absolute FreeBSD fame (among many others), has a book coming out on Network Flow Analysis. Sweet!
/me hurries off to pre-order...
I think I've finally figured out what's going on with my bacula-sd hangs. At the risk of repeating myself, this post is adapted from the bug report I've just filed.
Here's the situation: with the latest Bacula release (5.0.1), I regularly see bacula-sd hang when running jobs; it happens often and at seemingly random times; and when this happens, I see two bacula processes, a parent and a child. (Since bacula is multi-threaded, I usually just see one storage daemon process.) This came to my attention when I came back from vacation to find a week's worth of backups stalled (sigh).
When bacula-sd hangs, the traceback of the relevant thread in the parent process's looks like this:
Thread 10 (Thread 0x466c0940 (LWP 12926)):
#0 0x00000035aa4c5f3b in read () from /lib64/libc.so.6
#1 0x00000035aa46cc07 in _IO_new_file_underflow (fp=<value optimized out>) at fileops.c:590
#2 0x00000035aa46d5ce in _IO_default_uflow (fp=<value optimized out>) at genops.c:435
#3 0x00000035aa468e8b in _IO_getc (fp=<value optimized out>) at getc.c:41
#4 0x00002b76479565c0 in bfgets (s=0x466bf710 "", size=<value optimized out>, fd=0x60080a0) at bsys.c:617
#5 0x000000000040d432 in release_device (dcr=0x60988b8) at acquire.c:533
[snip]
Here, bacula's storage daemon has just finished running a job, and
before it releases the tape drive to someone else runs the "Alert"
command. This is specified in the config file for the storage daemon,
and is meant to see if the drive has, say, run out of magnetism during
the last job. Here's the source code in stored/acquire.c
:
alert = get_pool_memory(PM_FNAME);
alert = edit_device_codes(dcr, alert, dcr->device->alert_command, "");
bpipe = open_bpipe(alert, 0, "r");
if (bpipe) {
while (fgets(line, sizeof(line), bpipe->rfd)) { /* AardvarkNote: This is where the parent hangs */
Jmsg(jcr, M_ALERT, 0, _("Alert: %s"), line);
}
status = close_bpipe(bpipe);
}
Meanwhile, the child process stack looks like this:
Thread 1 (Thread 0x466c0940 (LWP 13000)):
#0 0x00000035aa4df9ee in __lll_lock_wait_private () from /lib64/libc.so.6
#1 0x00000035aa4d06a5 in _L_lock_1206 () from /lib64/libc.so.6
#2 0x00000035aa4d05be in closelog () at ../misc/syslog.c:419
#3 0x00002b764795cd35 in open_bpipe (prog=<value optimized out>, wait=0, mode=<value optimized out>) at bpipe.c:138
#4 0x000000000040d3f1 in release_device (dcr=0x60988b8) at acquire.c:531
[snip]
open_bpipe()
can be found in lib/bpipe.c; it's a routine for forking
a child process (FORESHADOWING: not another thread!) and sets up a
pipe between parent and child. The relevant bits look like this:
/* Start worker process */
switch (bpipe->worker_pid = fork()) {
[snip]
case 0: /* child */
if (mode_write) {
close(writep[1]);
dup2(writep[0], 0); /* Dup our write to his stdin */
}
if (mode_read) {
close(readp[0]); /* Close unused child fds */
dup2(readp[1], 1); /* dup our read to his stdout */
dup2(readp[1], 2); /* and his stderr */
}
/* AardvarkNote: This is where the child hangs: */
closelog(); /* close syslog if open */
for (i=3; i<=32; i++) { /* close any open file descriptors */
close(i);
}
execvp(bargv[0], bargv); /* call the program */
closelog()
itself is simple: it "closes the current Syslog
connection, if there is one." But running strace
on the child
process just shows a lot of futex
calls...nothing very useful there
at all. So what the hell is going on, and why it hanging at
closelog()
?
Some background info on threads: In Linux at least, they're user-land things, and the kernel doesn't know about them. To the kernel, it's just another process with a PID. Implementing threads is left as an exercise to the reader...or in this case, to glibc and NPTL.
Since threads are part of the same process, they share memory. (A
fork()
, by contrast, copies over parent memory to the child -- and
then the child has its own copy of everything, separate from the
parent.) glibc/NPTL implements locks for certain things to make sure
that one thread doesn't stomp all over another thread's memory
willy-nilly. And those locks are done with futexes, which are
provided by the kernel...which explains why it would show up in
strace
, which tracks system calls.
Why is this relevant? Because in the glibc code, closelog()
looks
like this:
void
closelog ()
{
/* Protect against multiple users and cancellation. */
__libc_cleanup_push (cancel_handler, NULL);
__libc_lock_lock (syslog_lock); /* AardvarkNote: This is where things hang */
closelog_internal ();
LogTag = NULL;
LogType = SOCK_DGRAM; /* this is the default */
/* Free the lock. */
__libc_cleanup_pop (1);
}
That __libc_lock_lock (syslog_lock)
call is there to prevent two
threads trying to mess with the syslog file handle at one time. Sure
enough, the info for frame #2 of the child process shows that the
process is trying to get the syslog_lock mutex:
#2 0x00000035aa4d05be in closelog () at ../misc/syslog.c:419
419 __libc_lock_lock (syslog_lock);
ignore1 = <value optimized out>
ignore2 = <value optimized out>
ignore3 = <value optimized out>
As noted in the mailing list discussion, closelog()
should be a noop
if there's no descriptor open to syslog. However, in my case there
is such a file descriptor, because I've got bacula-sd configured to
log to syslog.
Well, as the Bible notes, mixing fork()
and threading is
problematic:
There are at least two serious problems with the semantics of fork() in a multi-threaded program. One problem has to do with state (for example, memory) covered by mutexes. Consider the case where one thread has a mutex locked and the state covered by that mutex is inconsistent while another thread calls fork(). In the child, the mutex is in the locked state (locked by a nonexistent thread and thus can never be unlocked). Having the child simply reinitialize the mutex is unsatisfactory since this approach does not resolve the question about how to correct or otherwise deal with the inconsistent state in the child.
And hey, doesn't that sound familiar?
Now that I had an idea of what was going on, I was able to find a number of similar problems that people have encountered:
This post describes a multi-threaded app that hung when a signal was
received during a syslog()
call
This thread from the libc-alpha mailing list, describes a similar problem:
> The particular problem I'm seeing is with syslog_lock.
>
> If another thread is writing to syslog when the fork happens,
> the syslog_lock mutex is cloned in a locked state.
This Debian bug describes a multithreaded app deadlocking when
syslog()
is called after a fork()
The Bible describes the POSIX-ish way around this, the pthread_atfork()
call.
This blog entry has an overview of the problem with threads and
fork()
, and of pthread_atfork()
So: what seems to be happening is that, when the stars align, the
child process is being created at the same moment that the parent
process is logging to syslog. The child is created with a locked
syslog_lock
mutex, but without the thread that had been holding
it...and thus, without anything that can release it. The child
blocks waiting for the mutex, the parent blocks on the child, and
backup jobs halt (well, at least the spooling of jobs to tape) until I
kill the child manually.
This was complicated to find for a number of reasons:
My gut feeling (see also: handwavy assertion) is that logging to syslog is relatively unusual, which would explain why this problem has taken a while to surface.
It's a race condition that's exacerbated by having multiple jobs running at once; I'd only recently implemented this in my Bacula setup.
btraceback
, a shell wrapper around gdb that comes with
Bacula, is meant to run the debugger on a hung Bacula process. I
used it to get the tracebacks shown above. It's great if you don't
know what you're doing with gdb (and I certainly don't!). But it
has a number of problems.
btraceback
is usually called by the unprivileged
bacula user. That meant I had to modify the script to run gdb with
sudo, and change sudoers to allow bacula to run "sudo gdb" without a
password. Otherwise, the dump was useless -- particularly
frustrating before I'd figured out how to duplicate the problem.btraceback
can be triggered by sending a fatal signal to the
bacula-sd process (say, "kill -6"). That's great when you notice a
hang and can run the script. But it won't trace child processes --
which was where the interesting things were -- and it was a while
before it occured to me to do that manually.Bacula has an --enable-lockmgr
option that's meant to catch
deadlocks, kill itself and run btraceback
. However, it's not
enabled by default in the SRPMs I was using to build bacula, and in
any case it watches for deadlocks on Bacula's own locks -- not
external locks like syslog_lock
.
So what to do?
For now, I'm removing the log-to-syslog option from bacula-sd.conf. When I do that, I see no problems at all running jobs.
On the programming side -- and keep in mind I'm no programmer -- it looks like there are a number of options here:
Don't call closelog()
after fork()
in open_bpipe()
. This may
leave an open descriptor to syslog, which may be a problem. (Or maybe
it's just ugly. Don't know.)
Don't fork()
a child process when running the alert command, but
start another thread instead. I have no idea why a fork()
is
preferred over a new thread in this scenario, but I presume there's a
good reason.
Use pthread_atfork()
to set things up for a fork()
. That's
what The Bible says to do, but I don't know what Bacula would
actually need to do with it in order to make this deadlock go away.
Good lord, I'm closing in on 1700 words here. If you've stuck with me this long, I hope it was interesting; once I got over the panic it was a fascinating problem, and it's taught me a lot about processes, threads and deadlocks. And how can that be a bad thing?
I've been off on vacation for a week while my parents have been visiting. It's been fun, relaxing, and generally a good time. But today it was back to work.
I found out that backups have not been running since the day after I went on vacation. There were something like 650 jobs in Bacula stacked up, waiting to run; the storage daemon was stuck trying to do something, and nothing was getting done.
Near as I can tell, the storage daemon was stuck in a deadlock. I've got some backtraces, I've posted to the devel mailing list, and it looks there's a bug, recently fixed, that addresses the problem. Inna meantime I've put in Nagios monitoring for the size of the run queue, and I'm figuring out how to watch for the extra bacula-sd process that showed up.
Tonight, at least, the same problem happened again. This is good, because now I have a chance to repeat it. Got another backtrace, killed the job, and things are moving on. Once things are finished here, I think I'm going to try running that job again and seeing if I can trigger the problem.
Sigh. This was going to be a good day and a good post. But it's late and I've got a lot to do.
Ugh, late night...which these days, means anything past 9:30pm. Two machines down at work with what I think are unrelated problems.
First one appears to have had OOM-killer run repeatedly and leave the ethernet driver in a bad state; I know, but the OOM-killer kept killing things until we got this bug.
Second one appears to have crashed and/or rebooted, but the hardware clock got reset to December 2001 in the process -- which meant that when it tried to contact the LDAP servers, none of their certificates were valid yet.
Again, ugh. But I did come across this helpful addition to my toolkit:
openssl s_client -CAfile /path/to/CA_cert -connect host:port
which, I just realized, I've rediscovered, along with having the same fucking problem again.
And did I mention I'm up at 5am tomorrow to move some equipment around at work? Ah well, I have safety boots now. I'll be suitably rewarded in Valhalla.
I mentioned that I've been having problems with Bacula recently. These have been aggravated by the fact that the trigger seems to be a job that takes 53 hours to finish.
Well, I think I've got a handle on one part of the problem. See, when Bacula is doing this big job, other jobs stack up behind it -- despite having two tape drives, and two separate pools of tapes, and concurrent jobs set up, the daily jobs don't finish. The director says this:
9279 Full BackupCatalog.2010-02-20_21.10.00_10 is waiting for higher priority jobs to finish
9496 Full BackupCatalog.2010-02-23_21.10.00_13 is waiting execution
9498 Full bigass_server-d_drive.2010-02-24_03.05.01_15 is running
9520 Increme little_server-var.2010-02-24_21.05.00_38 is waiting on Storage tape
9521 Increme little_server-opt.2010-02-24_21.05.00_39 is waiting on max Storage jobs
but storage says this:
Running Jobs:
Writing: Full Backup job bigass_server-d_drive JobId=9498
Volume="000031"
pool="Monthly" device="Drive-0" (/dev/nst1)
spooling=1 despooling=0 despool_wait=0
Files=708,555 Bytes=1,052,080,331,191 Bytes/sec=11,195,559
FDReadSeqNo=22,294,829 in_msg=20170567 out_msg=5 fd=16
Writing: Incremental Backup job little_server-var JobId=9508 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=156 Bytes=3,403,527,093 Bytes/sec=72,415
FDReadSeqNo=53,041 in_msg=52667 out_msg=9 fd=9
Writing: Incremental Backup job little_server-etc JobId=9519 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=9 Bytes=183,606 Bytes/sec=3
FDReadSeqNo=72 in_msg=50 out_msg=9 fd=10
Writing: Incremental Backup job other_little_server-etc JobId=9522 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=1
Files=5 Bytes=182,029 Bytes/sec=3
FDReadSeqNo=45 in_msg=32 out_msg=9 fd=19
Writing: Incremental Backup job other_little_server-var JobId=9525 Volume="000017"
pool="Daily" device="Drive-1" (/dev/nst0)
spooling=0 despooling=0 despool_wait=0
Files=0 Bytes=0 Bytes/sec=0
FDSocket closed
Out of desperation I tried running "unmount" for the drive holding the daily tape, thinking that might reset things somehow...but the console just sat there, and never returned a prompt or an error message. Meanwhile, storage was logging this:
cbs-01-sd: dircmd.c:218-0 <dird: unmount SL-500 drive=1
cbs-01-sd: dircmd.c:232-0 Do command: unmount
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-0
cbs-01-sd: dircmd.c:617-0 Device SL-500 drive wrong: want=1 got=0 skipping
cbs-01-sd: dircmd.c:596-0 Try changer device Drive-1
cbs-01-sd: dircmd.c:612-0 Found changer device Drive-1
cbs-01-sd: dircmd.c:625-0 Found device Drive-1
cbs-01-sd: block.c:133-0 Returning new block=39cee10
cbs-01-sd: acquire.c:647-0 JobId=0 enter attach_dcr_to_dev
...and then just hung there. "Aha, race condition!" I thought, and sure enough a bit of searching found this commit in November: "Fix SD DCR race condition that causes seg faults". No, I don't have a segfault, but the commit touches the last routine I see logged (along with a buncha others).
This commit is in the 5.0.1 release; I wasn't planning to upgrade to this just yet, but I think I may have to. But I'm going on vacation week after next, and I'm reluctant to do this right before I'm away for a week. What to do, what to do...
Remember: when adding access to your Fishworks/Unified Storage System
7310, LDAP entries must include objectClass: shadowAccount
. That
took me a while to track down.
Arghh...I just spent 24 hours trying to figure out why shadow migration was causing our new 7310 to hang. The answer? Because jumbo frames were not enabled on the switch the 7310 was on, and they were on the machine we're migrating from. Arghh, I say!
I have just spent two and a half hours trying to track down the reason a page on a Joomla site suddenly started saying "You are not authorized to view this resource". In the end it turned out to be a known problem with the OpenSEF plugin, but it took me a stupidly long time to even guess that might be the problem. (Probably shoulda searched for the error message first…)
There has to be a better way to do this. The only way I could figure
to trace the problem was by sprinkling lots of print "FIXME: Made it
here\n";
throughout the code. I know, it's a terrible way of doing
it, but running php index.php
didn't seem to work — I couldn't get
the code to see the arguments I was trying to pass on. What am I
missing?