Torque problems with Rocks

I've been puttering away at work getting the cluster going. It's hard, because there are a lot of things I'm having to learn on the go. One of the biggest chunks is Torque and Maui, and how they interact with each other and Rocks as a whole.

For example: today I tried submitting a crapton of jobs all at once. After a while I checked the queue with showq (a Maui command; not to be confused with qstat, which is Torque) and found that a lot of jobs were listed as "Deferred" rather than "Idle". I watched, and the idle ones ran; the deferred ones just stayed in place, even after the list of running jobs was all done.

At first I thought this might be something to do with fairness. There are a lot of knobs to twiddle in Maui, and since I hadn't looked at the configuration after installation I wasn't really sure what was there. But near as I could tell, there wasn't anything happening there; the config file for Maui was empty, and I couldn't seem to find any mention of what the default settings were. I followed the FAQ and ran the various status commands, but couldn't really see anything obvious there.

Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:

06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com

And on compute-3-2 (/opt/torque/mom_logs)

06/08/2011 14:42:01;0080;   pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local

That's weird. I ran rocks sync config out of superstition, but nothing changed. I found a suggestion that it might be a bug in Torque, and to run momctl -d to see if the head node was in the trusted client list. It was not. I tried running that command on all the nodes (sudo rocks run host compute command="momctl -d3 |grep Trusted | grep 10.1.1.1); turned out that only 10 were. What the hell?

I'm still not sure exactly where this gets set, but I did notice that /opt/torque/mom_priv/config listed the head node as the server, and was identical on all machines. On a hunch, I tried restarting the pbs service on all the nodes; suddenly they all came up. I submitted a bunch more jobs, and they all ran through -- none were deferred. And running momctl -d showed that, yes, the head node was now in the trusted client list.

Thoughts:

  • None of this was shown by Ganglia (which just monitors load) or showq (which is a Maui command; the problem was with Torque).

  • Doubtless there were commands I should've been running in Torque to show these things.

  • While the head node is running a syslog server and collects stuff from the client nodes, Torque logs are not among them; I presume Torque is not using syslog. (Must check that out.)

  • I still don't know how the trusted client list is set. If it's in a text file, that's something that I think Rocks should manage.

  • I'm not sure if tracking down the problem this way is exactly the right way to go. I think it's important to understand this, but I suspect the Rocks approach would be "just reboot or reinstall". There's value to that, but I intensely dislike not knowing why things are happening and sometimes that gets in my way.

Tags: cluster rocks

How to print character arrays in Python

I've come across this problem a number of times, so here's a reminder. When printing in Python, occasionally I'll end up with something like this:

>>> print "foo=%s" % row["Name"]
foo=array('c', 'bar')

The solution is to use the .tostring() method:

>>> print "foo=%s" % row["Name"].tostring()
foo=bar

Details here.

Tags: python

Uh, what?

From The CBC:

The Vancouver Canucks have skated to within a victory from advancing to their first Stanley Cup final since 1994 after they exhibited their superiority in a bizarre special teams battle with the San Jose Sharks on Sunday.

Tags:

Memo to Canadians

Memo to Canadians: your government will throw you under a bus if they feel like it.

Quote:

The Canadian Security Intelligence Service, Canada's principal intelligence agency, routinely transmits to U.S. authorities the names and personal details of Canadian citizens who are suspected of, but not charged with, what the agency refers to as "terrorist-related activity."

The criteria used to turn over the names are secret, as is the process itself.

Quote:

In at least some cases, the people in the cables appear to have been named as potential terrorists solely based on their associations with other suspects, rather than any actions or hard evidence.

Quote:

The first stop for these names is usually the so-called Visa Viper list maintained by the U.S. government. Anyone who makes that list is unlikely to be admitted to the States.

Given Washington's policy of centralizing such information, though, the names also go into the database of the U.S. National Counterterrorism Centre. Inclusion in such databases can have several consequences, such as being barred from aircraft that fly through U.S. airspace.

Or, as Canadian Maher Arar discovered in 2002, the consequences can be worse: much arrest, interrogation, even "rendition" to another country.

Quote:

"We don't want another Arar," said the security official. But at the same time, he said, CSIS is acutely aware that if it did not pass on information about someone it suspected, and that person then carried out some sort of spectacular attack in the U.S., the consequences could be cataclysmic for Canada.

U.S. authorities, already suspicious that Canada is "soft on terror," would likely tighten the common border, damaging hundreds of billions of dollars worth of vital commerce.

A former senior official, who also spoke to CBC on the basis of anonymity, put it more bluntly: "The reality is, sorry, there are bad people out there.

"And it's very hard to get some of those people before a court of law with the information you have. And so there has to be some sort of process which allows you to provide some sort of safeguard to society on both sides of the border."

Furthermore, he said, "it's not a fundamental human right to be able to go to the United States."

No, it's not a fundamental human right to be able to go to the United States. It is a fundamental human right not to be kidnapped and tortured.

Tags: politics rant

Good reading

With a perfect storm of disruptive technology rendering traditional broadcasting all but obsolete, foreign entrants with superior services and lower costs, unfavorable demographics, a powerful pro-competition government, entrenched inflexible business models, lack of competitive and innovative edge due to decades of insulation from the rest of the telecom world, bloated balance sheets due to costly acquisitions of old-media companies, and regulatory uncertainty relating to usage based billing and functional separation, large telecommunications firms in Canada do not look well-positioned.

Why Canadian Cable Companies and Telecoms Are in Trouble (via Michael Geist).

Tags:

Trouble compiling GotoBLAS2 on newer CPU

I came across a problem compiling GotoBLAS2 at work today. It went well on a practice cluster, but on the new one I got this error:

gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC  -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation

The solution was simple:

gmake clean
gmake TARGET=NEHALEM

The problem appears to be that newer CPUs (Intel X5650 in my case) are not detected properly by the CPU ID routine in GotoBlas2. You can verify this by checking the contents of config.h in the top-level directory. Without TARGET=NEHALEM, I saw this line:

#define INTEL_UNKNOWN

But with TARGET=NEHALEM, this becomes:

#define NEHALEM

The problem with gemm_ncopy_4.S arises because it defines RPRETCHSIZE and WPREFETCHSIZE using #ifdef statements depending on CPU type. There is an entry for #ifdef GENERIC, but that was not set for me in config.h.

In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean" first, you get a little further before you run into a similar error:

usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value

If I was a better person, I'd have a look at how the sizes are defined and figure out what the right value is for newer CPUs, then modify cpuid.c (which I presume is what's being used to generate config.h, or at least this part of it. Maybe another day...

Tags: rocks cluster hpc software debugging

Sunrise on the moon

Tonight was the first clear night in far, far too long. But instead of staying out 'til midnight, I decided to try pointing the scope out my bathroom window to look at the moon and get to bed at a semi-reasonable hour.

And hey, not bad! Sure, it got pretty awful above 50X, but that was enough to let me see all kinds of things. I decided to sketch what I saw, and I'm glad I did; it's no great artistry, but it really forced me to pay attention to what I was seeing.

After a half hour or so, I took a break, then came back to look again and pick out the features I could recognize. There was Albategnius, Plinius, Manilius, and...hey, I don't remember that bright bit in Abategnius being that big. And what's with the bright spot in Mare Imbrium's shadow?

So I looked it the new bright bit, and it's Mons Piton -- 7000 ft/2100-odd metres high. And it hits me: I didn't see that before. It's a big mountain in the middle of shadow. It's just the other side of the terminator. Same thing must've happened with Albategnius. Holy crap, I just saw sunrise on the moon!

It shouldn't really be a surprise -- even though I'm still getting familiar with the moon I know how the terminator moves, and I know that it's gotta move sometime. But it was really, really surprising to see it in such a short space of time.

Tags: astronomy

Multiple redundant cascading problems

Oh god this week. I've been setting up the cluster (three chassis' worth of blades from Dell). I've installed Rocks on the front end (rackmount R710). After that:

  • All blades powered on.

  • Some installed, most did not. Not sure why. Grub Error 15 is the result, which is Grub for "File not found".

  • I find suggestions in the Rocks mailing list to turn off floppy controllers. Don't have floppy controllers exactly in these, but I do see boot order includes USB floppy and USB CDROM. Pick a blade, disable, PXE boot and reinstall. Whee, it works!

  • Try on another blade and find that reinstallation takes 90 minutes. Network looks fine; SSH to the reinstalling blade and wget all the RPMs in about twelve seconds. What the hell?

  • Discover Rocks' Avalanche Installer and how it uses BitTorrent to serve RPMs to nodes. Notice that the installing node is constantly ARPing to find nodes that aren't turned on (they're waiting for me to figure out what the hell's going on). Restart service rocks-tracker on the front end and HOLY CRAP now it's back down to a three minute installation. Make a mental note to file a bug about this.

  • Find out that Dell OpenManage Deploy Toolkit is the best way to populate a new machine w/BIOS settings, since the Chassis Management Console can't push that particular setting to blades. Download that, start reading.

  • Try fifteen different ways of connecting virtual media using CMC. Once I find out the correct syntax for NFS mounts (amusingly different between manuals), some blades find it and some don't; no obvious hints why. What the hell?

  • Give up, pick a random blade and tell it by hand where to find the goddamn ISO. (This ignores the problems of getting Java apps to work in Awesome [hint: use wmname], which is my own fault.) Collect settings before and after disabling USB CDROM and Floppy and find no difference; this setting is apparently not something they expose to this tool.

  • Give up and try PXE booting this blade even with the demon USB devices still enabled. It works; installation goes fine and after it reboots it comes up fine. What the hell?

  • Power cycle the blade to see if it still works and it reinstalls. Reinstalls Rocks. What the hell?

  • Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to pxe boot next time and at graceful shutdown reverses the change, allowing the machine to boot normally. The thinking is that if you have to power cycle a machine you probably want to reinstall it anyhow.

  • Finally put this all together. Restart tracker, set all blades in one of the chassis' to reinstall. Pick a random couple of blades and fire up consoles. Power all the blades up. Installation fails with anaconda error, saying it can't find any more mirrors. What the hell?

  • eth0 is down on the front end; dmesg shows hundreds of "kernel: bnx2: eth0 NIC Copper Link is Down" messages starting approximately the time I power-cycled the blades.

I give up. I am going here tonight because my wife is a good person and is taking me there. And I am going to have much, and much-deserved, beer.

Tags: cluster hpc rocks

Bacula multi-tape restores while backups queued

I've got a tape library at work with two tape drives. Today, one of the drives was doing (full) backups and the second was free for a restore job. However, when that restore job ran, I got this error:

JobId 62397: Forward spacing Volume "000039" to file:block 7:0.
JobId 62397: Error: block.c:1016 Read error on fd=7 at file:blk 3:0 on device "Drive-0" (/dev/nst1). ERR=Input/output error.
JobId 62397: End of Volume at file 3 on device "Drive-0" (/dev/nst1), Volume "000039"
JobId 62397: Fatal error: acquire.c:72 Acquire read: num_writers=1 not zero. Job 62397 canceled.
JobId 62397: Fatal error: mount.c:844 Cannot open Dev="Drive-0" (/dev/nst1), Vol=000039
JobId 62397: End of all volumes.
JobId 62397: Error: Bacula cbs-01-dir 5.0.2 (28Apr10): 03-May-2011 12:09:20

The problem wasn't that it encountered the end of the volume -- the job spanned a number of volumes, so that was okay.

No, the problem was that after the restore job had run, a number of other regular backups had started. These were incrementals, and thus were unable to use the first drive. When the restore job ran into the EOM on the first volume, it appears to have released the drive -- at which point the incrementals started up and denied the use of the second drive to the restore job. The restore job promptly gave up and called it an error.

As I was in a hurry, I tried killing off the incrementals and re-running the restore job. This worked just fine. Arguably it's a bug, but I suspect I just need to tweak the priority for restore jobs instead.

(Two entries in one day...woot!)

Tags: bacula backups

Clearing the trace flag

A couple times now, I've run strace on a java process and not had it resume after starting. Running ps or top has shown that the T flag has remained on the process. Running kill -SIGCONT <pid> has started things up again.

Tags:

I am too old for this

About 10.45 yesterday I noticed that my SSH connections to the servers in our server room had stopped, and I was unable to make any more. I checked Nagios the machine by my desk (multiple Nagios FTW!) and found that it had noticed problems a few minutes before. I ran over to see what was going on.

After a few minutes of checking, I'd found:

  • our firewall machine seemed to be up just fine
  • but I couldn't ping anything: no DHCP lease, and not with a manually configured interface

I called IT Services and asked if problems; they said no, so I double checked again. Suddenly I could ping the firewall and other machines, but SSHing to them hung.

My guess at this point was LDAP problems. I connected monitor and keyboard to the machine hosting the LDAP server and found it responsive, but a CRAPTON of errors on eth1 (which Xen handily renames/clones as peth1):

peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:9000 Metric:1 RX packets:1895496748 errors:1500778269 dropped:1505784776 overruns:0 frame:1500778269
TX packets:340023247 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
RX bytes:186743473052 (173.9 GiB) TX bytes:384744601794 (358.3 GiB)
Interrupt:18 Memory:ec000000-ec012800

I didn't know what to make of this, so I replaced the cable to peth1, ran ifconfig down/up, and got the connection back -- at which point LDAP came back up, the machines started working, etc.

Okay, weird -- but at least it's working again. I went back to my desk to try and figure out what had happened. While I was doing that, I started losing connectivity to the machines in the server room for 30 seconds at a time. What the hell?

After that, frankly, it's a blur. I was there 'til 7.45pm and here's what I think was going on.

First, the Xen host was having big memory problems that affected its networking, and the networking of the VMs within it. I was seeing a crapton of these messages:

Apr 14 15:12:51 kernel: xen_net: Memory squeeze in netback driver.

This bug said it was fixed in CentOS 5.6 -- so I tried upgrading to that (I was at 5.5, so not a big jump). Nope. Then I saw a suggestion that the problem was in memory ballooning -- that the dom0 was sucking up all the memory for some reason. The solution was to add a "dom0_mem=" to the kernel argument in Grub, ideally matching the dom0-min-mem argument in /etc/xen/xend-config.sxp. Unfortunately, I didn't realize that without specifying units, Xen assumes bytes -- so I was specifying a max mem of 512 bytes, not megabytes.

This caused the machine to panic and reboot -- but because consoles were only available via serial port, and because the IPMI console wasn't working, I was unable to see it. I had to edit the Grub entries on the fly to remove those arguments from the kernel, see what was going on, and then set it correctly.

After rebooting with a working memory limit, top showed that ksoftirqd/0 was taking up an enormous amount of CPU time -- 98% of one CPU. This was pretty much all due to eth0 interrupts. tcpdump showed that there was a lot of traffic on the management subnet, which the machine shouldn't have been seeing. I checked the switch and saw that the management vlan WAS on there as tagged (the normal, inside VLAN was default and untagged). I turned that off within the switch, rebooted the machine and things pretty much went back to normal.

All of that was doubly unfortunate because of the four VMs on there, two are the only two LDAP servers in that room -- the third is no another network, but it took a long time for the clients to fail over to it. This is piss-poor planning on my part.

As if that wasn't enough, another server's disk array disappeared, which caused MySQL to die and the website running on it to disappear. Turned out the machine had been booted with the wrong multipath drivers. When it had problems on one connection, the drive came back with a different device (/dev/sde1 instead of /dev/sdd1). This took a while to figure out, but I finally got it rebooted and the drive array back.

Now things were mostly back to normal -- except that the connection to the management VLAN seemed to be coming and going. This was shown both by nagios ("foo-ilom up! foo-ilom down!") and by good old-fashioned pings. A given ILOM/SP would respond to pings for 30 seconds, then go down; five minutes later it'd come back for 30 seconds, then disappear again. For the nth time: what the hell?

Then I remembered that, back when all this had begun, we'd been configuring a new cluster. Working on its switches, in fact, which were from a different vendor than our usual (package deal, dontcha know). I began to suspect that the problem might somehow lie there. I removed the two patch cables connecting the new switches to our network...and at last the management VLAN connection came back up and stayed up.

In all I was in the server room 'til 7.45pm last night. Part of it was spent reinstalling CentOS on a separate machine in hopes of at least getting an LDAP server up on it. I didn't stick around for that, as the VMs came back up fine, but that's definitely on the agenda.

Tags:

Checking Bacula exclusions

I came across this tip on an old posting to the Bacula mailing list. To determine if exclusions in a fileset are working, run these commands in bconsole:

@output some-file
estimate job=<job-name> listing level=Full
@output

The file will contain a list of files Bacula will include in the backup.

(Incidentally, I came across this while trying to figure out why my excludions weren't working; turned out I needed to remove the trailing slash in my directory names in the Exclude section.

Tags: backups bacula toptip

Rocks Lessons Part 2 -- Torque, Maui and OpenMPI

Torque is a resource manager; it's an open source project with a long history. It keeps track of resources -- typically compute nodes, but it's "flexible enough to handle scheduling a conference room". It knows how many compute nodes you have, how much memory, how many cores, and so on.

Maui is the job scheduler. It looks at the jobs being submitted, notes what resources you've asked for, and makes requests of Torque. It keeps track of what work is being done, needs to be done, or has been completed.

MPI stands for "Message Passing Interface". Like BLAS, it's a standard with different implementations. It's used by a lot of HPC/scientific programs to exchange messages between processes -- often but not necessarily on separate computers -- related to their work.

MPI is worth mentioning in the same breath as Torque and Maui because of mpiexec, which is part of OpenMPI; OpenMPI is a popular open-source implementation of MPI. mpiexec (aka mpirun, aka orterun) lets you launch processes in an OpenMPI environment, even if the process doesn't require MPI. IOW, there's no problem running something like "mpiexec echo 'Hello, world!'".

To focus on OpenMPI and mpiexec: you can run n copies of your program by using the "-np" argument. Thus, "-np 8" will run 8 copies of your program...but it will run on the machine you run mpiexec on:

$ mpiexec -np 8 hostname
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org

This isn't always useful -- why pay big money for all this hardware if you're not going to use it? -- so you can tell it to run on different hosts:

$ mpiexec -np 8 -host compute-0-0,compute-0-1 hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local

And if you're going to do that, you might as well give it a file to read, right?

$ mpiexec -np 8 -hostfile /opt/openmpi/etc/openmpi-default-hostfile hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local

That file is where Rocks sticks the hostfile, but it could be anywhere -- including in your home directory, if you decide that you want it to run on a particular set of machines.

However, if you're doing that, then you're really setting yourself up as the resource manager. Isn't that Torque's job? Didn't we set all this up so that you wouldn't have to keep track of what machine is busy?

So OpenMPI can work with Torque:

  1. How do I run jobs under Torque / PBS Pro?

The short answer is just to use mpirun as normal.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).

Whee! So easy! Except that Rocks does not compile OpenMPI with Torque support!

Because the Rocks project is kind of a broad umbrella, with lots of sub-projects underneath, the Torque roll is separate from the OpenMPI roll. Besides, installing one doesn't mean you'll install the other, so it may not make sense to build OpenMPI that way.

The fine folks at Rocks asked the fine folks at OpenMPI and found a way around this: by having every job submitted to Torque/Maui and using MPI source /opt/torque/etc/openmpi-setup.sh. While not efficient, it works; the recommended way, though, is to recompile OpenMPI with Torque installed so that it knows about Torque.

To me, this makes the whole Rocks installation less useful, particularly since this is didn't seem terribly well documented. To be fair, it is there in the Torque roll documentation:

Although OpenMPI has support for the torque tm-interface (tm=taskmanager) it is not compiled into the library shipped with Rocks (the reason for this is that the OpenMPI build process needs to have access to libtm from torque to enable the interface). The best workaround is to recompile OpenMPI on a system with torque installed. Then the mpirun command can talk directly to the batch system to get the nodelist and start the parallel application using the torque daemon already running on the nodes. Job startup times for large parallel applications is significantly shorter using the tm-interface that using ssh to start the application on all nodes.

So maybe I should just shut my mouth.

In any event, I suspect I'll end up recompiling OpenMPI in order to get it to see Torque.

Tags: rocks hpc

Rocks Lesson Part 1 -- BLAS

There's a lot to clusters. I'm learning that now.

At $WORK, we're getting a cluster RSN -- rack fulla blades, head node, etc etc. I haven't worked w/a cluster before so I'm practicing with a test one: three little nodes, dual core CPUs, 2 GB memory each, set up as one head node and two compute nodes. I'm using Rocks to manage it.

Here a few stories about things I've learned along the way.

BLAS

You find a lot of references to BLAS when you start reading software requirements for HPC, and not a lot explaining it.

BLAS stands for "Basic Linear Algebra Subprograms"; the original web page is here. Wikipedia calls it "a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication." This is important to realize, because, as in the article, common usage of the term seems to refer to an API than anything else; there's the reference implementation, but it's not really used much.

As I understand it -- and I invite corrections -- BLAS chugs through linear algebra and comes up with an answer at the end. Brute force is one way to do this sort of thing, but there are ways to speed up the process; these can make a huge difference in the amount of time it takes to do some calculation. Some of these are heuristics and algorithms that allow you to search more intelligently through the search space. Some are ways of compiling or writing the library routines differently, taking advantage of the capabilities of different processors to let you search more quickly.

There are two major open-source BLAS implementations:

  • The Goto BLAS library is a hand-optimized BLAS implementation that, by all accounts, is very fast. It's partly written in assembler, and the guy who wrote it basically crafted it the way (I think) Enzo Ferrari crafted cars.

  • ATLAS is another BLAS implementation. The ATLAS home page says "it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK." As noted in the articles attached to this page, ATLAS tries many, many different searches for a solution to a particular problem. It uses CPU capabilities to do these searches efficiently.

As such, compilation of ATLAS is a big deal, and the resulting binaries are tuned to the CPU they were built on. Not only do you need to turn off CPU throttling, but you need to build on the CPU you'll be running on. Pre-built packages are pretty much out.

ATLAS used to be included in the HPC roll of the Rocks 4 series. Despite [irritatingly out-of-date information][13], this has not been the case in a while.

[LAPACK] "is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems." It needs a BLAS library. From the FAQ:

Why aren’t BLAS routines included when I download an LAPACK routine?

It is assumed that you have a machine-specific optimized BLAS library already available on the architecture to which you are installing LAPACK. If this is not the case, you can download a Fortran77 reference implementation of the BLAS from netlib.

Although a model implementation of the BLAS in available from netlib in the blas directory, it is not expected to perform as well as a specially tuned implementation on most high-performance computers -- on some machines it may give much worse performance -- but it allows users to run LAPACK software on machines that do not offer any other implementation of the BLAS.

Alternatively, you can automatically generate an optimized BLAS library for your machine, using ATLAS http:>www.netlib.org/atlas/

(There is an RPM called "blas-3.0" available for rocks; given the URL listed (http://www.netlib.org/lapack/), it appears that this is the model implementation listed above. This version is at /usr/lib64/libblas.so*, and is in ldconfig.)

Point is, you'll want a BLAS implementation, but you've got two (at least) to choose from. And you'll need to compile it yourself. I get the impression that the choice of BLAS library is something that can vary depending on religion, software, environment and so on...which means you'll probably want to look at something like modules to manage all this.

Tomorrow: Torque, Maui and OpenMPI.

Tags: rocks hpc

RDMA, iWARP and Linux

Fell down a rabbit hole today when I was looking at a data sheet for the Broadcom 5709c chipset. "RDMA over TCP (iWARP) - RDMAC 1.0 compliant". Huh?

This blog has a good overview of RDMA:

The rationale for RDMA is laid out in great detail in RFC 4297, but the basic idea is that allowing network messages to carry information about where they should be received and allowing the NIC to place the data directly in that buffer allows fundamentally better performance. [...]

With RDMA andiSCSI Extensions for (iSER, which is RFC 5046), the target can send the data in response to a read command and have it placed directly in the receive buffer on the initiator, which saves the copy and uses 3x less memory bandwidth (which is huge if the data is running at 10Gb/sec).

But there's more in that post, like this bit of drama on the LKML from 2007:

How about we just remove the RDMA stack altogether? I am not at all kidding. If you guys can't stay in your sand box and need to cause problems for the normal network stack, it's unacceptable. We were told all along the if RDMA went into the tree none of this kind of stuff would be an issue.

These are exactly the kinds of problems for which people like myself were dreading. These subsystems have no buisness using the TCP port space of the Linux software stack, absolutely none.

After TCP port reservation, what's next? It seems an at least bi-monthly event that the RDMA folks need to put their fingers into something else in the normal networking stack. No more.

I will NACK any patch that opens up sockets to eat up ports or anything stupid like that.

From Dave Miller, no less. Of course, that was four years ago (shit!). Now where do things stand?

Well, there's this announcement from IBM of a Linux iWARP driver and user library. And there's NFS over RDMA (is that the right term?) in the kernel now, including iWARP support.

I'm not sure if this'll be useful at work, but it's interesting to read about.

Tags:

Cfengine 3: copying config files for services

At $work I'm migrating slowly to Cfengine 3. One of the attractions is the ability to do what this page shows: loop over lists in a Cf-ish kind of way.

Here's the first bundle. (It's pretty much stolen from that page, but customized for my environment.) It tells you some basic details about the config file, the process name and the restart command for different daemons:

bundle common services {
  vars:
    redhat|centos::
      "cfg_file_prefix" string => "centos/5";

      "cfg_file[ssh]" string => "/etc/ssh/sshd_config";
      "daemon[ssh]"   string => "sshd";
      "start[ssh]"    string => "/sbin/service sshd restart";
      "enable[ssh]"   string => "/sbin/chkconfig sshd on";

      "cfg_file[iptables]" string => "/etc/sysconfig/iptables";
      "start[iptables]"    string => "/sbin/service iptables restart";
      "enable[iptables]"       string => "/sbin/chkconfig iptables on";
}

Here's the bundle that copies config files and restarts the daemon if necessary:

bundle agent fix_service(service) {
  files:
    "$(services.cfg_file[$(service)])"
      copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
      perms => mog("0600","root","root"),
      classes => if_repaired("$(service)_restart"),
      comment => "Copy a stock configuration file template from repository";

  processes:
    "$(services.daemon[$(service)])"
      comment => "Check that the server process is running, and start if necessary",
      restart_class => canonify("$(service)_restart");

  commands:
    "$(services.start[$(service)])"
      comment => "Method for starting this service",
      ifvarclass => canonify("$(service)_restart");

    "$(services.enable[$(service)])"
      comment => "Method for enabling this service",
      ifvarclass => canonify("$(service)_restart");
}

And here's the loop that puts it all together:

bundle agent redhat {
  vars:
    "service" slist => { "ssh", "iptables" };

methods:
  "any" usebundle => fix_service("$(service)"),
    comment => "Make sure the basic application services are running";

}

I ran into a problem with this, though: it would always, without fail, restart iptables even though no config file had been copied. The problem was with the process check: there's no process to check for with iptables. And from what I can tell, when the processes stanza was asked to check for a non-existent variable, it checked for the literal string $(services.daemon[$(service)]) -- that is, dollar-bracket-s-e-r-v-.... Since there was no such thing, it decided it needed restarting.

The way around this was to add this variable to the services bundle (the one that has all the info about the daemons):

"daemon[iptables]" string => "cf_null";

I also had to modify the processes stanza:

processes:
  $(services.daemon[$(service)])"
  comment => "Check that the server process is running, and start if necessary",
  restart_class => canonify("$(service)_restart"),
  ifvarclass => canonify("$(services.daemon[$(service)])");

That ifvarclass check on the last line says to run iff there is a value for daemon. cf_null is a NULL value special to cfengine. Since the check fails for iptables, the process check isn't run and we only restart if we copy over a new config file.

Tags: cfengine

Windows 7 freezes under Apple Boot Camp

Today I installed Windows 7 Pro on a Macbook Pro with Boot Camp. I ran into two problems I figured I should document:

  • First, Boot Camp would not proceed past the offer to burn a CD with drivers for the disk. It's a bug, and I was able to ignore it without problems; the networking and graphics came up w/o problems.

  • Second, after installing Windows Defender and rebooting, Windows would freeze at the login screen. This too is a known problem, with lots of suggestions on how to fix it. What worked for me was booting into safe mode w/the command prompt (every other safe mode would freeze), then disabling the Windows Defender service with MMC. After that, I was able to boot; after that, I was able to set it to "Automatic Start", reboot, and I've had no further troubles.

Tags: windows debugging

First light

This past Sunday I picked up a reflector on Craigslist. It's an Omcon 811SE, a 114mm f/9 Newtonian. I hadn't heard of the name before, but a quick search showed that they'd been built in the '90s here in Vancouver, and at least one CloudyNights.com member has one listed in their sig.

It seemed like it was in good condition. It came with a pretty sturdy wooden tripod with an alt-az friction mount, a 6x30 finder scope, and two eyepieces: an 18mm Kellner (50X), and a 7.5mm Plossl (120X). At $40 (knocked down from $50!), it was too good a deal to pass up.

...And then the clouds came. OF COURSE. I spent my time adjusting the finder and pointing it out the window, cursing water vapour under my breath.

Finally, we got some clear-ish skies; there were lingering clouds, but they seemed thin. I took it out to a park within walking distance of my house. The skies are by no means dark, but it's easy to get there. I set up the scope, popped in the 18mm, pointed it at a star, focussed and...hey, a star! Definitely reassuring, since I wasn't sure about how good the collimation was...it seemed okay to me, based on what I'd read, but the mirror has no centre mark and it was hard to be sure without actually testing it.

Finally, Orion came out, so I pointed it at M42...

...WOW.

WOW.

I'd only ever looked at it before in binoculars and my Galileoscope (the only other scope I've had since people started calling me a grownup), and...well, frankly it didn't seem like all that. I mean, it was nice, but nothing spectacular. But this...THIS was spectacular.

I swapped in the 7.5, and incredibly it seemed even better. It was fainter, of course, but the narrower FOV seemed to focus my attention more. I spent some time letting it drift across the eyepiece, and began to notice dark spots, lanes and such. It was amazing. I couldn't say I saw any colour, but I definitely know now what the fuss is about.

I decided to try the Pleides, and even at 50X -- such a narrow FOV compared to binoculars! -- it was astonishing. I just kept repeating "Oh wow" over and over again.

Finally, I decided to try splitting Eta Cassiopeia. I missed it at 50X, found it at 120X and then saw it on second look at 50X. Can't say I saw any colour difference between the two.

(I'd been REALLY hoping to see Jupiter, but the thrice-cursed clouds hid it.)

Now, the scope isn't perfect. The mounting definitely needs to be tweaked to make it easier to move (while not actually falling down), and pointing it at the zenith is going to be difficult. The &?*#! finderscope kept dewing over (first priority is to get my kids to make me a dew shield; they're 4 and 2, so it'll be a great craft project for them :-)). And even w/o much experience I can the eyepieces aren't great...in the 18mm, stars get fuzzy or elongated at the edge of the FOV, and the focus on the 7.5mm seems mushy/hard to achieve. (Though I suppose that could be the mirror...I wouldn't know.)

But oh man, oh man, oh man...what a night. I couldn't be happier with my new scope.

Tags: astronomy

Xmas Maintenance 2010: Lessons learned

Xmas vacation is when I get to do big, disruptive maintenance with a fairly free hand. Here's some of what I did and what I learned this year.

Order of rebooting

I made the mistake of rebooting one machine first: the one that held the local CentOS mirror. I did this thinking that it would be a good guinea pig, but then other machines weren't able to fetch updates from it; I had to edit their repo files. Worse, there was no remote console on it, and no time (I thought) to take a look.

  • Lesson: Don't do that.

Automating patching

Last year I tried getting machines to upgrade using Cfengine like so:

centos.some_group_of_servers.Hr14.Day29.December.Yr2009::
          "/usr/bin/yum -q -y clean all"
          "/usr/bin/yum -q -y upgrade"
          "/usr/bin/reboot"

This didn't work well: I hadn't pushed out the changes in advance, because I was paranoid that I'd miss something. When I did push it out, all the machines hit on the cfserver at the same time (more or less) and didn't get the updated files because the server was refusing connections. I ended up doing it by hand.

This year I pushed out the changes in advance, but it still didn't work because of the problems with the repo. I ran cssh, edited the repos file and updated by hand.

This worked okay, but I had to do the machines in separate batches -- some needed to have their firewall tweaked to let them reach a mirror in the first place, some I wanted to watch more carefully, and so on. That meant going through a list of machines, trying to figure out if I'd missed any, adding them by hand to cssh sessions, and so on.

  • Lesson: I need a better way of doing this.
  • Lesson: I need a way to check whether updates are needed.

I may need to give in and look at RHEL, or perhaps func or better Cfengine tweaking will do the job.

Staggering reboots

Quick and dirty way to make sure you don't overload your PDUs:

sleep $(expr $RANDOM / 200 ) && reboot

Remote consoles

Rebooting one server took a long time because the ILOM was not working well, and had to be rebooted itself.

  • Lesson: I need to test the SP before doing big upgrades; the simplest way of doing this may just be rebooting them.

Upgrading the database servers w/the 3 TB arrays took a long time: stock MySQL packages conflicted with the official MySQL rpms, and fscking the arrays takes maybe an hour -- and there's no sign of life on the console while you're doing it. Problems with one machine's ILOM meant I couldn't even get a console for it.

  • Lesson: Again, make sure the SP is okay before doing an upgrade.
  • Lesson: Fscking a few TB will take an hour with ext3.
  • Lesson: Start the console session on those machines before you reboot, so that you can at least see the progress of the boot messages up until the time it starts fscking.
  • Lesson: Might be worth editing fstab so that they're not mounted at boot time; you can fsck them manually afterward. However, you'll need to remember to edit fstab again and reboot (just to make sure)...this may be more trouble than it's worth.

OpenSuSE

Holy mother of god, what an awful time this was. I spent eight hours on upgrades for just nine desktop machines. Sadly, most of it was my fault, or at least bad configuration:

  • Two of the machines were running OpenSuSE 11.1; the rest were running 11.2. The latter lets you upgrade to the latest release from the command line using "zypper dist-upgrade"; the former does not, and you need to run over with a DVD to upgrade them.
  • By default, zypper fetches packages one at a time, installs them, then fetches them again. I'm not certain, but I think that means there's a lot more TCP overhead and less chance to ratchet up the speed. Sure as hell seemed slow downloading 1.8GB x 9 machines this way.
  • Graphics drivers: awful. Four different versions, and I'd used the local install scripts rather than creating an RPM and installing that. (Though to be fair, that would just rebuild the driver from scratch when it was installed, rather than do something sane like build a set of modules for a particular kernel.) And I didn't figure out where the uninstall script was 'til 7pm, meaning lots of fun trying to figure out why the hell one machine wouldn't start X.

  • Lesson: This really needs to be automated.

  • Lesson: The ATI uninstall script is at /usr/share/ati/fglrx-uninstall.sh. Use it.

  • Lesson: Next time, uninstall the driver and build a goddamn RPM.

  • Lesson: A better way of managing xorg.conf would be nice.

  • Lesson: Look for prefetch options for zypper. And start a local mirror.

  • Lesson: Pick a working version of the driver, and commit that fucker to Subversion.

Special machines

These machines run some scientific software: one master, three slaves. When the master starts up at boot time, it tries to SSH to the slaves to copy over the binary. There appears to be no, or poor, rate throttling; if the slaves are not available when the master comes up, you end up with the following symptoms:

  • Lots of SSH/scp processes on the master
  • Lots of SSH/scp processes on the slave (if it's up)
  • If you try to run the slave binary on the slave, you get errors like "lseek(3, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)" (from strace) or "ESPIPE text file busy" (from running it in the shell).

The problem is that umpty scp processes on the slave are holding open the binary, and the kernel gets confused trying to run it.

  • Lesson: Bring up the slaves first, then bring up the master.
  • Lesson: There are lots of interesting and obscure Unix errors.

I also ran into problems with a duff cable on the master; confusingly, both the kernel and the switch said it was still up. This took a while to track down.

  • Lesson: Network cables are surprisingly fragile at the connection with the jack.

Virtual Machines

It turned out that a couple of my kvm-based VMs did not have jumbo frames turned on. I had to use virt-manager to shut down the machines, turn on virtio on the drivers, then reboot. However, kudzu on the VMs then saw these as new interfaces and did not configure them correctly. This caused problems because the machines were LDAP clients and hung when the network was unavailable.

  • Lesson: To get around this, go into single-user mode and copy /etc/sysconfig/network-scripts/ifcfg-eth0.bak to ifcfg.eth0.
  • Lesson: Be sure you're monitoring everything in Nagios; it's a sysadmin's regression test.

Tags: work cfengine jumboframes rant toptip mysql

Org Mode + The Cycle + DayTimer + RT

In the spirit of Chris Siebenmann, and to kick off the new year, here's a post that's partly documentation for myself and partly an attempt to ensure I do it right: how I manage my tasks using Org Mode, The Cycle, my Daytimer and Request Tracker.

  • Org Mode is awesome, even more awesome than the window manager. I love Emacs and I love Org Mode's flexibility.

  • Tom Limoncelli's "Time Management for System Administrators." Really, I shouldn't have to tell you this.

  • DayTimer: because I love paper and pen. It's instant boot time, and it's maybe $75 to replace (and durable) instead of $500 (and delicate). And there is something so satisfying about crossing off an item on a list; C-c C-c just isn't as fun.

  • RT: Email, baby. Problem? Send an email. There's even rt-liberation for integration with Emacs (and probably Org Mode, though I haven't done that yet).

So:

  • Problems that crop up, I email to RT -- especially if I'm not going to deal with them right away. This is perfect for when you're tackling one problem and you notice something else non-critical. Thus, RT is often a global todo list for me.

  • If I take a ticket in RT (I'm a shop of one), that means I'm planning to work on it in the next week or so.

  • Planning for projects, or keeping track of time spent on various tasks or for various departments, is kept in Org Mode. I also use it for things like end-of-term maintenance lists. (I work at a university.) It's plain text, I check it into SVN nightly, and Emacs is The One True Editor.

  • My DayTimer is where I write down what I'm going to do today, or that appointment I've got in two weeks at 3pm. I carry it everywhere, so I can always check before making a commitment. (That bit sampled pretty much directly from TL.)

  • Every Monday (or so; sometimes I get delayed) I look through things to see what has to be done:

    • Org mode for projects or next-step sorta things
    • RT for tickets that are active
    • DayTimer for upcoming events
  • I plan out my week. "A" items need to be done today; "B" items should be done by the end of the week; "C" items are done if I have time.

  • Once every couple of months, I go through RT and look at the list of tickets. Sometimes things have been done (or have become irrelevant) and can be closed; sometimes they've become more important and need to be worked.

  • I try to plan out what I want to get done in the term ahead at the beginning of the term, or better yet just before the term starts; often there are new people starting with a new term, and it's always a bit hectic.

Tags: work