Linpack: a newbie's view

10 Jun 2011

(Keep in mind: I don't know what I'm talking about. I've written this down because I've found very little that seems to explain this to a newbie. I'm probably wrong; if you know that I'm wrong, leave a comment.)

Linpack is "a software library for performing numerical linear algebra on digital computers". It has been superceded for that purpose by Lapack; now it's mainly used for benchmarking. The latest incarnation, used for scores on the Top 500 list, is called hpl (High Performance Linpack); it uses MPI for communication.

Why do you need to know this?

If you search for "linpack score", you'll find an astonishing number of people posting scores for their phone. If you're looking for information on maxing your score on your HTC Dream, I can't help you.

If you have a cluster, then near as I can tell there are two reasons to do this:

Get high scores.
Exercise your hardware and see what happens.

The first is the sort of measuring contest that you hope gets you into the Top 500 list. It probably affects your funding, and may affect your continued employment.

The second uses Linpack as a shakedown (did the test work? did anything break?), or as a way of benchmarking performance. Sometimes people will use Linpack scores as a baseline; they'll make tweaks to a cluster (add more memory, change MTU, turn off more daemons, twiddle BIOS settings, etc) and see what the effect is. Linpack is not perfect for this; it stresses CPU and FPU, possibly memory and network, and doesn't really check disk, power usage or other things. But it's a start, it's familiar, and it boils down to a single number.

(HAH.)

So how high can it go?

The theoretical peak score is:

CPU GHz x Flops/Hz x Cores/node x nodes

(Cite). Flops/Hz is CPU-specific; for the Nehalems, at least, it's 4/Hz. Thus, for the cluster I'm working on, the peak score is:

2.67 GHz x 4 Flops/Hz x 24 Cores/node x 35 nodes

(Note the assumption of HyperThreading turned on in the 24 cores/node figure; while I've got that turned on now, I should probably turn it off -- at least for running Linpack.)

Anyhow, comes out to about 8970 GFlops, or almost 9 TFlops. By contrast, Tianhe-1A, the top entry in the November 2010 Top 500 list, has an RPeak of 4701000 GFlops -- so 4701 TFlops, or about 4.7 Petaflops. So there's no need to buy me that "I made the Top 500 and all I got was this lousy t-shirt" t-shirt yet.

Of course, that's a theoretical peak, and a lot depends on the way your system is configured. For example, [[https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2010-May/047136.html][this post]] to the Rocks-discuss mailing list says:

FYI, I've got around 84-85% on a cluster with Infiniband and OpenMPI, but some people told me they get better results.

That's 85% of the theoretical max. And it depends on Infiniband. Jeezum Crow.

Configuration file tuning

Linpack uses a configuration file named HPL.dat. The format is a little non-obvious, but is documented here. Here's a sample files, as generated by cbench (about which more later):

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
108304 346573 368233  # N -- fantastically important; see ahead
1            # Number of block sizes
80 112 96    # Block sizes
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
25           # Ps
26           # Qs
8.0          threshold
1            # of panel fact
0 2 1        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4 2          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1 2 0        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0 3 1 2 4    BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
256          swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
0            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

(Not the best format; sorry.)

P and Q, multiplied, should be the number of cores you want to use (thanks, Tim Doug) -- and so need to match the parameters you pass to Torque, or whatever batching system you're using.

Block sizes: Tim Doug says: "128 works well for me. Others suggest 80, 160, or 256. Experiment." cbench uses 80, 112 and 96; I think this is just how they do things. I see a peak (in very early tests) around 150.

N: The biggie. This posting to the Rocks-discuss mailing list gives an excellent overview of how N works in the Linpack test:

If you really want to stress your cluster, you want to have your matrix size fill approximately 80% of memory. For an NxN matrix, you consume NN8bytes. If you have a 16-node cluster, for example with 8 GB/memory/node, then you have 168GB.80 = 102GB. N would be approximately sqrt(102e^9/8) ~ 113,000.

That's a pretty big matrix an takes O(113,000^3) or 1.4 Quadrillion Floating OPs. If your nodes were 8 core, 2.5GHz, 4 Flops/cycle, then one would expect this matrix to factor (at reasonable computational efficiency) in the 30-90 Minute range. The exact time depends on efficiency, constant on the O(n^3) term and actual speed of your processors and network.

HPL will allow you to set up various matrix sizes, set up something that will compute quickly, eg. a 1000x1000 matrix, to verify that everything is happy. then step through some sizes that will take 1 - 5 minutes to factor, this will allow you calibrate the time you expect the full load to run. Remember each doubling of matrix size results in 8X the number floating ops. You get more efficiency as you get larger (more computation to communication), but it starts to level off pretty quickly. For most interconnects, using ~20% of memory is usually a decent indicator of ultimate system performance, if 20% takes 1 minute to actually compute, you expect the 80% run (8X) to take about an hour.

Full machine LINPACK runs can take many hours to run.

(It's worth emphasizing that Linpack really does take a long time to run with large N; this discussion shows how to start small and ramp up your Linpack tests as you gain confidence.)

cbench, which I mentioned above, is a suite of programs that are meant to exercise and benchmark a cluster. It's really quite excellent, but as with a lot of things in HPC the documentation isn't as explicit as it could be. For example, the N figures that are above are:

calculated for the amount of memory I have in my cluster (35 nodes * 48 GB/node), and
designed to fill 25%, 80% and 85% of memory.

Thus, it's a good reflection of/agreement with the approach outlined above: 20% for a short run (good for a ballpark figure) and 80% - 85% for longer runs (to really stress things).

So if you run Linpack a bunch of times and tweak the parameters, you'll see different results. This page discusses why:

The parallel solution of a system of linear equation requires some communication between the processors. To measure the loss of efficiency due this communication, we solved systems of equations of varying size on a varying number of processors. The general rule is: larger N means more work for each CPU and less influence of communication. As you can see from Fig. 1, a 4-CPU setup comes very close to the single CPU peak performance of 528 Mflops. This indicates, that the solver that works in HPL is not significantly worse than ATLAS. The relative speed per CPU decreases with increasing number of CPUs, however.

The problem size N is limited by the total memory. Tina has 512 MByte per node, i.e. each node can hold at most an 8192x8192 matrix of double precision floats. In practice, the matrix has to be smaller since the system itself needs a bit of memory, too. If both CPUs on a node are operating, the maximum size reduces to 5790x5790 per CPU. To minimize the relative weight of communication, the memory load should be as high as possible on each node. In Fig.2 you can see, how the effective speed increases with increasding load factor. A load factor 1 means that 256 MByte are required on each node to hold the NxN coefficient matrix.

[...]

With all 144 CPUs, communication becomes the major bottleneck. The current performance of 41 Gflops scales down to 284 Mflops/CPU [as compared to 528 MFlops for N=5000 on a single 4-CPU system]. The CPUs seem to spend almost half of their time chatting with each other...

Really, as someone else said, it's a black art. There are a ton of papers out there on optimizing Linpack parameters. There's even -- and I am crapping you negative -- a software project called ga-linhack that aims to "Develop a complete genetic algorithm tool set for determining optimal parameters for Linpack runs." Because as they say:

To most cluster engineers (the authors included) the tuning explanations of the hpl parameters yield little clue as to the underlying effect of varying these parameters. Not everyone can take a graduate mathematics course in advanced linear algebra in their free time.

Testify!

Why are we doing this again?

This page (mentioned above) also has this quote:

After this, look for the top 8 or 16 results, and refine the config file to use only the parameters that produced these results.

...which for me brought up a lot of questions, like:

Why are we doing this?
Do we want a high score or do we want to stress the system?
how compatible are those two goals?
How much fiddling with the test parameters is "morally acceptable", for lack of a better term?
Which results do you pay attention to?

I'm still figuring all this out.

Other resources

The Linpack FAQ
Tuning HPL.dat
Linpack results for (single) common CPUs
HPL Calculator -- theoretical max calculator, plus advice on tuning
HPL How-To -- PDF from the author of HPL Calculator, explaining his reasoning
Rocks case study -- PDF on using Rocks and optimizing HPL.dat, from a frequent and good contributor to the Rocks-discuss mailing list
A study in cluster optimization
The LINPACK Benchmark: Past, Present and Future -- a paper by the author of the LINPACK benchmark
A lonely, lonely voice

R.I.P., all of you

10 Jun 2011

On the music player this morning, I heard:

For Ash by Marnie Stern (about a dead friend)
For The Morning by Nick Drake
Ghost Of His Smile by Sparklehorse

All such good music.

Torque problems with Rocks

08 Jun 2011

I've been puttering away at work getting the cluster going. It's hard, because there are a lot of things I'm having to learn on the go. One of the biggest chunks is Torque and Maui, and how they interact with each other and Rocks as a whole.

For example: today I tried submitting a crapton of jobs all at once. After a while I checked the queue with showq (a Maui command; not to be confused with qstat, which is Torque) and found that a lot of jobs were listed as "Deferred" rather than "Idle". I watched, and the idle ones ran; the deferred ones just stayed in place, even after the list of running jobs was all done.

At first I thought this might be something to do with fairness. There are a lot of knobs to twiddle in Maui, and since I hadn't looked at the configuration after installation I wasn't really sure what was there. But near as I could tell, there wasn't anything happening there; the config file for Maui was empty, and I couldn't seem to find any mention of what the default settings were. I followed the FAQ and ran the various status commands, but couldn't really see anything obvious there.

Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:

06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com

And on compute-3-2 (/opt/torque/mom_logs)

06/08/2011 14:42:01;0080;   pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local

That's weird. I ran rocks sync config out of superstition, but nothing changed. I found a suggestion that it might be a bug in Torque, and to run momctl -d to see if the head node was in the trusted client list. It was not. I tried running that command on all the nodes (sudo rocks run host compute command="momctl -d3 |grep Trusted | grep 10.1.1.1); turned out that only 10 were. What the hell?

I'm still not sure exactly where this gets set, but I did notice that /opt/torque/mom_priv/config listed the head node as the server, and was identical on all machines. On a hunch, I tried restarting the pbs service on all the nodes; suddenly they all came up. I submitted a bunch more jobs, and they all ran through -- none were deferred. And running momctl -d showed that, yes, the head node was now in the trusted client list.

Thoughts:

None of this was shown by Ganglia (which just monitors load) or showq (which is a Maui command; the problem was with Torque).
Doubtless there were commands I should've been running in Torque to show these things.
While the head node is running a syslog server and collects stuff from the client nodes, Torque logs are not among them; I presume Torque is not using syslog. (Must check that out.)
I still don't know how the trusted client list is set. If it's in a text file, that's something that I think Rocks should manage.
I'm not sure if tracking down the problem this way is exactly the right way to go. I think it's important to understand this, but I suspect the Rocks approach would be "just reboot or reinstall". There's value to that, but I intensely dislike not knowing why things are happening and sometimes that gets in my way.

How to print character arrays in Python

06 Jun 2011

I've come across this problem a number of times, so here's a reminder. When printing in Python, occasionally I'll end up with something like this:

>>> print "foo=%s" % row["Name"]
foo=array('c', 'bar')

The solution is to use the .tostring() method:

>>> print "foo=%s" % row["Name"].tostring()
foo=bar

Details here.

Uh, what?

23 May 2011

From The CBC:

The Vancouver Canucks have skated to within a victory from advancing to their first Stanley Cup final since 1994 after they exhibited their superiority in a bizarre special teams battle with the San Jose Sharks on Sunday.

Memo to Canadians

18 May 2011

Memo to Canadians: your government will throw you under a bus if they feel like it.

Quote:

The Canadian Security Intelligence Service, Canada's principal intelligence agency, routinely transmits to U.S. authorities the names and personal details of Canadian citizens who are suspected of, but not charged with, what the agency refers to as "terrorist-related activity."

The criteria used to turn over the names are secret, as is the process itself.

Quote:

In at least some cases, the people in the cables appear to have been named as potential terrorists solely based on their associations with other suspects, rather than any actions or hard evidence.

Quote:

The first stop for these names is usually the so-called Visa Viper list maintained by the U.S. government. Anyone who makes that list is unlikely to be admitted to the States.

Given Washington's policy of centralizing such information, though, the names also go into the database of the U.S. National Counterterrorism Centre. Inclusion in such databases can have several consequences, such as being barred from aircraft that fly through U.S. airspace.

Or, as Canadian Maher Arar discovered in 2002, the consequences can be worse: much arrest, interrogation, even "rendition" to another country.

Quote:

"We don't want another Arar," said the security official. But at the same time, he said, CSIS is acutely aware that if it did not pass on information about someone it suspected, and that person then carried out some sort of spectacular attack in the U.S., the consequences could be cataclysmic for Canada.

U.S. authorities, already suspicious that Canada is "soft on terror," would likely tighten the common border, damaging hundreds of billions of dollars worth of vital commerce.

A former senior official, who also spoke to CBC on the basis of anonymity, put it more bluntly: "The reality is, sorry, there are bad people out there.

"And it's very hard to get some of those people before a court of law with the information you have. And so there has to be some sort of process which allows you to provide some sort of safeguard to society on both sides of the border."

Furthermore, he said, "it's not a fundamental human right to be able to go to the United States."

No, it's not a fundamental human right to be able to go to the United States. It is a fundamental human right not to be kidnapped and tortured.

Good reading

13 May 2011

With a perfect storm of disruptive technology rendering traditional broadcasting all but obsolete, foreign entrants with superior services and lower costs, unfavorable demographics, a powerful pro-competition government, entrenched inflexible business models, lack of competitive and innovative edge due to decades of insulation from the rest of the telecom world, bloated balance sheets due to costly acquisitions of old-media companies, and regulatory uncertainty relating to usage based billing and functional separation, large telecommunications firms in Canada do not look well-positioned.

Why Canadian Cable Companies and Telecoms Are in Trouble (via Michael Geist).

Trouble compiling GotoBLAS2 on newer CPU

13 May 2011

I came across a problem compiling GotoBLAS2 at work today. It went well on a practice cluster, but on the new one I got this error:

gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC  -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation

The solution was simple:

gmake clean
gmake TARGET=NEHALEM

The problem appears to be that newer CPUs (Intel X5650 in my case) are not detected properly by the CPU ID routine in GotoBlas2. You can verify this by checking the contents of config.h in the top-level directory. Without TARGET=NEHALEM, I saw this line:

#define INTEL_UNKNOWN

But with TARGET=NEHALEM, this becomes:

#define NEHALEM

The problem with gemm_ncopy_4.S arises because it defines RPRETCHSIZE and WPREFETCHSIZE using #ifdef statements depending on CPU type. There is an entry for #ifdef GENERIC, but that was not set for me in config.h.

In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean" first, you get a little further before you run into a similar error:

usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value

If I was a better person, I'd have a look at how the sizes are defined and figure out what the right value is for newer CPUs, then modify cpuid.c (which I presume is what's being used to generate config.h, or at least this part of it. Maybe another day...

Sunrise on the moon

10 May 2011

Tonight was the first clear night in far, far too long. But instead of staying out 'til midnight, I decided to try pointing the scope out my bathroom window to look at the moon and get to bed at a semi-reasonable hour.

And hey, not bad! Sure, it got pretty awful above 50X, but that was enough to let me see all kinds of things. I decided to sketch what I saw, and I'm glad I did; it's no great artistry, but it really forced me to pay attention to what I was seeing.

After a half hour or so, I took a break, then came back to look again and pick out the features I could recognize. There was Albategnius, Plinius, Manilius, and...hey, I don't remember that bright bit in Abategnius being that big. And what's with the bright spot in Mare Imbrium's shadow?

So I looked it the new bright bit, and it's Mons Piton -- 7000 ft/2100-odd metres high. And it hits me: I didn't see that before. It's a big mountain in the middle of shadow. It's just the other side of the terminator. Same thing must've happened with Albategnius. Holy crap, I just saw sunrise on the moon!

It shouldn't really be a surprise -- even though I'm still getting familiar with the moon I know how the terminator moves, and I know that it's gotta move sometime. But it was really, really surprising to see it in such a short space of time.

Multiple redundant cascading problems

05 May 2011

Oh god this week. I've been setting up the cluster (three chassis' worth of blades from Dell). I've installed Rocks on the front end (rackmount R710). After that:

All blades powered on.
Some installed, most did not. Not sure why. Grub Error 15 is the result, which is Grub for "File not found".
I find suggestions in the Rocks mailing list to turn off floppy controllers. Don't have floppy controllers exactly in these, but I do see boot order includes USB floppy and USB CDROM. Pick a blade, disable, PXE boot and reinstall. Whee, it works!
Try on another blade and find that reinstallation takes 90 minutes. Network looks fine; SSH to the reinstalling blade and wget all the RPMs in about twelve seconds. What the hell?
Discover Rocks' Avalanche Installer and how it uses BitTorrent to serve RPMs to nodes. Notice that the installing node is constantly ARPing to find nodes that aren't turned on (they're waiting for me to figure out what the hell's going on). Restart service rocks-tracker on the front end and HOLY CRAP now it's back down to a three minute installation. Make a mental note to file a bug about this.
Find out that Dell OpenManage Deploy Toolkit is the best way to populate a new machine w/BIOS settings, since the Chassis Management Console can't push that particular setting to blades. Download that, start reading.
Try fifteen different ways of connecting virtual media using CMC. Once I find out the correct syntax for NFS mounts (amusingly different between manuals), some blades find it and some don't; no obvious hints why. What the hell?
Give up, pick a random blade and tell it by hand where to find the goddamn ISO. (This ignores the problems of getting Java apps to work in Awesome [hint: use wmname], which is my own fault.) Collect settings before and after disabling USB CDROM and Floppy and find no difference; this setting is apparently not something they expose to this tool.
Give up and try PXE booting this blade even with the demon USB devices still enabled. It works; installation goes fine and after it reboots it comes up fine. What the hell?
Power cycle the blade to see if it still works and it reinstalls. Reinstalls Rocks. What the hell?
Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to pxe boot next time and at graceful shutdown reverses the change, allowing the machine to boot normally. The thinking is that if you have to power cycle a machine you probably want to reinstall it anyhow.
Finally put this all together. Restart tracker, set all blades in one of the chassis' to reinstall. Pick a random couple of blades and fire up consoles. Power all the blades up. Installation fails with anaconda error, saying it can't find any more mirrors. What the hell?
eth0 is down on the front end; dmesg shows hundreds of "kernel: bnx2: eth0 NIC Copper Link is Down" messages starting approximately the time I power-cycled the blades.

I give up. I am going here tonight because my wife is a good person and is taking me there. And I am going to have much, and much-deserved, beer.

Bacula multi-tape restores while backups queued

03 May 2011

I've got a tape library at work with two tape drives. Today, one of the drives was doing (full) backups and the second was free for a restore job. However, when that restore job ran, I got this error:

JobId 62397: Forward spacing Volume "000039" to file:block 7:0.
JobId 62397: Error: block.c:1016 Read error on fd=7 at file:blk 3:0 on device "Drive-0" (/dev/nst1). ERR=Input/output error.
JobId 62397: End of Volume at file 3 on device "Drive-0" (/dev/nst1), Volume "000039"
JobId 62397: Fatal error: acquire.c:72 Acquire read: num_writers=1 not zero. Job 62397 canceled.
JobId 62397: Fatal error: mount.c:844 Cannot open Dev="Drive-0" (/dev/nst1), Vol=000039
JobId 62397: End of all volumes.
JobId 62397: Error: Bacula cbs-01-dir 5.0.2 (28Apr10): 03-May-2011 12:09:20

The problem wasn't that it encountered the end of the volume -- the job spanned a number of volumes, so that was okay.

No, the problem was that after the restore job had run, a number of other regular backups had started. These were incrementals, and thus were unable to use the first drive. When the restore job ran into the EOM on the first volume, it appears to have released the drive -- at which point the incrementals started up and denied the use of the second drive to the restore job. The restore job promptly gave up and called it an error.

As I was in a hurry, I tried killing off the incrementals and re-running the restore job. This worked just fine. Arguably it's a bug, but I suspect I just need to tweak the priority for restore jobs instead.

(Two entries in one day...woot!)

Clearing the trace flag

03 May 2011

A couple times now, I've run strace on a java process and not had it resume after starting. Running ps or top has shown that the T flag has remained on the process. Running kill -SIGCONT <pid> has started things up again.

I am too old for this

15 Apr 2011

About 10.45 yesterday I noticed that my SSH connections to the servers in our server room had stopped, and I was unable to make any more. I checked Nagios the machine by my desk (multiple Nagios FTW!) and found that it had noticed problems a few minutes before. I ran over to see what was going on.

After a few minutes of checking, I'd found:

our firewall machine seemed to be up just fine
but I couldn't ping anything: no DHCP lease, and not with a manually configured interface

I called IT Services and asked if problems; they said no, so I double checked again. Suddenly I could ping the firewall and other machines, but SSHing to them hung.

My guess at this point was LDAP problems. I connected monitor and keyboard to the machine hosting the LDAP server and found it responsive, but a CRAPTON of errors on eth1 (which Xen handily renames/clones as peth1):

peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:9000 Metric:1 RX packets:1895496748 errors:1500778269 dropped:1505784776 overruns:0 frame:1500778269
TX packets:340023247 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
RX bytes:186743473052 (173.9 GiB) TX bytes:384744601794 (358.3 GiB)
Interrupt:18 Memory:ec000000-ec012800

I didn't know what to make of this, so I replaced the cable to peth1, ran ifconfig down/up, and got the connection back -- at which point LDAP came back up, the machines started working, etc.

Okay, weird -- but at least it's working again. I went back to my desk to try and figure out what had happened. While I was doing that, I started losing connectivity to the machines in the server room for 30 seconds at a time. What the hell?

After that, frankly, it's a blur. I was there 'til 7.45pm and here's what I think was going on.

First, the Xen host was having big memory problems that affected its networking, and the networking of the VMs within it. I was seeing a crapton of these messages:

Apr 14 15:12:51 kernel: xen_net: Memory squeeze in netback driver.

This bug said it was fixed in CentOS 5.6 -- so I tried upgrading to that (I was at 5.5, so not a big jump). Nope. Then I saw a suggestion that the problem was in memory ballooning -- that the dom0 was sucking up all the memory for some reason. The solution was to add a "dom0_mem=" to the kernel argument in Grub, ideally matching the dom0-min-mem argument in /etc/xen/xend-config.sxp. Unfortunately, I didn't realize that without specifying units, Xen assumes bytes -- so I was specifying a max mem of 512 bytes, not megabytes.

This caused the machine to panic and reboot -- but because consoles were only available via serial port, and because the IPMI console wasn't working, I was unable to see it. I had to edit the Grub entries on the fly to remove those arguments from the kernel, see what was going on, and then set it correctly.

After rebooting with a working memory limit, top showed that ksoftirqd/0 was taking up an enormous amount of CPU time -- 98% of one CPU. This was pretty much all due to eth0 interrupts. tcpdump showed that there was a lot of traffic on the management subnet, which the machine shouldn't have been seeing. I checked the switch and saw that the management vlan WAS on there as tagged (the normal, inside VLAN was default and untagged). I turned that off within the switch, rebooted the machine and things pretty much went back to normal.

All of that was doubly unfortunate because of the four VMs on there, two are the only two LDAP servers in that room -- the third is no another network, but it took a long time for the clients to fail over to it. This is piss-poor planning on my part.

As if that wasn't enough, another server's disk array disappeared, which caused MySQL to die and the website running on it to disappear. Turned out the machine had been booted with the wrong multipath drivers. When it had problems on one connection, the drive came back with a different device (/dev/sde1 instead of /dev/sdd1). This took a while to figure out, but I finally got it rebooted and the drive array back.

Now things were mostly back to normal -- except that the connection to the management VLAN seemed to be coming and going. This was shown both by nagios ("foo-ilom up! foo-ilom down!") and by good old-fashioned pings. A given ILOM/SP would respond to pings for 30 seconds, then go down; five minutes later it'd come back for 30 seconds, then disappear again. For the nth time: what the hell?

Then I remembered that, back when all this had begun, we'd been configuring a new cluster. Working on its switches, in fact, which were from a different vendor than our usual (package deal, dontcha know). I began to suspect that the problem might somehow lie there. I removed the two patch cables connecting the new switches to our network...and at last the management VLAN connection came back up and stayed up.

In all I was in the server room 'til 7.45pm last night. Part of it was spent reinstalling CentOS on a separate machine in hopes of at least getting an LDAP server up on it. I didn't stick around for that, as the VMs came back up fine, but that's definitely on the agenda.

Checking Bacula exclusions

11 Apr 2011

I came across this tip on an old posting to the Bacula mailing list. To determine if exclusions in a fileset are working, run these commands in bconsole:

@output some-file
estimate job=<job-name> listing level=Full
@output

The file will contain a list of files Bacula will include in the backup.

(Incidentally, I came across this while trying to figure out why my excludions weren't working; turned out I needed to remove the trailing slash in my directory names in the Exclude section.

Rocks Lessons Part 2 -- Torque, Maui and OpenMPI

04 Mar 2011

Torque is a resource manager; it's an open source project with a long history. It keeps track of resources -- typically compute nodes, but it's "flexible enough to handle scheduling a conference room". It knows how many compute nodes you have, how much memory, how many cores, and so on.

Maui is the job scheduler. It looks at the jobs being submitted, notes what resources you've asked for, and makes requests of Torque. It keeps track of what work is being done, needs to be done, or has been completed.

MPI stands for "Message Passing Interface". Like BLAS, it's a standard with different implementations. It's used by a lot of HPC/scientific programs to exchange messages between processes -- often but not necessarily on separate computers -- related to their work.

MPI is worth mentioning in the same breath as Torque and Maui because of mpiexec, which is part of OpenMPI; OpenMPI is a popular open-source implementation of MPI. mpiexec (aka mpirun, aka orterun) lets you launch processes in an OpenMPI environment, even if the process doesn't require MPI. IOW, there's no problem running something like "mpiexec echo 'Hello, world!'".

To focus on OpenMPI and mpiexec: you can run n copies of your program by using the "-np" argument. Thus, "-np 8" will run 8 copies of your program...but it will run on the machine you run mpiexec on:

$ mpiexec -np 8 hostname
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org

This isn't always useful -- why pay big money for all this hardware if you're not going to use it? -- so you can tell it to run on different hosts:

$ mpiexec -np 8 -host compute-0-0,compute-0-1 hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local

And if you're going to do that, you might as well give it a file to read, right?

$ mpiexec -np 8 -hostfile /opt/openmpi/etc/openmpi-default-hostfile hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local

That file is where Rocks sticks the hostfile, but it could be anywhere -- including in your home directory, if you decide that you want it to run on a particular set of machines.

However, if you're doing that, then you're really setting yourself up as the resource manager. Isn't that Torque's job? Didn't we set all this up so that you wouldn't have to keep track of what machine is busy?

So OpenMPI can work with Torque:

How do I run jobs under Torque / PBS Pro?

The short answer is just to use mpirun as normal.

Open MPI automatically obtains both the list of hosts and how many processes to start on each host from Torque / PBS Pro directly. Hence, it is unnecessary to specify the --hostfile, --host, or -np options to mpirun. Open MPI will use PBS/Torque-native mechanisms to launch and kill processes ([rsh] and/or ssh are not required).

Whee! So easy! Except that Rocks does not compile OpenMPI with Torque support!

Because the Rocks project is kind of a broad umbrella, with lots of sub-projects underneath, the Torque roll is separate from the OpenMPI roll. Besides, installing one doesn't mean you'll install the other, so it may not make sense to build OpenMPI that way.

The fine folks at Rocks asked the fine folks at OpenMPI and found a way around this: by having every job submitted to Torque/Maui and using MPI source /opt/torque/etc/openmpi-setup.sh. While not efficient, it works; the recommended way, though, is to recompile OpenMPI with Torque installed so that it knows about Torque.

To me, this makes the whole Rocks installation less useful, particularly since this is didn't seem terribly well documented. To be fair, it is there in the Torque roll documentation:

Although OpenMPI has support for the torque tm-interface (tm=taskmanager) it is not compiled into the library shipped with Rocks (the reason for this is that the OpenMPI build process needs to have access to libtm from torque to enable the interface). The best workaround is to recompile OpenMPI on a system with torque installed. Then the mpirun command can talk directly to the batch system to get the nodelist and start the parallel application using the torque daemon already running on the nodes. Job startup times for large parallel applications is significantly shorter using the tm-interface that using ssh to start the application on all nodes.

So maybe I should just shut my mouth.

In any event, I suspect I'll end up recompiling OpenMPI in order to get it to see Torque.

Rocks Lesson Part 1 -- BLAS

03 Mar 2011

There's a lot to clusters. I'm learning that now.

At $WORK, we're getting a cluster RSN -- rack fulla blades, head node, etc etc. I haven't worked w/a cluster before so I'm practicing with a test one: three little nodes, dual core CPUs, 2 GB memory each, set up as one head node and two compute nodes. I'm using Rocks to manage it.

Here a few stories about things I've learned along the way.

BLAS

You find a lot of references to BLAS when you start reading software requirements for HPC, and not a lot explaining it.

BLAS stands for "Basic Linear Algebra Subprograms"; the original web page is here. Wikipedia calls it "a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication." This is important to realize, because, as in the article, common usage of the term seems to refer to an API than anything else; there's the reference implementation, but it's not really used much.

As I understand it -- and I invite corrections -- BLAS chugs through linear algebra and comes up with an answer at the end. Brute force is one way to do this sort of thing, but there are ways to speed up the process; these can make a huge difference in the amount of time it takes to do some calculation. Some of these are heuristics and algorithms that allow you to search more intelligently through the search space. Some are ways of compiling or writing the library routines differently, taking advantage of the capabilities of different processors to let you search more quickly.

There are two major open-source BLAS implementations:

The Goto BLAS library is a hand-optimized BLAS implementation that, by all accounts, is very fast. It's partly written in assembler, and the guy who wrote it basically crafted it the way (I think) Enzo Ferrari crafted cars.
ATLAS is another BLAS implementation. The ATLAS home page says "it provides C and Fortran77 interfaces to a portably efficient BLAS implementation, as well as a few routines from LAPACK." As noted in the articles attached to this page, ATLAS tries many, many different searches for a solution to a particular problem. It uses CPU capabilities to do these searches efficiently.

As such, compilation of ATLAS is a big deal, and the resulting binaries are tuned to the CPU they were built on. Not only do you need to turn off CPU throttling, but you need to build on the CPU you'll be running on. Pre-built packages are pretty much out.

ATLAS used to be included in the HPC roll of the Rocks 4 series. Despite [irritatingly out-of-date information][13], this has not been the case in a while.

[LAPACK] "is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems." It needs a BLAS library. From the FAQ:

Why aren’t BLAS routines included when I download an LAPACK routine?

It is assumed that you have a machine-specific optimized BLAS library already available on the architecture to which you are installing LAPACK. If this is not the case, you can download a Fortran77 reference implementation of the BLAS from netlib.

Although a model implementation of the BLAS in available from netlib in the blas directory, it is not expected to perform as well as a specially tuned implementation on most high-performance computers -- on some machines it may give much worse performance -- but it allows users to run LAPACK software on machines that do not offer any other implementation of the BLAS.

Alternatively, you can automatically generate an optimized BLAS library for your machine, using ATLAS http:>www.netlib.org/atlas/

(There is an RPM called "blas-3.0" available for rocks; given the URL listed (http://www.netlib.org/lapack/), it appears that this is the model implementation listed above. This version is at /usr/lib64/libblas.so*, and is in ldconfig.)

Point is, you'll want a BLAS implementation, but you've got two (at least) to choose from. And you'll need to compile it yourself. I get the impression that the choice of BLAS library is something that can vary depending on religion, software, environment and so on...which means you'll probably want to look at something like modules to manage all this.

Tomorrow: Torque, Maui and OpenMPI.

RDMA, iWARP and Linux

08 Feb 2011

Fell down a rabbit hole today when I was looking at a data sheet for the Broadcom 5709c chipset. "RDMA over TCP (iWARP) - RDMAC 1.0 compliant". Huh?

This blog has a good overview of RDMA:

The rationale for RDMA is laid out in great detail in RFC 4297, but the basic idea is that allowing network messages to carry information about where they should be received and allowing the NIC to place the data directly in that buffer allows fundamentally better performance. [...]

With RDMA andiSCSI Extensions for (iSER, which is RFC 5046), the target can send the data in response to a read command and have it placed directly in the receive buffer on the initiator, which saves the copy and uses 3x less memory bandwidth (which is huge if the data is running at 10Gb/sec).

But there's more in that post, like this bit of drama on the LKML from 2007:

How about we just remove the RDMA stack altogether? I am not at all kidding. If you guys can't stay in your sand box and need to cause problems for the normal network stack, it's unacceptable. We were told all along the if RDMA went into the tree none of this kind of stuff would be an issue.

These are exactly the kinds of problems for which people like myself were dreading. These subsystems have no buisness using the TCP port space of the Linux software stack, absolutely none.

After TCP port reservation, what's next? It seems an at least bi-monthly event that the RDMA folks need to put their fingers into something else in the normal networking stack. No more.

I will NACK any patch that opens up sockets to eat up ports or anything stupid like that.

From Dave Miller, no less. Of course, that was four years ago (shit!). Now where do things stand?

Well, there's this announcement from IBM of a Linux iWARP driver and user library. And there's NFS over RDMA (is that the right term?) in the kernel now, including iWARP support.

I'm not sure if this'll be useful at work, but it's interesting to read about.

Cfengine 3: copying config files for services

28 Jan 2011

At $work I'm migrating slowly to Cfengine 3. One of the attractions is the ability to do what this page shows: loop over lists in a Cf-ish kind of way.

Here's the first bundle. (It's pretty much stolen from that page, but customized for my environment.) It tells you some basic details about the config file, the process name and the restart command for different daemons:

bundle common services {
  vars:
    redhat|centos::
      "cfg_file_prefix" string => "centos/5";

      "cfg_file[ssh]" string => "/etc/ssh/sshd_config";
      "daemon[ssh]"   string => "sshd";
      "start[ssh]"    string => "/sbin/service sshd restart";
      "enable[ssh]"   string => "/sbin/chkconfig sshd on";

      "cfg_file[iptables]" string => "/etc/sysconfig/iptables";
      "start[iptables]"    string => "/sbin/service iptables restart";
      "enable[iptables]"       string => "/sbin/chkconfig iptables on";
}

Here's the bundle that copies config files and restarts the daemon if necessary:

bundle agent fix_service(service) {
  files:
    "$(services.cfg_file[$(service)])"
      copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
      perms => mog("0600","root","root"),
      classes => if_repaired("$(service)_restart"),
      comment => "Copy a stock configuration file template from repository";

  processes:
    "$(services.daemon[$(service)])"
      comment => "Check that the server process is running, and start if necessary",
      restart_class => canonify("$(service)_restart");

  commands:
    "$(services.start[$(service)])"
      comment => "Method for starting this service",
      ifvarclass => canonify("$(service)_restart");

    "$(services.enable[$(service)])"
      comment => "Method for enabling this service",
      ifvarclass => canonify("$(service)_restart");
}

And here's the loop that puts it all together:

bundle agent redhat {
  vars:
    "service" slist => { "ssh", "iptables" };

methods:
  "any" usebundle => fix_service("$(service)"),
    comment => "Make sure the basic application services are running";

}

I ran into a problem with this, though: it would always, without fail, restart iptables even though no config file had been copied. The problem was with the process check: there's no process to check for with iptables. And from what I can tell, when the processes stanza was asked to check for a non-existent variable, it checked for the literal string $(services.daemon[$(service)]) -- that is, dollar-bracket-s-e-r-v-.... Since there was no such thing, it decided it needed restarting.

The way around this was to add this variable to the services bundle (the one that has all the info about the daemons):

"daemon[iptables]" string => "cf_null";

I also had to modify the processes stanza:

processes:
  $(services.daemon[$(service)])"
  comment => "Check that the server process is running, and start if necessary",
  restart_class => canonify("$(service)_restart"),
  ifvarclass => canonify("$(services.daemon[$(service)])");

That ifvarclass check on the last line says to run iff there is a value for daemon. cf_null is a NULL value special to cfengine. Since the check fails for iptables, the process check isn't run and we only restart if we copy over a new config file.

Windows 7 freezes under Apple Boot Camp

27 Jan 2011

Today I installed Windows 7 Pro on a Macbook Pro with Boot Camp. I ran into two problems I figured I should document:

First, Boot Camp would not proceed past the offer to burn a CD with drivers for the disk. It's a bug, and I was able to ignore it without problems; the networking and graphics came up w/o problems.
Second, after installing Windows Defender and rebooting, Windows would freeze at the login screen. This too is a known problem, with lots of suggestions on how to fix it. What worked for me was booting into safe mode w/the command prompt (every other safe mode would freeze), then disabling the Windows Defender service with MMC. After that, I was able to boot; after that, I was able to set it to "Automatic Start", reboot, and I've had no further troubles.

First light

27 Jan 2011

This past Sunday I picked up a reflector on Craigslist. It's an Omcon 811SE, a 114mm f/9 Newtonian. I hadn't heard of the name before, but a quick search showed that they'd been built in the '90s here in Vancouver, and at least one CloudyNights.com member has one listed in their sig.

It seemed like it was in good condition. It came with a pretty sturdy wooden tripod with an alt-az friction mount, a 6x30 finder scope, and two eyepieces: an 18mm Kellner (50X), and a 7.5mm Plossl (120X). At $40 (knocked down from $50!), it was too good a deal to pass up.

...And then the clouds came. OF COURSE. I spent my time adjusting the finder and pointing it out the window, cursing water vapour under my breath.

Finally, we got some clear-ish skies; there were lingering clouds, but they seemed thin. I took it out to a park within walking distance of my house. The skies are by no means dark, but it's easy to get there. I set up the scope, popped in the 18mm, pointed it at a star, focussed and...hey, a star! Definitely reassuring, since I wasn't sure about how good the collimation was...it seemed okay to me, based on what I'd read, but the mirror has no centre mark and it was hard to be sure without actually testing it.

Finally, Orion came out, so I pointed it at M42...

...WOW.

WOW.

I'd only ever looked at it before in binoculars and my Galileoscope (the only other scope I've had since people started calling me a grownup), and...well, frankly it didn't seem like all that. I mean, it was nice, but nothing spectacular. But this...THIS was spectacular.

I swapped in the 7.5, and incredibly it seemed even better. It was fainter, of course, but the narrower FOV seemed to focus my attention more. I spent some time letting it drift across the eyepiece, and began to notice dark spots, lanes and such. It was amazing. I couldn't say I saw any colour, but I definitely know now what the fuss is about.

I decided to try the Pleides, and even at 50X -- such a narrow FOV compared to binoculars! -- it was astonishing. I just kept repeating "Oh wow" over and over again.

Finally, I decided to try splitting Eta Cassiopeia. I missed it at 50X, found it at 120X and then saw it on second look at 50X. Can't say I saw any colour difference between the two.

(I'd been REALLY hoping to see Jupiter, but the thrice-cursed clouds hid it.)

Now, the scope isn't perfect. The mounting definitely needs to be tweaked to make it easier to move (while not actually falling down), and pointing it at the zenith is going to be difficult. The &?*#! finderscope kept dewing over (first priority is to get my kids to make me a dew shield; they're 4 and 2, so it'll be a great craft project for them :-)). And even w/o much experience I can the eyepieces aren't great...in the 18mm, stars get fuzzy or elongated at the edge of the FOV, and the focus on the 7.5mm seems mushy/hard to achieve. (Though I suppose that could be the mirror...I wouldn't know.)

But oh man, oh man, oh man...what a night. I couldn't be happier with my new scope.

Older Newer