10 Jun 2011
(Keep in mind: I don't know what I'm talking about. I've written this
down because I've found very little that seems to explain this to a
newbie. I'm probably wrong; if you know that I'm wrong, leave a
comment.)
Linpack is "a software library for performing numerical linear
algebra on digital computers". It has been superceded for that
purpose by Lapack; now it's mainly used for benchmarking. The latest
incarnation, used for scores on the Top 500 list, is called hpl
(High Performance Linpack); it uses MPI for communication.
Why do you need to know this?
If you search for "linpack score", you'll find an astonishing
number of people posting scores for their phone. If you're looking
for information on maxing your score on your HTC Dream, I can't help
you.
If you have a cluster, then near as I can tell there are two reasons
to do this:
- Get high scores.
- Exercise your hardware and see what happens.
The first is the sort of measuring contest that you hope gets you into
the Top 500 list. It probably affects your funding, and may
affect your continued employment.
The second uses Linpack as a shakedown (did the test work? did
anything break?), or as a way of benchmarking performance. Sometimes
people will use Linpack scores as a baseline; they'll make tweaks to a
cluster (add more memory, change MTU, turn off more daemons, twiddle
BIOS settings, etc) and see what the effect is. Linpack is not
perfect for this; it stresses CPU and FPU, possibly memory and
network, and doesn't really check disk, power usage or other things.
But it's a start, it's familiar, and it boils down to a single number.
(HAH.)
So how high can it go?
The theoretical peak score is:
CPU GHz x Flops/Hz x Cores/node x nodes
(Cite). Flops/Hz is CPU-specific; for the Nehalems, at least,
it's 4/Hz. Thus, for the cluster I'm working on, the peak score is:
2.67 GHz x 4 Flops/Hz x 24 Cores/node x 35 nodes
(Note the assumption of HyperThreading turned on in the 24 cores/node
figure; while I've got that turned on now, I should probably turn it
off -- at least for running Linpack.)
Anyhow, comes out to about 8970 GFlops, or almost 9 TFlops. By
contrast, Tianhe-1A, the top entry in the November 2010 Top 500 list,
has an RPeak of 4701000 GFlops -- so 4701 TFlops, or about 4.7
Petaflops. So there's no need to buy me that "I made the Top 500
and all I got was this lousy t-shirt" t-shirt yet.
Of course, that's a theoretical peak, and a lot depends on the way
your system is configured. For example,
[[https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2010-May/047136.html][this
post]] to the Rocks-discuss mailing list says:
FYI, I've got around 84-85% on a cluster with Infiniband and OpenMPI,
but some people told me they get better results.
That's 85% of the theoretical max. And it depends on Infiniband.
Jeezum Crow.
Configuration file tuning
Linpack uses a configuration file named HPL.dat. The format is a
little non-obvious, but is documented here. Here's a sample
files, as generated by cbench (about which more later):
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
108304 346573 368233 # N -- fantastically important; see ahead
1 # Number of block sizes
80 112 96 # Block sizes
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
25 # Ps
26 # Qs
8.0 threshold
1 # of panel fact
0 2 1 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 2 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 2 0 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
0 3 1 2 4 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
256 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
(Not the best format; sorry.)
P and Q, multiplied, should be the number of cores you want to
use (thanks, Tim Doug) -- and so need to match the parameters you
pass to Torque, or whatever batching system you're using.
Block sizes: Tim Doug says: "128 works well for me. Others
suggest 80, 160, or 256. Experiment." cbench uses 80, 112 and 96; I
think this is just how they do things. I see a peak (in very early
tests) around 150.
N: The biggie. This posting to the Rocks-discuss mailing list
gives an excellent overview of how N works in the Linpack test:
If you really want to stress your cluster, you want to have your
matrix size fill approximately 80% of memory. For an NxN matrix, you
consume NN8bytes. If you have a 16-node cluster, for example with 8
GB/memory/node, then you have 168GB.80 = 102GB. N would be
approximately sqrt(102e^9/8) ~ 113,000.
That's a pretty big matrix an takes O(113,000^3) or 1.4 Quadrillion
Floating OPs. If your nodes were 8 core, 2.5GHz, 4 Flops/cycle, then
one would expect this matrix to factor (at reasonable computational
efficiency) in the 30-90 Minute range. The exact time depends on
efficiency, constant on the O(n^3) term and actual speed of your
processors and network.
HPL will allow you to set up various matrix sizes, set up something
that will compute quickly, eg. a 1000x1000 matrix, to verify that
everything is happy. then step through some sizes that will take 1 - 5
minutes to factor, this will allow you calibrate the time you expect
the full load to run. Remember each doubling of matrix size results in
8X the number floating ops. You get more efficiency as you get larger
(more computation to communication), but it starts to level off pretty
quickly. For most interconnects, using ~20% of memory is usually a
decent indicator of ultimate system performance, if 20% takes 1 minute
to actually compute, you expect the 80% run (8X) to take about an
hour.
Full machine LINPACK runs can take many hours to run.
(It's worth emphasizing that Linpack really does take a long time to
run with large N; this discussion shows how to start small and
ramp up your Linpack tests as you gain confidence.)
cbench, which I mentioned above, is a suite of programs that are
meant to exercise and benchmark a cluster. It's really quite
excellent, but as with a lot of things in HPC the documentation isn't
as explicit as it could be. For example, the N figures that are above
are:
- calculated for the amount of memory I have in my cluster (35 nodes *
48 GB/node), and
- designed to fill 25%, 80% and 85% of memory.
Thus, it's a good reflection of/agreement with the approach outlined
above: 20% for a short run (good for a ballpark figure) and 80% - 85%
for longer runs (to really stress things).
So if you run Linpack a bunch of times and tweak the parameters,
you'll see different results. This page discusses why:
The parallel solution of a system of linear equation requires some
communication between the processors. To measure the loss of
efficiency due this communication, we solved systems of equations of
varying size on a varying number of processors. The general rule is:
larger N means more work for each CPU and less influence of
communication. As you can see from Fig. 1, a 4-CPU setup comes very
close to the single CPU peak performance of 528 Mflops. This
indicates, that the solver that works in HPL is not significantly
worse than ATLAS. The relative speed per CPU decreases with
increasing number of CPUs, however.
The problem size N is limited by the total memory. Tina has 512
MByte per node, i.e. each node can hold at most an 8192x8192 matrix
of double precision floats. In practice, the matrix has to be
smaller since the system itself needs a bit of memory, too. If both
CPUs on a node are operating, the maximum size reduces to 5790x5790
per CPU. To minimize the relative weight of communication, the
memory load should be as high as possible on each node. In Fig.2 you
can see, how the effective speed increases with increasding load
factor. A load factor 1 means that 256 MByte are required on each
node to hold the NxN coefficient matrix.
[...]
With all 144 CPUs, communication becomes the major bottleneck. The
current performance of 41 Gflops scales down to 284 Mflops/CPU [as
compared to 528 MFlops for N=5000 on a single 4-CPU system]. The
CPUs seem to spend almost half of their time chatting with each
other...
Really, as someone else said, it's a black art. There are a ton
of papers out there on optimizing Linpack parameters. There's even --
and I am crapping you negative -- a software project called
ga-linhack that aims to "Develop a complete genetic algorithm tool set
for determining optimal parameters for Linpack runs." Because as they
say:
To most cluster engineers (the authors included) the tuning
explanations of the hpl parameters yield little clue as to the
underlying effect of varying these parameters. Not everyone can
take a graduate mathematics course in advanced linear algebra in
their free time.
Testify!
Why are we doing this again?
This page (mentioned above) also has this quote:
After this, look for the top 8 or 16 results, and refine the config
file to use only the parameters that produced these results.
...which for me brought up a lot of questions, like:
- Why are we doing this?
- Do we want a high score or do we want to stress the system?
- how compatible are those two goals?
- How much fiddling with the test parameters is "morally acceptable",
for lack of a better term?
- Which results do you pay attention to?
I'm still figuring all this out.
Other resources
Tags:
cluster
10 Jun 2011
On the music player this morning, I heard:
All such good music.
Tags:
08 Jun 2011
I've been puttering away at work getting the cluster going. It's
hard, because there are a lot of things I'm having to learn on the
go. One of the biggest chunks is Torque and Maui, and how they
interact with each other and Rocks as a whole.
For example: today I tried submitting a crapton of jobs all at once.
After a while I checked the queue with showq
(a Maui command; not to
be confused with qstat
, which is Torque) and found that a lot of jobs
were listed as "Deferred" rather than "Idle". I watched, and the idle
ones ran; the deferred ones just stayed in place, even after the list
of running jobs was all done.
At first I thought this might be something to do with fairness. There
are a lot of knobs to twiddle in Maui, and since I
hadn't looked at the configuration after installation I wasn't really
sure what was there. But near as I could tell, there wasn't anything
happening there; the config file for Maui was empty, and I couldn't
seem to find any mention of what the default settings were. I
followed the FAQ and ran the various status commands, but
couldn't really see anything obvious there.
Then I tried looking in the Torque logs (/opt/torque/server_logs), and found this:
06/08/2011 14:42:09;0008;PBS_Server;Job;8356.example.com;send of job to compute-3-2 failed error = 15008
06/08/2011 14:42:09;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Access from host not allowed, or unknown host (15008) in send_job, child failed in previous commit request for job 8356.example.com
And on compute-3-2 (/opt/torque/mom_logs)
06/08/2011 14:42:01;0080; pbs_mom;Req;req_reject;Reject reply code=15008(Access from host not allowed, or unknown host MSG=request not authorized), aux=0, type=QueueJob, from PBS_Server@example.local
That's weird. I ran rocks sync config
out of superstition, but
nothing changed. I found a suggestion that it might be a bug in
Torque, and to run momctl -d
to see if the head node was in the
trusted client list. It was not. I tried running that command on
all the nodes (sudo rocks run host compute command="momctl -d3
|grep Trusted | grep 10.1.1.1
); turned out that only 10 were. What
the hell?
I'm still not sure exactly where this gets set, but I did notice that
/opt/torque/mom_priv/config
listed the head node as the server, and
was identical on all machines. On a hunch, I tried restarting the pbs
service on all the nodes; suddenly they all came up. I submitted a
bunch more jobs, and they all ran through -- none were deferred. And
running momctl -d
showed that, yes, the head node was now in the
trusted client list.
Thoughts:
None of this was shown by Ganglia (which just monitors load) or
showq (which is a Maui command; the problem was with Torque).
Doubtless there were commands I should've been running in Torque to
show these things.
While the head node is running a syslog server and collects stuff
from the client nodes, Torque logs are not among them; I presume
Torque is not using syslog. (Must check that out.)
I still don't know how the trusted client list is set. If it's in
a text file, that's something that I think Rocks should manage.
I'm not sure if tracking down the problem this way is exactly the
right way to go. I think it's important to understand this, but I
suspect the Rocks approach would be "just reboot or reinstall".
There's value to that, but I intensely dislike not knowing why
things are happening and sometimes that gets in my way.
Tags:
cluster
rocks
06 Jun 2011
I've come across this problem a number of times, so here's a
reminder. When printing in Python, occasionally I'll end up with
something like this:
>>> print "foo=%s" % row["Name"]
foo=array('c', 'bar')
The solution is to use the .tostring()
method:
>>> print "foo=%s" % row["Name"].tostring()
foo=bar
Details here.
Tags:
python
23 May 2011
From The CBC:
The Vancouver Canucks have skated to within a victory from advancing
to their first Stanley Cup final since 1994 after they exhibited their
superiority in a bizarre special teams battle with the San Jose Sharks
on Sunday.
Tags:
18 May 2011
Memo to Canadians: your government will throw you under a bus if they
feel like it.
Quote:
The Canadian Security Intelligence Service, Canada's principal
intelligence agency, routinely transmits to U.S. authorities the names
and personal details of Canadian citizens who are suspected of, but
not charged with, what the agency refers to as "terrorist-related
activity."
The criteria used to turn over the names are secret, as is the
process itself.
Quote:
In at least some cases, the people in the cables appear to have been
named as potential terrorists solely based on their associations with
other suspects, rather than any actions or hard evidence.
Quote:
The first stop for these names is usually the so-called Visa Viper
list maintained by the U.S. government. Anyone who makes that list is
unlikely to be admitted to the States.
Given Washington's policy of centralizing such information, though,
the names also go into the database of the U.S. National
Counterterrorism Centre. Inclusion in such databases can have several
consequences, such as being barred from aircraft that fly through
U.S. airspace.
Or, as Canadian Maher Arar discovered in 2002, the consequences can be
worse: much arrest, interrogation, even "rendition" to another country.
Quote:
"We don't want another Arar," said the security official. But at the
same time, he said, CSIS is acutely aware that if it did not pass on
information about someone it suspected, and that person then carried
out some sort of spectacular attack in the U.S., the consequences
could be cataclysmic for Canada.
U.S. authorities, already suspicious that Canada is "soft on terror,"
would likely tighten the common border, damaging hundreds of billions
of dollars worth of vital commerce.
A former senior official, who also spoke to CBC on the basis of
anonymity, put it more bluntly: "The reality is, sorry, there are bad
people out there.
"And it's very hard to get some of those people before a court of law
with the information you have. And so there has to be some sort of
process which allows you to provide some sort of safeguard to society
on both sides of the border."
Furthermore, he said, "it's not a fundamental human right to be able
to go to the United States."
No, it's not a fundamental human right to be able to go to the United
States. It is a fundamental human right not to be kidnapped and
tortured.
Tags:
politics
rant
13 May 2011
With a perfect storm of disruptive technology rendering traditional
broadcasting all but obsolete, foreign entrants with superior services
and lower costs, unfavorable demographics, a powerful pro-competition
government, entrenched inflexible business models, lack of competitive
and innovative edge due to decades of insulation from the rest of the
telecom world, bloated balance sheets due to costly acquisitions of
old-media companies, and regulatory uncertainty relating to usage
based billing and functional separation, large telecommunications
firms in Canada do not look well-positioned.
Why Canadian Cable Companies and Telecoms Are in Trouble (via
Michael Geist).
Tags:
13 May 2011
I came across a problem compiling GotoBLAS2 at work today. It went
well on a practice cluster, but on the new one I got this error:
gcc -c -O2 -Wall -m64 -DF_INTERFACE_G77 -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=24 -DASMNAME=strmm_ounncopy -DASMFNAME=strmm_ounncopy_ -DNAME=strmm_ounncopy_ -DCNAME=strmm_ounncopy -DCHAR_NAME=\"strmm_ounncopy_\o
../kernel/x86_64/gemm_ncopy_4.S: Assembler messages:
../kernel/x86_64/gemm_ncopy_4.S:192: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:193: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:194: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:195: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:197: Error: undefined symbol `WPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:345: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:346: Error: undefined symbol `RPREFETCHSIZE' in operation
../kernel/x86_64/gemm_ncopy_4.S:348: Error: undefined symbol `WPREFETCHSIZE' in operation
The solution was simple:
gmake clean
gmake TARGET=NEHALEM
The problem appears to be that newer CPUs (Intel X5650 in my case) are
not detected properly by the CPU ID routine in GotoBlas2. You can
verify this by checking the contents of config.h
in the top-level
directory. Without TARGET=NEHALEM
, I saw this line:
But with TARGET=NEHALEM
, this becomes:
The problem with gemm_ncopy_4.S
arises because it defines
RPRETCHSIZE
and WPREFETCHSIZE
using #ifdef
statements depending
on CPU type. There is an entry for #ifdef GENERIC
, but that was not
set for me in config.h
.
In addition, if you type "gmake TARGET=NEHALEM" without "gmake clean"
first, you get a little further before you run into a similar error:
usr/bin/ld: ../libgoto2_nehalemp-r1.13.a(ssymv_U.o): relocation R_X86_64_32S against `PREFETCHSIZE' can not be used when making a shared object; recompile with -fPIC
../libgoto2_nehalemp-r1.13.a(ssymv_U.o): could not read symbols: Bad value
If I was a better person, I'd have a look at how the sizes are defined
and figure out what the right value is for newer CPUs, then modify
cpuid.c
(which I presume is what's being used to generate
config.h
, or at least this part of it. Maybe another day...
Tags:
rocks
cluster
hpc
software
debugging
10 May 2011
Tonight was the first clear night in far, far too long. But instead
of staying out 'til midnight, I decided to try pointing the scope out
my bathroom window to look at the moon and get to bed at a
semi-reasonable hour.
And hey, not bad! Sure, it got pretty awful above 50X, but that was
enough to let me see all kinds of things. I decided to sketch what I
saw, and I'm glad I did; it's no great artistry, but it really forced
me to pay attention to what I was seeing.
After a half hour or so, I took a break, then came back to look again
and pick out the features I could recognize. There was Albategnius,
Plinius, Manilius, and...hey, I don't remember that bright bit in
Abategnius being that big. And what's with the bright spot in Mare
Imbrium's shadow?
So I looked it the new bright bit, and it's Mons Piton -- 7000
ft/2100-odd metres high. And it hits me: I didn't see that before.
It's a big mountain in the middle of shadow. It's just the other side
of the terminator. Same thing must've happened with Albategnius.
Holy crap, I just saw sunrise on the moon!
It shouldn't really be a surprise -- even though I'm still getting
familiar with the moon I know how the terminator moves, and I know
that it's gotta move sometime. But it was really, really
surprising to see it in such a short space of time.
Tags:
astronomy
05 May 2011
Oh god this week. I've been setting up the cluster (three chassis'
worth of blades from Dell). I've installed Rocks on the front end
(rackmount R710). After that:
All blades powered on.
Some installed, most did not. Not sure why. Grub Error 15 is the
result, which is Grub for "File not found".
I find suggestions in the Rocks mailing list to turn off floppy
controllers. Don't have floppy controllers exactly in these, but I
do see boot order includes USB floppy and USB CDROM. Pick a blade,
disable, PXE boot and reinstall. Whee, it works!
Try on another blade and find that reinstallation takes 90 minutes.
Network looks fine; SSH to the reinstalling blade and wget all the
RPMs in about twelve seconds. What the hell?
Discover Rocks' Avalanche Installer and how it uses BitTorrent to
serve RPMs to nodes. Notice that the installing node is constantly
ARPing to find nodes that aren't turned on (they're waiting for me
to figure out what the hell's going on). Restart service
rocks-tracker on the front end and HOLY CRAP now it's back down to a
three minute installation. Make a mental note to file a bug about
this.
Find out that Dell OpenManage Deploy Toolkit is the best way to
populate a new machine w/BIOS settings, since the Chassis Management
Console can't push that particular setting to blades. Download
that, start reading.
Try fifteen different ways of connecting virtual media using CMC.
Once I find out the correct syntax for NFS mounts (amusingly
different between manuals), some blades find it and some don't; no
obvious hints why. What the hell?
Give up, pick a random blade and tell it by hand where to find the
goddamn ISO. (This ignores the problems of getting Java apps to
work in Awesome [hint: use wmname], which is my own fault.) Collect
settings before and after disabling USB CDROM and Floppy and find no
difference; this setting is apparently not something they expose to
this tool.
Give up and try PXE booting this blade even with the demon USB
devices still enabled. It works; installation goes fine and after
it reboots it comes up fine. What the hell?
Power cycle the blade to see if it still works and it reinstalls.
Reinstalls Rocks. What the hell?
Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to
pxe boot next time and at graceful shutdown reverses the change,
allowing the machine to boot normally. The thinking is that if you
have to power cycle a machine you probably want to reinstall it
anyhow.
Finally put this all together. Restart tracker, set all blades in
one of the chassis' to reinstall. Pick a random couple of blades
and fire up consoles. Power all the blades up. Installation fails
with anaconda error, saying it can't find any more mirrors. What
the hell?
eth0 is down on the front end; dmesg shows hundreds of "kernel:
bnx2: eth0 NIC Copper Link is Down" messages starting approximately
the time I power-cycled the blades.
I give up. I am going here tonight because my wife is a good person and
is taking me there. And I am going to have much, and much-deserved,
beer.
Tags:
cluster
hpc
rocks
03 May 2011
I've got a tape library at work with two tape drives. Today, one of
the drives was doing (full) backups and the second was free for a
restore job. However, when that restore job ran, I got this error:
JobId 62397: Forward spacing Volume "000039" to file:block 7:0.
JobId 62397: Error: block.c:1016 Read error on fd=7 at file:blk 3:0 on device "Drive-0" (/dev/nst1). ERR=Input/output error.
JobId 62397: End of Volume at file 3 on device "Drive-0" (/dev/nst1), Volume "000039"
JobId 62397: Fatal error: acquire.c:72 Acquire read: num_writers=1 not zero. Job 62397 canceled.
JobId 62397: Fatal error: mount.c:844 Cannot open Dev="Drive-0" (/dev/nst1), Vol=000039
JobId 62397: End of all volumes.
JobId 62397: Error: Bacula cbs-01-dir 5.0.2 (28Apr10): 03-May-2011 12:09:20
The problem wasn't that it encountered the end of the volume -- the
job spanned a number of volumes, so that was okay.
No, the problem was that after the restore job had run, a number of
other regular backups had started. These were incrementals, and thus
were unable to use the first drive. When the restore job ran into the
EOM on the first volume, it appears to have released the drive -- at
which point the incrementals started up and denied the use of the
second drive to the restore job. The restore job promptly gave up and
called it an error.
As I was in a hurry, I tried killing off the incrementals and
re-running the restore job. This worked just fine. Arguably it's a
bug, but I suspect I just need to tweak the priority for restore jobs
instead.
(Two entries in one day...woot!)
Tags:
bacula
backups
03 May 2011
A couple times now, I've run strace
on a java process and not had it
resume after starting. Running ps
or top
has shown that the T
flag has remained on the process. Running kill -SIGCONT <pid>
has
started things up again.
Tags:
15 Apr 2011
About 10.45 yesterday I noticed that my SSH connections to the servers
in our server room had stopped, and I was unable to make any more. I
checked Nagios the machine by my desk (multiple Nagios FTW!) and found
that it had noticed problems a few minutes before. I ran over to see
what was going on.
After a few minutes of checking, I'd found:
- our firewall machine seemed to be up just fine
- but I couldn't ping anything: no DHCP lease, and not with a manually configured interface
I called IT Services and asked if problems; they said no, so I double
checked again. Suddenly I could ping the firewall and other machines,
but SSHing to them hung.
My guess at this point was LDAP problems. I connected monitor and
keyboard to the machine hosting the LDAP server and found it
responsive, but a CRAPTON of errors on eth1 (which Xen handily
renames/clones as peth1):
peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link
UP BROADCAST RUNNING NOARP MTU:9000 Metric:1 RX packets:1895496748 errors:1500778269 dropped:1505784776 overruns:0 frame:1500778269
TX packets:340023247 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000
RX bytes:186743473052 (173.9 GiB) TX bytes:384744601794 (358.3 GiB)
Interrupt:18 Memory:ec000000-ec012800
I didn't know what to make of this, so I replaced the cable to peth1,
ran ifconfig down/up, and got the connection back -- at which point
LDAP came back up, the machines started working, etc.
Okay, weird -- but at least it's working again. I went back to my
desk to try and figure out what had happened. While I was doing
that, I started losing connectivity to the machines in the server
room for 30 seconds at a time. What the hell?
After that, frankly, it's a blur. I was there 'til 7.45pm and here's
what I think was going on.
First, the Xen host was having big memory problems that affected its
networking, and the networking of the VMs within it. I was seeing a
crapton of these messages:
Apr 14 15:12:51 kernel: xen_net: Memory squeeze in netback driver.
This bug said it was fixed in CentOS 5.6 -- so I tried upgrading
to that (I was at 5.5, so not a big jump). Nope. Then I saw a
suggestion that the problem was in memory ballooning -- that the dom0
was sucking up all the memory for some reason. The solution was to
add a "dom0_mem=" to the kernel argument in Grub, ideally
matching the dom0-min-mem argument in /etc/xen/xend-config.sxp.
Unfortunately, I didn't realize that without specifying units, Xen
assumes bytes -- so I was specifying a max mem of 512 bytes, not
megabytes.
This caused the machine to panic and reboot -- but because consoles
were only available via serial port, and because the IPMI console
wasn't working, I was unable to see it. I had to edit the Grub
entries on the fly to remove those arguments from the kernel, see what
was going on, and then set it correctly.
After rebooting with a working memory limit, top showed that
ksoftirqd/0 was taking up an enormous amount of CPU time -- 98% of
one CPU. This was pretty much all due to eth0 interrupts. tcpdump
showed that there was a lot of traffic on the management subnet,
which the machine shouldn't have been seeing. I checked the switch
and saw that the management vlan WAS on there as tagged (the normal,
inside VLAN was default and untagged). I turned that off within the
switch, rebooted the machine and things pretty much went back to
normal.
All of that was doubly unfortunate because of the four VMs on there,
two are the only two LDAP servers in that room -- the third is no
another network, but it took a long time for the clients to fail over
to it. This is piss-poor planning on my part.
As if that wasn't enough, another server's disk array disappeared,
which caused MySQL to die and the website running on it to disappear.
Turned out the machine had been booted with the wrong multipath
drivers. When it had problems on one connection, the drive came back
with a different device (/dev/sde1 instead of /dev/sdd1). This took a
while to figure out, but I finally got it rebooted and the drive array
back.
Now things were mostly back to normal -- except that the connection to
the management VLAN seemed to be coming and going. This was shown
both by nagios ("foo-ilom up! foo-ilom down!") and by good
old-fashioned pings. A given ILOM/SP would respond to pings for 30
seconds, then go down; five minutes later it'd come back for 30
seconds, then disappear again. For the nth time: what the hell?
Then I remembered that, back when all this had begun, we'd been
configuring a new cluster. Working on its switches, in fact, which
were from a different vendor than our usual (package deal, dontcha
know). I began to suspect that the problem might somehow lie there.
I removed the two patch cables connecting the new switches to our
network...and at last the management VLAN connection came back up and
stayed up.
In all I was in the server room 'til 7.45pm last night. Part of it
was spent reinstalling CentOS on a separate machine in hopes of at
least getting an LDAP server up on it. I didn't stick around for
that, as the VMs came back up fine, but that's definitely on the
agenda.
Tags:
11 Apr 2011
I came across this tip on an old posting to the Bacula mailing
list. To determine if exclusions in a fileset are working, run these
commands in bconsole:
@output some-file
estimate job=<job-name> listing level=Full
@output
The file will contain a list of files Bacula will include in the
backup.
(Incidentally, I came across this while trying to figure out why my
excludions weren't working; turned out I needed to remove the trailing
slash in my directory names in the Exclude
section.
Tags:
backups
bacula
toptip
04 Mar 2011
Torque is a resource manager; it's an open source project with a
long history. It keeps track of resources -- typically compute nodes,
but it's "flexible enough to handle scheduling a conference
room". It knows how many compute nodes you have, how much
memory, how many cores, and so on.
Maui is the job scheduler. It looks at the jobs being submitted,
notes what resources you've asked for, and makes requests of Torque.
It keeps track of what work is being done, needs to be done, or has
been completed.
MPI stands for "Message Passing Interface". Like BLAS, it's a
standard with different implementations. It's used by a lot of
HPC/scientific programs to exchange messages between processes --
often but not necessarily on separate computers -- related to their
work.
MPI is worth mentioning in the same breath as Torque and Maui because
of mpiexec, which is part of OpenMPI; OpenMPI is a popular open-source
implementation of MPI. mpiexec (aka mpirun, aka orterun) lets you
launch processes in an OpenMPI environment, even if the process
doesn't require MPI. IOW, there's no problem running something like
"mpiexec echo 'Hello, world!'".
To focus on OpenMPI and mpiexec: you can run n copies of your program
by using the "-np" argument. Thus, "-np 8" will run 8 copies of your
program...but it will run on the machine you run mpiexec on:
$ mpiexec -np 8 hostname
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
rocks-01.example.org
This isn't always useful -- why pay big money for all this hardware if
you're not going to use it? -- so you can tell it to run on different
hosts:
$ mpiexec -np 8 -host compute-0-0,compute-0-1 hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
And if you're going to do that, you might as well give it a file to
read, right?
$ mpiexec -np 8 -hostfile /opt/openmpi/etc/openmpi-default-hostfile hostname
compute-0-0.local
compute-0-0.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
compute-0-1.local
compute-0-0.local
compute-0-1.local
That file is where Rocks sticks the hostfile, but it could be anywhere
-- including in your home directory, if you decide that you want it to
run on a particular set of machines.
However, if you're doing that, then you're really setting yourself up
as the resource manager. Isn't that Torque's job? Didn't we set all
this up so that you wouldn't have to keep track of what machine is busy?
So OpenMPI can work with Torque:
- How do I run jobs under Torque / PBS Pro?
The short answer is just to use mpirun as normal.
Open MPI automatically obtains both the list of hosts and how many
processes to start on each host from Torque / PBS Pro
directly. Hence, it is unnecessary to specify the --hostfile,
--host, or -np options to mpirun. Open MPI will use
PBS/Torque-native mechanisms to launch and kill processes ([rsh]
and/or ssh are not required).
Whee! So easy! Except that Rocks does not compile OpenMPI with
Torque support!
Because the Rocks project is kind of a broad umbrella, with lots of
sub-projects underneath, the Torque roll is separate from the OpenMPI
roll. Besides, installing one doesn't mean you'll install the other,
so it may not make sense to build OpenMPI that way.
The fine folks at Rocks asked the fine folks at OpenMPI and
found a way around this: by having every job submitted to Torque/Maui
and using MPI source /opt/torque/etc/openmpi-setup.sh. While not
efficient, it works; the recommended way, though, is to recompile
OpenMPI with Torque installed so that it knows about Torque.
To me, this makes the whole Rocks installation less useful,
particularly since this is didn't seem terribly well documented. To
be fair, it is there in the Torque roll documentation:
Although OpenMPI has support for the torque tm-interface
(tm=taskmanager) it is not compiled into the library shipped with
Rocks (the reason for this is that the OpenMPI build process needs to
have access to libtm from torque to enable the interface). The best
workaround is to recompile OpenMPI on a system with torque
installed. Then the mpirun command can talk directly to the batch
system to get the nodelist and start the parallel application using
the torque daemon already running on the nodes. Job startup times for
large parallel applications is significantly shorter using the
tm-interface that using ssh to start the application on all nodes.
So maybe I should just shut my mouth.
In any event, I suspect I'll end up recompiling OpenMPI in order to
get it to see Torque.
Tags:
rocks
hpc
03 Mar 2011
There's a lot to clusters. I'm learning that now.
At $WORK, we're getting a cluster RSN -- rack fulla blades, head node,
etc etc. I haven't worked w/a cluster before so I'm practicing with a
test one: three little nodes, dual core CPUs, 2 GB memory each, set up
as one head node and two compute nodes. I'm using Rocks to manage it.
Here a few stories about things I've learned along the way.
BLAS
You find a lot of references to BLAS when you start reading software
requirements for HPC, and not a lot explaining it.
BLAS stands for "Basic Linear Algebra Subprograms"; the original web
page is here. Wikipedia calls it "a de facto application
programming interface standard for publishing libraries to perform
basic linear algebra operations such as vector and matrix
multiplication." This is important to realize, because, as in the
article, common usage of the term seems to refer to an API than
anything else; there's the reference implementation, but it's not
really used much.
As I understand it -- and I invite corrections -- BLAS chugs through
linear algebra and comes up with an answer at the end. Brute force is
one way to do this sort of thing, but there are ways to speed up the
process; these can make a huge difference in the amount of time it
takes to do some calculation. Some of these are heuristics and
algorithms that allow you to search more intelligently through the
search space. Some are ways of compiling or writing the library
routines differently, taking advantage of the capabilities of
different processors to let you search more quickly.
There are two major open-source BLAS implementations:
The Goto BLAS library is a hand-optimized BLAS implementation
that, by all accounts, is very fast. It's partly written in
assembler, and the guy who wrote it basically crafted it the way (I
think) Enzo Ferrari crafted cars.
ATLAS is another BLAS implementation. The ATLAS home page says
"it provides C and Fortran77 interfaces to a portably efficient BLAS
implementation, as well as a few routines from LAPACK." As noted in
the articles attached to this page, ATLAS tries many, many different
searches for a solution to a particular problem. It uses CPU
capabilities to do these searches efficiently.
As such, compilation of ATLAS is a big deal, and the resulting
binaries are tuned to the CPU they were built on. Not only do you
need to turn off CPU throttling, but you need to build on the CPU
you'll be running on. Pre-built packages are pretty much out.
ATLAS used to be included in the HPC roll of the Rocks 4
series. Despite [irritatingly out-of-date information][13], this has
not been the case in a while.
[LAPACK] "is written in Fortran 90 and provides routines for
solving systems of simultaneous linear equations, least-squares
solutions of linear systems of equations, eigenvalue problems, and
singular value problems." It needs a BLAS library. From the
FAQ:
Why aren’t BLAS routines included when I download an LAPACK routine?
It is assumed that you have a machine-specific optimized BLAS library
already available on the architecture to which you are installing
LAPACK. If this is not the case, you can download a Fortran77
reference implementation of the BLAS from netlib.
Although a model implementation of the BLAS in available from netlib
in the blas directory, it is not expected to perform as well as a
specially tuned implementation on most high-performance computers --
on some machines it may give much worse performance -- but it allows
users to run LAPACK software on machines that do not offer any other
implementation of the BLAS.
Alternatively, you can automatically generate an optimized BLAS
library for your machine, using ATLAS http:>www.netlib.org/atlas/
(There is an RPM called "blas-3.0" available for rocks; given the URL
listed (http://www.netlib.org/lapack/), it appears that this is the
model implementation listed above. This version is at
/usr/lib64/libblas.so*, and is in ldconfig.)
Point is, you'll want a BLAS implementation, but you've got two (at
least) to choose from. And you'll need to compile it yourself.
I get the impression that the choice of BLAS library is something that
can vary depending on religion, software, environment and so
on...which means you'll probably want to look at something like
modules to manage all this.
Tomorrow: Torque, Maui and OpenMPI.
Tags:
rocks
hpc
08 Feb 2011
Fell down a rabbit hole today when I was looking at a data sheet for
the Broadcom 5709c chipset. "RDMA over TCP (iWARP) - RDMAC 1.0
compliant". Huh?
This blog has a good overview of RDMA:
The rationale for RDMA is laid out in great detail in RFC 4297, but
the basic idea is that allowing network messages to carry information
about where they should be received and allowing the NIC to place the
data directly in that buffer allows fundamentally better
performance. [...]
With RDMA andiSCSI Extensions for (iSER, which is RFC 5046), the
target can send the data in response to a read command and have it
placed directly in the receive buffer on the initiator, which saves
the copy and uses 3x less memory bandwidth (which is huge if the data
is running at 10Gb/sec).
But there's more in that post, like this bit of drama on
the LKML from 2007:
How about we just remove the RDMA stack altogether? I am not at all
kidding. If you guys can't stay in your sand box and need to cause
problems for the normal network stack, it's unacceptable. We were
told all along the if RDMA went into the tree none of this kind of
stuff would be an issue.
These are exactly the kinds of problems for which people like myself
were dreading. These subsystems have no buisness using the TCP port
space of the Linux software stack, absolutely none.
After TCP port reservation, what's next? It seems an at least
bi-monthly event that the RDMA folks need to put their fingers
into something else in the normal networking stack. No more.
I will NACK any patch that opens up sockets to eat up ports or
anything stupid like that.
From Dave Miller, no less. Of course, that was four years
ago (shit!). Now where do things stand?
Well, there's this announcement from IBM of a Linux iWARP driver
and user library. And there's NFS over RDMA (is that the
right term?) in the kernel now, including iWARP support.
I'm not sure if this'll be useful at work, but it's interesting to
read about.
Tags:
28 Jan 2011
At $work I'm migrating slowly to Cfengine 3. One of the attractions
is the ability to do what this page shows: loop over lists in a
Cf-ish kind of way.
Here's the first bundle. (It's pretty much stolen from that page, but
customized for my environment.) It tells you some basic details about
the config file, the process name and the restart command for
different daemons:
bundle common services {
vars:
redhat|centos::
"cfg_file_prefix" string => "centos/5";
"cfg_file[ssh]" string => "/etc/ssh/sshd_config";
"daemon[ssh]" string => "sshd";
"start[ssh]" string => "/sbin/service sshd restart";
"enable[ssh]" string => "/sbin/chkconfig sshd on";
"cfg_file[iptables]" string => "/etc/sysconfig/iptables";
"start[iptables]" string => "/sbin/service iptables restart";
"enable[iptables]" string => "/sbin/chkconfig iptables on";
}
Here's the bundle that copies config files and restarts the daemon if
necessary:
bundle agent fix_service(service) {
files:
"$(services.cfg_file[$(service)])"
copy_from => secure_cp("$(g.masterfiles)/$(services.cfg_file_prefix)/$(services.cfg_file[$(service)])", "$(g.masterserver)"),
perms => mog("0600","root","root"),
classes => if_repaired("$(service)_restart"),
comment => "Copy a stock configuration file template from repository";
processes:
"$(services.daemon[$(service)])"
comment => "Check that the server process is running, and start if necessary",
restart_class => canonify("$(service)_restart");
commands:
"$(services.start[$(service)])"
comment => "Method for starting this service",
ifvarclass => canonify("$(service)_restart");
"$(services.enable[$(service)])"
comment => "Method for enabling this service",
ifvarclass => canonify("$(service)_restart");
}
And here's the loop that puts it all together:
bundle agent redhat {
vars:
"service" slist => { "ssh", "iptables" };
methods:
"any" usebundle => fix_service("$(service)"),
comment => "Make sure the basic application services are running";
}
I ran into a problem with this, though: it would always, without
fail, restart iptables even though no config file had been copied.
The problem was with the process check: there's no process to check
for with iptables. And from what I can tell, when the processes
stanza was asked to check for a non-existent variable, it checked for
the literal string $(services.daemon[$(service)])
-- that is,
dollar-bracket-s-e-r-v-.... Since there was no such thing, it decided
it needed restarting.
The way around this was to add this variable to the services bundle
(the one that has all the info about the daemons):
"daemon[iptables]" string => "cf_null";
I also had to modify the processes stanza:
processes:
$(services.daemon[$(service)])"
comment => "Check that the server process is running, and start if necessary",
restart_class => canonify("$(service)_restart"),
ifvarclass => canonify("$(services.daemon[$(service)])");
That ifvarclass
check on the last line says to run iff there is a
value for daemon. cf_null
is a NULL value special to cfengine.
Since the check fails for iptables, the process check isn't run and
we only restart if we copy over a new config file.
Tags:
cfengine
27 Jan 2011
Today I installed Windows 7 Pro on a Macbook Pro with Boot Camp. I
ran into two problems I figured I should document:
First, Boot Camp would not proceed past the offer to burn a CD with
drivers for the disk. It's a bug, and I was able to ignore it
without problems; the networking and graphics came up w/o problems.
Second, after installing Windows Defender and rebooting, Windows
would freeze at the login screen. This too is a known problem,
with lots of suggestions on how to fix it. What worked for me was
booting into safe mode w/the command prompt (every other safe mode
would freeze), then disabling the Windows Defender service with
MMC. After that, I was able to boot; after that, I was able to
set it to "Automatic Start", reboot, and I've had no further
troubles.
Tags:
windows
debugging
27 Jan 2011
This past Sunday I picked up a reflector on Craigslist. It's an Omcon
811SE, a 114mm f/9 Newtonian. I hadn't heard of the name before, but
a quick search showed that they'd been built in the '90s here in
Vancouver, and at least one CloudyNights.com member has one listed in
their sig.
It seemed like it was in good condition. It came with a pretty sturdy
wooden tripod with an alt-az friction mount, a 6x30 finder scope, and
two eyepieces: an 18mm Kellner (50X), and a 7.5mm Plossl (120X). At
$40 (knocked down from $50!), it was too good a deal to pass up.
...And then the clouds came. OF COURSE. I spent my time adjusting
the finder and pointing it out the window, cursing water vapour under
my breath.
Finally, we got some clear-ish skies; there were lingering clouds, but
they seemed thin. I took it out to a park within walking distance of
my house. The skies are by no means dark, but it's easy to get there.
I set up the scope, popped in the 18mm, pointed it at a star, focussed
and...hey, a star! Definitely reassuring, since I wasn't sure about
how good the collimation was...it seemed okay to me, based on what I'd
read, but the mirror has no centre mark and it was hard to be sure
without actually testing it.
Finally, Orion came out, so I pointed it at M42...
...WOW.
WOW.
I'd only ever looked at it before in binoculars and my Galileoscope
(the only other scope I've had since people started calling me a
grownup), and...well, frankly it didn't seem like all that. I mean,
it was nice, but nothing spectacular. But this...THIS was
spectacular.
I swapped in the 7.5, and incredibly it seemed even better. It was
fainter, of course, but the narrower FOV seemed to focus my
attention more. I spent some time letting it drift across the
eyepiece, and began to notice dark spots, lanes and such. It was
amazing. I couldn't say I saw any colour, but I definitely know now
what the fuss is about.
I decided to try the Pleides, and even at 50X -- such a narrow FOV
compared to binoculars! -- it was astonishing. I just kept repeating
"Oh wow" over and over again.
Finally, I decided to try splitting Eta Cassiopeia. I missed it at
50X, found it at 120X and then saw it on second look at 50X. Can't
say I saw any colour difference between the two.
(I'd been REALLY hoping to see Jupiter, but the thrice-cursed clouds
hid it.)
Now, the scope isn't perfect. The mounting definitely needs to be
tweaked to make it easier to move (while not actually falling down),
and pointing it at the zenith is going to be difficult. The &?*#!
finderscope kept dewing over (first priority is to get my kids to make
me a dew shield; they're 4 and 2, so it'll be a great craft project
for them :-)). And even w/o much experience I can the eyepieces
aren't great...in the 18mm, stars get fuzzy or elongated at the edge
of the FOV, and the focus on the 7.5mm seems mushy/hard to achieve.
(Though I suppose that could be the mirror...I wouldn't know.)
But oh man, oh man, oh man...what a night. I couldn't be happier
with my new scope.
Tags:
astronomy