Linpack: a newbie's view

(Keep in mind: I don't know what I'm talking about. I've written this down because I've found very little that seems to explain this to a newbie. I'm probably wrong; if you know that I'm wrong, leave a comment.)

Linpack is "a software library for performing numerical linear algebra on digital computers". It has been superceded for that purpose by Lapack; now it's mainly used for benchmarking. The latest incarnation, used for scores on the Top 500 list, is called hpl (High Performance Linpack); it uses MPI for communication.

Why do you need to know this?

If you search for "linpack score", you'll find an astonishing number of people posting scores for their phone. If you're looking for information on maxing your score on your HTC Dream, I can't help you.

If you have a cluster, then near as I can tell there are two reasons to do this:

  1. Get high scores.
  2. Exercise your hardware and see what happens.

The first is the sort of measuring contest that you hope gets you into the Top 500 list. It probably affects your funding, and may affect your continued employment.

The second uses Linpack as a shakedown (did the test work? did anything break?), or as a way of benchmarking performance. Sometimes people will use Linpack scores as a baseline; they'll make tweaks to a cluster (add more memory, change MTU, turn off more daemons, twiddle BIOS settings, etc) and see what the effect is. Linpack is not perfect for this; it stresses CPU and FPU, possibly memory and network, and doesn't really check disk, power usage or other things. But it's a start, it's familiar, and it boils down to a single number.

(HAH.)

So how high can it go?

The theoretical peak score is:

CPU GHz x Flops/Hz x Cores/node x nodes

(Cite). Flops/Hz is CPU-specific; for the Nehalems, at least, it's 4/Hz. Thus, for the cluster I'm working on, the peak score is:

2.67 GHz x 4 Flops/Hz x 24 Cores/node x 35 nodes

(Note the assumption of HyperThreading turned on in the 24 cores/node figure; while I've got that turned on now, I should probably turn it off -- at least for running Linpack.)

Anyhow, comes out to about 8970 GFlops, or almost 9 TFlops. By contrast, Tianhe-1A, the top entry in the November 2010 Top 500 list, has an RPeak of 4701000 GFlops -- so 4701 TFlops, or about 4.7 Petaflops. So there's no need to buy me that "I made the Top 500 and all I got was this lousy t-shirt" t-shirt yet.

Of course, that's a theoretical peak, and a lot depends on the way your system is configured. For example, [[https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2010-May/047136.html][this post]] to the Rocks-discuss mailing list says:

FYI, I've got around 84-85% on a cluster with Infiniband and OpenMPI, but some people told me they get better results.

That's 85% of the theoretical max. And it depends on Infiniband. Jeezum Crow.

Configuration file tuning

Linpack uses a configuration file named HPL.dat. The format is a little non-obvious, but is documented here. Here's a sample files, as generated by cbench (about which more later):

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
3            # of problems sizes (N)
108304 346573 368233  # N -- fantastically important; see ahead
1            # Number of block sizes
80 112 96    # Block sizes
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
25           # Ps
26           # Qs
8.0          threshold
1            # of panel fact
0 2 1        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4 2          NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1 2 0        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0 3 1 2 4    BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
256          swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
0            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

(Not the best format; sorry.)

P and Q, multiplied, should be the number of cores you want to use (thanks, Tim Doug) -- and so need to match the parameters you pass to Torque, or whatever batching system you're using.

Block sizes: Tim Doug says: "128 works well for me. Others suggest 80, 160, or 256. Experiment." cbench uses 80, 112 and 96; I think this is just how they do things. I see a peak (in very early tests) around 150.

N: The biggie. This posting to the Rocks-discuss mailing list gives an excellent overview of how N works in the Linpack test:

If you really want to stress your cluster, you want to have your matrix size fill approximately 80% of memory. For an NxN matrix, you consume NN8bytes. If you have a 16-node cluster, for example with 8 GB/memory/node, then you have 168GB.80 = 102GB. N would be approximately sqrt(102e^9/8) ~ 113,000.

That's a pretty big matrix an takes O(113,000^3) or 1.4 Quadrillion Floating OPs. If your nodes were 8 core, 2.5GHz, 4 Flops/cycle, then one would expect this matrix to factor (at reasonable computational efficiency) in the 30-90 Minute range. The exact time depends on efficiency, constant on the O(n^3) term and actual speed of your processors and network.

HPL will allow you to set up various matrix sizes, set up something that will compute quickly, eg. a 1000x1000 matrix, to verify that everything is happy. then step through some sizes that will take 1 - 5 minutes to factor, this will allow you calibrate the time you expect the full load to run. Remember each doubling of matrix size results in 8X the number floating ops. You get more efficiency as you get larger (more computation to communication), but it starts to level off pretty quickly. For most interconnects, using ~20% of memory is usually a decent indicator of ultimate system performance, if 20% takes 1 minute to actually compute, you expect the 80% run (8X) to take about an hour.

Full machine LINPACK runs can take many hours to run.

(It's worth emphasizing that Linpack really does take a long time to run with large N; this discussion shows how to start small and ramp up your Linpack tests as you gain confidence.)

cbench, which I mentioned above, is a suite of programs that are meant to exercise and benchmark a cluster. It's really quite excellent, but as with a lot of things in HPC the documentation isn't as explicit as it could be. For example, the N figures that are above are:

Thus, it's a good reflection of/agreement with the approach outlined above: 20% for a short run (good for a ballpark figure) and 80% - 85% for longer runs (to really stress things).

So if you run Linpack a bunch of times and tweak the parameters, you'll see different results. This page discusses why:

The parallel solution of a system of linear equation requires some communication between the processors. To measure the loss of efficiency due this communication, we solved systems of equations of varying size on a varying number of processors. The general rule is: larger N means more work for each CPU and less influence of communication. As you can see from Fig. 1, a 4-CPU setup comes very close to the single CPU peak performance of 528 Mflops. This indicates, that the solver that works in HPL is not significantly worse than ATLAS. The relative speed per CPU decreases with increasing number of CPUs, however.

The problem size N is limited by the total memory. Tina has 512 MByte per node, i.e. each node can hold at most an 8192x8192 matrix of double precision floats. In practice, the matrix has to be smaller since the system itself needs a bit of memory, too. If both CPUs on a node are operating, the maximum size reduces to 5790x5790 per CPU. To minimize the relative weight of communication, the memory load should be as high as possible on each node. In Fig.2 you can see, how the effective speed increases with increasding load factor. A load factor 1 means that 256 MByte are required on each node to hold the NxN coefficient matrix.

[...]

With all 144 CPUs, communication becomes the major bottleneck. The current performance of 41 Gflops scales down to 284 Mflops/CPU [as compared to 528 MFlops for N=5000 on a single 4-CPU system]. The CPUs seem to spend almost half of their time chatting with each other...

Really, as someone else said, it's a black art. There are a ton of papers out there on optimizing Linpack parameters. There's even -- and I am crapping you negative -- a software project called ga-linhack that aims to "Develop a complete genetic algorithm tool set for determining optimal parameters for Linpack runs." Because as they say:

To most cluster engineers (the authors included) the tuning explanations of the hpl parameters yield little clue as to the underlying effect of varying these parameters. Not everyone can take a graduate mathematics course in advanced linear algebra in their free time.

Testify!

Why are we doing this again?

This page (mentioned above) also has this quote:

After this, look for the top 8 or 16 results, and refine the config file to use only the parameters that produced these results.

...which for me brought up a lot of questions, like:

I'm still figuring all this out.

Other resources