Woodcrest Cluster

The
RRZE's Woodcrest cluster (termed "Woody")
(
Bechtle
/
HP)
is a high-performance compute resource with high speed interconnect. It is intended for
distributed-memory (MPI) or hybrid parallel programs with medium to high communication requirements.
The system entered the November 2006
Top500 list on rank 124 and is now (
November 2007) ranked number 329.
217 compute nodes, each with two Xeon 5160 "Woodcrest" chips (4 cores) running at 3.0 GHz with 4 MB Shared Level 2 Cache per dual core, 8 GB of RAM and 160 GB of local scratch disk
- Infiniband interconnect fabric with 10 GBit/s bandwith per link and direction
2 frontend systems with the same features as the compute nodes but 320 GB of local scratch disk
1 NFS file server with a capacity of 15 TB
Parallel file system (HP SFS) with a capacity of 15 TB and an aggregated parallel I/O bandwidth of > 900 MB/s
- Overall peak performance of 10.4 TFlop/s (6.62 TFlop/s LINPACK)
Woody is a system that is designed for running parallel programs using significantly more than one node. Jobs with less than one node are not supported by RRZE and are subject to be killed without notice.
This website shows information regarding the following topics:
- Access, User Environment, File Systems
- Software Development
- Parallel Computing
- Important Libraries
- Batch Processing
- Further Information
Access, User Environment, and File Systems
Access to the machine
Access to the system is granted via a number (currently two) frontend nodes via ssh. Please connect to
woody.rrze.uni-erlangen.de
and you will be randomly routed to one of the frontends. All systems in the
cluster, including the frontends, have private IP addresses in the
10.188.82.0/23 range. Thus they can only be accessed
directly from within the FAU
networks. If you need access from outside of
FAU you have to connect
for example to the dialog server cshpc.rrze.uni-erlangen.de
first and then ssh
to Woody from there. While it is possible to ssh directly to a compute node,
a user is only allowed to do this when they have a batch
job running there. When all batch jobs of a user on a node have ended,
all of their shells will be killed automatically.
The login and compute nodes run 64-bit SuSE Linux Enterprise Server. As on most
other RRZE HPC systems, a modules
environment is provided to facilitate access to software packages.
Type "modules avail" to get a list of available packages.
File Systems
The following table summarizes the available file systems and their features:
| Mount point | Access via | Purpose | Technology, size | Backup | Data lifetime | Quota |
|---|---|---|---|---|---|---|
/home/rrze | $HOME | Storage of source, input and important results | NFS on RRZE servers, small | YES | Account lifetime | YES (very restrictive) |
/home/woody | $WOODYHOME | Cluster-local large volume storage | NFS, 15 TB | NO | Account lifetime | YES |
/wsfs | $FASTTMP | High performance parallel I/O; short-term storage | HP SFS parallel file system via InfiniBand, 15 TB | NO | High watermark deletion | NO |
/tmp | $TMPDIR | Temporary job data directory | Node-local RAID0 array, 130 GB | NO | Job runtime | NO |
NFS file systems $HOME and $WOODYHOME
When connecting to one of the front end nodes, you'll find yourself in your regular
RRZE
$HOME directory (/home/rrze/...). The cluster also has its own
home directory tree which can be accessed much faster. This local directory tree
is mounted at /home/woody/$GROUP/$USER/ and available via the shell
variable $WOODYHOME. Please note that there is no backup by the
RRZE
for any data stored in this local tree!
Quotas are active on $WOODYHOME. New users get a standard quota
of 10 GBytes; more space is available on request. All users should regard
disk space as a valuable resource and not use it as a long-term archive.
Parallel file system $FASTTMP
The cluster's parallel file system is mounted on all nodes under
/wsfs/$GROUP/$USER/ and available via the $FASTTMP
environment variable. It supports parallel I/O using the
MPI-I/O functions and can be accessed with an aggregate bandwidth
of >900 MB/sec (and even much larger if caching effects can be used).
The parallel file system is strictly intended to be a high-performance
short-term storage, so a high watermark deletion algorithm
is employed: When the filling of the file system exceeds a certain limit
(e.g. 80%), files will be deleted starting with the oldest and largest files
until a filling of less than 60% is reached. Be aware that the normal tar -x
command preserves the modification time of the original file instead of the time
when the archive is unpacked. So unpacked files may become one of the first candidates
for deletion. Use tar -mx or touch in combination with
find to work around this. Be aware that the exact time of deletion is
unpredictable.
Node-local storage $TMPDIR
Each node has 130 GB of local hard drive capacity for temporary files
available under /tmp/ (also accessible via /scratch/).
All files in these directories which are older than a certain number of days
(currently 12) will be deleted automatically without any notification.
If possible, compute jobs should use the local disk for scratch space as this reduces the load on the central servers. Important data to be kept can be copied to a cluster-wide volume at the end of the job, even if the job is cancelled by a time limit. See the section on batch processing for details.
In batch scripts the shell variable $TMPDIR points to a node-local,
job-exclusive directory whose lifetime is limited to the duration of the batch job.
This directory exists on each node of a parallel job separately (it is not shared between
the nodes). It will be deleted automatically when the job ends. Please see the
section on batch processing for examples on how to use
$TMPDIR.
Software Development
You will find a wide variety of software packages in different versions installed on the cluster frontends. The module concept is used to simplify the selection and switching between different software packages and versions. Please see the section on batch processing for a description of how to use modules in batch scripts.
Compilers
Intel
Intel compilers are the recommended choice for software development on Woody. A current version
of the Fortran90, C and C++ compilers (called ifort, icc and
icpc, respectively) can be selected by loading the intel64
module. For use in scripts and makefiles, the module sets the shell variables
$INTEL_F_HOME and $INTEL_C_HOME to the base directories
of the compiler packages.
As a starting point, try to use the option combination -O3 -xP when
building objects. All Intel compilers have a -help switch
that gives an overview of all available compiler options. For in-depth information
please consult the local docs in $INTEL_[F,C]_HOME/doc/ and
Intel's online documentation for
C/C++
and
Fortran
compilers.
These compilers generate 64-bit objects. Production and use of 32-bit objects is not supported by RRZE, although you might be able to successfully run pre-built 32-bit binaries.
Endianness
All x86-based processors use the little-endian storage format
which means that the LSB for multi-byte data has the lowest memory location. The same format
is used in unformatted Fortran data files. To simplify the handling of big-endian files
(e.g. data you have produced on IBM Power, Sun Ultra, or NEC SX systems)
the Intel Fortran compiler has the ability to convert the endianness on the fly
in read or write operations. This can be configured separately for different
Fortran units. Just set the environment variable F_UFMTENDIAN at run-time.
Examples:
| F_UFMTENDIAN= | Effect |
|---|---|
| big | everything treated as BE |
| little | everything treated as LE (default) |
| big:10,20 | everything treated as LE, except for units 10 and 20 |
| "big;little:8" | everything treated as BE, except for unit 8 |
GNU
The GNU compiler collection (GCC) is available directly without having to load any module. As the cluster is running an enterprise version of SuSE Linux, do not expect to find the latest GCC version here. Be aware that the default Intel MPI module assumes the Intel compiler and does not work with the GCC. For details see the section on parallel computing.
DDT Parallel Debugger
DDT (Distributed Debugging Tool),
sold by
Allinea, is a parallel GUI-based
source-level debugger, similar to Totalview which is installed on the
Transtec cluster. With DDT you can debug serial
and MPI-parallel programs, i.e. single step, set breakpoints, inspect variables etc..
To use DDT, the following steps should be performed:
- Connect to a Woody frontend with X forwarding enabled (
-Xoption to ssh) and start an interactive batch job with the-Xswitch toqsub. - Load the
ddtmodule and execute theddtcommand. Do not put it to the background! - If this is the first time you use DDT, you are prompted to create a new configuration file. Choose "intel-mpi" as the MPI implementation and check "Do not configure DDT for attaching this time" on the next screen. Finally, accept the location of the config file that is suggested to you.
- In the "session control" window, click on "advanced". Enter the path to your executable in the top box ("application") and specify any command line arguments below. At the bottom of the window, select the number of processes you wish to start.
- Again in the session control window, click on "Change" and select
"Submit job through batch or configure own mpirun command". In the
"Submit command" input box type "
mpirun -n NUM_PROCS_TAG -ddt PROGRAM_ARGUMENTS_TAG". The placeholdersNUM_PROCS_TAGandPROGRAM_ARGUMENTS_TAGwill get substituted automatically when the command is run. - Click on "Submit" to start your application.
- For serial applications, select "none" as the MPI implementation.
There are many more options for debugging with DDT. Full documentation is accessible
in the GUI via the Help menu or in ${DDT}/doc/.
Check in particular the userguide.pdf and the quickstart-*.pdf
documents. If you think you have encountered a bug, please contact
hpc@rrze.
DDT is a commercial application with considerable license fees. The number of concurrent processes that can be run under DDT's control is limited. Please exit DDT at the end of your debugging session to free resources for other users.
MPI Profiling with Intel Trace Collector/Analyzer
Intel Trace Collector/Analyzer are powerful tools that acquire/display information on the communication behaviour of an MPI program. Peformance problems related to MPI can be identified by looking at timelines and statistical data. Appropriate filters can reduce the amount of information displayed to a manageable level.
In order to use Trace Collector/Analyzer you have to load the itac module.
This section describes only the most basic usage patterns.
Complete documentation can be found in ${VT_ROOT}/doc/, on
Intel's ITAC website,
or in the Trace Analyzer Help menu.
Trace Collector (ITC)
ITC is a tool for producing tracefiles from a running MPI application. These traces contain information about all MPI calls and messages and, optionally, on functions in the user code. To use ITC in the standard way you only have to re-link your application. If you want to add user function information to the trace, the code must by instrumented manually using the ITC API and recompiled. Please note that we currently support Intel MPI only.
| Variable | Use | Example | Comments |
|---|---|---|---|
$ITC_LIB | Link against ITC libraries | mpif90 *.o -o a.out $ITC_LIB | Place after object files (but before any MPI library) on linker command line! Trace files are not written if MPI code does not finish correctly. |
$ITC_LIBFS | Link against "failsafe" ITC libraries | mpif90 *.o -o a.out $ITC_LIBFS | Place after object files (but before any MPI library) on linker command line! Use this variant for MPI codes that do not finish correctly. More intrusive than $ITC_LIB. |
$ITC_INC | Include directory with ITC API headers | mpicc $ITC_INC -c hello.c | - |
After an MPI application that has been compiled or linked with ITC has terminated,
a collection of trace files is written to the current directory. They follow
the naming scheme <binary-name>.stf* and serve as input for
the Trace Analyzer tool.
Trace Analyzer (ITA)
The <binary-name>.stf file produced after running the instrumented
MPI application should be used as an argument to the traceanalyzer
command:
traceanalyzer <binary-name>.stf
The trace analyzer processes the trace files written by the application and lets you
browse through the data. Click on "Charts-Event Timeline" to see the messages transferred
between all MPI processes and the time each process spends in MPI and application
code, respectively. Click and drag lets you zoom into the timeline data (zoom out with the "o"
key). "Charts-Message profile" shows statistics about the communication
requirements of each pair of MPI processes. The statistics displays change their
content according to the currently displayed data in the timeline window.
Please consider the Help menu or the docs in ${VT_ROOT}/doc/
to get more information. Additionally,the HPC group
of RRZE will be happy to work with
you on getting insight into the performance characteristics of your
MPI applications.
Parallel Computing
The intended parallelization paradigm on Woody is message passing using the
Message Passing Interface (MPI).
Intel compilers also support shared-memory programming in a node with
OpenMP.
OpenMP
The installed Intel compilers support the
OpenMP standard in version 2.5.
The compiler recognizes OpenMP directives if you supply the command line
option -openmp. This is also required for the link step.
Intel has kindly provided a temporary license for their
Cluster OpenMP
product which makes it possible to use OpenMP programs across the
cluster interconnect. If you are interested in using Cluster OpenMP,
please contact hpc@rrze.
MPI
Although the cluster is basically able to support many different MPI
versions, we maintain and recommend to use
Intel MPI.
Intel MPI supports
different compilers (GCC, Intel). If you use Intel compilers, the appropriate
intelmpi module is loaded automatically upon loading the intel64 compiler module.
The standard MPI scripts mpif77, mpif90, mpicc
and mpicxx are then available. By loading a intelmpi/3.XXX-gnu
module instead of the default intelmpi, those scripts will
use the GCC.
There are no special prerequisites for running MPI programs. Just use
mpirun [<options>] your-binary your-arguments
By default, one process will be started on each allocated CPU
(4 per node) in a
blockwise fashion, i.e. the first node is filled completely,
followed by the second node etc.. If you want to start n<4
processes per node (e.g. because of large memory requirements)
you can specify the -npernode n option to mpirun
(-pernode is equivalent to -npernode 1).
Finally, if you want to start less processes than
CPUs available you can add the -np N option which will only start
N processes.
Examples: We assume that the batch system has allocated 8 nodes (32 processors) for the job.
mpirun a.out
will start 32 processes. If r is the rank of an MPI
process, rank r will run on node (r % 4).
mpirun -npernode 2 a.out
will start 16 processes, and rank r will run on node
(r % 2).
mpirun -pernode -np 4 a.out
will start 4 processes, each on its own node. I.e., 4 of the 8 allocated nodes stay empty. Note that it is currently not possible to start more processes than processors allocated.
We do not support running MPI programs interactively on the frontends. To do interactive testing, please start an interactive batch job on some compute nodes. During working hours, a number of nodes is reserved for short (< 1 hour) tests.
The MPI start mechanism communicates all environment variables
that are set in the shell where mpirun is running
to all MPI processes. Thus it is not required to change
your login scripts in order to export things like OMP_NUM_THREADS,
LD_LIBRARY_PATH etc..
Libraries
Mathematical Libraries
Intel [Cluster] Math Kernel Library ([C]MKL)
The
Math Kernel Library
provides threaded BLAS, LAPACK, and FFT routines and some supplementary
functions (e.g., random number generators). For distributed-memory
parallelization there is also SCALAPACK and CDFT (cluster DFT), together
with some sparse solver subroutines.
It is highly recommended to use MKL for any kind of linear algebra if possible.
After loading the mkl module, several shell variables
are available that help with compiling and linking programs that
use MKL:
| Variable | Use | Example |
|---|---|---|
$MKL_INC | Compiler option(s) for MKL include search path. | icc -O3 $MKL_INC -c code.c |
$MKL_SHLIB | Linker options for dynamic linking of LAPACK, BLAS, FFT | ifort *.o -o prog.exe $MKL_SHLIB |
$MKL_LIB | Linker options for dynamic linking of LAPACK, BLAS, FFT | ifort *.o -o prog.exe $MKL_LIB |
$MKL_SCALAPACK | Linker options for SCALAPACK (includes LAPACK, BLAS FFT) | mpicc *.o -o parsolve.exe $MKL_SCALAPACK |
$MKL_CDFT | Linker options for Cluster DFT functions (includes BLAS, FFT) | mpif90 *.o -o parfft.exe $MKL_CDFT |
Many MKL routines are threaded and can run in parallel by setting the OMP_NUM_THREADS
shell variable to the desired number of threads. If you do not set OMP_NUM_THREADS,
the default number of threads is one. Using OpenMP together with threaded MKL
is possible, but the OMP_NUM_THREADS setting will apply to both
your code and the MKL routines. If you don't want this it is
possible to force MKL into serial mode by setting the MKL_SERIAL environment
variable to YES.
For more in-depth information, please refer to Intel's
online documentation on MKL.
FFTW
FFTW is a high-performance, free library for Fast Fourier Transforms.
It is used by many software packages. We provide a current version of FFTW that is compatible
with the Intel compilers by the fftw module.
| Variable | Use | Example |
|---|---|---|
$FFTW_INC | Compiler option(s) for FFTW include search path. | icc -O3 $FFTW_INC -c code.c |
$FFTW_LIB | Linker options for (static) linking of FFTW | ifort *.o -o prog.exe $FFTW_LIB |
$FFTW_BASE | Base directory of FFTW installation | - |
The fftw-wisdom and fftw-wisdom-to-conf tools
and their manual pages are also provided in the respective search paths.
Batch Processing
All user jobs except short serial test runs must be submitted to the cluster
by means of the
Torque
Resource Manager. The submitted jobs are routed into a number
of queues (depending on the needed resources, e.g. runtime) and sorted according to some
priority scheme. The queue configuration looks like follows:
| Queue | min - max walltime | min - max nodes | Availablility | Comments |
|---|---|---|---|---|
route | N/A | N/A | all users | Default router queue; sorts jobs into execution queues |
devel | 0 - 01:00:00 | 1 - 16 | all users | Some nodes reserved for queue during working hours |
work | 01:00:01 - 24:00:00 | 1 - 64 | all users | "Workhorse" |
special | 0 - infinity | 1 - all | special users | Direct job submit with -q special |
A job will run when the required resources become available.
For short test runs with less than one hour of runtime, a number of nodes
is reserved during working hours. These nodes are dedicated to the
devel queue. Do not use the devel queue
for production runs. Since we do not allow MPI-parallel applications
on the frontends, short parallel test runs must be performed
using batch jobs.
It is also possible to submit interactive jobs that, when started, open a shell on one of the assigned compute nodes and let you run interactive (including X11) programs there.
The command to submit jobs is called qsub. To submit a batch job
use
qsub <further options> [<job script>]
The job script may be omitted for interactive jobs (see below). After
submission, qsub will output the Job ID of your
job. It can later be used for identification purposes and
is also available as $PBS_JOBID in job scripts (see below). These are
the most important options for the qsub command:
| Option | Meaning |
|---|---|
-N <job name> | Specifies the name which is shown with qstat. If the option
is omitted, the name of the batch script file is used. |
-o <standard output file> | File name for the standard output stream. If this option
is omitted, a name is compiled from the job name (see -N) and the job ID. |
-e <error output file> | File name for the standard error stream. If this option
is omitted, a name is compiled from the job name (see -N) and the job ID. |
-l nodes=<# of nodes>[:ppn=[1|2|3|4]] | Specifies the requested nodes or CPUs.
Without ppn=... only a single CPU per node is allocated. The other CPU(s) are
considered to be available by Torque. On Woody you must always
specify ppn=4. |
-l walltime=HH:MM:SS | Specifies the required wall clock time (runtime).
When the job reaches the walltime given here it will be sent a TERM
signal. After 60 seconds, if the job has not ended yet, it will be sent
KILL. See the section on stage-out below
for hints how to use this delay for saving important data.If you omit the walltime option a short default time is used. Please
specify a reasonable runtime, since the scheduler bases its decisions also on this
value (short jobs are preferred). |
-M x@y -m abe | You will get e-mail to x@y when the job is
aborted (a), starting (b), and ending (e). You can choose any subset of
abe for the -m option. |
-W depend:<dependency list> | Makes the job depend on
certain conditions. E.g., with -W depend=afterok:12345 the job will only
run after Job 12345 has ended successfully, i.e. with an exit code of zero. Please
consult the qsub man page for more information. |
-r [y|n] | Specifies if the job is rerunnable (y, default) or not (n).
Under some (error) conditions, Torque will decide to re-queue jobs that had already
been running before the error occurred. If a job is not suited for this kind of
action, use -r n. |
-I | Interactive job. It is still allowed to specify a job script, but
it will be ignored except for the PBS options.
No code will be executed. Instead, the user will get an interactive shell on one of the
allocated nodes and can execute any command there. In particular, you can start a
parallel program with mpirun. |
-X | Enable X11 forwarding. If the $DISPLAY environment variable
is set when submitting the job, an X program running on the compute node(s) will
be displayed at the user's screen. This makes sense only for
interactive jobs (see -I option). |
-q <queue> | Specifies the Torque queue (see above); default queue is route.
Usually it is not required to use this parameter as the route queue
automatically forwards the job to an appropriate execution queue. |
Parallel jobs are required to request all CPUs in a node (ppn=4).
Using more than one node and less than 4 CPUs per node is not supported and may result in your jobs being
killed without further notice.
There are several Torque commands for job inspection and control. The following table gives a short summary:
| Command | Purpose | Options |
|---|---|---|
qstat [<options>] [<JobID>|<queue>] | Displays information on jobs. Only the user's own jobs are displayed. For information on the overall queue status see the section on job priorities. | -a display "all" jobs in user-friendly format-f extended job info-r display only running jobs |
qdel <JobID> ... | Removes job from queue | - |
qalter <qsub-options> | Changes job parameters previously set by qsub. Only certain parameters may be changed after the job has started. | see qsub and the qalter manual page |
qcat [<options>] <JobID> | Displays stdout/stderr from a running job | -o display stdout (default)-e display stderr-f output appended data as the job is running (like tail -f |
Batch Scripts
To submit a batch job you have to write a shell script that contains all the commands to be executed. Job parameters like estimated runtime and required number of nodes/CPUs can also be specified there:
#!/bin/bash -l
#
# allocate 16 nodes (64 CPUs) for 6 hours
#PBS -l nodes=16:ppn=4,walltime=06:00:00
#
# job name
#PBS -N Sparsejob_33
#
# stdout and stderr files
#PBS -o job33.out -e job33.err
#
# first non-empty non-comment line ends PBS options
# jobs always start in $HOME -
# change to a temporary job directory on $FASTTMP
mkdir ${FASTTMP}/$PBS_JOBID
cd ${FASTTMP}/$PBS_JOBID
# copy input file from location where job was submitted
cp ${PBS_O_WORKDIR}/inputfile .
# run
mpirun ${WOODYHOME}/bin/a.out -i inputfile -o outputfile
# save output on parallel file system
mkdir -p ${FASTTMP}/output/$PBS_JOBID
cp outputfile ${FASTTMP}/output/$PBS_JOBID
cd
# get rid of the temporary job dir
rm -rf ${FASTTMP}/$PBS_JOBID
|
The comment lines starting with #PBS are ignored by the shell
but interpreted by Torque as options for job submission (see above for
an options summary). These options can all be given on the qsub
command line as well. The example also shows the use of the
$FASTTMP and $WOODYHOME variables. $PBS_O_WORKDIR
contains the directory where the job was submitted. All batch scripts start executing
in the user's $HOME so some sort of directory change is always
in order.
If you have to load modules from inside a batch script, you can do so.
The only requirement is that you have to use either a csh-based shell
or bash with the -l switch, like in the example above.
Interactive Jobs
The resources of the Woody cluster are mainly available in batch mode.
However, for testing purposes or when running applications that require
some manual intervention (like GUIs), Torque offers interactive access
to the compute nodes that have been assigned to a job. To do this,
specify the -I option to the qsub command
and omit the batch script.
When the job is scheduled, you will get a shell on the master node
(the first in the assigned job node list). It is possible to
use any command, including mpirun, there. If you need
X forwarding, use the -X option in addition to
-I.
Note that the starting time of an interactive batch job cannot reliably
be determined; you have to wait for it to get scheduled. Thus we recommend
to always run such jobs with wallclock time limits less than one hour so
the job will be routed to the devel queue for which
a number of nodes is reserved during working hours.
Interactive batch jobs do not produce stdout and stderr
files. If you want a protocol of what's happened, use e.g. the UNIX
script command.
Staging Out Results
Warning! This does not work with the current version of the batch system due to a software bug!
When a job reaches its walltime limit, it will be killed by
the batch system. The job's node-local data will either get
deleted (if you use $TMPDIR or be inaccessible
because login to a node is disallowed if you don't have a job
running there. In order to prevent data loss, Torque waits 60
seconds after the TERM signal before sending the
final KILL. If the batch script catches
TERM with a signal handler, those 60 seconds can
be used to copy node-local data to a global file system:
#!/bin/bash
# signal handler: catch SIGTERM, save scratch data
trap "sleep 5 ; cd $TMPDIR ; tar cf - * | tar xf - -C ${WOODYHOME}/$PBS_JOBID ; exit" 15
# make job data save directory
mkdir ${WOODYHOME}/$PBS_JOBID
cd $PBS_O_WORKDIR
# assuming a.out stores temp data in $TMPDIR
mpirun ./a.out
|
The sleep command at the start of the signal handler
gives your application some time to shut down before the data is
saved. Please note that it is required to use a Bourne or Korn shell variant
for catching the TERM signal since csh
has only limited facilities for signal handling.
Job Priorities and Reservations
The scheduler of the batch system assigns a priority to
each waiting job. This priority value depends on certain
parameters (like waiting time, queue, user group, and recently
used CPU time (a.k.a. fairshare)). The ordering of waiting
jobs listed by qstat does not reflect the
priority of jobs. All waiting jobs with their assigned
priority are listed anonymously on the
HPC user web pages
(those pages are password protected; execute the
docpw command to get the username and password).
There you also get a list of all running
jobs, any node reservations, and all jobs which cannot be
scheduled for some reason. Some of this information is also
available in text form: The text file
/home/woody/STATUS/joblist contains a list of all
waiting jobs; the text file
/home/woody/STATUS/nodelist contains information
about node and queue activities.
Further Information
Intel Xeon 5160 "Woodcrest" Processor
The
Xeon 5160 processor
implements Intel's Core microarchitecture. It is a dual-core chip running at 3.0 GHz
and outperforms previous Xeons with the older
Netburst architecture
(as used, e.g., in our Transtec cluster) significantly.
It features many architectural enhancements like, e.g.,
- 32kB L1 data cache per core (8-way set-associative, 64B cache line, 2-3 cycles latency, writeback).
- Two cores sharing a common 4MB L2 cache (16-way set-associative, 64B cache line, 14 cycles latency, writeback).
- Much shorter pipelines than Netburst.
- Very short cache latencies compared to Netburst.
- 4 FLOPs per cycle double precision floating-point throughput with SSE2.
- Each core can sustain up to one 128-bit load and one 128-bit store operation per cycle.
- Theoretical memory bandwidth of 10.6 GB/s; less than half of this value is typically seen in applications.
- Four different hardware prefetchers that try to hide memory latency by loading data and instruction cache lines in advance. In particular, "adjacent cache line prefetch" loads the current and the next cache line on a cache miss automatically, effectively doubling L2 line size to 128B for reads and RFOs.
The
Intel® 64 and IA-32 Architectures Optimization Reference Manual
contains in-depth information about the microarchitecture and specific optimization techniques.
HP DL140G3 Compute Node
A compute node comprises two sockets, each housing a Xeon
5160 dual-core chip. Compared to older dual-socket systems, the two
frontside buses (FSBs) are not directly connected to each other
but to the chipset, which is theoretically able to saturate
the bandwidth requirements of both chips concurrently (21.3 GB/s).
Due to deficiencies in the bus protocols and other factors, the
maximum achievable bandwidth per node is roughly 8 GB/s.
Although the node shows
UMA
memory access characteristics, the peculiar structure of two dual-core
chips with separate FSBs and the partly shared L2 caches offers
some diverse possibilities for parallel programming. If, for some reason,
only two of the four cores are actually used, it depends on the
code's bandwidth and communication requirements whether one should
place the processes (threads) on a single or on separate sockets.
It is generally a good idea to "pin" threads/processes to cores in
order to get reproducible performance results. Please consult
hpc@rrze for further
advice.
The Intel 5000X chipset ("Greencreek") features a "snoop filter" that
can to some extent lessen the performance impact of the snoop-based
cache coherence
protocol. It has an on-chip memory that keeps track of modified cache
lines in all the cores' caches.
InfiniBand Interconnect Fabric
The
InfiniBand
(IB) network features a non-blocking switch (fat-tree) with
static routing. Each node is capable of sending and receiving
data at a rate of 10 GBit/s per direction (full duplex), with an MPI latency of less than 5 µs.
The IB network is used for MPI communication and the parallel file system. NFS traffic uses a separate
GBit Ethernet network.


