Sprungmarken

Videoportal der FAU

Die letzten Meldungen

Novell Serverwartung MOLMED am Dienstag, 22 Mai von 8 Uhr bis ca. 11 Uhr

16. Mai 2012

Am Dienstag, 22 Mai findet von 8 Uhr bis voraussichtlich 11 Uhr eine dringende Serverwartung des Novell-Servers “MOLMED” statt. In der genannten Zeit ist der Zugriff auf die Volumes “USERTEMP” und “SYS” nicht möglich.
Weiterlesen...

Terminänderung – Vortrag “Einführung von fau.de-Maildomains und neuen Mail-/Groupware-Komponenten für die FAU” verschoben

15. Mai 2012

Aufgrund von Terminüberschneidungen mussten im Rahmen der Vorlesung “PRAXIS DER DATENKOMMUNIKATION” (Netzwerkausbildung) Termine getauscht werden.
Weiterlesen...

RRZE-Betrieb am „Berch“-Dienstag

15. Mai 2012

Am Dienstag, den 29.05.2012, wird das RRZE ab 12 Uhr geschlossen.
Weiterlesen...

Meldungen nach Thema

 

IA32/EM64T/AMD64 Compute-Cluster

Picture of the IA32/EM64T/AMD64 Compute Clusters The IA32/EM64T/AMD64 Cluster Externer Link:  (manufactured by Transtec)of the RRZE consists of 175 computing nodes, three front end nodes, and two servers:

  • 86 compute nodes, dual Xeon 2.66 GHz "Prestonia" (533 MHz FSB), 2 GByte RAM, 80 GB IDE hard drive per node
  • 64 compute nodes, dual Xeon 3.20 GHz "Nocona" (800 MHz FSB / 666 MHz RAM), 2 GByte RAM, 80 GB IDE hard drive per node
  • 25 compute nodes, quad (2x Dual Core) Opteron 2.0 GHz (800 MHz FSB, ccNUMA), 4 GByte RAM, 80 GB IDE hard drive per node
  • 2 front end nodes, dual Xeon 2.66 GHz, equipped as the compute nodes, but with 4 GB RAM and activated Externer Link:  Hyperthreading
  • 1 front end nodes, dual Xeon 3.20 GHz "Nocona", equipped as the compute nodes, but with 4 GB RAM and activated Externer Link:  Hyperthreading
  • 1 file server, dual Xeon 2.8 GHz, 5.8 TB HD capacity, RAID 5 (SCSI)
  • 1 file server, dual Xeon 3.0 GHz, 2.4 TB HD capacity, RAID 5 (SCSI)

The nodes are connected via Gigabit Ethernet (GE). 24 of the Nocona nodes are additionally connected via Infiniband links. Compared to GE, Infiniband's bandwidth is 10 times higher and its latency is one fifth. If you are interested in using Infiniband, please contact the HPC team.

Please consider the notes below concerning certain issues due to the different architectures of the compute nodes (the Nocona nodes are incarnations of Externer Link:  Intel's EM64T architecture and are therefore running a 64-bit operating system as the Opteron nodes). Both the Nocona and the Opteron nodes have been partially funded by certain research groups, so please be aware that the usage of these nodes might be restricted for regular users.

Below, you find more information concerning:

Access, Environment, and File Systems

The cluster can be accessed using the three front end nodes:

  • sfront01.rrze.uni-erlangen.de
  • sfront02.rrze.uni-erlangen.de
  • sfront03.rrze.uni-erlangen.de (Nocona)

These three front end nodes have private IP addresses and can therefore only be accessed from within the FAU network. If you want to connect from outside you have to login on the dialog server "cssun" first and then connect to one of the front end nodes from there.

When connecting to one of the front end nodes, you find yourself in your regular RRZE home directory(/home/rz[sun]home/...).

The cluster also has its own home directory tree which can be accessed much faster. This local directory tree is mounted at /home/cluster32/$gid/$uid/. Please note, that there is no backup by the RRZE for any data stored in this local tree!

Each node contains a local 80 GB hard drive for temporary files which is mounted at /scratch. All files in these directories which are older than a certain number of days (currently 12 days) will be deleted automatically without any notification.

Compilers and Tools

You will find a wide variety of compilers and tools in different versions installed on the cluster. The concept of modules is used to simplify the selection and switching between different software packages and versions. It replaces the previous source command and makes the loading and unloading of software packages very easy.

Due to compatibility issues, 64-bit executables should only be generated on sfront03. These executables can then only be used on Nocona and the Opteron nodes.

GNU Compiler Suite (gcc, g++, g77)

Different versions of the GNU compilers are already included in the default executable path. The current default version is 3.3.x.

Intel Compiler (ifc/ifort,icc)

The different versions of this compiler can differ in the performance and correctness of the generated executables. As a rule of thumb you should use the latest version. In rare cases, the use of an older version might be advantageous.

In order to use one of the Intel compilers, you have to set some environment variables by using the module command instead of the deprecated source mechanism.

For certain compilers you can specify which release you prefer (e.g. intel-f/9.0-026). Calling module avail shows you a list of all available modules.

Endianness

The Xeon processor uses the little-endian format like all x86-based architectures which means that the LSB for multi-byte data has the lowest memory location. The same format is used in unformatted Fortran data files. To simplify the handling of big-endian files, the Intel Fortran compiler has the ability to convert the endianness during reading and writing. This can be configured separately for units. Just set the environment variable F_UFMTENDIAN at run-time.

Examples:

Effect of the environment variable F_UFMTENDIAN
F_UFMTENDIAN= Effect
big everything treated as BE
little everything treated as LE (default)
big:10,20 everything treated as LE, except for units 10 and 20
"big;little:8" everything treated as BE, except for unit 8

Intel Compiler Version 8.1 and 9.0

To use these compilers, you have to load the corresponding module first: module add intel/8.1 or module add intel/9

ifort is a Fortran95 compatible compiler, icc is the C compiler, and icpc is the C++ compiler. Although it is possible to compile C++ code with icc, it is strongly advised to use icpc for linking. For each compiler, you can get an online help by specifying -help as a command line option. Additional documentation can be found in the $INTEL_C_HOME/doc/ directory and on the Intel website for the Externer Link:  C/C++ compiler and for the Externer Link:  Fortran compiler. The RRZE cannot always provide you with the latest compiler versions and the documentations links above might therefore refer to a different compiler version.

Since version 8.1, the Intel compilers can generate 64-bit executables which exploit some special features of the Nocona (and Opteron) CPUs. These special compilers and the generated executables should only by run on Nocona or Opteron nodes. If you use the compiler flag -xP to vectorize your code, the executable can only be run on Intel EM64T CPUs. Executables compiled with the -xW flag can be run on any 64-bit architecture, but will not use the SSE3 extensions.

Intel Compiler Version 7.1

We strongly recommend the use of newer versions such as 8.1 or newer. If you want this old version nevertheless, you have to load the corresponding module first: module add intel/7.1

ifc is a Fortran95 compatible compiler, icc is the C/C++ compiler. For each compiler, you can get an online help by specifying -help as a command line option. The complete documentation can be found in /opt/intel/compiler71/doc/.

Profiling

You can use gprof with both the Intel and the GNU compilers which measures inclusive and exclusive execution times for each function in your code (assuming appropriate compiler switches).

Unfortunately, gprof cannot use hardware performance counters to measure, e.g., cache misses or FLOPs. You can use hpcmon in /opt/rrze/bin/hpcmon{32,64} (depending on the architecture) to get an overview for your code about such metrics. hpcmon is similar to the perfex tool on the Memory Server: The performance counters are read right before and after running the specified code. The measurements are then displayed concisely. For certain combinations of metrics it might be necessary to start your code multiple times; therefore, your code must be restartable. You should also be aware that hpcmon takes all running processes into account, so please make sure that no other user process is running. Use hpcmon -h to get an online help.

The HPC service can offer you more sophisticated profiling tools than gprof or hpcmon. Please contact them for further information.

Debuggers

For sequential codes, you can use the GNU debugger gdb and its graphical front end ddd.

For parallel debugging, we recommend the TotalView debugger from Etnus. It is licensed for 4 CPUs and a single user, so please make sure that you do not block other users by running it unnecessarily. Use module add totalview; /usr/totalview/bin/totalview to start the debugger.

For parallel debugging, type setenv TVDSVRLAUNCHCMD ssh
/opt/mpi/bin/mpirun -dbg=totalview <further arguments>
The complete documentation for TotalView can be found in the restricted user area.

Parallel Computing

OpenMP

All Intel compilers support the Externer Link:  #BROKEN LINK# OpenMP standard in version 2.0.

MPI

Externer Link:  MPICH is available in an up-to-date version in /opt/mpi/bin. Just use the MPI compiler calls (mpicc, mpif77, etc.) which will internally choose the matching compiler specified by the environment variable USE_PROG_MPI or the loaded compiler module.

The meta modules for the Intel compilers (e.g, intel/8.0) automatically load the matching MPI module. If you use the GNU compilers instead, you have to load the matching MPI module manually, e.g. with module add mpich/gnu.

It is possible to run interactive MPI jobs on the front end nodes. Simply use /opt/mpi/bin/mpirun in the regular way. The 4 CPUs of the front end nodes are allocated round-robin. If you prefer to have neighboring MPI ranks on the same node, add the option -norr to your mpirun call. A specified machine file is ignored by mpirun. The front end nodes are not equipped with Infiniband interconnects; use the queue iexpress instead if you want to use Infiniband (see below).

For MPI jobs that are submitted via the queueing system, the dispatchment on the nodes is done automatically. Neighboring MPI ranks are located on the same node. If you want a different placement, you can use the option -rr to get a round-robin placement (neighboring MPI ranks are located on different nodes). A specified machine file is ignored. It is redundant (but not forbidden) to specify the number of processes with the option -np, since this value is determined from the number of requested CPUs.

It is possible to use OpenMP and MPI simultaneously (hybrid programming). To do so, certain environment variables (e.g. OMP_NUM_THREADS) have to be set for all MPI processes. Please contact theHPC service if you are planning to do hybrid programming.

Math Libraries

LAPACK, BLAS, and other common libraries are installed in different flavors:

  • Externer Link:  Intel Math Kernel Library (MKL). This library contains highly optimized LAPACK, BLAS, and FFT routines. A stable version can be found in /opt/intel/mkl, experimental or older versions are installed in /opt/intel/mkl/<version>. The MKL library can be linked dynamically and statically. For EM64T architectures, a special 64 bit version is available which also works on Opteron nodes and requires a 64 bit compiler:
    Linking with the MKL library
    Linking typeIA32EM64T/AMD64
    static-L/opt/intel/mkl/lib/32 -lmkl_lapack -lmkl_ia32 -lguide -lpthread-L/opt/intel/mkl/lib/em64t -lmkl_lapack -lmkl_em64t -lguide -lpthread
    dynamic-L/opt/intel/mkl/lib/32 -lmkl_lapack64 -lmkl_lapack32 -lmkl -lguide -lpthread-L/opt/intel/mkl/lib/em64t -lmkl_lapack64 -lmkl_lapack32 -lmkl -lguide -lpthread
    For dynamically linked binaries, it is necessary to set the environment variable LD_LIBRARY_PATH before running the code. You can also specify the library search path when compiling the code with the option: -Xlinker -rpath -Xlinker <path>.
  • Externer Link:  AMD Core Math Library (ACML). In most cases, the ACML library has a better performance than the MKL library. Up-to-date versions for different compilers can be found in /opt/acml. The "gnu64" version should also work with the Intel compilers. The linking options are -L/opt/acml/gnu64/lib -lacml.
  • Vanilla BLAS, LAPACK, and ATLAS versions are installed in /usr/lib.

Batch System

General

Production runs and long-running test runs must be submitted via the queueing system Torque. Torque is very similar to the well-known NQS system. From a user's point of view it is identical with OpenPBS which has often be used in the past. To submit a job you have to create a so-called job script which is given as an argument to the qsub command. The following steps are necessary:

  1. Create a job script. Examples:
    Examples for Torque jobs scripts
    Sequential job:Parallel job:
    #!/bin/sh
    #
    # allocate 1 CPU for 8 hours
    #
    #PBS -l nodes=1,walltime=08:00:00
    #
    # first non-empty, non-comment line: GO!
    
    # execution starts in your home directory
    # $PBS_O_WORKDIR is the directory where you submitted the job using qsub
    
    cd $PBS_O_WORKDIR
    
    # create temporary directory, if necessary
    # $PBS_JOBID contains job ID
    
    TMPDIR=/scratch/${USER}/$PBS_JOBID
    mkdir -p $TMPDIR
    
    # GO!
    
    ./a.out
    
    #!/bin/sh
    #
    # allocate 8 nodes (16 CPUs) for 3 hours
    #
    #PBS -l nodes=8:ppn=2,walltime=03:00:00
    #
    
    cd $PBS_O_WORKDIR
    
    # if necessary
    
    USE_PROG_MPI=intel
    export USE_PROG_MPI
    
    # -np is redundant
    
    /opt/mpi/bin/mpirun ./a.out
    
    The lines starting with #PBS are options for qsub which could alternatively be specified as command line arguments (see below).
  2. Submit the job with qsub <further options> <job script>
  3. Normally, you do not have to specify which queue should be used, but there are cases where it might be necessary (see below).
  4. Use qstat -a to get the status of your own jobs. For running jobs you also get the elapsed wall clock time. You cannot get information about other user's jobs.
  5. You can see the output to stdout of your running job with /opt/rrze/bin/qcat [-f] <job ID>. Add the option -f to follow the output continuously (just as tail -f does).
  6. A running or waiting job can be dequeued with the command qdel <job ID>.

Important Options for qsub

Important options for qsub and their meaning
OptionMeaning
-N <job name>specifies the name which is shown with qstat
-o <standard output file>file name for the standard output stream
-e <error output file>file name for the standard error stream
-q <queue>specifies the Torque queue (see below); default queue is route.
-l nodes=<# of nodes>[:ppn=[2|4]][:<property>]specifies the requested nodes or CPUs. Without ppn=... only a single CPU per node is allocated. The other CPU(s) are considered to be available by Torque. For parallel jobs you should always specify ppn=2 (for Intel nodes) or ppn=4 (for Opteron nodes). With the property option you can further specify the type of requested nodes for special cases. For regular users there is no need to use this option (see below).
-l walltime=HH:MM:SSspecifies the required wall clock time (run-time). If you omit this option a short default time is used depending on the chosen queue. Please specify a reasonable run-time, since the scheduler bases its decisions also on this value (short jobs are preferred).
-Iinteractive job. It is still allowed to specify a job script, but it will be ignored except for the PBS options. No code will be executed. Instead, the user will get an interactive shell on one of the allocated nodes and can execute any command there. In particular, you can start a parallel code with mpirun.
-M x@y -m abeyou will get a mail to x@y when the job is aborted (a), starting (b), and ending (e). You can choose any subset of abe for the -m option.

Queues and Job Parameters

The following queues are available for users and can be selected with the -q option
Queue nameDescription
routeThis is the default queue if no other queue is specified. All jobs in route are dispatched to other queues depending on certain criteria (such as run-time, number of CPUs, ACLs, ...). The exact rules of this procedure are subject to change without notice and are normally irrelevant for users. Only Intel nodes can be allocated when using this queue.
expressFor sequential and parallel test jobs on IA32 nodes with less than one hour run-time. Jobs queued in route might be rerouted to this queue if applicable. From Monday until Friday, 8:00 until 21:00, four nodes are dedicated to this queue. At other times, the queue is still active, but be prepared for longer delays.
iexpressFor sequential and parallel test jobs on nodes with Infiniband interconnect with less than one hour run-time. From Monday until Friday, 10:00 until 16:00, four nodes are dedicated to this queue. At other times, the queue is still active, but be prepared for longer delays.
ibandFor parallel jobs on nodes with Infiniband interconnect. Your job has to request at least two nodes. Always allocate entire nodes (ppn=2). Please contact the HPC service if you need access to this queue. Please note that some of the Infiniband nodes are used by the iexpress queue at certain times (see above).
Important: Only executables compiled with Infiniband MPI are allowed in this queue!
oexpressFor sequential and parallel test jobs on Opteron nodes with less than one hour run-time. From Monday until Friday, 8:00 until 21:00, one node is dedicated to this queue. At other times, the queue is still active, but be prepared for longer delays.
opteronFor sequential and parallel jobs on Opteron nodes. Always allocate entire nodes (ppn=4) if you submit a parallel job in this queue. Please note that some of the Opteron nodes are used by the oexpress queue at certain times (see above).
fhg, lstm, special, ...Special queues for certain tasks or user groups.

If you need an Opteron node, you have to use either the oexpress or the opteron queue, since the default queue route will only use IA32 nodes.

If you want to use Infiniband, you have to use a 64 bit compiler on sfront03, since only the Nocona nodes have Infiniband installed.

The queueing system prefers jobs with a short run-time (in particular less than six hours). If you are a regular user, the maximum run-time of your jobs is 24 hours on Opteron nodes and 48 hours on all other nodes. Here is an overview of possible combinations:

Possible combinations for CPU count (N) and run-times (R)
R < 1h1h ≤ R < 6h6h ≤ R < 48h48h ≤ R < 168h168h ≤ R < 240h
N ≤ 8 CPUsOKOK; jobs also have access to Nocona nodesOKavailable on requestonly available for SFB; jobs also have access to Nocona nodes
N ≤ 64 CPUsN/AN/A

Jobs from special queues (ls_*, special) and regular jobs with a run-time between 1 and 6 hours might run on mixed node types. You can influence the choice of nodes with properties if required:

Possible job properties
PropertyMeaning
normalall Intel nodes
gbitonly Gbit interconnect
mnoxInfiniband nodes (autom. 64 bit)
ia3232 bit Intel nodes (Prestonia)
em64t64 bit Intel nodes (Nocona)
opteronOpteron nodes (AMD64)

You have to specify the properties after the run-time and the number of CPUs: qsub -l walltime=05:00:00,nodes=4:ppn=2:em64t:gbit ...

For conflicting properties (e.g. ia32:mnox or ppn=4:em64t) the job remains in the Q state and will never be started.

For some queues, a default run-time is set if it was not specified explicitly using -l walltime=...:

Maximum and default run-time (wall clock time)
QueueDefault run-timeMax. run-time
route00:10:00240:00:00
express00:10:0001:00:00
iexpress00:10:0001:00:00
iband00:10:0064:00:00
oexpress00:10:0001:00:00
opteron00:10:0024:00:00 (more for certain user)

Other relevant queue parameters can be listed with qstat -q and qstat -Q -f.

Job Priorities and Reservations

The scheduler of the batch system assigns a priority to each waiting job. This priority value depends on certain parameters (like waiting time, queue, user group, and recently used CPU time (Fairshare)) and is unavailable for users. The ordering of waiting jobs listed by qstat does not reflect the priority of jobs. All waiting jobs with their assigned priority are listed anonymously on the HPC user web pages. There, you also get a list of all running jobs, any node reservations, and all jobs which cannot be scheduled for some reason. All this information is also available in text form: the text file /home/cluster32/hpcop/joblist contains a list of all waiting jobs; the text file /home/cluster32/hpcop/nodelist contains information about node and queue activities.

Access to Compute Nodes and Handling of Output Data

A direct login on a compute node is only possible for the owner of the running job on this particular node. After the completions of a job, any processes on a node (including login shells) are terminated. This can cause problems if a job is killed due to exceeding the run-time limit or due to an explicit kill using qdel, since local data in /scratch cannot be copied to a global file system. To gracefully handle such cases, the job should catch the SIGTERM signal that is sent to the job script by the batch system Torque and copy any local data to a global file system. After receiving the SIGTERM signal, the job has 60 seconds before a SIGKILL signal inevitably kills the job. Here is an example for a job script which handles the SIGTERM signal (you have to use a bash script, because shells based on csh cannot handle SIGTERM signals):

#!/bin/sh
#
#PBS ....
#

TMPDIR=/scratch/${USER}/${PBS_JOBID}
export TMPDIR

# Here is the SIGTERM signal handler which
# moves any local data to the global file system

trap "sleep 5 ; cd /scratch/${USER} ; tar cf - ${PBS_JOBID} | tar xf - -C /home/cluster32/${GROUP}/${USER} ; exit" 15

# Here begins the regular job script which stores
# its data in /scratch/${USER}/${PBS_JOBID}

cd $PBS_O_WORKDIR

# create the scratch directory
mkdir -p $TMPDIR

# GO!
./a.out

The crux of the matter is the trap command which intercepts the specified signal (here 15 = SIGTERM) and then executes the given commands (here moving of the local data (/scratch/${USER}/${PBS_JOBID}) to the global file system (/home/cluster32/${GROUP}/${USER})). The sleep command waits for a few seconds to make sure that the program has terminated. The exit command is necessary to finally quit the job script after handling the signal.

Miscellaneous

Documentation about the IA32/EM64T architecture, its programming and optimization can be found in the Externer Link:  IA32 Intel Architecture Software Developer's Manuals.

Documentation about the AMD64 architecture, its programming and optimization can be found in the Externer Link:  AMD Developer Resources/Documentation.

Letzte Änderung: 13. Maerz 2012, Historie

zum Seitenanfang

Startseite | Kontakt | Impressum

RRZE - Regionales RechenZentrum Erlangen, Martensstraße 1, D-91058 Erlangen | Tel.: +49 9131 8527031 | Fax: +49 9131 302941

Inhaltenavigation

FAU - Friedrich-Alexander-Universität
UnivIS - Informationssystem der Friedrich-Alexander-Universität Erlangen Nürnberg

Zielgruppennavigation

  1. Studierende
  2. Beschäftigte
  3. Einrichtungen
  4. IT-Beauftragte
  5. Presse & Öffentlichkeit