Getting Started | HPC CIMNE

Understanding the resource manager Slurm

Request an account

If you have a CIMNE account you need ask for an account at Acuario cluster using tickets system or sending an e-mail to cau@cimne.upc.edu.
If you don’t have a CIMNE account you need fill correctly this form and your CIMNE responsible must sign it.

Access to Acuario cluster

GNU/Linux

First you need OpenSSH client installed and graphical environment. Then execute the next command:

ssh -X -l username acuario.cimne.upc.edu

Windows

First you need Xming (the one called “Xming” at public domain release) installed with default options and the PuTTY SSH client. Then you must be sure that Xming is started, launch PuTTY and configure it:

Go to “Connection”, “SSH”, “X11” and check “Enable X11 forwarding”. Also type localhost:0 at “X display location” box.
At “Session” type the Acuario cluster login node acuario.cimne.upc.edu. Afterwards type a name for this session at “Saved Sessions” box and click at “Save” button.
Then whenever you want connect to Acuario cluster launch PuTTY, load the saved session and open it.

Environment in Acuario

In order to do certain tasks like compile code, use openMP, open MPI, Intel MPI, or use the Intel compilers, some enviroment variables must be set.

We unified the process of doing so by using the environment modules method. This method allows you to load the module you need and then all the necessary variables are set into your local environment.

Once you set your environment in your local bash, you can send your jobs to Slurm with srun, sbatch, salloc, etc and all the variables will be passed to the remote nodes.

To see which are the modules you can load, type:

[user@acuario ~]$ module avail

------------- /globalfs/etc/modulefiles -------------
boost/1.61.0-b1 gcc/5.3.0 intel/clusterStudioXE2013 openmpi/2.1.1 python/3.5.1
boost/1.64.0 gcc/5.4.0 kratos/daily parmetis/3.2.0 python/3.6.1
clang/3.9.1 gcc/6.1.0 kratos-dependencies parmetis/4.0.3 VTK/7.1.1
cmake/3.5.2 gcc/6.3.0 metis/5.1.0 petsc/3.7.6-debug
cmake/3.8.2 gcc/7.1.0 openmpi/1.10.2 petsc/3.7.6-release

To load a module:

[user@acuario ~]$ module load cmake/3.8.2

To see which modules are loaded into your bash session, type:

[user@acuario ~]$ module list
Currently Loaded Modulefiles:
 1) cmake/3.8.2

To unload a module:

[user@acuario ~]$ module unload cmake/3.5.2

Tip: If you use some modules freqüently, you can add to your .bash_profile file (at your home dir) a command like “module load module-name”.

Storage

You have already access to two storage spaces from all Acuario machines:

/home: It’s where are personal user directories. We assign every user a disk quota for this storage. The disk quota has 2 limits, soft and hard. Soft limit is the real space limit but it can be overpassed until hard limit, but only for a week. This way a job can finish and user can obtain the results. User quota is shown at login time but also can be checked with “quota -sf /dev/sdb1” command.
/shome: It’s where are personal user directories of old Acuario cluster. This storage don’t has disk quota but it’s slower than /home.

Understanding the resource manager Slurm

Due to the necessary use of a resource manager there are a few concepts that you must take into account.

First, the notation:

Partition: A set of nodes.
Node: A computation node with his memory and his processors.
Processor: A physical processor like Intel Xeon E5-2670. In Acuario there are nodes with 2 or 4 processors.
Cores: A physical core of a processor. In Acuario there are processors that have from 8 to 10 cores.

The cluster is like a bank that gives you some resources for computing. In order to get the resources you want, you must order a reservation.

You can reserve two main resources:

Cores
Memory

Moreover there are some partitions of nodes (see below). In your resource reservation you can select on which partition your job will be launched. Each partition can have some restrictions, for example max. time of job, max. num of queued jobs, max. mem per job, etc.

Job computation time

You can also specify your job estimated computation time. It’s optional but has advantages. For example, suppose the following scenario:

Cluster with 3 computing nodes.
A defined partition including these 3 nodes, and with default job time set to infinite.

Then, suppose a job that is running this partition, and is using the entire 3 nodes. Let’s call this Job1. Suppose that there are also one job called Job2, that have no time specification and that are waiting for resources on these partition, exactly for 3 full nodes. Now, suppose that you launch a job called Job3 that needs one entire node, and supose that you don’t specify the estimated running time. The result queue will be:

Job 1 - Running    - Estimated time, infinite - 2 Nodes
Job 2 - Waiting... - Estimated time, infinite - 3 Nodes
Job 3 - Waiting... - Estimated time, 1 day    - 1 Node

But, if you specify the running time of your Job 3, then your job will run immediately! It’s because your job will not delay the execution time of Job 2 since Job1 have a estimated finishing time of infinite.

So, be sure that specifying an estimated time for your job can be advantageous!

Memory resource

Specifying the amount of memory reserved to your job it’s very important because the default memory is 1GB per core. So if you run an application that requires more than the default memory, your job will be killed. Memory restrictions are implemented this way in order to prevent oversubscribing of memory and in consequence, swapping.

The total amount of memory that you can reserve has to be calculated taking into account the total of memory available at every node. For example, if you want to run a serial job which consume 32GB of RAM, you must do a reservation of 32GB on one single node. Be careful to not sending this job to a partition with nodes that have less than 32GB per node, because it won’t be accepted. Also, if there are not enough available memory at used nodes the job will be in pending state till other jobs using this memory finishes. In order to specify the memory reserved use the –mem option.

If you want to be more specific, for example when running OpenMP or MPI jobs, you can also specify the total amount of RAM per CPU (actually per core), with the –mem-per-cpu option. If in an hypothetical case you reserve 10GB per CPU, be aware that if your thread runs in only one node (typically OpenMP jobs), and has 8 threads, you are reserving 8×10 = 80GB of RAM!

Where do you want to compute?

When you submit your job you can specify the partition in which you want it to be computed. Remember that the partitions are defined as a pool of nodes, you can see which are defined in the system through sview or sinfo.

To list the current partitions status on the system, run sinfo:

[user@acuario ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
HighParallelization up 10-00:00:0 3 mix pez[026-028]
HighParallelization up 10-00:00:0 9 alloc pez[017-025]
R815 up 10-00:00:0 1 idle pez035
R815-dev up 1:00:00 1 idle pez035
R630* up 10-00:00:0 1 mix pez045
R630* up 10-00:00:0 1 alloc pez046
R630-dev up 1:00:00 1 mix pez045
R630-dev up 1:00:00 1 alloc pez046

Or better execute sview to graphically see which partitions are configured and the full info (name, priority, max. time, node list, etc.) of everyone:

You can specify the partition where you want run your job using the –partition option. Also you can specify the node where your job will run using the –nodelist option. Be sure that the selected node forms part of the selected partition. If you don’t specify any partition Slurm assigns your job the default partition.

The current partition scheme at June 2018 is like this:

Partition Name	Time Limit	#nodes	Node List	CPU Model	Cores per Node	Memory per Node	Intended usage
R630 (default)	10 days	2	pez[045-046]	Intel Xeon E5-2630 v3	16	128GB	OpenMP/MPI
R630-dev	1 hour	2	pez[045-046]	Intel Xeon E5-2630 v3	16	128GB	Quick tests
R815	10 days	1	pez035	AMD Opteron 6376	64	256GB	OpenMP
R815-dev	1 hour	1	pez035	AMD Opteron 6376	64	256GB	Quick tests
HighParallelization	10 days	8	pez[036-043]	Intel Xeon E5-2660 v2	20	128GB	MPI
HighParallelization-dev	1 hour	8	pez[036-043]	Intel Xeon E5-2660 v2	20	128GB	Quick tests
B510	10 days	14	pez[019-032]	Intel Xeon E5-2670	16	64GB	MPI
B510-dev	1 hour	14	pez[019-032]	Intel Xeon E5-2670	16	64GB	Quick tests
HM	10 days	2	pez[048-049]	AMD EPYC 7451	48 (96 threads)	1TB	OpenMP
HM-dev	1 hour	2	pez[048-049]	AMD EPYC 7451	48 (96 threads)	1TB	Quick tests

Jobs launched at development partitions, *-dev, have higher priority.

How do I run serial jobs in Acuario?

Create a script called “run.sh” and fill it with the following content. Change the SBATCH parameters, Job name, and executable.

#!/bin/bash
#SBATCH --job-name=JobName
#SBATCH --output=JobName-output-job_%j.out
#SBATCH --error=JobName-output-job_%j.err
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks=1

##Optional - Required memory in MB per core. Defaults are 1GB per core.
##SBATCH --mem-per-cpu=3072

##Optional - Estimated execution time
##Acceptable time formats include  "minutes",   "minutes:seconds",
##"hours:minutes:seconds",   "days-hours",   "days-hours:minutes" ,"days-hours:minutes:seconds".
##SBATCH --time=

########### Further details -> man sbatch ##########

cd /home/user/binaries/
./binary

Then execute the following:

[user@acuario ~]$ sbatch run.sh
srun: jobid 3214 submitted

How do I run OpenMP jobs in Acuario?

Create a script called “run.sh”, and fill it with the following content. Change the SBATCH parameters, Job name, and executable.

In this case, the ntasks-per-node should be greater or equal to the OMP_NUM_THREADS.

#!/bin/bash
#SBATCH --job-name=JobName
#SBATCH --output=JobName-output-job_%j.out
#SBATCH --error=JobName-output-job_%j.err
#SBATCH --partition=R815
#SBATCH --ntasks-per-node=8

##Optional - Required memory in MB per node, or per core. Defaults are 1GB per core.
##SBATCH --mem=3072
##SBATCH --mem-per-cpu=3072

##Optional - Estimated execution time
##Acceptable time formats include  "minutes",   "minutes:seconds",
##"hours:minutes:seconds",   "days-hours",   "days-hours:minutes" ,"days-hours:minutes:seconds".
##SBATCH --time=

########### Further details -> man sbatch ##########

export OMP_NUM_THREADS=8
./binary

Then execute the following:

[user@acuario ~]$ sbatch run.sh
srun: jobid 3214 submitted

How do I run Open MPI jobs in Acuario?

Create a script called “run.sh”, and fill it with the following content. Change the SBATCH parameters, Job name, and executable. The –ntasks parameter will be passed to mpirun, and will run only one task per core.

#!/bin/bash
#SBATCH --job-name=JobName
#SBATCH --output=JobName-output-job_%j.out
#SBATCH --error=JobName-output-job_%j.err
#SBATCH --ntasks=Number_of_MPI_tasks

##Optional - Required memory in MB per node, or per core. Defaults are 1GB per core.
##SBATCH --mem=3072
##SBATCH --mem-per-cpu=3072

##Optional - Estimated execution time
##Acceptable time formats include  "minutes",   "minutes:seconds",
##"hours:minutes:seconds",   "days-hours",   "days-hours:minutes" ,"days-hours:minutes:seconds".
##SBATCH --time=24:00:00

########### Further details -> man sbatch ##########

srun --mpi=pmi2 ./binary

Then execute the following:

[user@acuario ~]$ sbatch run.sh
Submitted batch job 29

How do I run Intel MPI jobs in Acuario?

Load the necessary modules with:

[user@acuario ~]$ module load intel/clusterStudioXE2013

Create a script called “run.sh” , and fill it with the following content. Change the SBATCH parameters, Job name, and executable. The –ntasks parameter will be passed as the number of processors like if you were executing mpirun -np xx .

Also take care of NOT having the module openmpi/1.6.2 loaded. List currently loaded modules with “module list”, and unload with the command module unload modulename.

#!/bin/bash
#SBATCH --job-name=JobName
#SBATCH --output=JobName-output-job_%j.out
#SBATCH --error=JobName-output-job_%j.err
#SBATCH --partition=HighParallelization
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks=96

##Optional - Required memory in MB
##SBATCH --mem=2048

##Optional - Estimated execution time
##Acceptable time formats include  "minutes",   "minutes:seconds",
##"hours:minutes:seconds",   "days-hours",   "days-hours:minutes" ,"days-hours:minutes:seconds".
##SBATCH --time=24:00:00

########### Further details -> man sbatch ##########

cd /home/user/mpibinary/
srun intelmpiexecutable

Then execute the following:

[user@acuario ~]$ sbatch run.sh
srun: jobid 3214 submitted

How do I use local node /tmp directory in Acuario?

This is very useful for jobs generating intermediate files, or constantly reading and writing in output instead of only using memory. What we want to do is to sent the data need for the execution of our job into the node we’re going to execute our job. It’s important to understand that if the node we’re running in crash we’ll lose everything we had in the directory /tmp.

This is some example file for run.sh that ensure a secure cleaning of the /tmp even with scancel:

#!/bin/bash 
#SBATCH --job-name=job_name 
#SBATCH --output=job_%j.out 
#SBATCH --error=job_%j.err

ORIG_DIR="$SLURM_SUBMIT_DIR"
JOB_TMP_DIR="/tmp/${USER}_${SLURM_JOB_ID}"

cleanup() {
    echo "[$(date)] Cleaning up temporary directory: $JOB_TMP_DIR"
    rm -rf "$JOB_TMP_DIR"
    echo "[$(date)] Cleanup finished."
}

trap cleanup EXIT SIGINT SIGTERM

echo "[$(date)] Creating temporary directory: $JOB_TMP_DIR"
mkdir -p "$JOB_TMP_DIR"

echo "[$(date)] Staging input files to compute node..."
cp "$ORIG_DIR/my_program" "$JOB_TMP_DIR/"
cp "$ORIG_DIR/input_data.dat" "$JOB_TMP_DIR/"

echo "[$(date)] Starting computation..."
cd "$JOB_TMP_DIR"
./my_program -i input_data.dat -o final_results.out

echo "[$(date)] Staging output files back to NFS /home..."
cp final_results.out "$ORIG_DIR/"

echo "[$(date)] Job completed successfully."

Basic Commands

sbatch – submit a job to the batch queue system
squeue – check the current jobs in the batch queue system
sinfo – view the current status of the queues
scancel – cancel a job
sview – run a graphical tool to control jobs, see partitions and nodes status.
smap – run a console tool to control jobs, see partitions and nodes status.
sacct – displays statistics data for all jobs and job steps
sstat – displays current usage resources for all jobs and job steps

Example:

$] sstat -j 20794.batch
..
$] sacct --format=jobid,User,NodeList,AllocCPUS,AveRSS,MaxRSS,Partition,UserCPU,State -j 24464 
...

More info

If you need more information take a look to the manual of sbatch (“man sbatch”) or the Slurm Documentation.

Request an account

Access to Acuario cluster

Environment in Acuario

Storage

Understanding the resource manager Slurm

Request an account

Access to Acuario cluster

GNU/Linux

Windows

Environment in Acuario

Storage

Understanding the resource manager Slurm

Job computation time

Memory resource

Where do you want to compute?

How do I run serial jobs in Acuario?

How do I run OpenMP jobs in Acuario?

How do I run Open MPI jobs in Acuario?

How do I run Intel MPI jobs in Acuario?

How do I use local node /tmp directory in Acuario?

Basic Commands

More info

Science & Computation