This guide is intended to help you get started with the HPC clusters. It is not intended to be a comprehensive guide, but rather a quick reference to get you started. For the cluster documentation, please refer to here.)
Note If you have any questions, please contact the cluster administrators by making a CAU ticket for
Library and IT / Supercomputació
.
To access the HPC clusters, you need to have a RESEARCH ID. This ID is different from your UPF account (uXXXXXXX). By default, Master's and PhD students at the UPF have access to the HPC clusters. Alternatively, you should contact your supervisor to get access to the clusters.
Note The RESEARCH IDs are commonly formatted as
[first character of first name][last name]
(e.g.jdoe
for John Doe). You need a password set up to access the clusters using your ID. You can set up your password by providing you username using this link (to change password, use this link). If these links are inactive, navigate toTools > Change Password
on the guide![]()
To access the cluster you need to first be connected to the UPF VPN
network. You can find instructions on how to
connect to the VPN using Forticlient
(instructions here.)
Note Remember that to access the VPN you should connect to the UPF network using your UPF account (uXXXXXXX). However, for connecting to the cluster, you should use your RESEARCH ID. These two accounts are different and don't share the same password necessarily.
To connect to the cluster, you can use the following command in your terminal:
ssh -X [RESEARCH_ID]@hpc.s.upf.edu
Example: ssh -X [email protected]
In order to access your files on the cluster, you can use a File Transfer Protocol (FTP)
client such as
FileZilla. If you decide to use FileZilla, to connect to the cluster,
go to File > Site Manager
and add a new site. Then, place your RESEARCH ID and password in the Login Type
,
while setting the remaining settings as follows:
The HPC clusters are composed of two sets of nodes: Login
and Compute
. When you first ssh into the cluster,
you will be connected to a Login
node. The Login
nodes are used to submit jobs to the Compute
nodes or request
interactive access to the Compute
nodes. The compute nodes are used to run the jobs submitted by the users.
Warning The
Login
nodes are not meant to run jobs. You SHOULD NOT do any computation on theLogin
nodes.
To run a job on the cluster, your submitted task or your request for interactive access will be placed in a queue.
As soon as a Compute
node is available, your job will be executed. The place in the queue is determined by the
priority
of the job as well as the amount of resources requested by the user for the task as well as the amount of
resources previously used by the user. As a result, to ensure that you minimize the time you wait in the queue, you
should request the minimum amount of resources necessary to run your job. This will also ensure that you will not
lower your priority in the queue for future jobs.
There are different priority
levels for the jobs. The priority levels are as follows:
There are a variety of GPU resources available on the cluster. The GPUs are available on the Compute
nodes.
At the time of writing this guide, the GPUs available on the cluster are listed in the following table:
GPU Model | Architecture | VRAM | CUDA CORES | --gres Tag |
---|---|---|---|---|
Quadro RTX 6000 | Turing | 24 GB | 4608 | --gres=gpu:quadro:1 |
Tesla T4 | Turing | 16 GB | 2560 | --gres=gpu:tesla:1 |
GTX 1080 Ti | Pascal | 11 GB | 3584 | --gres=gpu:pascal:1 |
To get the list of available GPUs on the cluster, you can use the following command:
sinfo -o "%P %G %D %N"
To start an interactive session, you can use the following commands:
For CPU-ONLY tasks:
srun --nodes=1 --partition=short --cpus-per-task=4 --mem=8g --pty bash -i
For CPU/GPU tasks:
srun --nodes=1 --partition=short --gres=gpu:1 --cpus-per-task=4 --mem=8g --pty bash -i
The --gres
flag is used to request the GPU resources. If any GPU works for your task, you can use --gres=gpu:X
where
X
is the number of GPUs you want to request. If you want to request a specific GPU, you can use --gres=gpu:[GPU_MODEL]:X
where [GPU_TAG]
is the model of the GPU you want to request and X
is the number of GPUs you want to request. For
the list of available GPU flags, refer to the last column of the table presented above in the
section called GPU RESOURCES AVAILABLE.
To learn more about interactive sessions, refer to the HPC Documentation.
To submit a job, you need to create a batch script
that contains the commands you want to run. The job script
should be saved in a file with the .sh
extension.
Note A batch script is a file that contains a series of commands that are executed one after the other. In order to run a batch script, you need to create an empty
[file_name].sh
file, and then copy the contents of the batch script into thefile.sh
file. Finally, you can sumbit the batch script to the cluster using thesbatch file.sh
command.To summarize,
touch install_conda.sh vim install_conda.sh # in vim: # 1. press i to enter insert mode # 2. copy the contents of the batch script into the file # 3. press esc to exit insert mode # 4. type :wq to save and quit sbatch install_conda.sh
An example of a batch script is shown below:
#!/bin/bash
#SBATCH -J sweep_small
#SBATCH -p medium
#SBATCH -N 1
#SBATCH --gres=gpu:tesla:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16g
#SBATCH --time=8:00
#SBATCH -o %N.%J.OUTPUT.out
#SBATCH -e %N.%J.ERROR_LOGS.err
source /etc/profile.d/lmod.sh
source /etc/profile.d/zz_hpcnow-arch.sh
module load Anaconda3/2020.02
source activate GrooveTransformer
python run_some_code.py
For more information about the job script, refer to the Basic Jobs section of the HPC documentation.
Many packages and applications are pre-compiled for use on the cluster.
These packages can be viewed here.
Alternatively, in an interactive session, you can call the module avail
command after sourcing the
/etc/profile.d/lmod.sh
and /etc/profile.d/zz_hpcnow-arch.sh
files.
source /etc/profile.d/lmod.sh
source /etc/profile.d/zz_hpcnow-arch.sh
module avail
If you want to search for a specific package, use the module spider PACKAGE_NAME
command:
module spider conda
Once you find a module, you can simply include it in your session by calling the module load PACKAGE_NAME
command:
module load Anaconda3/2020.02
If you need a package that is not available, you can install it via conda, or pip (if Anaconda and/or Python) modules are loaded. Alternatively, contact the cluster admins (see beginning of this guide), and ask them to install the software as a module to be loaded directly in a project.
Use the squeue
command to monitor the jobs you have submitted. The squeue
command will show you the jobs
that are currently running, the jobs that are waiting in the queue, and the jobs that have finished running.
squeue
If you have specified the output and error logs in your job script using the #SBATCH -o
and #SBATCH -e
flags,
you can use the vim
command to view the output and error logs of your job. Example:
vim node018.221673.OUTPUT.out
vim node018.221673.OUTPUT.err
To cancel a job, get the job-id
from the queue list obtained using squeue
. Then, use the scancel
command
with the job-id
as the argument. Example:
scancel 1234567