Niagara, Canada's Supercomputer

Maryam Daryalal

Dec 23, 2020

Updated: Mar 19, 2024

****************************

Update of March 19th, 2024:

As of March 2024, multi-factor authentication is required to access Niagara nodes. Find the instructions here.

****************************

Updates of July 21st, 2022:

- Login using ssh keys

- Gurobi and third-party packages

- Batch file for a Python code using Gurobi/Cplex

****************************

Niagara is a Canadian supercomputer hosted and owned by the University of Toronto and operated by SciNet. It can be used by all Canadian researchers via their active Compute Canada account. So if you're a Canadian researcher, go ahead and get your access, because it's truly amazing! Last year I had to do a massive amount of computational experiments for my most recent paper (on handling stochasticity in telecommunication networks; why not have a look at it?!), and my resources seemed minuscule in comparison. So I finally started dipping my toes in Niagara (oh the pun!). Niagara's Quickstart documentation is quite nice and can get you started immediately. Here I will simplify that guide for first-time users, and expand it with some additional steps that might come in handy for us ORMS researchers.

Apply For A Compute Canada Database (CCDB) Account

Go to this link and follow the instructions to get your account. If you are a PI, then you're good to go. Otherwise, you need to ask your PI to get their CCRI number, which you need for creating your account. After activating your Compute Canada account, you have to head on to https://ccdb.computecanada.ca and request access to Niagara. After 1 or 2 days your access will be granted. The following links will be useful (constantly!):

Status of the servers and scheduler: There will be outages, planned maintenance periods, cooling system's pump seal explosions (yes it happened), and reservations. These are all very rare, but if you see that for some reason you cannot connect to the server, with this link you can check whether the server or scheduler is down.
SciNet portal: This is your portal on SciNet and provides updated information on your usage and the history of your submitted jobs (even the scripts you have submitted to the nodes).

Login to Niagara

There are two types of nodes on Niagara: “login” and “compute” nodes. When you login, you are directed to a login node. On a login node, you can upload/download your files, test them, and finally submit your jobs to the compute nodes on the cluster. Every user has two main directories with predefined names: $HOME and $SCRATCH. For a researcher with username "tiger" working with a PI with username "daniel" (I do have a toddler!), the paths look like this:

$HOME=/home/d/daniel/tiger -> read-only for the compute nodes. It’s mostly useful for keeping backups or installing softwares.
$SCRATCH=/scratch/d/daniel/tiger -> 25Tb of storage that expires after 2 months. Compute nodes can read and write on $SCRATCH, so it’s suitable for working on the projects and submitting the jobs.

****************************

Updated (July 21st, 2022)

As of January 22, 2022, the simple ssh command for login to Niagara does not work. Now everyone has to use "ssh keys" to connect to the servers. To do this, we have to first create a "key pair" of public and private keys protected by a "passphrase". The public part needs to be stored on the server, and the private part on a machine we want to use. So we need one key pair per every computing device we want to work on. Use this link for instructions on how to generate and install a key pair.

Next step is to login to the cluster using a ssh command supplied by the ssh key:

Assume that "tiger" is working on a Mac, and has stored an ssh key pair "Niagara" and "Niagara.pub" in the default location of "~/.ssh/". So, "tiger" logs into Niagara with the following ssh command and enters the passphrase when asked.

****************************

Updated (March 19th, 2024)

Now you also need to use multi-factor authentication to access Niagara. The process is quite straightforward. Go ahead and download the Duo app on your phone. Then login to your CCDB account and go to your Multifactor authentication management page and register your phone as a new device. From now on, whenever you want to login, you need to authenticate your access using your phone. The instructions are given here.

****************************

Assume that Niagara assigns "tiger" a login node named "nia-login02". In the following, you see the initial commands that "tiger" ran on Niagara, and their outcomes:

Data Management on Niagara

I usually use ExpanDrive for file management on servers. It supports ssh keys and its integration with macOS is seamless. There are other free softwares as well, but I didn’t find them quite as stable. You don’t need a software for file management though.

Using scp for under 10Gb files

You can use the scp command as you always do on Linux. For uploading source to path on Niagara, "tiger" uses the following command:

and for download to destination:

Through the web app

(The following instructions for Web app might have slightly changed since last year, but in essence it’s the same.)

Install Globus Connect Personal on your machine first (instructions below) -> this gives you a name for your machine
https://globus.computecanada.ca -> web app of Compute Canada for data management
Login with Compute Canada id
In Transfer Files, on one window, type the name you got from installing Globus Connect Personal on your machine
On another window type computecanada#niagara, and on the path field, write your home (or scratch) path (you can bookmark for future use)
Go to your desired directory, select the file you want to transfer, and press on the blue arrow to transfer the files

Softwares

Pre-installed softwares

Some common softwares and modules are already installed on Niagara. A list of available softwares can be found via:

Before running any program that uses such modules, we have to load them. So if you want to run a java program, you need to run the following command:

If a software is not installed, you can either ask the support to install it for you or your research group, or it can be installed on your own space. Let us install our two wonderful MIP solvers, Cplex and Gurobi.

Install Cplex:

Download the Cplex bin file (linux x86_64)
Upload the bin file to your Niagara space (preferably your $HOME, because $SCRATCH expires every 2 months. Then in the scripts you can use $HOME as a predefined path to refer to the address where Cplex is installed)
Use the following command to change its permission to read and write:

Run the installer:

For installation, it needs an absolute path, e.g., /home/d/daniel/tiger/ILOG/CPLEX_Studio1210
If you want to use CPLEX or CP Optimizer engines through their Python APIs, you need to tell Python where to find them. To do so, enter the following command into the terminal:

Install Gurobi

****************************

Updated (July 21st, 2022)

Latest version of Gurobi is already installed on Niagara and a Compute Canada has a free license for its use. So "technically" the only thing we need to do is to load the Gurobi module, which can be done by using the following commands:

These commands load the latest version of Gurobi. There is one issue though, if you are using Python. Gurobi has its own Python, which comes without any packages that are not part of the standard distribution, like pandas, scipy, etc. Therefore, when you simply run a Python code on Niagara that imports one of these packages, the jobs fail with the error "Package ... not found!"

The solution is given in this link, and I really appreciate that Compute Canada has dedicated a page to Gurobi and its use on the servers. The workaround is basically to create a "virtual environment", and then install Gurobi and every other package you might need there. For instance, with the following commands we build a virtual environment called "env_gurobi", activate it, then install scipy and gurobipy in env_gurobi.

${GUROBI_HOME} and ${USER} are predefined names on the servers pointing to the installation path of Gurobi and name of the user, respectively. So you can simply copy and paste the above commands. Later one, if we need to install another third party package to be used with Gurobi, we need to first activate env_gurobi, then install them. In the following we are installing pandas in env_gurobi:

****************************

Testing

For testing purposes, you can use the login nodes. This should be small instances that take only a couple of minutes to finish, and use up to 2GB memory. The commands are as usual with all Linux servers. Just remember, we have to first load the softwares that we need (the ones installed on Niagara). For instance, "tiger" wants to run a Java program named myTest (in folder tests) that uses Cplex as its solver:

Let us analyze the second command:

By using the keyword nohup, your program continues to execute, even after you log out. It means "no hang up".
java invokes the java interpreter.
-Djava.library.path=$HOME/ILOG/CPLEX_Studio1210/cplex/bin/x86-64_linux/ adds Cplex to Java library path
-classpath $HOME/ILOG/CPLEX_Studio1210/cplex/lib/cplex.jar:bin” adds cplex.jar and bin folder of myTest to the classpath. These two are separated by ":"
tests/myTest is the Java class "tiger" is running
10 2 are the arguments that are passed to the main() function of the Java program. They are separated with a space.
> ./logs/logMyTest_a10_b2.out reroutes the console output to a log file with the given name logMyTest_a10_b2.out in folder logs
2>&1 tells the shell to route the standard error output (2, the file descriptor for standard error, stderr) to the standard output (file descriptor 1). Using &1, you're telling the shell that this is not a file named 1, it's a file descriptor. Roughly speaking, in the previous step you told the shell to change the standard output to your log file. Now you're adding that you want the errors to be written there as well.
& is for running the program in the background, in a subshell. So your main shell does not wait for the program to be finished and you can continue executing other commands. You can check the performance of the task with the command "top”.

To test your script for job submission though, you can reserve up to 4 compute nodes for an hour. For example:

In the above command, “1” means that only one node is requested for debugging. By using this mode, you can make sure that the job submission is using the node to the full capacity. To run a batch file named myfile.sh on a compute node while testing, use the following commands (more on batch files in the next section):

Some useful tips:

When you have multiple .sh files, for running the next ones you should use

instead of $ ./FILENAME.sh.

For running the file in the background, you don't need to (shouldn't!) use the nohup command. When a compute node is reserved for debugging, after logging out every job gets killed since the resources are released. However, when jobs are submitted to a compute node by the scheduler, you are not logged into them anyway, and your job will continue running there even when you log out of your login node. So, to sum up, for executing a batch file in the background on a compute node in the debugging mode:

Finally, let us kill all at once, every job "tiger" has submitted:

Of course only "tiger" can do that!

Submitting Jobs

In most cases, you will want to submit from your $SCRATCH directory, so that the output of your compute job can be saved in a log file (as mentioned above, $HOME is read-only on the compute nodes). See the default resource capacity here. When you submit your jobs, a Scheduler assigns a compute node from the cluster to your jobs and they are run on that particular compute node. Every compute node would be reserved for just a single user, so they would have access to the full capacity of the node (40 core and 200GB) and no one else can use that node at the same time.

To submit your job to the Scheduler, put your project in your $SCRATCH folder. Then run your batch file myfile.sh with the following command:

Example of a batch file for a single job on a single compute node

If "tiger" wants to submit the Java program to the Scheduler and only needs one compute node, myfile.sh should look like this:

In the above batch file, “#!/bin/bash” is telling Linux that this is a batch file. Jobs are placed in a queue and run accordingly. The lines that start with #SBATCH are commands for the Scheduler that decides on the order and priorities of the jobs in the queue. In this example, we are requesting 1 node, and since we have only one single job (task), we are telling the Scheduler to assign 40 cores to this single job. Then we give it a name and an upper bound on the running time (which can be up to 24hrs). If we have a single job and output, using “--output=myTest_output_%j.txt”, outputs can be saved in the log file myTest_output_%j.txt, where %j would be job ID assigned by the Scheduler. With “--mail-type=FAIL”, Niagara will send you an email if your job fails for any reason. Then we have the commands we want to run. So for example, here "tiger" moves to the directory of the project myProject with the cd command (because the current working directory is $SCRATCH), then loads the Java module, and finally runs the Java program with the usual command. As was mentioned before, there is no need for “2>&1 &” because the job is running in the background by default.

****************************

Updated (July 21st, 2022)

Here is a batch file example in Python, that uses Gurobi:

Note that after the Gurobi module is loaded, we are activating the virtual environment to use some third party packages. If you don't have any such packages, you don't need that line.

Another batch file example in Python, this time using Cplex 12.10 in Python 3.7:

****************************

I have found the following commands very useful:

Check the statues of your jobs on the queue: $squeue -u <username>
An estimate of when a certain job is going to start: $ squeue --start -j <jobid>
Check a summary of the status of recent jobs: $ sacct
Cancel a job: $scancel <jobid>
Memory and cpu usage of a job: $jobperf <jobid>

* As a side note, having the line “#SBATCH --account def-daniel” is not necessary, but I found out if you have it in your batch file, you can still submit your jobs even if the Scheduler is down. "def-daniel" is the default allocation for your research group which is under the supervision of the account holder "daniel". When the Scheduler is down, you get the following error after submitting your job:

Also, Scheduler will be red on the status page.

Updated (Feb 2nd, 2021):

On Niagara you can nicely parallelize your jobs through preinstalled modules. You can head out to this post for job parallelization.

Niagara, Canada's Supercomputer

Apply For A Compute Canada Database (CCDB) Account

Login to Niagara

Data Management on Niagara

Softwares

Testing

Submitting Jobs

Recent Posts

Comments