If you have read my introductory post on Niagara, you already know how to run a single job on a single compute node of the cluster. In this post, by job parallelization we're going to become much more efficient in terms of resource management on Niagara. Before that, maybe have a look at this interesting article on parallel universes!
Consider the following batch file:
Notice that we are running two different Java programs. In this case, the Scheduler will run the above commands one at a time, consecutively: when the first task is finished, the next one is executed. So at any time, the node is only occupied with one task. Unless that task is extremely resource consuming, this is not an efficient way to work with the cluster. The recommendation is to set up your jobs such that they run in parallel and use nearly all the resources of a single node.
Job Parallelization on One Compute Node
Thankfully, the parallelization is very simple, using the pre-installed module “gnu-parallel” (its tutorial and manual is quite useful). The following example shows a basic usage of the module:
In this example, we are requesting one node and the Scheduler runs the submitted jobs, two at a time. So two jobs start running in parallel, and when one is terminated, the third job starts running. The main body of the script is as follows:
I find it easier and cleaner to define a function for my job (but it’s not necessary). "myCommand()" is a function that takes two arguments. There’s no need to define the type of the arguments, since by default everything is String and you have already taken care of this in your program (remember these are the arguments that you pass to your main() function in C++ and Java).
“export” makes the function known in the current shell.
Next, the working directory is changed and the necessary modules are loaded. “gnu-parallel” his loaded using “module load NiaEnv/2019b gnu-parallel”.
Finally, parallelization of multiple jobs is done with the “parallel” command. Here with “parallel” we are telling the Scheduler to parallelize the three jobs defined with the function “myCommand” and its argument which we separate using ::: as the separator. Here {10..12} means that for the first argument, we want to execute the function with three values from the set {10, 11, 12}. So "gnu-parallel" will create three jobs with three different first arguments from this set and runs three of them in parallel, until all jobs are executed and terminated.
Updated (July 18th, 2022):
Here is an example in Python, using Gurobi and some other third-party application:
Job Parallelization on Multiple Compute Node
In the above example, we had three jobs and we executed them two of them at a time on a single node. Assume that we have six jobs. We can request three nodes and run two jobs on each node, all in parallel. If each job can take up the whole capacity of a single node, we can even request six nodes and execute our jobs on each in parallel with a single bash file.
Using multiple nodes when we need to load modules is a bit trickier than a single node, because the nodes do not inherit the same environment from the first node. At least until the next update on gnu-parallel, we have to do this on our own. The following example shows how we can do this:
In this example, we are requesting three nodes and we are running two jobs at a time on each of them. Everything remains the same as the single node example, until we reach the “parallel” command. Using --env myCommand, we transfer the definition of our function from the first node to the rest of the nodes. We also need --wd $PWD to make sure that all nodes are working in the same directory as the first node. Now "gnu-parallel" sets up six tasks and distributes them over three nodes, two jobs per each node. If we had more jobs and still had requested three nodes and two tasks per node, "gnu-parallel" would have continued its distribution every time a node is done with a job, making sure to optimize the resources.
コメント