Scheduling multiple similar jobs
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How do we launch many similar jobs?
How can we monitor job arrays?
Objectives
Understand how to launch similar jobs using a job array.
We just saw an example of how parallelising a job to run over too many processes may eventually result in negative returns due to the time taken for all those processes to communicate with each other. However, sometimes we are in the lucky situation where the work we need to do can be separated into truly independent tasks, with no communication required between them. The classic example of this is a parameter search, where we run the same algorithm on a range of different inputs. In that case we can make use of a feature of the job scheduler known as a job array.
Job Arrays
Job arrays are a way to cleanly submit many similar jobs while only having to define and launch one submission script.
Here is an example submission script. Either copy paste it to example-job.sh
or download it using:
$ wget https://aniabrown.github.io/hpc-carpentry-WHPC/files/simple_job_array_example.tar.gz
$ tar -xzf simple_job_array_example.tar.gz
$ cd simple_job_array_example
#!/bin/bash
# job configuration
#PBS -N job_array_example
#PBS -l select=1:ncpus=1
#PBS -l walltime=00:02:00
#PBS -J 1-4
# Change to the directory that the job was submitted from
# (remember this should be on the /work filesystem)
cd $PBS_O_WORKDIR
echo "Running job ${PBS_ARRAY_INDEX} of job array"
sleep 120
This is very similar to scripts we’ve seen before, with two changes. The configuration parameter #PBS -J 1-4
tells the scheduler to launch four jobs, assigning each an index which will range between one and four. We can also access this index that has been assigned to each job using the PBS variable PBS_ARRAY_INDEX
.
We can launch the job array as we would launch a single job
$ [yourUsername@cirrus-login0 ~]$ qsub -A tc008 -q R1262266 example-job.sh
The status of the job array as a whole can be viewed using qstat -u $USER
indy2-login0:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1266791[].indy2 ania workq job_array_ -- 1 1 -- 00:00 B --
And we can also look at the individual jobs using qstat -t jobId[]
. Eg qstat -t 1266791[]
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1266791[].indy2-l job_array_examp ania 0 B workq
1266791[1].indy2- job_array_examp ania 00:00:00 R workq
1266791[2].indy2- job_array_examp ania 00:00:00 R workq
1266791[3].indy2- job_array_examp ania 00:00:00 R workq
1266791[4].indy2- job_array_examp ania 00:00:00 R workq
Remember, for all purposes other than ease of launching, these are completely separate jobs. They may get scheduled to run at different times, and each will generate its own output and error files:
$ ls
example_job.pbs job_array_example.e1266791.3 job_array_example.o1266791.2
job_array_example.e1266791.1 job_array_example.e1266791.4 job_array_example.o1266791.3
job_array_example.e1266791.2 job_array_example.o1266791.1 job_array_example.o1266791.4
Currently these jobs only print out their unique id, which is not very useful. Additionally, the output of these jobs is scattered among all the different output files, which we will need to process somehow.
Example with input/output files per job
Let’s take a look at a more realistic example. Download and untar the example files using:
$ wget https://aniabrown.github.io/hpc-carpentry-WHPC/files/job_array_example.tar.gz
$ tar -xzf job_array_example.tar.gz
$ cd job_array_example
$ ls
input_1.txt input_3.txt process_file.py summarise_outputs.py
input_2.txt input_4.txt submit_job_array.pbs
There are four input files, each containing a list of numbers. We can see an example using cat input_1.txt
.
31587
16729
29533
25846
21477
6016
25138
30079
4120
12355
793
26439
31226
18139
4081
9797
32245
6563
15591
7784
The short python script process_file.py
takes one of these input files, finds the maximum number in the file and prints that value to an output file that we specify – very slightly more useful than our first example.
The script takes two arguments – the input and output file names. For example:
$ module load anaconda/python3
$ python3 process_file.py input_1.txt output_1.txt
Process all four input files using a job array
The example folder contains the same submission script as in the previous example. Can you add a line that runs the
process_file.py
script on each input file? Remember, submitting the job array submission script will launch four separate jobs each with their own value of${PBS_ARRAY_INDEX}
.Solution
#!/bin/bash # job configuration #PBS -N job_array_example #PBS -l select=1:ncpus=1 #PBS -l walltime=00:00:30 #PBS -J 1-4 # Change to the directory that the job was submitted from # (remember this should be on the /work filesystem) cd $PBS_O_WORKDIR echo "Running job ${PBS_ARRAY_INDEX} of job array" module load anaconda/python3 python3 process_file.py input_${PBS_ARRAY_INDEX}.txt output_${PBS_ARRAY_INDEX}.txt
In this example, we create one output file per job in the job array – this is fairly typical. The last piece of work we need to do is then to process those output files – in this example there is a python script, summarise_outputs
, which reads through all the output files and puts them into a summary results table, summary_file.csv
.
$ module load anaconda/python3
$ python3 summarise_outputs.py
$ cat summary_file.csv
job, max
1,32245
2,32244
3,29748
4,30474
Responsible use
Warning: Job arrays give you a lot of power – use it wisely! All the jobs in a job array will go through the queuing system so if you submit a job array that is too large eventually your jobs will get held to allow other users through. However, it’s still best not to submit thousands of jobs at once, and always remember to test on a small numbers of jobs first!
Key Points
Job arrays allow you to launch many jobs that are each assigned a different index value, using just one submission script.