Batch Processing

  1. Creation of Job script
  2. Job Submission
  3. (If necessary) Confirmation of the job status
  4. (If necessary) Job Cancellation

A job script is the same format as the shell script. A job script consists of an option area describing Slurm job submission options and a user program area describing the program to be executed. Please refer to here for environment variables that are automatically set when a job is executed.

An example of a job script is shown below, so you can see the script to keep in perspective. More detailed information for each is given in the next section.

#!/bin/bash
#============ Slurm Options ===========
#SBATCH -p gr19999b # Specify the job queue (partition). It must be changed to the queue name which you want to submit.
#SBATCH -t 1:0:0  # Specify the elapsed time (example of specifying 1 hour)
#SBATCH --rsc p=4:t=8:c=8:m=8G # Specify requested resources
#SBATCH -o %x.%j.out # Specify the standard output file for the job
## %x is replaced by the job name and %j is replaced by the job ID.
#============ Shell Script ============

# (Optional) Specify set -x to keep track of the execution progress of the job script.
set -x

# Environment variables such as the number of MPI processes and OMP_NUM_THREADS are automatically set based on the value of the --rsc option.
# If necessary, it is possible to specify overwrite within the range of resources allocated by the srun command argument or environment variables.
srun ./a.out

## Notes on job scripts ##
# Lines beginning with '#' or after '#' in a line are treated as comments. Only lines beginning with #SBATCH are exceptionally recognised as slurm option specifications.
# The current directory at the job execution is automatically moved to the current directory at the job submission.
# Environment variables set at the job submission are inherited during job execution.

A sample job script is available for your reference.

Execution Type Sample File
Non-parallelism Download
Thread parallelism Download
Process parallelism(Intel MPI) Download
Hybrid parallelism Download

Specify in the Slurm Options part of the job script followed by "#SBATCH".

Option Meaning Example
-p QUEUE Specify the queue (required item) -p gr19999b
-t HOUR:MINUTES:SECONDS Set the upper limit of execution time -t 24:0:0
--rsc p=PROCS:t=THREADS:c=CORES:m=MEMORY
or
--rsc g=GPU
Specify the resources. For more details click here --rsc p=4:t=8:c=8:m=8G
or
--rsc g=1
-o FILENAME Specify the destination to save for standard output. Refer to Official Manual for the available special characters. -o result.out
-e FILENAME Specify the designation to save for standard error output. Refer to Official Manual for the available special characters. -e result.err
-J JOBNAME Specify the job name. -J ReplaceJobName
--comment=Comment Specify the Comments. --comment=ThisIsComment
-a ARRAY_SPEC Specify the array job. For more details click here -a 1-5
-d TYPE:JOBID Specify the order of job execution. For more details click here -d afterok:999999
--no-requeue Declare that batch request is not re-runnable --no-requeue
--mail-user=MAILADDR Specify the e-mail address --mail-user=bar@sample.com
--mail-type=TYPE Specify the event notification
Specify BEGIN, END, FAIL, REQUEUE, and ALL if necessary
--mail-type=BEGIN,END

Please refer to the Official Manual for other options and details on options. Also, please check Options not available when submitting jobs if necessary.

In order to execute a program on a computing node, you must use the srun command at the point where you execute the program of the job script whether it is a sequential program or an MPI program.

The following is a typical list of options for the srun command. Please refer to the Official Manual for other options and details on options.

Option Function
-n PROCS Specify the number of processes to be started. If not specified, the value of p in the --rsc option is used.
-c CORES Specify the number of CPU cores to be secured per process. If not specified, the value of c in the --rsc option is used.
--ntasks-per-node=PROCS_PER_NODE Specify the number of processes per node. If not specified, they will be scheduled to execute on a small number of nodes.

To check the queue of jobs available for submission, use the spartition command.

Displays the queue name, Rmin/Rstd/Rmax, and standard and maximum elapsed time values for jobs that can be submitted. For more information on Rmin/Rstd/Rmax

$ spartition
Partition  State   Rmin  Rstd  Rmax    DefTime     MaxTime
gr19999g   UP         0    64    64   01:00:00  1-00:00:00

To submit a job to the queue, use the sbatch command.

$ sbatch sample.sh
Submitted batch job 20
  • Enter the job script file you created followed by the end of the command.
  • The execution of the job script is requested to the system and the job ID is displayed.
  • If option is specified when a job is submitted, the option in the job script will be overwritten.

$ sbatch sample.sh
sbatch: cli_filter/accms_resource_req: convert_rsc_option: Updated the number of cores from c=10 to c=40 based on memory size request
Submitted batch job 21

The message above is that the number of cores has been changed based on the requested memory size. This is not an error and the job will be submitted.

To display information on submitted jobs, use the squeue or sacct command.

Displays information about jobs currently registered in the queue.

$ squeue
JOBID PARTITION     NAME     USER ST   TIME  NODES NODELIST(REASON)
    1  gr19999b interact   b59999  R   0:33      1 no0001
  • Information is displayed only during job execution.
  • Please refer to the Official Manual for options.

If a job is waiting for execution, the reason is indicated in NODELIST(REASON). Typical examples are shown below. For other reasons, please refer to Official Manual for more information.

REASON Meaning
Resources There are no available resources at this time.
QOSJobLimit The maximum number of concurrent executions has been reached.

Displays information about jobs in the accounting database. Information about past jobs can also be displayed. Please refrain from repeating the command mechanically because it overloads the system.

$ sacct -X
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1               test.sh   gr19999b     b59999          8  COMPLETED      0:0
2               test.sh   gr19999b     b59999          8  COMPLETED      0:0
  • By default, a list of jobs executed on the day the command was issued is displayed.

Option Meaning Examples
-j Display statistics for the specified job. sacct -j 1234556
-X Display only statistics related to the job assignment itself, without considering the job steps. sacct -X
-l Display all information about the job. sacct -l / sacct -Xl
-S Displays jobs after the specified time. sacct -S 2022-10-01

For other options, please refer to the Official Manual.

Displays job information.

$ qs
 QUEUE     USER     JOBID          STATUS  PROC  CORE    MEM    ELAPSE(    limit)
 gr19999b  b59999   1              RUN        4     1  4570M  00:00:07( 01:00:00)
  • Information is displayed only during job execution.
Header Summary
QUEUE queue name
USER user name
JOBID job ID
STATUS Job Status
PROC Number of Process
CORE Number of cores per process
MEM Amount of memory per process
ELAPSE Job elapsed time
limit Elapsed time limit of the job

To check the queue status of a group, use the qgroup command.

Displays information on the entire group queue and statistics for each user.

$ qgroup
 QUEUE    SYS |   RUN  PEND OTHER | ALLOC ( MIN/ STD/ MAX)
----------------------------------------------------------------
 gr19999b  B  |     0     0     0 |     0 (   0/224/224)

 QUEUE    USER     |   RUN(ALLOC)  PEND(REQUEST) OTHER(REQUEST)
----------------------------------------------------------------
 gr19999b b59999   |     1(    4)     0(     0)     0(     0)

If you specify the -l option, information per job is displayed In addition to the overall queue information of the group and statistics per user, information per job is displayed.

 $ qgroup -l
 QUEUE    SYS |   RUN  PEND OTHER | ALLOC ( MIN/ STD/ MAX)
----------------------------------------------------------------
 gr19999b  B  |     0     0     0 |     0 (   0/224/224)

 QUEUE    USER     |   RUN(ALLOC)  PEND(REQUEST) OTHER(REQUEST)
----------------------------------------------------------------
 gr19999b b59999   |     1(    4)     0(     0)     0(     0)

 QUEUE    USER     JOBID    | STAT  SUBMIT_AT        | RSC:core | PROC CORE    MEM       ELAPSE
------------------------------------------------------------------------------------------------
 gr19999b b59999   1        | RUN   2024-06-06 16:51 |        4 |    4    1  4570M     01:00:00

Header Summary
QUEUE queue name
SYS system name
RUN,PEND,OTHER Number of jobs
ALLOC Number of cores allocated
MIN Minimum guaranteed number of cores for queue
STD Standard number of cores for queue
MAX Maximum number of cores in queue

Header Summary
RUN(ALLOC) Number of running jobs and allocated resources (in terms of number of cores)
PEND(REQUEST) Number of jobs waiting for execution and required resources (in terms of number of cores)
OTHER(REQUEST) Number of jobs and required resources (in terms of number of cores) other than the above

Header Summary
STAT Job Status
SUBMIT_AT Job submission date and time
RSC:core Amount of resources (in terms of number of cores)
PROC Number of processes
CORE Number of cores per process
MEM Amount of memory per process
ELAPSE Elapsed time limit for the job

To cancel a submitted job, use the scancel command.

$ scancel 20
  • Specify the job ID as an argument.
  • Please refer to the Official Manual for options.

You can cancel all submitted jobs with an instruction below.

## How to delete by specifying a user name.
$ scancel -u b59999

## How to delete by specifying a queue name. 
$ scancel -p gr19999b

## How to delete by specifying a running status.
$ scancel -t pending