Untergeordnete Seiten
  • Job submission
Zum Ende der Metadaten springen
Zum Anfang der Metadaten

Detailed documentation

on most job options of the LoadLeveler [2] and parallelization [3] is available at the Supermuc LRZ website. Below only the details of interest that are either missing in or different from the LRZ documentation are given.

Queues

Find out about all existing queues with llclass. Note that most queues have a maximum job time of 48h. Detailed online info on a particular queue is available via

llclass -c ${class_name} -l

Note that a user has to be registered in order to submit a job to certain queues; e.g., the lrztest queue is available only to LRZ and IBM staff members.

The most important queues are

parallel Use this one for multicore and multinode jobs. The number of top dogs (jobs that partially drain the cluster to obtained the required resources) is set to 10.

serial Designed for single-core jobs

shorttest One node is reserved for quick testing of job submission

fat Only way to access the fat node with more memory.

preempt One way to get more out of the cluster is to use preemptable jobs that won't count towards your budget. These jobs are placed on idle nodes (not cores!) and can be killed at any moment if other non-preemptable jobs require a cpu. Use at your own risk! Contact the staff if you want to use the preempt queue.

Job options

The following give an abridged introduction to important keywords and options. A full description off all keywords is in the IBM loadleveler manual.

Examples of resource specifications

Many more details and examples of resource specifications that the loadleveler recognizes (cores, CPUs, MCMs...) and how they can be configured in your job options at C2PAP are given in a separate page on this wiki.

blocking, node

If you want submit a parallel jobs with a number of MPI tasks to the cluster and it does not matter where the tasks run, then you have to specify in your submit script the following two keywords:

#@ blocking = unlimited
#@ total_tasks = N
...
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS

which means, that the Load Leveler will try to dispatch N tasks on the cluster in unlimited blocks, so it will simply fill available logical CPUs. Your tasks could be one node, or on multiple nodes.

If blocking is not specified then node is used. Beware the default value node = 1 assigns all tasks to one node, which can only run 32 tasks at the same time. Hence if more than 32 tasks are required, then either blocking = unlimited or node = K is needed to start the job at all, where K is the number of nodes on which the tasks run.

Setting MP_TASK_AFFINITY is always required with multiple tasks, even if you use only one thread per task to properly allocate the jobs!

notification

#@ notification = always|error|start|never|complete

Specifies when mail is sent to the adress in the notify_user keyword:

  • always : Notify the user when the job begins, ends, or if it incurs error conditions.
  • error: Notify the user only if the job fails.
  • start: Notify the user only when the job begins.
  • never: Never notify the user.
  • complete: Notify the user only when the job ends. (Default)
#@ notify_user = email_address

Specifies the address to which mail is sent based on the notification keyword.

If you do not want to get e-mail notifications, please ALWAYS specify #@ notifications = never in your submit script.

requirements

An optional keyword, #@ requirements allows you to specify on which nodes you want your job to run. On c2pap, all nodes have identical hardware so there usually is no choice to make. But suppose one of the nodes, say n005 is broken, and loadleveler keeps on sending your jobs there even though they all abort immediately. Then you can outsmart loadleveler like this

#@ requirements = (Machine != "n005")

resources

cpus
#@ resources = ConsumableCpus(1)

This is a required option specifying the number of threads (logical cpus in loadleveler terminology) per task. For a serial job, you have one task and should choose ConsumableCpus(1). Increase the demands by increasing either the number of MPI tasks or the number of cpus. For example,

#@ total_tasks = 5
#@ resources = ConsumableCpus(4)
...
export OMP_NUM_THREADS=4
export MP_SINGLE_THREAD=no
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS

poe my-mpi-openmp-job

would ask for 5x4=20 cpus. A task is always confined to a single node; i.e., all four openMP threads per MPI task of the example run on the same node. Different tasks can run on one or more machines, depending on availability and your settings. For a more detailed explanation, check the examples page on how to reserve/allocate cores, cpus, tell the difference between a core and a cpu in loadleveler, pin tasks to cores...

memory

By default, around 2 GB are available for a single task. If you know you need more, tell loadleveler, or else your job may be killed at runtime without warning if your task shares a node with other memory-hungry tasks. To ask for 20 GB in a serial task,

#@ resources = ConsumableCpus(1) ConsumableMemory(20 gb)

Example jobfiles

In the following example job files, replace my_project and my_user with your LRZ project name and your user name and save as my_job_file. If you don't know your project id, check here.

Submit the complete job file to the queueing system with

llsubmit my_job_file

Serial

#!/bin/bash
#
### the name of your project 
#@ job_type = serial
#@ class = serial
#@ group =  my_project
#
###                   hh:mm:ss
#@ wall_clock_limit = 15:55:50
#
### Don't reserve a node unless you have to.
### One reason not to share could be that you want the full 64 GB of RAM.
#@ node_usage = shared
#@ resources = ConsumableCpus(1)
#
#@ job_name = eos-$(jobid)
#@ initialdir = $(home)/path/to/script
#
### Want to keep logs in the project directory. Try `echo $WORK` to see your directory
### Use $jobid to have separate logs for each run.
#@ output = /gpfs/work/my_project/my_user/path/to/log/$(jobid).out
#@ error  = /gpfs/work/my_project/my_user/path/to/log/$(jobid).err
#@ notification=error
#@ notify_user=my.name@lmu.de
#@ queue

 
### myScript.sh is in $(home)/path/to/script
./myScript.sh arg1 arg2

One node, multiple threads

#!/bin/bash
#
### the name of your project 
#@ job_type = parallel
#@ class = parallel
#@ group =  my_project
#
###                   hh:mm:ss
#@ wall_clock_limit = 15:55:50
#@ node = 1
#@ total_tasks = 1
#
### Don't reserve a node if you need only four threads
#@ node_usage = shared
#@ blocking = unlimited
#@ resources = ConsumableCpus(4)
#
#@ job_name = eos-$(jobid)
#@ initialdir = $(home)/path/to/script
#
### Want to keep logs in the project directory. Try `echo $WORK` to see your directory
### Use $jobid to have separate logs for each run.
#@ output = /gpfs/work/my_project/my_user/path/to/log/$(jobid).out
#@ error  = /gpfs/work/my_project/my_user/path/to/log/$(jobid).err
#@ notification=error
#@ notify_user=my.name@lmu.de
#@ queue

### Only need this if your code uses openMP
export OMP_NUM_THREADS=4
 
### myScript.sh is in $(home)/path/to/script
./myScript.sh arg1 arg2

Multiple nodes, multiple single-threaded tasks

#!/bin/bash
#
#@ job_type = parallel
#@ class = parallel
#@ group = my_project
#
#@ blocking = unlimited
#@ total_tasks = N
#@ node_usage = shared
#
#@ wall_clock_limit = 48:00:00
#@ resources = ConsumableCpus(1)
#@ notification = error
#@ notify_user = my.name@lmu.de
#@ output = /gpfs/work/my_project/my_user/path/to/log/$(jobid).out
#@ error  = /gpfs/work/my_project/my_user/path/to/log/$(jobid).err
#@ queue

export OMP_NUM_THREADS=1
export MP_TASK_AFFINITY=cpu:$OMP_NUM_THREADS

poe ./hello

Here we don't care where the tasks run, so if the cluster if full, this will allow the job to get executed faster. If you want to reserve full nodes, then replace

#@ blocking = unlimited
#@ total_tasks = N
#@ node_usage = shared

by

#@ total_tasks = N
#@ node = K

or

#@ total_tasks = N
#@ tasks_per_node = 32

and make sure that N / K <= 32. Of course you can run fewer tasks than 32 in case you need, e.g. 10GB of RAM per task, but your budget will be charged for 32 cores if you reserve the whole node!

Interactive jobs

For testing or graphical output, it may be necessary to go interactive to see the output from a job running on a compute node immediately. Most of the above applies w/o change: save all  loadleveler options in a file but instead of submitting it with llsubmit, run it directly with poe. A trivial example to list the contents of the current directory from a compute node instead of a login node:

cat > LL_FILE <<EOD
#@ job_type = parallel
#@ group = my_project
#@ node = 1
#@ total_tasks = 1
#@ class = shorttest
#@ queue
EOD
poe ls -rmfile LL_FILE

Note that the job_type must be parallel even though there is no parallelism involved. Interactive jobs only support IBM MPI, Intel MPI (job_type = mpich) is not supported.

  • Keine Stichwörter