Danzek::SGE
Scheduler conf: execd_params ENABLE_ADDGRP_KILL ...for abaqus...
** SGE user/access_lists: only primary unix group, secondary groups ignored...
Other stuff:
-- advance reservations -- queue-draining
These pages describe the SGE configuration on Danzek (The CSF): the approach we have taken and which features of SGE we have used to implement it.
1. |
Background: Scheduling Policy Requirements |
- Contributions
- The cluster compute nodes come from funds contributed by research groups from the University. By means of the scheduling policy, usage by members of these research groups should reflect the level of contribution by their group.
- High-Priority and Low-Priority Queues
-
The original idea was to have both high-priority and low-priority queues on
each contributed compute node: only members of a contributor's research
group would have access to high-priority queues; all system users would have
access to low-priority queues. Low-priority queues would be subordinate.
For example:
# # First high-priority queue : # qname group-01.q hostlist @group-01 . subordinate_list low-priority.q=1 . . # # Second high-priority queue : # qname group-02.q hostlist @group-02 . subordinate_list low-priority.q=1 . . qname low-priority.q hostlist @group-01, @group-02 . subordinate_list NONE . .
Notes:- group-01.q and group-02.q are high-priority queues, each implemented on its own group of (mutually-exclusive) nodes.
- Any one (or more) slot active on a node in @group-01 in group-01.q (ditto 02) will suspend all low-priority.q slots on that node and also suspend any job running in those low-priority slots.
- Fair-share Scheduling
-
The case for fair-share scheduling instead of the above high/low-priority
approach was made:
- The use of a low-priority queue can be frustrating to users: jobs can be queued-waiting or started but suspended indefinitely.
- With fair-share, a contributor's users can sometimes run (parallel) jobs across more slots than they purchased.
- While some potential contributors were very suspicous of fair-share, the decision was made to go ahead and review after some months, with high/low-priority queues a fallback position.
2. |
Strategy 1/2: Abstract User-Experience(!) Away from Queues |
The strategy/requirements of the Danzek SGE configuration are summarised in this section. A dedicated, detailed section on each requirement, and the SGE functionality used to implement it, appears below.
2.1. |
Danzek Will Keep Changing/Growing |
- Danzek is a growing and changing system; compute nodes are added at unpredictable times and these new nodes may be of a different specification to existing nodes.
- All compute nodes of a particular specification live in one (or more, mutually subordinate) queue(s). All compute nodes in a given queue are of the same specification.
- Hence Danzek's queue-space is an everchanging place to live.
2.2. |
Minimise Change that Users See |
To minimise the change that users experience:
- Each parallel code/application available on the system is documented to work with a given, fixed (in name at least) parallel-environment.
- Jobs should be submitted by specifying forced complex attributes, i.e., options such as -l short or -l fat, and/or a PE; users should not specify a particular queue. The scheduler will select a queue which satisfies the options and/or PE given. This underlying queue can, and probably will, change.
2.3. |
Hostgroups, Queues and PEs etc. |
- There are several types of compute node — one SGE hostgroup for each type.
- On each hostgroup, one or more queues are hosted. If more than one, SGE subordination is used so that only one of the queues is active at once.
- On each queue, none or more PEs are hosted.
- Some PEs are hosted on more than one queue. In this case the queues must be hosted on different hostgroups, owing to a "feature" of SGE. (We have seen an SMP job, start in smp.pe, half in one queue and half in another, both queues on the same host. The queues, being mutually-subordinate, both suspend and the job does not run!)
- smp.pe should be moved back and forth between C6100-STD.q and C6100-STD-serial.q depending on what is in the queue, at the sysadmin's descretion! Indeed, could be moved to someother queue/hostgroup entirely, for example R410-sevenday.q, should it come to exist.
- For users, we term each hostgroup an environment, e.g., Intel, 4GB/core, no IB, or AMD, 2GB/core, IB.
3. |
Strategy 2/2: Fair-share Scheduling |
- A contributor's CPU share should reflect their financial contribution
- Each contributor and members of their research group have accounts on Danzek. Integrated over a period of time, the total CPU usage of the group should reflect the size of the financial contribution to Danzek. For example, if a contributor gave 15k and the total invested to date is 450k, the group should receive 15/450 = 3.3% of the available CPU hours averaged over, say, a month or two (provided they submit sufficient jobs).
- A contributor's users should each get the same CPU share
- Within a research group, there should be a mechanism to prevent some individuals hogging CPU hours at the expense of other members of the group.
4. |
Requirements: Bits and Pieces |
- Ability to run serial, smp, multi-host mpi and interactive jobs, dynamically-allocating the number of slots for each, as approprate
- Some compute nodes are reserved for MPI-based jobs of 64 processes, or greater, e.g, those which are Infiniband-connected. Others can be considered general-purpose nodes and are expected to run small MPI-based jobs, SMP jobs (e.g., OpenMP-based), plus serial and interactive (e.g., GUI-based) jobs. Given that we may want different resource quotas or other restrictions on different types of job, we create multiple cluster queues, e.g., one primarily aimed at serial jobs, one for multi-host jobs and another for interactive work. The number of slots allocated to each queue must be dynamically-allocated to match demand. This is achieved through the use of mutual queue subordination and resource quotas.
- Multi-thread (SMP, OpenMP) jobs to run on one host(!)
-
Many codes/applications run multi-threaded, or run multi-process but do not
scale at all well beyond say eight or 12 processes. We require that
all such threads and/or processes run on one host. (Running an
eight-process job over two hosts, when it can be run on one would be nuts.)
This is easily achieved by implementing a
PE with allocation_rule set to $pe_slots. - Multi-process (e.g., MPI) jobs to run on as few hosts as possible
-
It makes no sense for a 12-process job to run across two nodes (e.g., four
processes on one and eight on another), when it is possible to run all
processes on one node; also a 24-process job may well run faster across
two nodes than three. Furthermore, unnecessary fragmentation of jobs
tends to lead to further fragmentation. For this reason
we implement parallel environments in which jobs must use whole hosts, only. - Limit Users — Prevent Extreme Behaviour
- Though fair-share is implemented so that each group's usage and each individual's usage within a group, will integrate over time to reflect the group's contribution, in the short-term individuals can sometimes dominate queues, or large parts of the cluster. We can prevent this: Limiting User Greed — Resource Quotas.
- Fluent
-
Things we must deal with:
- licensing handled automatically with appropriate restrictions on number of tokens used on this cluster and no jobs to run without ensuring sufficient licences available before we start
- parallel environment sets up. . .
- job should check if enough licences available, if not, should requeue
- Abaqus
-
Things we must deal with:
- licensing handled automatically with appropriate restrictions on number of tokens used on this cluster and no jobs to run without ensuring sufficient licences available before we start
- parallel environment sets up mp_host_list and cpus automatically
- other than "-l abaqus" don't want any complication for users
- Handle (GP)GPU-hosting compute-nodes in a sensible way
-
Things we must deal with:
- only one or two GPUs on each (12-core) host, so need one or two slots only (not 12);
- want interactive and batch, mutually-subordinate;
- want to be able to separate out 2070s from 2050s.
- Minimise User Submission Errors
- Server-side JSVs. . .
5. |
Matching Jobs to Queues: PEs, Forced Complex Attributes (Hard Resource Lists) and Queue Sequence Numbers |
6. |
Multiple Queues on a Host with Dynamic Allocation of Slots to Queues: Subordinate Queues and Resource Quotas |
- Want minimal fragmentation of parallel jobs, as many serial jobs grouped together as possible, so change node scheduling algorithm to slots.
Examples:
- FAT and VFAT, with/out Interactive all on One Hostgroup
- Serial and Parallel, with/out Interactive all on One Hostgroup
SGE:: Add serial queue ---------------------- -- added C6100-STD-serial.q, clone of C6100-STD.q, except no PEs associated with this serial queue, and: -- these two queues operate on the same compute nodes with mutual exclusion, i.e., subordinate_list C6100-STD-serial.q=1 and subordinate_list C6100-STD.q=1 -- C6100-STD-serial.q: seq_no 0 C6100-STD.q : seq_no 1 ...so serial jobs should usually start in C6100-STD-serial.q -- remains to group jobs together (cf PE fill_up) rather than spread the load * * see: _notes_sge/least_used_vs_fill_up.html * -- limit use of serial queue: { name C6100-STD-serial.q.rqs description NONE enabled TRUE limit users {*} queues C6100-STD-serial.q to slots=48 } -- changed qtype of C6100-STD.q to NONE (from "BATCH INTERACTIVE") to eliminate serial jobs
7. |
Forced Complex Attributes for Short, Fat and other Queues |
Following http://gridengine.info/2006/06/26/creating-a-dedicated-short-queue added a complex ("qconf -mc", then "qconf -sc" to confirm) short short BOOL == FORCED NO 0 0 and "C6100-STD-short.q" which has complex_values short=1 rather than "NONE" like all the others. And also "@C6100-STD-short" of course! -- now can use -l short -- N.B. selecting C6100-STD-short.q withOUT "-l short" will NOT work!