Danzek::SGE

Scheduler conf:
    execd_params                 ENABLE_ADDGRP_KILL
     ...for abaqus...

** SGE user/access_lists: only primary unix group, secondary groups ignored...

Other stuff:

 -- advance reservations
 -- queue-draining

These pages describe the SGE configuration on Danzek (The CSF): the approach we have taken and which features of SGE we have used to implement it.

1.	Background: Scheduling Policy Requirements

Contributions

The cluster compute nodes come from funds contributed by research groups from the University. By means of the scheduling policy, usage by members of these research groups should reflect the level of contribution by their group.

High-Priority and Low-Priority Queues

The original idea was to have both high-priority and low-priority queues on each contributed compute node: only members of a contributor's research group would have access to high-priority queues; all system users would have access to low-priority queues. Low-priority queues would be subordinate. For example:

  #
  # First high-priority queue :
  #
  qname                 group-01.q
  hostlist              @group-01
  .
  subordinate_list      low-priority.q=1
  .
  .


  #
  # Second high-priority queue :
  #
  qname                 group-02.q
  hostlist              @group-02
  .
  subordinate_list      low-priority.q=1
  .
  .


  qname                 low-priority.q
  hostlist              @group-01, @group-02
  .
  subordinate_list      NONE
  .
  .

Notes:

group-01.q and group-02.q are high-priority queues, each implemented on its own group of (mutually-exclusive) nodes.
Any one (or more) slot active on a node in @group-01 in group-01.q (ditto 02) will suspend all low-priority.q slots on that node and also suspend any job running in those low-priority slots.

Fair-share Scheduling

The case for fair-share scheduling instead of the above high/low-priority approach was made:

The use of a low-priority queue can be frustrating to users: jobs can be queued-waiting or started but suspended indefinitely.
With fair-share, a contributor's users can sometimes run (parallel) jobs across more slots than they purchased.
While some potential contributors were very suspicous of fair-share, the decision was made to go ahead and review after some months, with high/low-priority queues a fallback position.

2.	Strategy 1/2: Abstract User-Experience(!) Away from Queues

The strategy/requirements of the Danzek SGE configuration are summarised in this section. A dedicated, detailed section on each requirement, and the SGE functionality used to implement it, appears below.

Users should select options (-l) and PEs, not queues.

2.1.	Danzek Will Keep Changing/Growing

Danzek is a growing and changing system; compute nodes are added at unpredictable times and these new nodes may be of a different specification to existing nodes.
All compute nodes of a particular specification live in one (or more, mutually subordinate) queue(s). All compute nodes in a given queue are of the same specification.
Hence Danzek's queue-space is an everchanging place to live.

2.2.	Minimise Change that Users See

To minimise the change that users experience:

Each parallel code/application available on the system is documented to work with a given, fixed (in name at least) parallel-environment.
Jobs should be submitted by specifying forced complex attributes, i.e., options such as -l short or -l fat, and/or a PE; users should not specify a particular queue. The scheduler will select a queue which satisfies the options and/or PE given. This underlying queue can, and probably will, change.

2.3.	Hostgroups, Queues and PEs etc.

There are several types of compute node — one SGE hostgroup for each type.
On each hostgroup, one or more queues are hosted. If more than one, SGE subordination is used so that only one of the queues is active at once.
On each queue, none or more PEs are hosted.
Some PEs are hosted on more than one queue. In this case the queues must be hosted on different hostgroups, owing to a "feature" of SGE. (We have seen an SMP job, start in smp.pe, half in one queue and half in another, both queues on the same host. The queues, being mutually-subordinate, both suspend and the job does not run!)
smp.pe should be moved back and forth between C6100-STD.q and C6100-STD-serial.q depending on what is in the queue, at the sysadmin's descretion! Indeed, could be moved to someother queue/hostgroup entirely, for example R410-sevenday.q, should it come to exist.
For users, we term each hostgroup an environment, e.g., Intel, 4GB/core, no IB, or AMD, 2GB/core, IB.

3.	Strategy 2/2: Fair-share Scheduling

A contributor's CPU share should reflect their financial contribution: Each contributor and members of their research group have accounts on Danzek. Integrated over a period of time, the total CPU usage of the group should reflect the size of the financial contribution to Danzek. For example, if a contributor gave 15k and the total invested to date is 450k, the group should receive 15/450 = 3.3% of the available CPU hours averaged over, say, a month or two (provided they submit sufficient jobs).
A contributor's users should each get the same CPU share: Within a research group, there should be a mechanism to prevent some individuals hogging CPU hours at the expense of other members of the group.

Both of these requirements are addressed by SGE's fair-share scheduling.

4.	Requirements: Bits and Pieces

Ability to run serial, smp, multi-host mpi and interactive jobs, dynamically-allocating the number of slots for each, as approprate

Some compute nodes are reserved for MPI-based jobs of 64 processes, or greater, e.g, those which are Infiniband-connected. Others can be considered general-purpose nodes and are expected to run small MPI-based jobs, SMP jobs (e.g., OpenMP-based), plus serial and interactive (e.g., GUI-based) jobs. Given that we may want different resource quotas or other restrictions on different types of job, we create multiple cluster queues, e.g., one primarily aimed at serial jobs, one for multi-host jobs and another for interactive work. The number of slots allocated to each queue must be dynamically-allocated to match demand. This is achieved through the use of mutual queue subordination and resource quotas.

Multi-thread (SMP, OpenMP) jobs to run on one host(!)

Many codes/applications run multi-threaded, or run multi-process but do not scale at all well beyond say eight or 12 processes. We require that all such threads and/or processes run on one host. (Running an eight-process job over two hosts, when it can be run on one would be nuts.) This is easily achieved by implementing a PE with allocation_rule set to $pe_slots.

Multi-process (e.g., MPI) jobs to run on as few hosts as possible

It makes no sense for a 12-process job to run across two nodes (e.g., four processes on one and eight on another), when it is possible to run all processes on one node; also a 24-process job may well run faster across two nodes than three. Furthermore, unnecessary fragmentation of jobs tends to lead to further fragmentation. For this reason we implement parallel environments in which jobs must use whole hosts, only.

Limit Users — Prevent Extreme Behaviour

Though fair-share is implemented so that each group's usage and each individual's usage within a group, will integrate over time to reflect the group's contribution, in the short-term individuals can sometimes dominate queues, or large parts of the cluster. We can prevent this: Limiting User Greed — Resource Quotas.

Fluent

Things we must deal with:

licensing handled automatically with appropriate restrictions on number of tokens used on this cluster and no jobs to run without ensuring sufficient licences available before we start
parallel environment sets up. . .
job should check if enough licences available, if not, should requeue

Here. . .

Abaqus

Things we must deal with:

licensing handled automatically with appropriate restrictions on number of tokens used on this cluster and no jobs to run without ensuring sufficient licences available before we start
parallel environment sets up mp_host_list and cpus automatically
other than "-l abaqus" don't want any complication for users

Handle (GP)GPU-hosting compute-nodes in a sensible way

Things we must deal with:

only one or two GPUs on each (12-core) host, so need one or two slots only (not 12);
want interactive and batch, mutually-subordinate;
want to be able to separate out 2070s from 2050s.

Minimise User Submission Errors

Server-side JSVs. . .

5.	Matching Jobs to Queues: PEs, Forced Complex Attributes (Hard Resource Lists) and Queue Sequence Numbers

6.	Multiple Queues on a Host with Dynamic Allocation of Slots to Queues: Subordinate Queues and Resource Quotas

Want minimal fragmentation of parallel jobs, as many serial jobs grouped together as possible, so change node scheduling algorithm to slots.

Examples:

SGE:: Add serial queue
----------------------


 -- added C6100-STD-serial.q, clone of C6100-STD.q, except no PEs associated with
    this serial queue, and:

     -- these two queues operate on the same compute nodes with mutual exclusion,
        i.e.,
          subordinate_list      C6100-STD-serial.q=1
        and 
          subordinate_list      C6100-STD.q=1

     -- C6100-STD-serial.q:  seq_no  0
        C6100-STD.q       :  seq_no  1
         ...so serial jobs should usually start in C6100-STD-serial.q

 -- remains to group jobs together (cf PE fill_up) rather than spread the
    load
     *
     * see: _notes_sge/least_used_vs_fill_up.html
     *

 -- limit use of serial queue:

    {
       name         C6100-STD-serial.q.rqs
       description  NONE
       enabled      TRUE
       limit        users {*} queues C6100-STD-serial.q to slots=48
    }

 -- changed qtype of C6100-STD.q to NONE (from "BATCH INTERACTIVE") to eliminate 
    serial jobs

7.	Forced Complex Attributes for Short, Fat and other Queues

Following 

  http://gridengine.info/2006/06/26/creating-a-dedicated-short-queue

added a complex ("qconf -mc", then "qconf -sc" to confirm)

  short               short      BOOL        ==    FORCED      NO         0        0

and "C6100-STD-short.q" which has

  complex_values        short=1

rather than "NONE" like all the others.  And also "@C6100-STD-short" of course!

 -- now can use 

     -l short

 -- N.B. selecting C6100-STD-short.q withOUT "-l short" will NOT work!

Stuff

UoM::RCS::Talby::Danzek::SGE

Page Contents:

Page Group

Basic Config:

Extra Stuff:

Applications:

Scripts Etc.

Page Related

Danzek::SGE

1.

Background: Scheduling Policy Requirements

2.

Strategy 1/2: Abstract User-Experience(!) Away from Queues

2.1.

Danzek Will Keep Changing/Growing

2.2.

Minimise Change that Users See

2.3.

Hostgroups, Queues and PEs etc.

3.

Strategy 2/2: Fair-share Scheduling

4.

Requirements: Bits and Pieces

5.

Matching Jobs to Queues: PEs, Forced Complex Attributes (Hard Resource Lists) and Queue Sequence Numbers

6.

Multiple Queues on a Host with Dynamic Allocation of Slots to Queues: Subordinate Queues and Resource Quotas

7.

Forced Complex Attributes for Short, Fat and other Queues