Parallel Environments, Host/Machine Files and Loose & Tight Integration of MPI

1.	Miscellaneous Notes

 -- The prolog, jobscript, epilog and associated PE scripts are just 
    executed one after the other as child-processes (or sub-shells).

 -- therefore cannot, e.g., export env vars from PE to job

2.	Parallel Environment Fields and Attributes

2.1.	Start Proc Args and Stop Proc Args

If no such procedures are required for a certain parallel environment, you can leave the fields empty.
The first argument is usually the name of the start or stop procedure itself; the remaining parameters are command-line arguments to these procedures.
A variety of special identifiers, which begin with a $ prefix, are available to pass internal runtime information to the procedures. The sge_pe man page contains a list and description of all available parameters. These are:
```
        $pe_hostfile
        $host 
        $job_owner
        $job_id
        $job_name
        $pe   
        $pe_slots
        $processors
        $queue
```

2.2.	Allocation Rule

A positive integer

This specifies the number of processes which are run on each host. For example, for eight-core nodes, could set this field to 8 so that only jobs of size eight, 16, 24. . . are accepted by the scheduler for this PE and all cores on each hosts are always used. This will:

minimise the number of hosts on which a job runs;
eliminate competition for on-host resources between this job and others.

$pe_slots

Use the special denominator $pe_slots to cause all the processes of a job to be allocated on a single host (SMP).

$fill_up

$round_robin

2.3.	Urgency Slots

For jobs submitted with a slot range, determines how a particular number of slots is chosen from this range for use in the resource-request-based priority contribution for numeric resources. In simple words:

a job's priority depends partly on the resources requested, e.g., the number of slots; the Urgency Slots value determineds how the number of slots is chosen for the priority calculation if a PE slot range has been specified.

One can specify:

min or max: Take the minimum or maximum value from the specified range.
avg: Average
An integer value

2.4.	Control Slaves

Specifies whether SGE generates (parallel) processes — child processes under sge_execd and sge_shepherd — or whether the PE creates its own processes:

If control slaves is set, SGE generates the (parallel) processes.
Full control over slave tasks by the SGE is strongly preferable, e.g., to ensure reported accounting figures are correct and for (parallel) process control — see loose vs tight integration, below.

2.5.	Job is first task

Significant only if control slaves is set (TRUE).
If Job is first task is set, the job script, or one of its child processes, acts as one of the parallel tasks of the parallel application. If Job is first task is unset, the job script initiates the parallel application but does not participate.
For PVM, this field should usually be set (TRUE).
For MPI, this field should not usually be set (FALSE).

2.6.	Accounting Summary

Significant only if control slaves is set (TRUE), i.e., if accounting data is available via sge_execd and sge_shepherd.
If set, only a single accounting record is written to the accounting file and this contains the accounting summary of the whole job including all slave tasks;
if unset (FALSE) an individual accounting record is written for every slave process and for the master process.

3.	Host/Machine Files — Ensuring the Processes Run Where SGE Says!

When a multi-process job is submitted to SGE a pe_hostfile is created which should be used to tell the parallel (e.g., MPI) job on which hosts to run and how many processes to start on each. It is easy to determine the pe_hostfile path and contents by adding a couple of lines to a qsub script, for example:

  #!/bin/bash

  #$ -cwd
  #$ -S /bin/bash
  #$ -pe orte-32.pe 64

  echo "PE_HOSTFILE:"
  echo $PE_HOSTFILE
  echo
  echo "cat PE_HOSTFILE:"
  cat $PE_HOSTFILE

  /opt/gridware/mpi/gcc/openmpi/1_4_3-ib/bin/mpirun -np $NSLOTS ./mynameis

which yields

  PE_HOSTFILE:
  /opt/gridware/ge/default/spool/node104/active_jobs/12565.1/pe_hostfile

  cat PE_HOSTFILE:
  node104.danzek.itservices.manchester.ac.uk 32 R815.q@node104.danzek.itservices.manchester.ac.uk UNDEFINED
  node105.danzek.itservices.manchester.ac.uk 32 R815.q@node105.danzek.itservices.manchester.ac.uk UNDEFINED
  
 Hello! From task            5  on host node103   
 Hello! From task            9  on host node103
 .                           .  .
 .                           .  .

3.1.	Example: The SGE `pe_hostfile` and OpenMPI

SGE and OpenMPI talk happily to eachother. There is nothing to do.

3.2.	Example: The SGE `pe_hostfile` and HP-MPI

HP-MPI expects a different host/machine file format. A suitable file is easily created from the start procedure in the PE. For example:

  qconf -sp hp-mpi.pe

  pe_name            hpmpi-12.pe
  slots              999
  user_lists         NONE
  xuser_lists        NONE
  start_proc_args    /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
  stop_proc_args     /bin/true
  allocation_rule    12
  control_slaves     TRUE
  job_is_first_task  FALSE
  urgency_slots      min
  accounting_summary FALSE

where /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh

  #!/bin/bash

  MACHINEFILE="machinefile.$JOB_ID"

  for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do 
      num=`grep $host $PE_HOSTFILE | awk '{print $2}'`
  ##  for i in {1..$num}; do
      for i in `seq 1 $num`; do
        echo $host >> $MACHINEFILE
      done
  done

and sample output likes like

  cat machinefile.44375

  node032.danzek.itservices.manchester.ac.uk
  node032.danzek.itservices.manchester.ac.uk
  node032.danzek.itservices.manchester.ac.uk
  .                                        .
  .                                        .
  node038.danzek.itservices.manchester.ac.uk
  node038.danzek.itservices.manchester.ac.uk
  .                                        .
  .                                        .

4.	Tight Integration vs Loose Integration

When running a multi-process (e.g., MPI) job under SGE, we want reliable control of resources, full process control and correct accounting. Within Unix/Linux this is possible only if the job processes are part of a process heirarchy with SGE at the top.

Parallel processes can be started under SGE in two ways: the sge_execd daemon can start the processes; or they can be started from the PE. See Control Slaves, below.

If the processes are started by sge_execd and, in turn, sge_shepherd, tight-integration is achieved. The process heirarchy looks something like

 
    1234  ?   Sl  12:34  /opt/ge/bin/lx26-amd64/sge_execd
    12345 ?   S    0:00    \_ sge_shepherd-54321 -bg
    12354 ?   S    0:00       \_ mpirun prog...

SGE can start, for example, OpenMPI and HP-MPI parallel processes under sge_execd and sge_shepherd without problem; tight-integration is easily achieved.
However, there are a great variety of parallel jobs, e.g., client-server models, which SGE does not understand, but suitable hand-crafted PEs can be used to start the required parallel processes via rsh/ssh — this approach offers great flexibility.
If the processes are started from the PE, e.g., using rsh/ssh, tight-integration is not achieved: processes started on remote hosts (compute nodes) by rsh/ssh are outside of SGE's context; it is likely that in the event of a crash, unwanted processes will be left behind.
To handle the above problem..... rsh_wrapper, qrsh -inherit....

5.	Loose Integration

6.	Tight Integration via RSH/SSH Wrappers

As stated above, using plain rsh/ssh does not yield tight-integration. A special invocation of qrsh can be used to achieve what we want, like this:

The qmaster sends the master process to its destination execution daemon;
it also reserves the jobs slots on the destination execution daemons for the slave processes — reserves only, not starts.
The PE uses
```
        qrsh -inherit
```
to start slave processes on target hosts — using qrsh this way bypasses the scheduler (contrast use of qrsh for interactive and/or GUI-based work); the job is sent directly to the target host execution daemon (sge_execd).
As a security precaution, execution daemons deny such job submissions by default, but those reserved by qmaster as above are told to allow them.

Because the slave tasks are run through Grid Engine, the qmaster is able to track resource usage and control processes — tight-integration has been achieved.
A common trick to make the implementation of the integration easier is to provide an rsh/ssh wrapper that translates rsh/ssh calls into qrsh calls.

Example

See $SGE_ROOT/mpi/startmpi.sh and $SGE_ROOT/rsh.

Stuff

UoM::RCS::Talby::Danzek::SGE

Page Contents:

Page Group

How can a user influence job priority?

Bugs/Features

Troubleshooting

Job Scheduling

Page Related

Parallel Environments, Host/Machine Files and Loose & Tight Integration of MPI

1.

Miscellaneous Notes

2.

Parallel Environment Fields and Attributes

2.1.

Start Proc Args and Stop Proc Args

2.2.

Allocation Rule

2.3.

Urgency Slots

2.4.

Control Slaves

2.5.

Job is first task

2.6.

Accounting Summary

3.

Host/Machine Files — Ensuring the Processes Run Where SGE Says!

3.1.

Example: The SGE `pe_hostfile` and OpenMPI

3.2.

Example: The SGE `pe_hostfile` and HP-MPI

4.

Tight Integration vs Loose Integration

5.

Loose Integration

6.

Tight Integration via RSH/SSH Wrappers

Example

Stuff

UoM::RCS::Talby::Danzek::SGE

Page Contents:

Page Group

How can a user influence job priority?

Bugs/Features

Troubleshooting

Job Scheduling

Page Related

Parallel Environments, Host/Machine Files and Loose & Tight Integration of MPI

1.

Miscellaneous Notes

2.

Parallel Environment Fields and Attributes

2.1.

Start Proc Args and Stop Proc Args

2.2.

Allocation Rule

2.3.

Urgency Slots

2.4.

Control Slaves

2.5.

Job is first task

2.6.

Accounting Summary

3.

Host/Machine Files — Ensuring the Processes Run Where SGE Says!

3.1.

Example: The SGE pe_hostfile and OpenMPI

3.2.

Example: The SGE pe_hostfile and HP-MPI

4.

Tight Integration vs Loose Integration

5.

Loose Integration

6.

Tight Integration via RSH/SSH Wrappers

Example

Example: The SGE `pe_hostfile` and OpenMPI

Example: The SGE `pe_hostfile` and HP-MPI