Parallel Environments, Host/Machine Files and Loose & Tight Integration of MPI
1. |
Miscellaneous Notes |
-- The prolog, jobscript, epilog and associated PE scripts are just executed one after the other as child-processes (or sub-shells). -- therefore cannot, e.g., export env vars from PE to job
2. |
Parallel Environment Fields and Attributes |
2.1. |
Start Proc Args and Stop Proc Args |
- If no such procedures are required for a certain parallel environment, you can leave the fields empty.
- The first argument is usually the name of the start or stop procedure itself; the remaining parameters are command-line arguments to these procedures.
- A variety of special identifiers, which begin with a $ prefix,
are available to pass internal runtime information to the procedures.
The sge_pe man page contains a list and description of all
available parameters. These are:
$pe_hostfile $host $job_owner $job_id $job_name $pe $pe_slots $processors $queue
2.2. |
Allocation Rule |
- A positive integer
-
This specifies the number of processes which are run
on each host. For example, for eight-core nodes, could set this field
to 8 so that only jobs of size eight, 16, 24. . . are
accepted by the scheduler for this PE and all cores on each hosts are
always used. This will:
- minimise the number of hosts on which a job runs;
- eliminate competition for on-host resources between this job and others.
- $pe_slots
- Use the special denominator $pe_slots to cause all the processes of a job to be allocated on a single host (SMP).
- $fill_up
- $round_robin
2.3. |
Urgency Slots |
For jobs submitted with a slot range, determines how a particular
number of slots is chosen from this range for use in the
resource-request-based priority contribution for numeric resources. In
simple words:
a job's priority depends partly on the resources
requested, e.g., the number of slots; the Urgency Slots value
determineds how the number of slots is chosen for the priority
calculation if a PE slot range has been specified.
One can specify:
- min or max
- Take the minimum or maximum value from the specified range.
- avg
- Average
- An integer value
2.4. |
Control Slaves |
Specifies whether SGE generates (parallel) processes — child processes under sge_execd and sge_shepherd — or whether the PE creates its own processes:
- If control slaves is set, SGE generates the (parallel) processes.
- Full control over slave tasks by the SGE is strongly preferable, e.g., to ensure reported accounting figures are correct and for (parallel) process control — see loose vs tight integration, below.
2.5. |
Job is first task |
- Significant only if control slaves is set (TRUE).
- If Job is first task is set, the job script, or one of its child processes, acts as one of the parallel tasks of the parallel application. If Job is first task is unset, the job script initiates the parallel application but does not participate.
- For PVM, this field should usually be set (TRUE).
- For MPI, this field should not usually be set (FALSE).
2.6. |
Accounting Summary |
- Significant only if control slaves is set (TRUE), i.e., if accounting data is available via sge_execd and sge_shepherd.
- If set, only a single accounting record is written to the accounting file and this contains the accounting summary of the whole job including all slave tasks;
- if unset (FALSE) an individual accounting record is written for every slave process and for the master process.
3. |
Host/Machine Files — Ensuring the Processes Run Where SGE Says! |
When a multi-process job is submitted to SGE a pe_hostfile is created which should be used to tell the parallel (e.g., MPI) job on which hosts to run and how many processes to start on each. It is easy to determine the pe_hostfile path and contents by adding a couple of lines to a qsub script, for example:
#!/bin/bash #$ -cwd #$ -S /bin/bash #$ -pe orte-32.pe 64 echo "PE_HOSTFILE:" echo $PE_HOSTFILE echo echo "cat PE_HOSTFILE:" cat $PE_HOSTFILE /opt/gridware/mpi/gcc/openmpi/1_4_3-ib/bin/mpirun -np $NSLOTS ./mynameiswhich yields
PE_HOSTFILE: /opt/gridware/ge/default/spool/node104/active_jobs/12565.1/pe_hostfile cat PE_HOSTFILE: node104.danzek.itservices.manchester.ac.uk 32 R815.q@node104.danzek.itservices.manchester.ac.uk UNDEFINED node105.danzek.itservices.manchester.ac.uk 32 R815.q@node105.danzek.itservices.manchester.ac.uk UNDEFINED Hello! From task 5 on host node103 Hello! From task 9 on host node103 . . . . . .
3.1. |
Example: The SGE pe_hostfile and OpenMPI |
SGE and OpenMPI talk happily to eachother. There is nothing to do.
3.2. |
Example: The SGE pe_hostfile and HP-MPI |
HP-MPI expects a different host/machine file format. A suitable file is easily created from the start procedure in the PE. For example:
qconf -sp hp-mpi.pe pe_name hpmpi-12.pe slots 999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh stop_proc_args /bin/true allocation_rule 12 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSEwhere /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
#!/bin/bash MACHINEFILE="machinefile.$JOB_ID" for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do num=`grep $host $PE_HOSTFILE | awk '{print $2}'` ## for i in {1..$num}; do for i in `seq 1 $num`; do echo $host >> $MACHINEFILE done doneand sample output likes like
cat machinefile.44375 node032.danzek.itservices.manchester.ac.uk node032.danzek.itservices.manchester.ac.uk node032.danzek.itservices.manchester.ac.uk . . . . node038.danzek.itservices.manchester.ac.uk node038.danzek.itservices.manchester.ac.uk . . . .
4. |
Tight Integration vs Loose Integration |
When running a multi-process (e.g., MPI) job under SGE, we want reliable control of resources, full process control and correct accounting. Within Unix/Linux this is possible only if the job processes are part of a process heirarchy with SGE at the top.
Parallel processes can be started under SGE in two ways: the sge_execd daemon can start the processes; or they can be started from the PE. See Control Slaves, below.
- If the processes are started by sge_execd and, in turn,
sge_shepherd, tight-integration is achieved. The process
heirarchy looks something like
1234 ? Sl 12:34 /opt/ge/bin/lx26-amd64/sge_execd 12345 ? S 0:00 \_ sge_shepherd-54321 -bg 12354 ? S 0:00 \_ mpirun prog...
- SGE can start, for example, OpenMPI and HP-MPI parallel processes under sge_execd and sge_shepherd without problem; tight-integration is easily achieved.
- However, there are a great variety of parallel jobs, e.g., client-server models, which SGE does not understand, but suitable hand-crafted PEs can be used to start the required parallel processes via rsh/ssh — this approach offers great flexibility.
- If the processes are started from the PE, e.g., using rsh/ssh, tight-integration is not achieved: processes started on remote hosts (compute nodes) by rsh/ssh are outside of SGE's context; it is likely that in the event of a crash, unwanted processes will be left behind.
- To handle the above problem..... rsh_wrapper, qrsh -inherit....
5. |
Loose Integration |
6. |
Tight Integration via RSH/SSH Wrappers |
As stated above, using plain rsh/ssh does not yield tight-integration. A special invocation of qrsh can be used to achieve what we want, like this:
- The qmaster sends the master process to its destination execution daemon;
- it also reserves the jobs slots on the destination execution daemons for the slave processes — reserves only, not starts.
- The PE uses
qrsh -inherit
to start slave processes on target hosts — using qrsh this way bypasses the scheduler (contrast use of qrsh for interactive and/or GUI-based work); the job is sent directly to the target host execution daemon (sge_execd). - As a security precaution, execution daemons deny such job submissions by default, but those reserved by qmaster as above are told to allow them.
- Because the slave tasks are run through Grid Engine, the qmaster is able to track resource usage and control processes — tight-integration has been achieved.
- A common trick to make the implementation of the integration easier is to provide an rsh/ssh wrapper that translates rsh/ssh calls into qrsh calls.
Example
See $SGE_ROOT/mpi/startmpi.sh and $SGE_ROOT/rsh.