Stuff

UoM::RCS::Talby::Danzek::SGE



Page Contents:


Page Group

Basic Config:

Extra Stuff:

Applications:

Scripts Etc.







Handling Abaqus: Starter Methods, CCRAs and JSVs

1. 

Notes

2. 

The Issues

PE Integration
We need SGE and Abaqus to talk to eachother to ensure that the application starts the SGE-determined number of processes and starts them on the right compute nodes.
Stray Processes
We also want to ensure processes are tidied up correctly at the end — Abaqus is known to leave stray MPI processes behind under some circumstances under SGE. We don't want this!
Licensing
Abaqus uses FlexLM with floating network licences provided by the Uni licence servers:
  1. we don't want jobs to be scheduled, then fail to run because of a lack of licences;
  2. we don't want Danzek to hog all of the University's licences.

3. 

Our Approach

Background:

 ?? ABAQUS jobs use 3+n main abaqus tokens where n is the number of processors, 
    plus either standard or parallel tokens according to the type of analysis ??


Approach/Steps:

 -- assuming need a dedicated PE and queue, use a JSV to ensure all abaqus jobs
    request the right ones (well PE, the latter will auto-choose the queue)

 -- check the above "3+n" formula and copy our Fluent approach to checking
    licences are available and re-scheduling if required

 -- later consider "-l abaqus=16" or similar --- see Fluent notes

4. 

Implementation: Licensing

5. 

Stray Processes

Setting

  execd_params        ENABLE_ADDGRP_KILL=true
e.g., by using qconf -mconf handles this for us. (For details on this, see Scheduler Configuration.)

6. 

Implementation: PE Integration — Old Version

  1. It seems the stuff below, in this section, about Abaqus environment files is not needed, or is out of date, or something.
  2. For completeness, we keep the Abaqus environment stuff below, in this section.
  3. It seems a standard HP-MPI machinefile is all that is required — see the next section!

What Abaqus requires — number of processes to start and where to start them
Abaqus is started something like
  abaqus mp_mode=MPI cpus=$cpus job=my-abaqus-job
with the Abaqus environment file, e.g., abaqus_v6.env containing a line equivalent to the standard MPI machinefile, viz,
  mp_host_list=[[R1-07, 12], [R1-06, 12]]
Required environment — PE and PE Startup Script. . .
Here we build the required Abaqus environment file line from the SGE machine file:
  pe_name            abaqus-mpi.pe
  .                  .
  start_proc_args    /users/simonh/CLUSTER/sge-scripts/pe_hostfile2abaqusenv.sh
  stop_proc_args     /bin/true
  .
with /users/simonh/CLUSTER/sge-scripts/pe_hostfile2abaqusenv.sh:
  #!/bin/bash

  cat $PE_HOSTFILE                               > pe_hostfile.$JOB_ID
  cat $PE_HOSTFILE | awk '{print $1" slots="$2}' > machinefile.$JOB_ID

  CPUS=$(cat machinefile.$JOB_ID | awk '{print $2}' | sed s/slots=// | awk '{SUM += $1} END {print SUM}')

  MP_HOST_LIST="["

  for HOST in `cat machinefile.$JOB_ID | awk '{print $1}'`; do
      SLOTS=$(grep $HOST machinefile.$JOB_ID | sed s/.*=//);
      MP_HOST_LIST="${MP_HOST_LIST}[$HOST, $SLOTS], "
  done
  
  MP_HOST_LIST=$(echo $MP_HOST_LIST | sed -e "s/,$/]/")

  echo $CPUS         > cpus.$JOB_ID
  echo $MP_HOST_LIST > mp_host_list.$JOB_ID

  ENV_FILE=abaqus_job_env_vars.sh

  echo "#" >> $ENV_FILE.$JOB_ID
  echo "export cpus=$CPUS"                 >> $ENV_FILE.$JOB_ID
  echo "export mp_host_list='$MP_HOST_LIST'" >> $ENV_FILE.$JOB_ID
  echo "#" >> $ENV_FILE.$JOB_ID
Simply pass environment variables from the PE startup script to Abaqus?
Having created our Abaqus job environment variables file above, we would like to pass its name to our job via an environment variable. It turns out that this is non-trivial:

It is not possible to export environment variables from the PE, as stated by Reuti: Yes, it's not possible. The prolog, jobscript, epilog and associated PE scripts are just executed one after the other as child-processes (or sub-shell). None knows anything what the other defined during their lifetime. It's like executing something in the command line, and anything defined in the started script will never show up in the superior shell at your prompt — it was execute as a child-process. This is different when you "source" something at the command prompt, which is like an "include", i.e. as it would have been typed on the comand line.

Use a starter method instead — see below!
. . .the queue. . .
  qname                 R410-abaqus.q
  hostlist              @R410
  seq_no                13
  .                     .
  pe_list               abaqus-mpi.pe
  .                     .
  starter_method        /users/simonh/CLUSTER/sge-scripts/qsm_hostfile2abaqusenv.sh
  .                     .
. . .associated starter method. . .
  • using this means users don't have to do add "source ..." to their qsub script
/users/simonh/CLUSTER/sge-scripts/qsm_hostfile2abaqusenv.sh:
  #!/bin/bash

  source abaqus_job_env_vars.sh.$JOB_ID

  exec "$@" 
Example Qsub Script
    #!/bin/bash

    #$ -S /bin/bash
    #$ -cwd
    #$ -pe abaqus-mpi.pe 24
   
    # ...source abaqus_job_env_vars.sh.$JOB_ID...
    #     -- not required, as starter_method script handles this for us

    /bin/date
    /bin/hostname

    echo ""
    echo "Abaqus environment:"
    echo ""
    echo "mp_host_list="\"$mp_host_list\"
    echo "cpus        ="\"$cpus\"
    echo ""
    echo ":tnemnorivne suqabA"
    echo ""
    echo "abaqus mp_mode=MPI cpus=$cpus job=whatever"
    echo ""

7. 

Implementation: PE Integration — New Version

Qsub Scripts
Serial:
     #!/bin/bash
     #$ -cwd
     #$ -V
     #$ -S /bin/bash

     abq6101 job=myabqjob input=myabqjob.inp cpus=$NSLOTS scratch=$HOME/scratch interactive
                      #
                      # N.B. "interactive", without which jobs silently fail.
                      #
Parallel:
     #!/bin/bash
     ### Use the current directory as the working directory
     #$ -cwd
     ### Inherit the user environment from the login node 
     #$ -V
     ### Request 4 cores in smp.pe
     #$ -pe smp.pe 4
     #$ -S /bin/bash

     abq6101 job=myabqjob input=myabqjob.inp cpus=$NSLOTS scratch=$HOME/scratch interactive
PEs
SMP:
  qconf -sp hp-mpi-smp.pe

  pe_name            hp-mpi-smp.pe
  slots              999
  user_lists         NONE
  xuser_lists        NONE
  start_proc_args    /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
  stop_proc_args     /bin/true
  allocation_rule    $pe_slots
  control_slaves     TRUE
  job_is_first_task  FALSE
  urgency_slots      min
  accounting_summary FALSE
and
  qconf -sp hp-mpi-12.pe

  pe_name            hp-mpi-12.pe
  slots              999
  user_lists         NONE
  xuser_lists        NONE
  start_proc_args    /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
  stop_proc_args     /bin/true
  allocation_rule    12
  control_slaves     TRUE
  job_is_first_task  FALSE
  urgency_slots      min
  accounting_summary FALSE
Script
where /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
  #!/bin/bash

  MACHINEFILE="machinefile.$JOB_ID"

  ## PE_HOSTFILE=pe_hostfile.example

  for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do 
      num=`grep $host $PE_HOSTFILE | awk '{print $2}'`
  ##  for i in {1..$num}; do
      for i in `seq 1 $num`; do
        echo $host >> $MACHINEFILE
      done
  done