Handling Abaqus: Starter Methods, CCRAs and JSVs
1. |
Notes |
- Consumable complex attributes-l abaqus
http://wikis.sun.com/display/gridengine62u5/Defining+Consumable+Resources http://wikis.sun.com/display/gridengine62u5/Example+1+-+Floating+Software+License+Management
- job submission verifier — not trying to get around consuming some licences are we?
2. |
The Issues |
- PE Integration
- We need SGE and Abaqus to talk to eachother to ensure that the application starts the SGE-determined number of processes and starts them on the right compute nodes.
- Stray Processes
- We also want to ensure processes are tidied up correctly at the end — Abaqus is known to leave stray MPI processes behind under some circumstances under SGE. We don't want this!
- Licensing
-
Abaqus uses FlexLM with floating network licences provided by the Uni licence
servers:
- we don't want jobs to be scheduled, then fail to run because of a lack of licences;
- we don't want Danzek to hog all of the University's licences.
3. |
Our Approach |
Background: ?? ABAQUS jobs use 3+n main abaqus tokens where n is the number of processors, plus either standard or parallel tokens according to the type of analysis ?? Approach/Steps: -- assuming need a dedicated PE and queue, use a JSV to ensure all abaqus jobs request the right ones (well PE, the latter will auto-choose the queue) -- check the above "3+n" formula and copy our Fluent approach to checking licences are available and re-scheduling if required -- later consider "-l abaqus=16" or similar --- see Fluent notes
4. |
Implementation: Licensing |
5. |
Stray Processes |
Setting
execd_params ENABLE_ADDGRP_KILL=truee.g., by using qconf -mconf handles this for us. (For details on this, see Scheduler Configuration.)
6. |
Implementation: PE Integration — Old Version |
- It seems the stuff below, in this section, about Abaqus environment files is not needed, or is out of date, or something.
- For completeness, we keep the Abaqus environment stuff below, in this section.
- It seems a standard HP-MPI machinefile is all that is required — see the next section!
- What Abaqus requires — number of processes to start and where to start them
-
Abaqus is started something like
abaqus mp_mode=MPI cpus=$cpus job=my-abaqus-job
with the Abaqus environment file, e.g., abaqus_v6.env containing a line equivalent to the standard MPI machinefile, viz,mp_host_list=[[R1-07, 12], [R1-06, 12]]
- Required environment — PE and PE Startup Script. . .
-
Here we build the required Abaqus environment file line from the SGE machine file:
pe_name abaqus-mpi.pe . . start_proc_args /users/simonh/CLUSTER/sge-scripts/pe_hostfile2abaqusenv.sh stop_proc_args /bin/true .
with /users/simonh/CLUSTER/sge-scripts/pe_hostfile2abaqusenv.sh:#!/bin/bash cat $PE_HOSTFILE > pe_hostfile.$JOB_ID cat $PE_HOSTFILE | awk '{print $1" slots="$2}' > machinefile.$JOB_ID CPUS=$(cat machinefile.$JOB_ID | awk '{print $2}' | sed s/slots=// | awk '{SUM += $1} END {print SUM}') MP_HOST_LIST="[" for HOST in `cat machinefile.$JOB_ID | awk '{print $1}'`; do SLOTS=$(grep $HOST machinefile.$JOB_ID | sed s/.*=//); MP_HOST_LIST="${MP_HOST_LIST}[$HOST, $SLOTS], " done MP_HOST_LIST=$(echo $MP_HOST_LIST | sed -e "s/,$/]/") echo $CPUS > cpus.$JOB_ID echo $MP_HOST_LIST > mp_host_list.$JOB_ID ENV_FILE=abaqus_job_env_vars.sh echo "#" >> $ENV_FILE.$JOB_ID echo "export cpus=$CPUS" >> $ENV_FILE.$JOB_ID echo "export mp_host_list='$MP_HOST_LIST'" >> $ENV_FILE.$JOB_ID echo "#" >> $ENV_FILE.$JOB_ID
- Simply pass environment variables from the PE startup script to Abaqus?
-
Having created our Abaqus job environment variables file above, we would like
to pass its name to our job via an environment variable. It turns out that
this is non-trivial:
It is not possible to export environment variables from the PE, as stated by Reuti: Yes, it's not possible. The prolog, jobscript, epilog and associated PE scripts are just executed one after the other as child-processes (or sub-shell). None knows anything what the other defined during their lifetime. It's like executing something in the command line, and anything defined in the started script will never show up in the superior shell at your prompt — it was execute as a child-process. This is different when you "source" something at the command prompt, which is like an "include", i.e. as it would have been typed on the comand line.
Use a starter method instead — see below! - . . .the queue. . .
-
qname R410-abaqus.q hostlist @R410 seq_no 13 . . pe_list abaqus-mpi.pe . . starter_method /users/simonh/CLUSTER/sge-scripts/qsm_hostfile2abaqusenv.sh . .
- . . .associated starter method. . .
-
- using this means users don't have to do add "source ..." to their qsub script
#!/bin/bash source abaqus_job_env_vars.sh.$JOB_ID exec "$@"
- Example Qsub Script
-
#!/bin/bash #$ -S /bin/bash #$ -cwd #$ -pe abaqus-mpi.pe 24 # ...source abaqus_job_env_vars.sh.$JOB_ID... # -- not required, as starter_method script handles this for us /bin/date /bin/hostname echo "" echo "Abaqus environment:" echo "" echo "mp_host_list="\"$mp_host_list\" echo "cpus ="\"$cpus\" echo "" echo ":tnemnorivne suqabA" echo "" echo "abaqus mp_mode=MPI cpus=$cpus job=whatever" echo ""
7. |
Implementation: PE Integration — New Version |
- Qsub Scripts
-
Serial:
#!/bin/bash #$ -cwd #$ -V #$ -S /bin/bash abq6101 job=myabqjob input=myabqjob.inp cpus=$NSLOTS scratch=$HOME/scratch interactive # # N.B. "interactive", without which jobs silently fail. #
Parallel:#!/bin/bash ### Use the current directory as the working directory #$ -cwd ### Inherit the user environment from the login node #$ -V ### Request 4 cores in smp.pe #$ -pe smp.pe 4 #$ -S /bin/bash abq6101 job=myabqjob input=myabqjob.inp cpus=$NSLOTS scratch=$HOME/scratch interactive
- PEs
-
SMP:
qconf -sp hp-mpi-smp.pe pe_name hp-mpi-smp.pe slots 999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh stop_proc_args /bin/true allocation_rule $pe_slots control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE
andqconf -sp hp-mpi-12.pe pe_name hp-mpi-12.pe slots 999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh stop_proc_args /bin/true allocation_rule 12 control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE
- Script
-
where /opt/gridware/ge-local/pe_hostfile2hpmpimachinefile.sh
#!/bin/bash MACHINEFILE="machinefile.$JOB_ID" ## PE_HOSTFILE=pe_hostfile.example for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do num=`grep $host $PE_HOSTFILE | awk '{print $2}'` ## for i in {1..$num}; do for i in `seq 1 $num`; do echo $host >> $MACHINEFILE done done