StarCCM
| 1. | See Also | 
The HP-MPI tight-integration notes at wiki.gridengine.info.
| 2. | The Issues | 
- PE Integration
- 
We need SGE and StarCCM+ to talk to eachother to ensure that the application 
starts the SGE-determined number of processes and starts them on the right 
compute nodes.  
 We also want to ensure processes are tidied up correctly at the end.
 StarCCM+ comes with its own implementation of MPI, HP's MPI.
- Licensing
- We have an unlimited number of licences for StarCCM — for MACE users only — so the complicatons that arose for Fluent do not exist for StarCCM+.
| 3. | A Further Problem | 
- All ok in R410.q
- With the PE set up as described below, StarCCM+ seemed to work perfectly with, for example, 24-process jobs on two 12-core nodes in the R410.q queue which comprises 11 Dell R410 which were part of RQ2, installed and configured by the Research Infrastructure team, rather than Alces.
- But not in C6100-STD.q — what is the difference?
- 
Running the same job in a queue comprising nodes installed/configure by
Alces, sometimes the jobs failed with:
node064.danzek.itservices.manchester.ac.uk 12 C6100-STD.q@node064.danzek.itservices.manchester.ac.uk UNDEFINED node047.danzek.itservices.manchester.ac.uk 12 C6100-STD.q@node047.danzek.itservices.manchester.ac.uk UNDEFINED Starting local server: /opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/star/bin/starccm+ -rsh ssh -np 24 -machinefile machinefile.42844 -server 100_layer.sim ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Host key verification failed. mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem. error: Server process ended unexpectedly (return code 255) mpirun: Warning one more more remote shell commands exited with non-zero status, which may indicate a remote access problem. 
- Hypothesis and Proof
- Hypothesis: jobs ran ok when they have corresponding entries in ~/.ssh/known_hosts; failed when not. Using pdsh -g nodes uptime to add all nodes to ~/.ssh/known_hosts, then all jobs run ok.
- So what is going on?
- On the R410 nodes, /etc/ssh/ssh_known_hosts includes BOTH hostnames and IP addresses; on Alces installed/configured, only hostnames. And it is the IPs that get added to one's personal known_hosts. . .
- Solution Attempt One
- 
Can we add -q to the -rsh ssh command, or use 
Host node* LogLevel QUIETin ~/.ssh/config? Did not seem to help: starccm+ is a script which includes other include-scripts, etc. Complicated. Running starccm with -verbose we see/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00/bin/mpirun -f /tmp/mpi-simonh10859/machinefile.10859 -e MPI_ROOT=/opt/gridware/apps/binapps/starccm/5.04.006_02/starccm+5.04.006/mpi/hp2/linux-x86_64-2.2.5/2.03.01.00 -e MPI_NOBACKTRACE=1 -e MPI_FLAGS=%MPI_FLAGS -e MPI_TMPDIR=/tmp/mpi-simonh10859 -e MPI_REMSH="ssh" and hacking to get MPI_REMSH="ssh -q" gives errors, as one might expect.
- Solution Attempt Two
- The scripts seem to set StrictHostKeyChecking=yes. Trying StrictHostKeyChecking=no in ~/.ssh/config seems to override this and all is well.
- Solution
- Rather than asking all users to change their SSH config, we got Alces to add IPs to /etc/ssh/ssh_known_hosts on all their nodes.
| 4. | Our Approach | 
- We need to create a suitable machine file for HP MPI from that provided by SGE. We do this in the parallel environment. See below!
| 5. | Implementation: Licensing | 
The StarCCM+ environment module simply sets
CDLMD_LICENSE_FILE="25050@lfarm1.eps.manchester.ac.uk:25050@lfarm2.eps.manchester.ac.uk:25050@lfarm3.eps.manchester.ac.uk"and that is all that is required.
| 6. | Implementation: PE Integration | 
The PE, starccm-12.pe, is
pe_name starccm-12.pe slots 999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh stop_proc_args /bin/true allocation_rule 12 control_slaves FALSE job_is_first_task FALSE urgency_slots min accounting_summary FALSEwhere /opt/gridware/ge-local/pe_hostfile2starccmmachinefile.sh
  #!/bin/bash
  MACHINEFILE="machinefile.$JOB_ID"
  for host in `cat $PE_HOSTFILE | awk '{print $1}'`; do 
      num=`grep $host $PE_HOSTFILE | awk '{print $2}'`
  ##  for i in {1..$num}; do
      for i in `seq 1 $num`; do
        echo $host >> $MACHINEFILE
      done
  done 
simply creates an HP-MPI format machinefile from that provided by SGE.
| 7. | Example Qsub Scripts | 
In this example, we select the required PE, starccm-12.pe, ensure the script knows about environment modules by sourceing the modules.sh file, then load the required module and call StarCCM+.
#!/bin/bash #$ -pe starccm-12.pe 24 #$ -S /bin/bash #$ -cwd source /etc/profile.d/modules.sh module load apps/binapps/starccm/5.04 starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim
In this example it is assumed that the required environment module has already been loaded. We must add the -V option to the script to ensure it inherits the environment, including that added by the module, from out commandline.
#!/bin/bash
#$ -pe starccm-12.pe 24
#$ -S /bin/bash
#$ -cwd
#$ -V
    # ...ensure the StarCCM+ env. module is loaded before qsubbing this script...
starccm+ -verbose -batch -rsh ssh -np $NSLOTS -machinefile machinefile.$JOB_ID 100_layer.sim 
