Stuff

UoM::RCS::Talby::Danzek::SGE



Page Group

Basic Config:

Extra Stuff:

Applications:

Scripts Etc.







ACE: Environment Issues

1. 

What Software

  /opt/gridware/apps/binapps/ace/2010.0

2. 

Required Environment

The apps/binapps/ace/2010.0 environment module sets up:

    ESI_HOME              =  /opt/gridware/apps/binapps/ace
    PYTHONHOME            =  $ESI_HOME/2010.0/UTILS
    MPI_ROOT              =  $ESI_HOME/2010.0/UTILS/hpmpi-2.03.01.00
    PAM_LMD_LICENSE_FILE  =  $ESI_HOME/2010.0/LICENSES_11.6/licenses/PAM_LICENSE
and prepends
    LD_LIBRARY_PATH  :  $ESI_HOME/2010.0/UTILS/lib
    PATH             :  $ESI_HOME/2010.0/UTILS/bin	

3. 

What is the issue?

After setting up the environment

  module load compilers/intel/fortran/11.1.064
  module load apps/binapps/ace/2010.0
and submitting all.sge
  #!/bin/bash

  #$ -cwd
  #$ -V
  #$ -pe hpmpi.pe 24    # ...creates a HP-MPI machine file from the SGE one...
  #$ -S /bin/bash

  dtf_decompose -z -file_out 5oct_firsttrack_no_preheat.DTF -dmp MASTER_5oct_firsttrack_no_preheat.DTF 1 24

      # ...Correct usage is: 
      #
      #    dtf_decompose [-version] [-metis | -cell_groups | -orig_topo | \
      #        -x | -y | -z | -wavefront] [-even] [-combined] [-keepFF] \
      #        [-w w1 w2...] [-file_out outFile.DTF] [-restart] inFile.DTF \
      #        sim# num_procs

  CFD-SOLVER -model 5oct_firsttrack_no_preheat.DTF -num=$NSLOTS \
      -hosts=machinefile.$JOB_ID  -sim=1 -nodecomp -verbose=3
while the dtf_decompose step runs fine, the CFD-SOLVER fails:
  Unable to find the UTILS folder: 2010.0/UTILS on R1-10

Looks like the environment is not getting to the application.

4. 

First Test

Place the module load command within the SGE script — test that SGE is behaving itself:

  #!/bin/bash

  #$ -cwd
  #$ -pe hpmpi.pe 24
  #$ -S /bin/bash

  source /etc/profile.d/modules.sh

  module load compilers/intel/fortran/11.1.064
  module load apps/binapps/ace/2010.0

  blah, blah...

Makes no difference.

5. 

A Solution

Putting the module load commands in each users' .bashrc.

6. 

Investigating

CFD-SOLVER is a Python script. The error is issued by this code fragment:

      # loop over all the machines to verify settings
    for machine in machines:
      # verify if ESI_HOME is set properly
      cmd = [options['remoteShell'], machine, '-n', 'echo', '$ESI_HOME']
      rc, esiDir = getCmdOutput(cmd)
      if rc != success:
	problem = 'ESI_HOME not set on ' + machine + '\n'
        print problem
        sys.stderr.write(problem)
        testFailed = true
        break
      esiDir = esiDir[0].rstrip()

      # if ESI_HOME is set test if the UTILS folder exists on all the machines
      utilsDirPath = os.path.join(esiDir, utilsDir)
      cmd = [options['remoteShell'], machine, '-n', 'ls', quotePath(utilsDirPath)]
      rc, output = getCmdOutput(cmd)
      if rc != success:
        problem = 'Unable to find the UTILS folder: ' + utilsDirPath + ' on ' + machine + '\n'
        print problem
        sys.stderr.write(problem)
	testFailed = true
        break

Changing the line Unable to find the UTILS folder to

     problem = 'Unable to find the UTILS folder: ' + utilsDirPath \
         + ' on ' + machine + '(//' + esiDir + '//' + utilsDir + '//)\n'
confirms that esiDir, i.e., ESI_HOME is not being passed to the remote hosts.

Why not? The getCmdOutput function calls subprocess.Popen to get the remote shell, but does not copy over the environment — it is not copied over by default. The Popen call can be modified to to the environment copy by using

  sys_env = os.environ.copy
and then add
    env=sys_env
to the Popen call.