ACE: Environment Issues
1. |
What Software |
/opt/gridware/apps/binapps/ace/2010.0
2. |
Required Environment |
The apps/binapps/ace/2010.0 environment module sets up:
ESI_HOME = /opt/gridware/apps/binapps/ace
PYTHONHOME = $ESI_HOME/2010.0/UTILS
MPI_ROOT = $ESI_HOME/2010.0/UTILS/hpmpi-2.03.01.00
PAM_LMD_LICENSE_FILE = $ESI_HOME/2010.0/LICENSES_11.6/licenses/PAM_LICENSE
and prepends
LD_LIBRARY_PATH : $ESI_HOME/2010.0/UTILS/lib
PATH : $ESI_HOME/2010.0/UTILS/bin
3. |
What is the issue? |
After setting up the environment
module load compilers/intel/fortran/11.1.064 module load apps/binapps/ace/2010.0and submitting all.sge
#!/bin/bash
#$ -cwd
#$ -V
#$ -pe hpmpi.pe 24 # ...creates a HP-MPI machine file from the SGE one...
#$ -S /bin/bash
dtf_decompose -z -file_out 5oct_firsttrack_no_preheat.DTF -dmp MASTER_5oct_firsttrack_no_preheat.DTF 1 24
# ...Correct usage is:
#
# dtf_decompose [-version] [-metis | -cell_groups | -orig_topo | \
# -x | -y | -z | -wavefront] [-even] [-combined] [-keepFF] \
# [-w w1 w2...] [-file_out outFile.DTF] [-restart] inFile.DTF \
# sim# num_procs
CFD-SOLVER -model 5oct_firsttrack_no_preheat.DTF -num=$NSLOTS \
-hosts=machinefile.$JOB_ID -sim=1 -nodecomp -verbose=3
while the dtf_decompose step runs fine, the CFD-SOLVER fails:
Unable to find the UTILS folder: 2010.0/UTILS on R1-10
Looks like the environment is not getting to the application.
4. |
First Test |
Place the module load command within the SGE script — test that SGE is behaving itself:
#!/bin/bash #$ -cwd #$ -pe hpmpi.pe 24 #$ -S /bin/bash source /etc/profile.d/modules.sh module load compilers/intel/fortran/11.1.064 module load apps/binapps/ace/2010.0 blah, blah...
Makes no difference.
5. |
A Solution |
Putting the module load commands in each users' .bashrc.
6. |
Investigating |
CFD-SOLVER is a Python script. The error is issued by this code fragment:
# loop over all the machines to verify settings
for machine in machines:
# verify if ESI_HOME is set properly
cmd = [options['remoteShell'], machine, '-n', 'echo', '$ESI_HOME']
rc, esiDir = getCmdOutput(cmd)
if rc != success:
problem = 'ESI_HOME not set on ' + machine + '\n'
print problem
sys.stderr.write(problem)
testFailed = true
break
esiDir = esiDir[0].rstrip()
# if ESI_HOME is set test if the UTILS folder exists on all the machines
utilsDirPath = os.path.join(esiDir, utilsDir)
cmd = [options['remoteShell'], machine, '-n', 'ls', quotePath(utilsDirPath)]
rc, output = getCmdOutput(cmd)
if rc != success:
problem = 'Unable to find the UTILS folder: ' + utilsDirPath + ' on ' + machine + '\n'
print problem
sys.stderr.write(problem)
testFailed = true
break
Changing the line Unable to find the UTILS folder to
problem = 'Unable to find the UTILS folder: ' + utilsDirPath \
+ ' on ' + machine + '(//' + esiDir + '//' + utilsDir + '//)\n'
confirms that esiDir, i.e., ESI_HOME is not being passed to
the remote hosts.
Why not? The getCmdOutput function calls subprocess.Popen to get the remote shell, but does not copy over the environment — it is not copied over by default. The Popen call can be modified to to the environment copy by using
sys_env = os.environ.copyand then add
env=sys_env
to the Popen call.