ACE: Environment Issues
1. |
What Software |
/opt/gridware/apps/binapps/ace/2010.0
2. |
Required Environment |
The apps/binapps/ace/2010.0 environment module sets up:
ESI_HOME = /opt/gridware/apps/binapps/ace PYTHONHOME = $ESI_HOME/2010.0/UTILS MPI_ROOT = $ESI_HOME/2010.0/UTILS/hpmpi-2.03.01.00 PAM_LMD_LICENSE_FILE = $ESI_HOME/2010.0/LICENSES_11.6/licenses/PAM_LICENSEand prepends
LD_LIBRARY_PATH : $ESI_HOME/2010.0/UTILS/lib PATH : $ESI_HOME/2010.0/UTILS/bin
3. |
What is the issue? |
After setting up the environment
module load compilers/intel/fortran/11.1.064 module load apps/binapps/ace/2010.0and submitting all.sge
#!/bin/bash #$ -cwd #$ -V #$ -pe hpmpi.pe 24 # ...creates a HP-MPI machine file from the SGE one... #$ -S /bin/bash dtf_decompose -z -file_out 5oct_firsttrack_no_preheat.DTF -dmp MASTER_5oct_firsttrack_no_preheat.DTF 1 24 # ...Correct usage is: # # dtf_decompose [-version] [-metis | -cell_groups | -orig_topo | \ # -x | -y | -z | -wavefront] [-even] [-combined] [-keepFF] \ # [-w w1 w2...] [-file_out outFile.DTF] [-restart] inFile.DTF \ # sim# num_procs CFD-SOLVER -model 5oct_firsttrack_no_preheat.DTF -num=$NSLOTS \ -hosts=machinefile.$JOB_ID -sim=1 -nodecomp -verbose=3while the dtf_decompose step runs fine, the CFD-SOLVER fails:
Unable to find the UTILS folder: 2010.0/UTILS on R1-10
Looks like the environment is not getting to the application.
4. |
First Test |
Place the module load command within the SGE script — test that SGE is behaving itself:
#!/bin/bash #$ -cwd #$ -pe hpmpi.pe 24 #$ -S /bin/bash source /etc/profile.d/modules.sh module load compilers/intel/fortran/11.1.064 module load apps/binapps/ace/2010.0 blah, blah...
Makes no difference.
5. |
A Solution |
Putting the module load commands in each users' .bashrc.
6. |
Investigating |
CFD-SOLVER is a Python script. The error is issued by this code fragment:
# loop over all the machines to verify settings for machine in machines: # verify if ESI_HOME is set properly cmd = [options['remoteShell'], machine, '-n', 'echo', '$ESI_HOME'] rc, esiDir = getCmdOutput(cmd) if rc != success: problem = 'ESI_HOME not set on ' + machine + '\n' print problem sys.stderr.write(problem) testFailed = true break esiDir = esiDir[0].rstrip() # if ESI_HOME is set test if the UTILS folder exists on all the machines utilsDirPath = os.path.join(esiDir, utilsDir) cmd = [options['remoteShell'], machine, '-n', 'ls', quotePath(utilsDirPath)] rc, output = getCmdOutput(cmd) if rc != success: problem = 'Unable to find the UTILS folder: ' + utilsDirPath + ' on ' + machine + '\n' print problem sys.stderr.write(problem) testFailed = true break
Changing the line Unable to find the UTILS folder to
problem = 'Unable to find the UTILS folder: ' + utilsDirPath \ + ' on ' + machine + '(//' + esiDir + '//' + utilsDir + '//)\n'confirms that esiDir, i.e., ESI_HOME is not being passed to the remote hosts.
Why not? The getCmdOutput function calls subprocess.Popen to get the remote shell, but does not copy over the environment — it is not copied over by default. The Popen call can be modified to to the environment copy by using
sys_env = os.environ.copyand then add
env=sys_envto the Popen call.