Click [slideshow] to begin presentation.
Universes and Job Examples
Condor Universes and Job Examples
In this module:
- detail the most commonly used universes in Condor
- give example Condor submission scripts for each.
Universes
The Vanilla Universe
Most Widely and Easily Applicable Universe
- Limitations
-
- Intended for programs which cannot be re-linked to the Condor libs;
also for shell scripts.
- Jobs cannot be checkpointed/migrated — suspended or killed only.
- IO and File Transfer
-
- No remote system calls:
- shared filesystem (NFS or AFS) assumed by default;
- file transfer can be explicitly specified instead.
Universes
Vanilla Example
Code:
PROGRAM hello
PRINT *, "Hello from GFortran and Condor 7.4.2"
END PROGRAM hello
Condor script:
executable = hello
universe = vanilla
requirements = (Memory > 900)
ShouldTransferFiles = IF_NEEDED
WhenToTransferOutput = ON_EXIT
#
# ...even though no output files other than STDOUT...
output = loop.$(Process).out
error = loop.$(Process).err
log = loop.log
queue 10
Universes
Another Vanilla Example: File Transfer
Code fragment:
OPEN(UNIT=1,FILE='myfile.txt')
WRITE(UNIT=1, FMT=*) "Hello world"
CLOSE(UNIT=1)
Condor script:
executable = hello-2
universe = vanilla
requirements = (Memory > 900)
ShouldTransferFiles = IF_NEEDED
WhenToTransferOutput = ON_EXIT
transfer_output_files = myfile.txt
output = loop.$(Process).out
error = loop.$(Process).err
log = loop.log
queue 1
Universes
The Standard Universe 1/2
Jobs can be checkpointed/migrated/restarted
- Checkpointing, Job Migration
-
- Condor checkpoints a job at regular intervals — saves state of a
process (memory, CPU, IO, etc) to a file.
- Process can be restarted exactly as if it had never stopped.
- Jobs can be migrated to another machine, e.g., when owner returns.
- Remote System Calls; File Transfer
-
- Access to IO files is through remote system calls — transfer of these
files does not take place
- Execute binaries and checkpoint files transferred automatically as
needed.
Universes
The Standard Universe 2/2
Re-linking required
Example Submit File
executable = myjob
universe = standard
#
# ...no longer seems to default to standard...
output = loop.$(Process).out
error = loop.$(Process).err
log = loop.log
Universes
The Java Universe
- Condor takes care of finding JVM, setting CLASS_PATH, etc
- these will likely be different on different machines in the pool!
Universes
The Parallel Universe
- Runs e.g., MPI jobs (supersedes MPI universe)
- Requires dedicated members of a pool — dedicated machines
never vacate executing jobs (not suitable for desktop machines).
Example Submit File
universe = parallel
executable = my_mpi_prog
log = my_mpi_prog.log
input = my_mpi_prog.data
output = my_mpi_prog.out.$(NODE)
errir = my_mpi_prog.err.$(NODE)
machine_count = 4
queue
But why?
- Why not simply use a traditional batch system?