Click [slideshow] to begin presentation.
  
 
Universes and Job Examples
Condor Universes and Job Examples
In this module:
- detail the most commonly used universes in Condor
- give example Condor submission scripts for each.
  
  
Universes
The Vanilla Universe
Most Widely and Easily Applicable Universe
- Limitations
- 
- Intended for programs which cannot be re-linked to the Condor libs;  
    also for shell scripts.
- Jobs cannot be checkpointed/migrated — suspended or killed only.
 
- IO and File Transfer
- 
- No remote system calls:
    
    - shared filesystem (NFS or AFS) assumed by default;
- file transfer can be explicitly specified instead.
 
 
  
  
Universes
Vanilla Example
Code:
  PROGRAM hello
  PRINT *, "Hello from GFortran and Condor 7.4.2"
  END PROGRAM hello
Condor script:
  executable = hello
  universe   = vanilla
  requirements = (Memory > 900)
  ShouldTransferFiles   = IF_NEEDED
  WhenToTransferOutput  = ON_EXIT
      #
      # ...even though no output files other than STDOUT...
  output  = loop.$(Process).out
  error   = loop.$(Process).err
  log     = loop.log
  queue 10
  
  
Universes
Another Vanilla Example:  File Transfer
Code fragment:
  OPEN(UNIT=1,FILE='myfile.txt')
  WRITE(UNIT=1, FMT=*) "Hello world"
  CLOSE(UNIT=1)
Condor script:
  executable = hello-2
  universe   = vanilla
  requirements = (Memory > 900)
  ShouldTransferFiles   = IF_NEEDED
  WhenToTransferOutput  = ON_EXIT
  transfer_output_files = myfile.txt
  output  = loop.$(Process).out
  error   = loop.$(Process).err
  log     = loop.log
  queue 1
  
  
Universes
The Standard Universe 1/2
Jobs can be checkpointed/migrated/restarted
- Checkpointing, Job Migration
- 
- Condor checkpoints a job at regular intervals — saves state of a 
    process (memory, CPU, IO, etc) to a file.
- Process can be restarted exactly as if it had never stopped.
- Jobs can be migrated to another machine, e.g., when owner returns.
 
- Remote System Calls;  File Transfer
- 
- Access to IO files is through remote system calls — transfer of these
    files does not take place
- Execute binaries and checkpoint files transferred automatically as 
    needed.
 
  
  
Universes
The Standard Universe 2/2
Re-linking required
Example Submit File
  executable  = myjob
  universe    = standard
    #
    # ...no longer seems to default to standard...
  output    = loop.$(Process).out
  error     = loop.$(Process).err
  log       = loop.log
  
  
Universes
The Java Universe
- Condor takes care of finding JVM, setting CLASS_PATH, etc
    
    - these will likely be different on different machines in the pool!
    
 
  
  
Universes
The Parallel Universe
- Runs e.g., MPI jobs (supersedes MPI universe)
- Requires dedicated members of a pool — dedicated machines 
    never vacate executing jobs (not suitable for desktop machines).
Example Submit File
  universe   = parallel
  executable = my_mpi_prog
  
  log    = my_mpi_prog.log
  input  = my_mpi_prog.data
  output = my_mpi_prog.out.$(NODE)
  errir  = my_mpi_prog.err.$(NODE)
  machine_count = 4
  queue
But why?
- Why not simply use a traditional batch system?