Click [slideshow] to begin presentation.

 

Universes and Job Examples

Condor Universes and Job Examples

In this module:

  • detail the most commonly used universes in Condor
    • Vanilla, Standard. . .
  • give example Condor submission scripts for each.




Universes

The Vanilla Universe

Most Widely and Easily Applicable Universe

  • Runs any serial job

Limitations
  • Intended for programs which cannot be re-linked to the Condor libs; also for shell scripts.
  • Jobs cannot be checkpointed/migrated — suspended or killed only.
IO and File Transfer
  • No remote system calls:
    • shared filesystem (NFS or AFS) assumed by default;
    • file transfer can be explicitly specified instead.




Universes

Vanilla Example

Code:

  PROGRAM hello
  PRINT *, "Hello from GFortran and Condor 7.4.2"
  END PROGRAM hello

Condor script:

  executable = hello
  universe   = vanilla

  requirements = (Memory > 900)

  ShouldTransferFiles   = IF_NEEDED
  WhenToTransferOutput  = ON_EXIT
      #
      # ...even though no output files other than STDOUT...

  output  = loop.$(Process).out
  error   = loop.$(Process).err
  log     = loop.log

  queue 10




Universes

Another Vanilla Example: File Transfer

Code fragment:

  OPEN(UNIT=1,FILE='myfile.txt')
  WRITE(UNIT=1, FMT=*) "Hello world"
  CLOSE(UNIT=1)

Condor script:

  executable = hello-2
  universe   = vanilla

  requirements = (Memory > 900)

  ShouldTransferFiles   = IF_NEEDED
  WhenToTransferOutput  = ON_EXIT

  transfer_output_files = myfile.txt

  output  = loop.$(Process).out
  error   = loop.$(Process).err
  log     = loop.log

  queue 1




Universes

The Standard Universe 1/2

Jobs can be checkpointed/migrated/restarted

Checkpointing, Job Migration
  • Condor checkpoints a job at regular intervals — saves state of a process (memory, CPU, IO, etc) to a file.
  • Process can be restarted exactly as if it had never stopped.
  • Jobs can be migrated to another machine, e.g., when owner returns.
Remote System Calls; File Transfer
  • Access to IO files is through remote system calls — transfer of these files does not take place
  • Execute binaries and checkpoint files transferred automatically as needed.




Universes

The Standard Universe 2/2

Re-linking required

  • Typically no changes to source code needed — but re-linking to Condors libraries required;
  • use condor_compile for this:
        condor_compile gcc -o myjob my_source.c
    
        condor_compile gfortran -o myjob my_source.f90


Example Submit File

  executable  = myjob
  universe    = standard
    #
    # ...no longer seems to default to standard...

  output    = loop.$(Process).out
  error     = loop.$(Process).err

  log       = loop.log




Universes

The Java Universe

  • Condor takes care of finding JVM, setting CLASS_PATH, etc
    • these will likely be different on different machines in the pool!




Universes

The Parallel Universe

  • Runs e.g., MPI jobs (supersedes MPI universe)
  • Requires dedicated members of a pool — dedicated machines never vacate executing jobs (not suitable for desktop machines).

Example Submit File

  universe   = parallel
  executable = my_mpi_prog
  
  log    = my_mpi_prog.log
  input  = my_mpi_prog.data
  output = my_mpi_prog.out.$(NODE)
  errir  = my_mpi_prog.err.$(NODE)

  machine_count = 4

  queue

But why?

  • Why not simply use a traditional batch system?