Stuff

UoM::RCS::Talby::Danzek::SGE



Page Contents:


Page Group

How can a user influence job priority?

 -- deadline jobs
 -- posix priority
 -- resource reservation
 -- advance reservation

Bugs/Features

Troubleshooting

Job Scheduling







Spool Corruptions

If the SGE master daemon crashes for whatever reason, spool files can get corrupted. If that happens the daemon may refuse to restart, or restart leaving SGE is a messed up state (e.g., hosts missing, queues missing, hosts not configured as expected. . .)

To diagnose, look in $SGE_ROOT/default/spool/messages. On a restart there may well be a sequence of warning/error messages, e.g.,

  .          .
  .          .
  .          .
  05/17/2012 14:09:00|  main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/qinstances/C6100-STD-serial.q/node078.danzek.itservices.manch"
  05/17/2012 14:09:00|  main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/qinstances/C6100-STD-serial.q/node060.danzek.itservices.manch"
  05/17/2012 14:09:00|  main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/qinstances/C6100-STD-serial.q/node058.danzek.itservices.manch"
  05/17/2012 14:11:34|  main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/exec_hosts/node049.danzek.itservices.manchester.ac.uk"
  05/17/2012 14:11:34|  main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/exec_hosts/node010.danzek.itservices.manchester.ac.uk"
  05/17/2012 14:11:34|  main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/exec_hosts/gpu210.danzek.itservices.manchester.ac.uk"
  .          .
  .          .
  .          .
  #
  # ...the first three are from the "previous" attempt to restart sge_master;
  #    the latter three are from "this" attempt..