Spool Corruptions
If the SGE master daemon crashes for whatever reason, spool files can get corrupted. If that happens the daemon may refuse to restart, or restart leaving SGE is a messed up state (e.g., hosts missing, queues missing, hosts not configured as expected. . .)
To diagnose, look in $SGE_ROOT/default/spool/messages. On a restart there may well be a sequence of warning/error messages, e.g.,
. . . . . . 05/17/2012 14:09:00| main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/qinstances/C6100-STD-serial.q/node078.danzek.itservices.manch" 05/17/2012 14:09:00| main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/qinstances/C6100-STD-serial.q/node060.danzek.itservices.manch" 05/17/2012 14:09:00| main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/qinstances/C6100-STD-serial.q/node058.danzek.itservices.manch" 05/17/2012 14:11:34| main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/exec_hosts/node049.danzek.itservices.manchester.ac.uk" 05/17/2012 14:11:34| main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/exec_hosts/node010.danzek.itservices.manchester.ac.uk" 05/17/2012 14:11:34| main|headnode1|E|error reading file: "/opt/gridware/ge/default/spool/qmaster/exec_hosts/gpu210.danzek.itservices.manchester.ac.uk" . . . . . . # # ...the first three are from the "previous" attempt to restart sge_master; # the latter three are from "this" attempt..
- One assumes that the first error message in a sequence is "real", so lookat at node049.danzek.itservices.manchester.ac.uk. . .
- . . .and that the rest are false, cascaded errors (cf. compiler error messages).
- So fix up the above file and all should be well.