The CSF's Own Qstat
1. |
The Problem: Qstat Output on the CSF is Difficult to Interpret |
- Why doesn't my qw job run?
- Which nodes/queues are empty?
2. |
Roll our own. . . |
Since DRMAA v1 can be used to build qsub but not qstat, we build our own from scratch, using Perl. . .
It's quick-n-dirty — relies on the output of qconf and qstat, rather that an API, but that is ok: when DRMAA in SGE supports what we want, we will use it and while the Perl modules will change there is no reason for the script itself to do so.
3. |
Example Output and Annotation |
4. |
Perl Module Wrappers for qstat and qconf |
4.1. |
SGE_Qconf.pm |
- new() returns an object wrapping the information that qconf can obtain.
- new() grabs:
- a queue list and all attributes for each queue;
- a host-group list and determines which queues are hosted on which host-groups.
4.2. |
SGE_Qstat.pm |
- new() grabs:
- basics of all running jobs from qstat -u "*";
- number of unused nodes in each queue.
job_matrix[PE][HRL][QUEUE](slots)which records the number of slots/cores in use for each combination of parallel-environment, hard resource-list (e.g., -l highmem) and queue name.
5. |
The Script Itself |
The Model
- There are several types of compute node — one SGE hostgroup for each type.
- On each hostgroup, one or more queues are hosted. If more than one, SGE subordination is used so that only one of the queues is active at once.
- On each queue, none or more PEs are hosted.
- Some PEs are hosted on more than one queue. In this case the queues must be hosted on different hostgroups, owing to a "feature" of SGE. (We have seen an SMP job, start in smp.pe, half in one queue and half in another, both queues on the same host. The queues, being mutually-subordinate, both suspend and the job does not run!)
- For users, we term each hostgroup an environment, e.g., Intel, 4GB/core, no IB, or AMD, 2GB/core, IB.
The Code
6. |
Example Output |
prompt> csf-qstat --show-default . . . . . . . . . -------------------------------------------------------------------------------- Job Type: Standard (4 GB/core) Details: PE Resource Flags Slots Used [serial] interactive 0 Cores total: 660, used: 0, available: 0 (Num empty nodes: 0) [serial] -- 110 Cores total: 660, used: 110, available: 10 (Num empty nodes: 0) orte-12.pe -- 0 orte.pe -- 212 smp.pe -- 305 fluent-smp.pe -- 19 starccm-12.pe -- 0 hp-mpi-12.pe -- 0 hp-mpi-smp.pe -- 0 Cores total: 660, used: 536, available: 4 (Num empty nodes: 0) -------------------------------------------------------------------------------- Job Type: Intel/IB (4 GB/core) Details: PE Resource Flags Slots Used orte-12-ib.pe -- 252 Cores total: 252, used: 252, available: 0 (Num empty nodes: 0) . . . . . . . . .