Resource Reservation: Preventing Large Parallel Job Starvation
1. |
More |
- Chris Dag's post entitled Resource reservation prevents parallel job starvation
- Grid Engine 6 Administration Guide: Administering the Scheduler
2. |
The Problem |
- A large parallel job may sit in a queue for a long time, indeed may never be scheduled, even though its priority is high (and growing) because lower priority smaller jobs will barge past the large job as sufficient slots become available to run the latter.
- Nothing in fairshare, urgency, etc comes to the aid of the large job.
- What is needed: block the lower priority jobs completely until enough space becomes available to run the larger job.
- You can use resource reservation to guarantee that resources are dedicated to jobs in job-priority order.
3. |
A Solution: Resource Reservation |
qsub -R y <scriptfile> qalter -R y <jobid>
4. |
Back Fill |
Backfilling enables a lower-priority job to use resources that are blocked due to a resource reservation — but only if there is a runnable job whose prospective run time is small enough to allow the blocked resource to be used without interfering with the reservation.
5. |
Detecting/Monitoring/Spotting R-R Jobs which are in the Queue, Waiting |
- Why isn't my job running? There are empty nodes!
- The Problem
- If a larger job has a r-r it will block a smaller job from jumping past it in the queue and grabbing compute nodes as they become free. For example, two compute nodes may be free on which a waiting 24-slot job could run, but it will be blocked by an 48-slot r-r job. How is the owner of the 24-slot job to know?
- Use qstat -j <job-num>
-
For example,
prompt>qstat -j 80063 . . . . reserve: y . .
N.B. If the job has no r-r, then the output is NOT reserve: n; there is no reserve line in the output at all. - qstat -g c?
- No! Shows only advance reservations, not resource reservations.