Click [slideshow] to begin presentation.
Introductions
An Introduction to Condor
Dr Simon Hood and Dr Jonathan Boyle
Research Computing
Services
rcs@manchester.ac.uk
- The slides — always available at:
What is RCS?
Research Computing Services Support?
Research Computing Services, RCS
- Specialist part of IT Services.
- Contact Details
-
- What is Research Computing?
-
- Computing to support research! Examples:
- running complex simulations;
- performing vast parameter searches.
Research Computing Examples
- High Throughput Computing (HTC)
-
- Large amounts of comp. power over a "long" time:
- Running long jobs!
- Running the same experiment many (1000s) times, with different inputs.
- High Performance Computing (HPC)
-
- Large amounts of comp. power over a "short" time:
- many CPUs simultaneously to run complex models quicker.
- Many compute-nodes' RAM simultaneously to handle very big jobs.
- Data Analysis and Visualization
-
- Getting the information out of the vast quantities of data.
How Does, Can RCS Help You? Free Stuff
Free Stuff!
- Provision of resources:
-
- Horace, Man1, Man2, Mace01, Redqueen. . .
- Condor pools; NW-Grid, NGS.
- Administration of HPC/HTC Clusters:
-
- Administer and support University, NW-Grid, NGS and some school and
research group HPC clusters.
- Support and Training:
-
- Documentation — Web and Wiki.
- Courses!
- Usage of HPC/HTC (Inc. Condor) clusters,
- application support,
How Can RCS Help You? In-Depth Support
In-depth support and collaborations
- Free dedicated short-term help
-
- Advice on parallelisation of code, or
- advanced use of HTC (inc. Condor).
- More in-depth help and collaborations
-
- Optimising code/models: scoping, estimate, coding — dedicated resources
may require funding.
- Example: one year's dedicated effort extracting maximumum performance.
Named resource/researchers on RCUK/EU etc. grants.
Other Related Courses
- Introduction to Condor:
-
- CPU-cycle scavenging and HPC cluster backfill;
- Web pages;
- Introduction to LaTeX:
-
- Other Courses:
-
- Introduction to OpenMP
- Introduction to MPI
- Fortran 95
- Matlab
- Image-Based Modelling
details and on-line booking. . .
This Course
Today's Course
- Three speakers
- Simon Hood (RCS) 10:00 – 12:00 approx.;
- Jonathan Boyle (RCS) and Ian Cottam (EPS)
This Course: Part One
Simon (AM)
- what Condor is;
- how to use it — simple cases;
- what Condor is good at and what it's not;
- what EPS- and RCS-backed Condor facilities are available to you.
. . .con't. . .
This Course: Parts Two and Three
Jonathan (PM)
- Using Matlab with Condor
- Job control using Dagman
- Job control and monitoring with BASH scripts
Ian Cottam
What is Condor?
From Wikipedia
- Condor is a high-throughput computing software
framework for coarse-grained distributed parallelization of
computationally-intensive tasks.
What is Condor, in English?
- Cycle-scavenging
-
It can farm out computational work to idle desktop computers.
- Runs on everything
-
Linux, Unix, Mac OS X, FreeBSD, and (even) MS Windows.
- It can work as a traditional batch system
-
It can manage workload (jobs) on a dedicated cluster of computers
(Beowulf) in place of SGE/LSF/PBS. . .
- Glue
-
Can seamlessly integrate dedicated and other resources,
e.g., Beowulfs, and (idle) teaching clusters and/or office desktop machines.
- All types of jobs
-
Can schedule serial and parallel jobs.
- Backfill
-
On traditional HPC clusters. . .
Condor Philosophy
[From www.cs.wisc.edu]
What is it good for?
Condor is Complementary to Traditional Batch Systems
- Good for backfill and using "spare" CPU cycles.
- Therefore, good for running jobs that can fill gaps flexibly.
- So, jobs which individually do not require great resources, e.g., RAM
or diskspace:
- can run "anywhere";
- can be checkpointed and migrated easily — requires re-linking.
- Large numbers of small jobs, e.g., parameter sweeps, are ideal.
Sometimes better to use Condor, sometimes SGE. . .
Traditional Condor Pools
CPU-scavenging
- Use otherwise wasted compute cycles from non-dedicated resources:
- individuals' office desktops;
- teaching/public clusters.
- Converts unused desktops into a distributed
high-throughput computing (HTC) facility.
- Minimal effect on desktop users:
- Condor jobs start only after zero keybd/mouse input for, say,
15 minutes;
- within seconds of keybd/mouse input, Condor jobs suspended.
- All machines in the pool can submit jobs; all will likely run jobs;
symmetrical, peer-to-peer topology.
Features of Condor
- Condor machines are members of a pool.
- Members can be compute nodes, submit nodes, or both — traditionally
both.
- Each pool has exactly one "head node" — the
collector/negotiator.
- Condor manages both resources (machines) and resource requests (jobs)
- Transparent checkpoint/restart
- and process migration (for some jobs)
- Manages large numbers of (small) jobs well.
Using Condor: Overview
How do I get computation done with Condor?
- Ensure your job is batch-ready — requires no user input, no GUI —
just as for SGE/LSF/PBS. . .
- Choose a universe — much more later.
- Create a small text file which defines the Condor job (cf. qsub script).
- Submit the job!
- Monitor progress: output, error and log files.
- Sit back with a nice mug of tea and enjoy the free CPU cycles.
Command Summary
- condor_status
-
Display status of pool: number and type of machines; status of
machines — owner/busy/idle; more. . .
- condor_submit
-
Queue jobs for execution under Condor.
- condor_q [-global]
-
Displays information about jobs in the Condor job queue; defaults to
the local queue
- condor_rm
-
Remove jobs from the Condor queue.
Command Help
condor_<command> -h|-help
# ...lists all command-line args...
Running a Job: Overview
Running a Job: Overview
In this module we look at the complete
job cycle:
- Make it batch-ready
- Choose a Universe
- Create a submit file
- Submit the job
- Monitor your job's status
Universes and Job Examples
In this module. . .
- detail the most commonly used universes in Condor
- give example Condor submission scripts for each.
Data and File Transfer Summary
In this module. . .
- Summarise use of remote IO and shared filesystems in Condor.
- Outline how to explicity transfer required input and output files.
Class Ads
In this module. . .
- What class ads are
- workstation resource ads
- job ads
- Class ad matching
- Debugging via -better-analyze
RCS Pools, Backfilling Dedicated HPC Systems
In this section we:
- how Condor can "backfill" traditional HPC clusters;
- Condor facilities offered by RCS.
Condor and Grid Computing
In this module:
- we define what we mean by grid;
- outline (only) how Condor can help with grid computing.
Installing Condor
In this module we outline:
- where to get the software from;
- how to set up a Linux machine to join a Condor pool;
- and how to set up a Condor pool from scratch.
Networking, Topology and Firewalls
Condor and Matlab
http://www.liv.ac.uk/e-science/condor/matlab/
Or simply use nodes with a shared filesystem?
In this module:
- how Condor can "backfill" traditional HPC clusters;
- Condor facilities offered by RCS.