SGE Notes
-- job is first task, control slaves... -- loose integration -- tight integration -- openmpi -- hp-mpi
Fair shares: 1. http://ait.web.psi.ch/services/linux/hpc/merlin3/sge/admin/ 2. http://wikis.sun.com/display/gridengine62u3/How+to+Create+Project-Based+Share-Tree+Scheduling+With+QMON http://wikis.sun.com/display/gridengine62u3/Configuring+the+Share-Based+Policy#ConfiguringtheShare-BasedPolicy-ConfiguringtheShareTreePolicyWithQMON ----------------- http://gridengine.sunsource.net/news/SGE62u5-announce.html -- includes topology-aware stuff ----------------- http://wiki.gridengine.info/wiki/index.php/Main_Page ----------------- Grid Engine Portal -- http://gridengine.sunsource.net/gep/GEP_Intro.html Users authenticate to a portal interface from anywhere on the internet via a browser and can then: * Securely access and execute applications via a transparent interface to Grid Engine * Monitor the status of jobs running in Grid Engine * Securely upload input files to the Portal Server with the click of a button * Securely download output files to a local workstation with the click of a button * View X-windows based applications using VNC Administrators can also remotely access the portal and perform administrative functions such as: * Registering applications for use with the GEP in a matter of minutes * Quickly building HTML interfaces to applications using templates that prompt users for input * Monitoring Grid Engine usage and statistics ---------------- -- The battle: Globus vs LSF --- is there not a third way via SGE's SDM. -- how is the multiclustering gonna work? -- requires common filesystems -- requires standard s/w stack -- not gonna work... -- SGE is licensed under GPL -- howtos http://gridengine.sunsource.net/howto/howto.html -- drmaa api -- ARCo accounting and reporting (MySQL or Oracle) --------------- SDM http://blogs.sun.com/templedf/entry/service_domain_manager supports -- cloud bursting -- powers down idle and underutilized machines -- not a metasheduler --- moves compute nodes from one cluster to another http://wikis.sun.com/display/GridEngine/Using+SDM+With+the+Sun+Grid+Engine+Adapter ---------------- SGE on Campus -- redqueen -- mace01 -- man2 -- usto oran (MACE) -- pacemaker (MHS) -- templar (FLS) -- agent (FLS) -- epsilon (EPS) -- Brian Blower's cluster (MHS) -- terra (Duncan Irving, Earth Sciences) ----------------- Topology Aware Scheduling http://blogs.sun.com/templedf/entry/topology_aware_scheduling ------------------ Checkpointing http://gridengine.sunsource.net/howto/checkpointing.html -- integrates with Condor libraries/compiler https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html -- BLCR