RHEL v6.2 and QLogic IB
The Problem
Upgrade some servers from SL/RHEL 5.5 to 6.2 and the QLogic IB stops working. Attempting to run an OpenMPI 1.6 job under SGE yields only
R3-Atmos-09.2902Driver initialization failure on /dev/ipath (err=23) -------------------------------------------------------------------------- PSM was unable to open an endpoint. Please make sure that the network link is active on the node and the hardware is functioning.and the equally helpful
R3-Atmos-09.2891ipath_userinit: mmap of rcvhdrq failed: Resource temporarily unavailable R3-Atmos-09.2895ipath_userinit: mmap of rcvhdrq failed: Resource temporarily unavailableQLogic's CLI tools, such as fabric_info, show all is well hardware-wise,
Attempting some diagnostics, by removing the PSM layer, for example
mpirun --mca btl openib,self,sm --mca ^pml cm --mca mtl ^psm -np $NSLOTS ./mynameisat least got the job to run, but
-------------------------------------------------------------------------- The OpenFabrics (openib) BTL failed to register memory in the driver. Please check /var/log/messages or dmesg for driver specific failure reason. The failure occured here: Local host: qib0 Device: openib_reg_mr Function: Cannot allocate memory() Errno says:
Background
All was very well with SL/RHEL 5.5. For example,
mpirun --mca btl openib,self,sm --mca pml cm --mca mtl psm -np $NSLOTS ./mynameisi.e., using the QLogic PSM layer, gave excellent results.
Solution
So how come the IB stack cannot allocate the memory it requires? Lockably memory is set by SL/RHEL 6.2 to just 64k (ulimit -l), but, the QLogic installation script sets lockable memory to unlimited by changing
/etc/security/limits.confSo what gives?
Turns out init-spawned processes are not subject to limits.conf, thus
cat /proc/`ps auxw | grep sge_execd | grep -v grep | awk '{print $2}'`/limits | grep locked Max locked memory 65536 unlimited bytesSo
root> /etc/init.d/sge stop root> /etc/init.d/sge startthen
cat /proc/`ps auxw | grep sge_execd | grep -v grep | awk '{print $2}'`/limits | grep locked Max locked memory unlimited unlimited bytesand all is well.