Other Stuff

UoM::RCS::Talby


Page Contents:


Page Group:

2010:

2009: 2008:


Related Pages:





RHEL v6.2 and QLogic IB

The Problem

Upgrade some servers from SL/RHEL 5.5 to 6.2 and the QLogic IB stops working. Attempting to run an OpenMPI 1.6 job under SGE yields only

R3-Atmos-09.2902Driver initialization failure on /dev/ipath (err=23)
--------------------------------------------------------------------------
PSM was unable to open an endpoint. Please make sure that the network link is
active on the node and the hardware is functioning.
and the equally helpful
R3-Atmos-09.2891ipath_userinit: mmap of rcvhdrq failed: Resource temporarily unavailable
R3-Atmos-09.2895ipath_userinit: mmap of rcvhdrq failed: Resource temporarily unavailable
QLogic's CLI tools, such as fabric_info, show all is well hardware-wise,

Attempting some diagnostics, by removing the PSM layer, for example

  mpirun  --mca btl openib,self,sm  --mca ^pml cm  --mca mtl ^psm  -np $NSLOTS ./mynameis
at least got the job to run, but
--------------------------------------------------------------------------
The OpenFabrics (openib) BTL failed to register memory in the driver.
Please check /var/log/messages or dmesg for driver specific failure
reason.
The failure occured here:

  Local host:    qib0
  Device:        openib_reg_mr
  Function:      Cannot allocate memory()
  Errno says:    

Background

All was very well with SL/RHEL 5.5. For example,

  mpirun  --mca btl openib,self,sm  --mca pml cm  --mca mtl psm  -np $NSLOTS ./mynameis
i.e., using the QLogic PSM layer, gave excellent results.

Solution

So how come the IB stack cannot allocate the memory it requires? Lockably memory is set by SL/RHEL 6.2 to just 64k (ulimit -l), but, the QLogic installation script sets lockable memory to unlimited by changing

    /etc/security/limits.conf
So what gives?

Turns out init-spawned processes are not subject to limits.conf, thus

    cat /proc/`ps auxw | grep sge_execd | grep -v grep | awk '{print $2}'`/limits | grep locked

    Max locked memory         65536            unlimited            bytes     
So
    root> /etc/init.d/sge stop
    root> /etc/init.d/sge start
then
    cat /proc/`ps auxw | grep sge_execd | grep -v grep | awk '{print $2}'`/limits | grep locked

    Max locked memory         unlimited        unlimited            bytes     
and all is well.