NFSv3 and NFSv4 Weirdness
Symptoms: RHEL 6.2 nodes going weird; rpciod using lots of CPU for no reason
- Compute nodes with NFS and RHEL 5.x, HP ProLiant fine for two years. Some with QLogic IB.
- Upgrade to RHEL 6.2: some well-used nodes start behaving weirdly — jobs fail for no apparent reason; only obvious thing is that nodes with no jobs have significant load, which turns out to be rpciod doing lots of CPU!
Fix for non-IB Nodes
- Upgrade to the latest RHEL kernel, 2.6.32-358.2.1.el6.x86_64, and all is well.
QLogic IB Nodes
- Upgrade to the latest RHEL kernel, 2.6.32-358.2.1.el6.x86_64 and find that the QLogic IB stack won't compile against the kernel; nor against 2.6.32-279.22.1, the minimum verion with the rpciod fix (according to RedHat).
- Brainwave: the Google results point to NFVv4 as the culprit, so remount all filesystems as NFSv3, with original kernel. All well.
Mounting as NFSv3
Defaultvers=3 Nfsvers=3and, of course, remount all (or reboot).
Trouble at Mill
One fileserver refused to dish out NFSv3, only NFSv4. Useful diagnostic:
- Pointing rpcinfo -p at the server revealed almost nothing registered, notably nfs (which is v3) was missing!