Platform management plug-in for JBoss ON does not discover file system resources due to discovery TimeoutException when NFS ping takes too long
Environment
- Red Hat JBoss Operations Network (ON) 3.3
- Firewall on NFS server host is blocking RPC TCP port 111 with no response
Issue
-
No file system resources are shown
-
Discovery component for file system resource type is blacklisted
-
Agent log reports the following warnings:
WARN [InventoryManager.discovery-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- The discovery component for resource type [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been blacklisted WARN [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Discovery for Resources of [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been running for more than 300000 milliseconds. This may be a plugin bug.
Resolution
Reconfigure the firewall so that the RPC service can be used on both UDP and TCP port 111.
If the service will continue to be blocked, ensure that the firewall is using a REJECT to send an ICMP response to the JBoss ON agent instead of silently dropping or denying TCP requests sent to port 111.
Root Cause
This issue is caused by the RPC ping request to the remote NFS host taking too long to execute. If one or more NFS file systems are being discovered and the total execution time exceeds 5 minutes the discovery scan will be aborted and the file system resource type is blacklisted. This means that no future attempts will be made to scan for file systems on this platform.
Under normal conditions, the RPC ping should quickly report that either the NFS server is not running or that it is or that it is unreachable. However, in the event that the network configuration is preventing the RPC ping from completing, the platform plug-in will wait for a socket timeout to occur. On most networks, this is 1 minute. Additional retries will further delay the thread resulting in the discovery scan for the file system resources taking too long.
This issue has been captured in This content is not included.Red Hat Bugzilla 1205429 and will be addressed in a future release of the platform plug-in for JBoss ON.
Diagnostic Steps
-
Review the agent log for an indication that the file system resource type has been blacklisted. The following message are relevant. Note the warning messages occur 5 minutes after runtime discvoery scan is logged:
2015-03-24 20:57:28,103 INFO [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Executing runtime discovery scan rooted at [platform]... 2015-03-24 21:02:28,108 WARN [InventoryManager.discovery-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- The discovery component for resource type [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been blacklisted 2015-03-24 21:02:28,109 WARN [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Discovery for Resources of [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been running for more than 300000 milliseconds. This may be a plugin bug. org.rhq.core.pc.inventory.TimeoutException: Call to [org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources()] with args [[org.rhq.core.pluginapi.inventory.ResourceDiscoveryContext@1f4d0999]] timed out. Invocation thread will be interrupted. at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:256) at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invoke(DiscoveryComponentProxyFactory.java:217) at com.sun.proxy.$Proxy43.discoverResources(Unknown Source) at org.rhq.core.pc.inventory.InventoryManager.invokeDiscoveryComponent(InventoryManager.java:385) ... Caused by: java.lang.Exception: Thread[ResourceDiscoveryComponent.invoker.daemon-1,5,main] with id [21] is hung. This exception contains its stack trace. at org.hyperic.sigar.RPC.ping(Native Method) at org.hyperic.sigar.NfsFileSystem.ping(NfsFileSystem.java:52) at org.hyperic.sigar.Sigar.getMountedFileSystemUsage(Sigar.java:707) ... at org.rhq.core.system.SigarAccessHandler.invoke(SigarAccessHandler.java:128) at com.sun.proxy.$Proxy42.getMountedFileSystemUsage(Unknown Source) at org.rhq.core.system.FileSystemInfo.refresh(FileSystemInfo.java:60) at org.rhq.core.system.FileSystemInfo.<init>(FileSystemInfo.java:43) at org.rhq.core.system.NativeSystemInfo.getFileSystems(NativeSystemInfo.java:325) at org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources(FileSystemDiscoveryComponent.java:62) ... -
How long does the RPC request take?
time rpcinfo -T tcp <NFS_HOST> 100003If the request takes more then 5 seconds, this issue most likely applies. For example:
time rpcinfo -T tcp nfs-server.example.com 100003 rpcinfo: RPC: Port mapper failure - Timed out real 1m0.256s user 0m0.007s sys 0m0.015sIn the above example it took 1 minute to return the
Port mapper failure - Timed outmessage. -
Are either TCP or UDP port 111 blocked and the NFS host machine? If so, is
REJECTorDROPbeing used for packets with a destination to port 111?If
DROPis being used, this issue applies. -
Does it take more then 5 minutes to retrieve disk space information from all mounted partition on the JBoss ON agent host? The following commands will use the agent's native library to retrieve disk space information from each mounted partition and display the actual time for each along with total time at the end:
cd <RHQ_AGENT_HOME> time awk '{print $2}' /etc/mtab | sort | uniq | while read mnt; do echo ""; echo "Checking $mnt:"; time java -jar lib/sigar-*.jar df "$mnt"; doneIf the total time is close to or above 5 minutes, this issue applies.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.