Platform management plug-in for JBoss ON does not discover file system resources due to discovery TimeoutException when NFS ping takes too long

Solution Verified - Updated

Environment

  • Red Hat JBoss Operations Network (ON) 3.3
  • Firewall on NFS server host is blocking RPC TCP port 111 with no response

Issue

  • No file system resources are shown

  • Discovery component for file system resource type is blacklisted

  • Agent log reports the following warnings:

      WARN  [InventoryManager.discovery-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- The discovery component for resource type [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been blacklisted
      WARN  [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Discovery for Resources of [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been running for more than 300000 milliseconds. This may be a plugin bug.
    

Resolution

Reconfigure the firewall so that the RPC service can be used on both UDP and TCP port 111.

If the service will continue to be blocked, ensure that the firewall is using a REJECT to send an ICMP response to the JBoss ON agent instead of silently dropping or denying TCP requests sent to port 111.

Root Cause

This issue is caused by the RPC ping request to the remote NFS host taking too long to execute. If one or more NFS file systems are being discovered and the total execution time exceeds 5 minutes the discovery scan will be aborted and the file system resource type is blacklisted. This means that no future attempts will be made to scan for file systems on this platform.

Under normal conditions, the RPC ping should quickly report that either the NFS server is not running or that it is or that it is unreachable. However, in the event that the network configuration is preventing the RPC ping from completing, the platform plug-in will wait for a socket timeout to occur. On most networks, this is 1 minute. Additional retries will further delay the thread resulting in the discovery scan for the file system resources taking too long.

This issue has been captured in This content is not included.Red Hat Bugzilla 1205429 and will be addressed in a future release of the platform plug-in for JBoss ON.

Diagnostic Steps

  • Review the agent log for an indication that the file system resource type has been blacklisted. The following message are relevant. Note the warning messages occur 5 minutes after runtime discvoery scan is logged:

      2015-03-24 20:57:28,103 INFO  [InventoryManager.discovery-1] (rhq.core.pc.inventory.RuntimeDiscoveryExecutor)- Executing runtime discovery scan rooted at [platform]...
      2015-03-24 21:02:28,108 WARN  [InventoryManager.discovery-1] (rhq.core.pc.util.DiscoveryComponentProxyFactory)- The discovery component for resource type [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been blacklisted
      2015-03-24 21:02:28,109 WARN  [InventoryManager.discovery-1] (rhq.core.pc.inventory.InventoryManager)- Discovery for Resources of [ResourceType[id=0, name=File System, plugin=Platforms, category=Service]] has been running for more than 300000 milliseconds. This may be a plugin bug.
      org.rhq.core.pc.inventory.TimeoutException: Call to [org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources()] with args [[org.rhq.core.pluginapi.inventory.ResourceDiscoveryContext@1f4d0999]] timed out. Invocation thread will be interrupted.
          at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invokeInNewThread(DiscoveryComponentProxyFactory.java:256)
          at org.rhq.core.pc.util.DiscoveryComponentProxyFactory$ResourceDiscoveryComponentInvocationHandler.invoke(DiscoveryComponentProxyFactory.java:217)
          at com.sun.proxy.$Proxy43.discoverResources(Unknown Source)
          at org.rhq.core.pc.inventory.InventoryManager.invokeDiscoveryComponent(InventoryManager.java:385)
          ...
      Caused by: java.lang.Exception: Thread[ResourceDiscoveryComponent.invoker.daemon-1,5,main] with id [21] is hung. This exception contains its stack trace.
          at org.hyperic.sigar.RPC.ping(Native Method)
          at org.hyperic.sigar.NfsFileSystem.ping(NfsFileSystem.java:52)
          at org.hyperic.sigar.Sigar.getMountedFileSystemUsage(Sigar.java:707)
          ...
          at org.rhq.core.system.SigarAccessHandler.invoke(SigarAccessHandler.java:128)
          at com.sun.proxy.$Proxy42.getMountedFileSystemUsage(Unknown Source)
          at org.rhq.core.system.FileSystemInfo.refresh(FileSystemInfo.java:60)
          at org.rhq.core.system.FileSystemInfo.<init>(FileSystemInfo.java:43)
          at org.rhq.core.system.NativeSystemInfo.getFileSystems(NativeSystemInfo.java:325)
          at org.rhq.plugins.platform.FileSystemDiscoveryComponent.discoverResources(FileSystemDiscoveryComponent.java:62)
          ...
    
  • How long does the RPC request take?

      time rpcinfo -T tcp <NFS_HOST> 100003
    

    If the request takes more then 5 seconds, this issue most likely applies. For example:

      time rpcinfo -T tcp nfs-server.example.com 100003
      rpcinfo: RPC: Port mapper failure - Timed out
    
      real    1m0.256s
      user    0m0.007s
      sys     0m0.015s
    

    In the above example it took 1 minute to return the Port mapper failure - Timed out message.

  • Are either TCP or UDP port 111 blocked and the NFS host machine? If so, is REJECT or DROP being used for packets with a destination to port 111?

    If DROP is being used, this issue applies.

  • Does it take more then 5 minutes to retrieve disk space information from all mounted partition on the JBoss ON agent host? The following commands will use the agent's native library to retrieve disk space information from each mounted partition and display the actual time for each along with total time at the end:

      cd <RHQ_AGENT_HOME>
      time awk '{print $2}' /etc/mtab | sort | uniq | while read mnt; do echo ""; echo "Checking $mnt:"; time java -jar lib/sigar-*.jar df "$mnt"; done
    

    If the total time is close to or above 5 minutes, this issue applies.

SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.