An antivirus software causes failover, reboot, or timeout events in a Red Hat High Availability cluster

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7 (with the High Availability Add-on)
  • Red Hat Enterprise Linux 8 (with the High Availability Add-on)
  • Red Hat Enterprise Linux for SAP HANA
  • Red Hat Enterprise Linux for SAP Solutions
  • An antivirus or security application, including but not limited to the following:
    • HelpSystems Antivirus
    • Symantec Endpoint Protection
    • Trend Micro Antivirus
    • VMware Carbon Black
    • Microsoft Defender for Endpoint / Microsoft Defender Advanced Threat Protection (MDATP)
    • SentinelOne
    • Tanium Endpoint management and Security platform
    • CrowdStrike
    • Nessus

Issue

  • Issues such as the following occurred in a pacemaker cluster with antivirus software running, and there was no obvious root cause. (Note: This is not an exhaustive list of possible symptoms. It is not necessary that any or all of these specific issues occurred.)
    • Creating the cluster running pcs cluster setup fails with a timeout error

    • Simple Linux utility commands hung for more than 120 seconds

    • Cluster commands like pcs status and crm_mon --one-shot --inactive take an excessively long time to complete.

    • An IPaddr2 resource failed to stop, leading to a fence event.

    • Generating an sosreport took more than 30 minutes.

    • There are many OCF_TIMEOUT messages from pacemaker daemons in the log files.

    • Totem token failure , leading to fencing of the node.

    • corosync repeatedly logs "[TOTEM] Retransmit List" messages.

        Mar 28 06:15:03 Node02 corosync[2028]: [TOTEM ] Retransmit List: 2bb9
      
  • Which files and directories should be excluded from an antivirus scan

Resolution

If a cluster node is running antivirus software and behaving in an unexpected manner, Red Hat may require that you reproduce the issue with the antivirus software disabled. If disabling the antivirus software causes the issue to disappear, then you will need to do one of the following:

  • Disable the antivirus software permanently.

  • Work with the antivirus software vendor to troubleshoot and remediate the issue within the antivirus configuration. This may involve excluding from antivirus scans any files that the cluster depends upon, with which the antivirus software is interfering. This could require some trial and error. The list of files that need to be excluded may vary depending on antivirus product and antivirus configuration (due to varying scan/protection behaviors and impacts), as well as depending on cluster package versions and cluster configuration (e.g., what types of resources are present). A good starting point might be to check the requires list for cluster-related RPM packages. For example, to check the requires list for the resource-agents package:

      # rpm -q --requires resource-agents
      /bin/bash
      /bin/bash
      /bin/gawk
      /bin/mount
      /bin/ps
      /bin/sed
      /bin/sh
      /sbin/fsck
      /sbin/ip
      ...
    

Does ‘on-access’ scanning by Antivirus impact Red Hat Enterprise Linux system performance?

Additional notes

Please refer to the antivirus vendor's product documentation for more details on how to configure the antivirus, including how to exclude files from scans.

Root Cause

There may be contention for file accesses between the antivirus software and other applications, such as pacemaker, dlm, or cluster resource agents. The cluster's attempt to access required files may be delayed for an indefinite period while the antivirus scans that file or its parent directory.

Diagnostic Steps

  1. If the issue is intermittent cluster resource failures, add trace_ra=1 to the configuration of each affected resource. After a failure, check the trace logs in /var/lib/heartbeat/trace_ra to determine which command(s) within the resource agent hung or took an excessively long time to complete. (See also: How can I determine exactly what is happening with every operation on a resource in Pacemaker?)

     # pcs resource update <resource id> trace_ra=1
    
  2. If the issue is readily reproducible every time you run an affected command (e.g., pcs status), then run the command in debug mode or under strace.

     # pcs status --debug
     # strace -Tttvfs 1024 -o /tmp/strace.out /usr/sbin/crm_mon --one-shot --inactive
    
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.