An antivirus software causes failover, reboot, or timeout events in a Red Hat High Availability cluster
Environment
- Red Hat Enterprise Linux 7 (with the High Availability Add-on)
- Red Hat Enterprise Linux 8 (with the High Availability Add-on)
- Red Hat Enterprise Linux for SAP HANA
- Red Hat Enterprise Linux for SAP Solutions
- An antivirus or security application, including but not limited to the following:
- HelpSystems Antivirus
- Symantec Endpoint Protection
- Trend Micro Antivirus
- VMware Carbon Black
- Microsoft Defender for Endpoint / Microsoft Defender Advanced Threat Protection (MDATP)
- SentinelOne
- Tanium Endpoint management and Security platform
- CrowdStrike
- Nessus
Issue
- Issues such as the following occurred in a
pacemakercluster with antivirus software running, and there was no obvious root cause. (Note: This is not an exhaustive list of possible symptoms. It is not necessary that any or all of these specific issues occurred.)-
Creating the cluster running
pcs cluster setupfails with a timeout error -
Simple Linux utility commands hung for more than 120 seconds
-
Cluster commands like
pcs statusandcrm_mon --one-shot --inactivetake an excessively long time to complete. -
An
IPaddr2resource failed to stop, leading to a fence event. -
Generating an
sosreporttook more than 30 minutes. -
There are many
OCF_TIMEOUTmessages frompacemakerdaemons in the log files. -
Totem token failure , leading to fencing of the node.
-
corosyncrepeatedly logs"[TOTEM] Retransmit List"messages.Mar 28 06:15:03 Node02 corosync[2028]: [TOTEM ] Retransmit List: 2bb9
-
- Which files and directories should be excluded from an antivirus scan
Resolution
If a cluster node is running antivirus software and behaving in an unexpected manner, Red Hat may require that you reproduce the issue with the antivirus software disabled. If disabling the antivirus software causes the issue to disappear, then you will need to do one of the following:
-
Disable the antivirus software permanently.
-
Work with the antivirus software vendor to troubleshoot and remediate the issue within the antivirus configuration. This may involve excluding from antivirus scans any files that the cluster depends upon, with which the antivirus software is interfering. This could require some trial and error. The list of files that need to be excluded may vary depending on antivirus product and antivirus configuration (due to varying scan/protection behaviors and impacts), as well as depending on cluster package versions and cluster configuration (e.g., what types of resources are present). A good starting point might be to check the
requireslist for cluster-related RPM packages. For example, to check therequireslist for theresource-agentspackage:# rpm -q --requires resource-agents /bin/bash /bin/bash /bin/gawk /bin/mount /bin/ps /bin/sed /bin/sh /sbin/fsck /sbin/ip ...
Related articles
Does ‘on-access’ scanning by Antivirus impact Red Hat Enterprise Linux system performance?
Additional notes
Please refer to the antivirus vendor's product documentation for more details on how to configure the antivirus, including how to exclude files from scans.
Root Cause
There may be contention for file accesses between the antivirus software and other applications, such as pacemaker, dlm, or cluster resource agents. The cluster's attempt to access required files may be delayed for an indefinite period while the antivirus scans that file or its parent directory.
Diagnostic Steps
-
If the issue is intermittent cluster resource failures, add
trace_ra=1to the configuration of each affected resource. After a failure, check the trace logs in/var/lib/heartbeat/trace_rato determine which command(s) within the resource agent hung or took an excessively long time to complete. (See also: How can I determine exactly what is happening with every operation on a resource in Pacemaker?)# pcs resource update <resource id> trace_ra=1 -
If the issue is readily reproducible every time you run an affected command (e.g.,
pcs status), then run the command in debug mode or understrace.# pcs status --debug # strace -Tttvfs 1024 -o /tmp/strace.out /usr/sbin/crm_mon --one-shot --inactive
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.