"pengine: error: native_create_actions: Resource <name> (<class>:<provider>:<type>) is active on 2 nodes attempting recovery" seen when a node rejoins a RHEL High Availability cluster

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux (RHEL) 6, 7, 8, 9 with the High Availability Add On

Issue

  • My resources restarted unexpectedly after a node rejoined, even though I have location constraints and resource-stickiness set to prefer the node it was already running on, and I saw "is active on 2 nodes" errors
Dec 10 07:33:56 [11743] node1    pengine:    error: native_create_actions:    Resource myScript (lsb::ha_script) is active on 2 nodes attempting recovery
Dec 10 07:33:56 [11743] node1    pengine:  warning: native_create_actions:    See http://clusterlabs.org/wiki/FAQ#Resource_is_Too_Active for more information.
Dec 10 07:33:56 [11743] node1    pengine:   notice: LogActions:       Restart myScript    (Started node1)
  • A node rejoined after fencing and the resource moved to it, even though its configured to stick where it is

Resolution

If this log message is occurring then you should investigate why pacemaker believes that this resource is running on multiple cluster nodes.

Some of the possible causes (and this is not an exhaustive list):

  • Disable any mechanism that starts the resource outside the control of the resource manager, such as if its enabled on boot in mounting a filesystem in /etc/fstab, a lsb script is enabled at boot time or systemd is enabling the resource at boot time.
  • If the resource is an lsb script, then ensure the script implements a status command that accurately returns whether the underlying resource is running or not.
  • If the resource is an IPaddr2, then check the network configuration to see if it is configured on a node directly.
  • If the resource is intended to run on multiple nodes simultaneously, then configure it as a clone resource.

A new value was added for the meta option multiple-active called stop_unexpected via the following bugs below that allows for configuring the action that pacemaker takes when multiple resources are found to be active on more than 2 nodes (when should only be active on 1 node).

Red Hat Enterprise Linux 8

  • The issue (bugzilla bug: 2062850) has been resolved with the errata RHEA-2022:4817 with the following package(s): pacemaker-2.0.5-9.el8_4.5 or later for RHEL 8.4.z.
  • The issue (bugzilla bug: 2062848) has been resolved with the errata RHEA-2022:5324 with the following package(s): pacemaker-2.1.2-4.el8_6.2 or later for RHEL 8.6.z.
  • The issue (bugzilla bug: 2036815) has been resolved with the errata RHBA-2022:7573 with the following package(s): pacemaker-2.1.4-5.el8 or later.

Red Hat Enterprise Linux 9

  • The issue (bugzilla bug: 2072108) has been resolved with the errata RHBA-2022:7937 with the following package(s): pacemaker-2.1.4-5.el9 or later.

Pacemaker's multiple-active resource meta-attribute now accepts the value stop_unexpected, which will stop only those instances of a resource detected to be active when they should not be and will stop those instances of the resources and will leave the other resource running (without stopping it).

The reason this is meta attribute was added is because some services are not disrupted by an extra instances and can remain running when the extra instances are stopped. This provides less downtime for a service that is detected to be wrongly labeled as "multiple active". This option should be used with care as could lead to some unusual resource placement.

For example, we create a resource with the meta option enabled: meta multiple-active=stop_unexpected.

# pcs resource create d-01 ocf:pacemaker:Dummy meta multiple-active=stop_unexpected

In this example, the resource is running on rhel8-1.examplerh.com. Then we start the resource on rhel8-2.examplerh.com outside of pacemaker control and have pacemaker re-probe for resources.

# pcs resource debug-start d-01
# pcs resource refresh

When that resource d-01 is started on rhel8-2.examplerh.com pacemaker will detect multiple active instances of that resource is running. Then will only stop the instances of that resources that pacemaker believes should not be active (which is on rhel8-2.examplerh.com). The resource that was active on the other cluster node will not be stopped.

There are other values that can be set for multiple-active which is outlined in the following documents:

Root Cause

When a node starts pacemaker, its resource manager will run a probe against the configured resources its eligible to run to determine if it has already started any of them, because the resource manager needs to know where all resources are running in order to effectively manage them. In the case where this initial probe finds that a resource is running on the newly-joined node in addition to the location it was already running when the resource is only configured to run once throughout the cluster, the resource manager will need to stop it in all places and restart it in the ideal location. This means that when a node joins, if it is already running any resource that is configured in the cluster, that resource may restart or move from where it was previously running.

If a resource is intended to be running on multiple nodes at the same time, then it should be configured as a clone resource, with the configuration reflecting how many times and where it should run. The resource manager must be the only mechanism by which any configured resource starts anywhere in the cluster.

In the case of lsbclass (pacemaker) resources, it is important that the script implement the status command. If it does not and the script simply exits normally (with return code 0) when status is called, then this issue will occur every time a node starts, because the initial probe on that script will always return a status that indicates the resource is running.

Diagnostic Steps

There could be many reasons why pacemaker believes that the resource is active (or not stopped) and logs that message. A couple of the know reasons are (but not an exhaustive list):

  • Look in chkconfig --list (RHEL 6) or systemctl list-units --type=service (RHEL 7) for any service that may start the resource that is configured in the cluster.
  • If the resource is a file system, check /etc/fstab for it.
  • If the resource is an IPaddr2, then check the network configuration to see if it is configured on a node directly.
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.