Why was the cluster node fenced when the following was logged: "controld(dlm) ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing"
Environment
- Red Hat Enterprise Linux Server 7, 8, and 9 (with the High Availability Add On)
pacemaker
Issue
A cluster node was rebooted after the following message was logged:
Nov 23 22:36:33 node42 controld(dlm)[13452]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
Resolution
When an uncontrolled lockspace is detected by the resource agent controld, the cluster will request that the cluster node with an uncontrolled lockspace should be fenced. An uncontrolled lockspace can occur when a cluster node ungracefully leaves the cluster. The lockspaces will not be cleaned up if the node leaves the cluster ungracefully (e.g., corosync crashes) or if a node is indeed fenced by a storaged-based stonith device but the cluster node is not rebooted. If the node rejoins the cluster before fencing or reboot occurs and then tries to start controld, then the resource agent will detect the uncontrolled lockspaces.
If you have a storage-based fencing configured, ensure the node has been rebooted (either manually or automatically) before it rejoins the cluster.
Root Cause
The corosync process crashed on a cluster that had DLM lockspaces. This led to the cluster node ungracefully leaving the cluster and leaving lockspaces that were not removed.
Diagnostic Steps
- Check the
/var/log/messagesfile for messages about uncontrolled lockspaces and if the cluster node left the cluster abruptly.
In this example, corosync crashed and caused the dlm_controld process to stop suddenly as it could no longer communicate with corosync.
Nov 23 22:36:08 node42 dlm_controld[22089]: 6280267 process_cluster_cfg cfg_dispatch 2
Nov 23 22:36:08 node42 dlm_controld[22089]: 6280267 cluster is down, exiting
Nov 23 22:36:08 node42 dlm_controld[22089]: 6280267 process_cluster quorum_dispatch 2
[....]
Nov 23 22:36:08 node42 systemd: Unit corosync.service entered failed state.
Nov 23 22:36:08 node42 kernel: dlm: closing connection to node 1
Nov 23 22:36:08 node42 kernel: dlm: closing connection to node 3
Nov 23 22:36:08 node42 kernel: dlm: closing connection to node 2
Nov 23 22:36:08 node42 systemd: corosync.service failed.
[.....]
Nov 23 22:36:08 node42 systemd: pacemaker.service holdoff time over, scheduling restart.
Nov 23 22:36:08 node42 systemd: Starting Corosync Cluster Engine...
[.....]
Nov 23 22:36:33 node42 controld(dlm)[13452]: ERROR: Uncontrolled lockspace exists, system must reboot. Executing suicide fencing
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.