Creating a Persistent (Daemonized) Pacemaker Resource Agent
A persistent (daemonized) Pacemaker resource agent that maintains state data can detect failures asynchronously and inject a failure into Pacemaker immediately without waiting for the next monitor interval. A persistent resource agent can also speed up cluster response time for services with a high state overhead, since maintaining state data can reduce the state overhead for cluster actions such as start, stop, and monitor by not invoking the state separately for each action.
This article provides an overview of the considerations for creating a persistent Pacemaker resource agent.
The implementation of a persistent Pacemaker resource agent requires two components:
-
A persistent daemon to accept commands and return status to the caller
-
A custom OCF (Open Cluster Framework) agent that passes calls to the daemon
These two components can be written as a single script, although this is not a requirement.
This article provides a summary of the specific considerations for creating a persistent Pacemaker resource agent. For full information on creating OCF resource agents and the OCF resource agents API, see
- Content from github.com is not included.The OCF Resource Agent Developer’s Guide
- Content from github.com is not included.Open Clustering Framework Resource Agent API.
The persistent daemon
The daemon the agent uses only requires some means of accepting commands and returning a status to the caller. You can write the daemon in any language. The communication could be by means of DBus, a REST API, a custom protocol over socket connections, or whatever you prefer.
The custom OCF agent
The custom OCF agent is a script like any other OCF script. It should accept the usual OCF commands:
-
meta-data,validate-all, and (optionally)reload: the script should handle these directly, without needing to interact with the daemon in any way. -
start,stop: these commands should start and stop the daemon, wait for the daemon to finish starting or stopping, and return an appropriate OCF exit status. If you want to start and stop the daemon outside of Pacemaker, you could set these commands up as dummy operations that would require extra bookkeeping. Note, however, that this increases the likelihood of issues you need to address in your system operation. -
monitorand (optionally)promote,demote, andnotify: The script should contact the daemon and send it an appropriate request, then translate the daemon's response into an OCF exit status. Amonitorcommand could be as simple as ensuring the daemon is running, or it could involve some sort of health check.
The OCF script can define any parameters you require. For example, the script can define the following parameters:
- The path to the daemon executable
- The path to the daemon's configuration file
- Extra command-line arguments to pass when launching the daemon
- A DBus interface,
REST URL, or port number where the daemon should listen for requests - Authentication tokens
- Any additional parameters you may need
Injecting failures into Pacemaker
Injecting failures into Pacemaker requires that you call crm_resource --fail --resource resource-id .
-
It is more straightforward to call this
crm_resourcecommand from the daemon rather than from the agent. -
If you launch the daemon through the agent start action, the daemon will need to find the resource ID in the environment as
OCF_RESOURCE_INSTANCEin order to call thecrm_resourcecommand. If the agent's start action is a dummy operation, the agent will need to provide some other way to pass the resource name from the environment variable to the daemon.
Sample persistent daemon and associated OCF agent
The following example shows a persistent daemon and its associated OCF agent, using a Python script based on the Twisted programming framework. It includes both the server and agent in the same script, although that is not required. This example is specific to Twisted, but it is intended to provide a general idea of what a persistent daemon and agent contain.
#!/usr/bin/python
""" Example persistent daemon and OCF agent for pacemaker clusters """
from __future__ import print_function
__copyright__ = "Copyright 2020 the Pacemaker project contributors"
__license__ = "GNU General Public License version 2 or later (GPLv2+) WITHOUT ANY WARRANTY"
import io
import os
import sys
import signal
import subprocess
class OCFExit(object):
""" Standard OCF exit status codes """
OK = 0
ERROR = 1
INVALID_PARAM = 2
UNIMPLEMENTED = 3
NOT_INSTALLED = 5
NOT_RUNNING = 7
# This daemon uses Twisted to set up a custom TCP protocol, for simplicity.
# A persistent resource daemon could use any means of communication,
# such as D-Bus, a REST API, etc.
try:
from twisted.application import service, internet
from twisted.internet import reactor, protocol
from twisted.protocols import basic
except ImportError:
sys.exit(OCFExit.NOT_INSTALLED)
# Default port for daemon to listen on
DEFAULT_PORT = 9999
# Path to twistd (for running as daemon)
TWISTD = "/usr/bin/twistd"
# Path to kill command (for signaling daemon to stop)
KILL = "/bin/kill"
# Where to store daemon process ID
PIDFILE = "/run/food.pid"
# This will have exit code for current monitor operation (if any)
monitor_exit_code = OCFExit.ERROR
def foo_port():
""" Return port number used by the foo daemon, per the OCF environment """
try:
port = int(os.environ['OCF_RESKEY_port'])
except (KeyError, ValueError):
port = DEFAULT_PORT
return port
#
# This sample code implements both the daemon and the OCF resource agent, for
# simplicity, but they could be implemented as separate applications.
#
# Since it is used an agent, it is expected to be installed under
# /usr/lib/ocf/resource.d for some provider name.
#
#
# Server-side implementation of daemon protocol
#
class FooServerProtocol(basic.LineReceiver):
""" An example custom daemon protocol over TCP """
def lineReceived(self, line):
""" Handle a line of input received on socket """
# Trivial protocol: anything sent gets a reply with the resource ID.
# (This is an example of how to get the resource ID, in case the daemon
# wants to force an asynchronous failure of the resource with
# "crm_resource --fail --resource <rsc-id>".)
try:
rsc = os.environ["OCF_RESOURCE_INSTANCE"]
except KeyError:
rsc = "unknown"
self.sendLine(rsc + ":[" + line + "]")
class FooServerFactory(protocol.ServerFactory):
def buildProtocol(self, addr):
return FooServerProtocol()
class FooServerTCP(internet.TCPServer):
def __init__(self):
internet.TCPServer.__init__(self, foo_port(), FooServerFactory(),
interface='127.0.0.1')
#
# Client-side implementation of daemon protocol
#
class FooClient(basic.LineReceiver):
def connectionMade(self):
""" Once connection is established, send a 'ping' to server """
self.sendLine("ping")
def lineReceived(self, line):
""" If a response is received, the server is alive """
global monitor_exit_code
monitor_exit_code = OCFExit.OK
# We're done, disconnect from server
self.transport.loseConnection()
class FooClientFactory(protocol.ClientFactory):
def buildProtocol(self, addr):
return FooClient()
def clientConnectionFailed(self, connector, reason):
""" If we couldn't connect, the server isn't running """
global monitor_exit_code
monitor_exit_code = OCFExit.NOT_RUNNING
reactor.stop()
def clientConnectionLost(self, connector, reason):
""" Quit the main loop once we drop the connection """
reactor.stop()
#
# Handlers for OCF resource agent actions
#
def metadata():
""" Handle the OCF meta-data action by printing standard OCF meta-data """
# This sample agent supports a single parameter "port".
# A real agent could support others, such as the path to the
# daemon executable, the path to the daemon's configuration file,
# extra command-line arguments to pass when launching the daemon,
# a DBus interface or REST URL for the daemon to listen on,
# authentication tokens, etc.
print("""<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="food" version="1.0">
<version>1.0</version>
<longdesc lang="en">
This is a sample OCF resource agent for interacting with a sample persistent
daemon.
</longdesc>
<shortdesc lang="en">Sample OCF daemon client</shortdesc>
<parameters>
<parameter name="port" unique="1">
<longdesc lang="en">
Port number that foo daemon should listen on
</longdesc>
<shortdesc lang="en">Port number</shortdesc>
<content type="string" default="%s" />
</parameter>
</parameters>
<actions>
<action name="start" timeout="10s" />
<action name="stop" timeout="10s" />
<action name="monitor" timeout="10s" interval="10s" depth="0"/>
<action name="validate-all" timeout="10s" />
<action name="meta-data" timeout="10s" />
</actions>
</resource-agent>""" % (DEFAULT_PORT))
return OCFExit.OK
def validate():
""" Handle the OCF validate-all action by ensuring port is an integer """
try:
int(os.environ['OCF_RESKEY_port'])
except KeyError:
pass
except ValueError:
return OCFExit.INVALID_PARAM
return OCFExit.OK
def start():
""" Handle the OCF start action by running the daemon """
exit_code = OCFExit.ERROR
try:
# Run the foo daemon by invoking twistd, which will import this file
rc = subprocess.call([TWISTD, "--syslog", "--pidfile", PIDFILE,
"--python", os.path.realpath(__file__)])
if rc == 0:
# A real agent should wait for its daemon to be fully up and
# operational, such that a monitor action would return success.
exit_code = OCFExit.OK
except OSError:
pass
return exit_code
def stop():
""" Handle the OCF stop action by signaling the daemon """
if not os.path.isfile(PIDFILE):
# No PID file should mean daemon is not running, so nothing needed
return OCFExit.OK
# Read PID file
try:
with io.open(PIDFILE, "rt") as f:
pid = int(f.readline())
except (IOError, ValueError):
return OCFExit.ERROR
# Is process running?
try:
os.kill(pid, 0)
except OSError:
# No, it's not, so nothing is needed
return OCFExit.OK
# Tell daemon to shut down
try:
os.kill(pid, signal.SIGHUP)
except OSError:
return OCFExit.ERROR
# A real agent should wait for the daemon to be fully stopped before
# continuing, such that a monitor would return "not running".
return OCFExit.OK
def monitor():
""" Handle the OCF monitor action by pinging the daemon """
reactor.connectTCP("127.0.0.1", foo_port(), FooClientFactory())
reactor.run()
return monitor_exit_code
def run_resource_agent():
""" Handle command-line arguments and environment as an OCF resource agent """
try:
actions = {
"meta-data": metadata,
"validate-all": validate,
"start": start,
"stop": stop,
"monitor": monitor,
}
try:
action = actions[sys.argv[1]]
except (IndexError, KeyError):
return OCFExit.UNIMPLEMENTED
return action()
except:
# Catch all uncaught exceptions, since we have to return an OCF code.
# (A real agent would keep a log or other means of debugging.)
return OCFExit.ERROR
if __name__ == '__main__':
# When executed, act as an OCF resource agent
sys.exit(run_resource_agent())
else:
# When imported, provide an application object for use with twistd
application = service.Application("Foo")
service = FooServerTCP()
service.setServiceParent(application)