Resolving Fluentd journald File Locking Issues
Environment
- Red Hat OpenShift Container Platform
- 3.x
Issue
- Journal files kept open by fluentd on nodes, causing systems to run out of file handles or disk space.
- See bug This content is not included.1664744
Resolution
- Create a reaper (cronjob) that causes a node's fluentd pod to be restarted (as a result of the daemon set for fluentd) when /var/log disk usage exceeds a configurable threshold. The default is 75% (see template below).
- NOTE: There is a single threshold for all nodes. Be sure all nodes are configured such that the normal and expected amount of /var/log disk usage does not exceed the threshold otherwise the node's fluentd pod will continually be restarted.
- cronjob template:
apiVersion: v1
kind: Template
metadata:
name: fluentd-reaper
objects:
- apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: fluentd-reaper
rules:
- apiGroups:
- ""
resources:
- pods
verbs:
- delete
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: fluentd-reaper
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: fluentd-reaper
subjects:
- kind: ServiceAccount
name: aggregated-logging-fluentd
namespace: ${LOGGING_NAMESPACE}
- apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: fluentd-reaper
labels:
provider: openshift
logging-infra: fluentd-reaper
spec:
schedule: "${REAP_SCHEDULE}"
jobTemplate:
spec:
template:
metadata:
labels:
provider: openshift
logging-infra: fluentd-reaper
spec:
serviceAccount: aggregated-logging-fluentd
serviceAccountName: aggregated-logging-fluentd
containers:
- env:
- name: REAP_THRESHOLD
value: "${REAP_THRESHOLD}"
name: cli
image: ${CLI_IMAGE}
command: ["/bin/bash", "-c"]
args:
- echo "Checking fluentd pods for space issues on /var/log...";
pods=$(oc get pods -l component=fluentd -o jsonpath={.items[*].metadata.name});
for p in $pods; do
echo "Checking $p...";
if ! $(oc get pod $p | grep Running >> /dev/null) ; then
echo "$p as its not in a Running state. Skipping...";
continue;
fi;
space=$(oc exec -c fluentd-elasticsearch $p -- bash -c 'df --output=pcent /var/log | tail -1 | cut -d "%" -f1 | tr -d " "');
echo "Capacity $space";
if [ $space -gt ${REAP_THRESHOLD_PERCENTAGE} ] ; then
echo "Used capacity exceeds threshold. Deleting $p";
oc delete pod $p ;
fi;
done;
restartPolicy: OnFailure
parameters:
- name: CLI_IMAGE
value: registry.access.redhat.com/openshift3/ose-cli:latest
description: "The image to use to execute the reaper script"
- name: REAP_THRESHOLD_PERCENTAGE
value: "75"
description: "The max capacity to allow for /var/log before restarting fluentd"
- name: REAP_SCHEDULE
value: "*/30 * * * *"
description: "The schedule to check for low disk capacity"
- name: LOGGING_NAMESPACE
value: openshift-logging
description: "The schedule to check for low disk capacity"
- You can create this using the following:
$ oc process -f cron.yml | oc apply -f -
Root Cause
The ruby systemd code does not call sd_journal_get_fd() immediately after sd_journal_open().
If a client calls sd_journal_get_fd() as soon as possible after calling sd_journal_open() the window for leaking FDs is closed significantly, but not entirely. The only way to close that gap is with a change to the journal APIs where sd_journal_open() creates the inotify FD at the time of the open.
As a result closing the process holding the lock (fluentd) is the only save and suffiecent solutions as fixing journald api's in this fashion could result in a breaking change in the RHEL API/APB compatability.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.