Resolving Fluentd journald File Locking Issues

Solution Unverified - Updated 14 Jun 2024

Environment

Red Hat OpenShift Container Platform
- 3.x

Issue

Journal files kept open by fluentd on nodes, causing systems to run out of file handles or disk space.
See bug This content is not included.1664744

Resolution

Create a reaper (cronjob) that causes a node's fluentd pod to be restarted (as a result of the daemon set for fluentd) when /var/log disk usage exceeds a configurable threshold. The default is 75% (see template below).
NOTE: There is a single threshold for all nodes. Be sure all nodes are configured such that the normal and expected amount of /var/log disk usage does not exceed the threshold otherwise the node's fluentd pod will continually be restarted.
cronjob template:

apiVersion: v1
kind: Template
metadata:
  name: fluentd-reaper
objects:
- apiVersion: rbac.authorization.k8s.io/v1
  kind: Role
  metadata:
    name: fluentd-reaper
  rules:
  - apiGroups:
    - ""
    resources:
    - pods
    verbs:
    - delete
  - apiGroups:
    - ""
    resources:
    - pods/exec
    verbs:
    - create
- apiVersion: rbac.authorization.k8s.io/v1
  kind: RoleBinding
  metadata:
    name: fluentd-reaper
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: Role
    name: fluentd-reaper
  subjects:
  - kind: ServiceAccount
    name: aggregated-logging-fluentd
    namespace: ${LOGGING_NAMESPACE}
- apiVersion: batch/v1beta1
  kind: CronJob
  metadata:
    name: fluentd-reaper
    labels:
       provider: openshift
       logging-infra: fluentd-reaper
  spec:
    schedule: "${REAP_SCHEDULE}"
    jobTemplate:
      spec:
        template:
          metadata:
            labels:
              provider: openshift
              logging-infra: fluentd-reaper
          spec:
            serviceAccount: aggregated-logging-fluentd
            serviceAccountName: aggregated-logging-fluentd
            containers:
            - env:
              - name: REAP_THRESHOLD
                value: "${REAP_THRESHOLD}"
              name: cli
              image: ${CLI_IMAGE}
              command: ["/bin/bash", "-c"]
              args:
                - echo "Checking fluentd pods for space issues on /var/log...";
                  pods=$(oc get pods -l component=fluentd -o jsonpath={.items[*].metadata.name});
                  for p in $pods; do
                    echo "Checking $p...";
                    if ! $(oc get pod $p | grep Running >> /dev/null)  ; then
                      echo "$p as its not in a Running state. Skipping...";
                      continue;
                    fi;
                    space=$(oc exec -c fluentd-elasticsearch $p -- bash -c 'df --output=pcent /var/log | tail -1 | cut -d "%" -f1 | tr -d " "');
                    echo "Capacity $space";
                    if [ $space -gt ${REAP_THRESHOLD_PERCENTAGE} ] ; then
                      echo "Used capacity exceeds threshold. Deleting $p";
                      oc delete pod $p ;
                    fi;
                  done;
            restartPolicy: OnFailure
parameters:
- name: CLI_IMAGE
  value: registry.access.redhat.com/openshift3/ose-cli:latest
  description: "The image to use to execute the reaper script"
- name: REAP_THRESHOLD_PERCENTAGE
  value: "75"
  description: "The max capacity to allow for /var/log before restarting fluentd"
- name: REAP_SCHEDULE
  value: "*/30 * * * *"
  description: "The schedule to check for low disk capacity"
- name: LOGGING_NAMESPACE
  value: openshift-logging
  description: "The schedule to check for low disk capacity"

You can create this using the following:

$ oc process -f cron.yml  | oc apply -f -

Root Cause

The ruby systemd code does not call sd_journal_get_fd() immediately after sd_journal_open().

If a client calls sd_journal_get_fd() as soon as possible after calling sd_journal_open() the window for leaking FDs is closed significantly, but not entirely. The only way to close that gap is with a change to the journal APIs where sd_journal_open() creates the inotify FD at the time of the open.

As a result closing the process holding the lock (fluentd) is the only save and suffiecent solutions as fixing journald api's in this fashion could result in a breaking change in the RHEL API/APB compatability.

SBR

Shift

Product(s)

Red Hat OpenShift Container Platform

Components

Fluentd

Category

Customize or extend

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.