Troubleshoot options for Data Grid pod crash

Solution Verified - Updated

Environment

  • Red hat OpenShift Container Platform (OCP)
    • 4.x
  • Red Hat Data Grid (RHDG)
    • 8.x

Issue

  • How to debug/troubleshoot DG pods when they crash?
  • What is the exit code of a pod in OCP 4?

Resolution

This solution is about DG 8 pod crash (e.g. pod exiting with code 3 and such), for OCP node crash see solution DG 8 operation in case of OCP nodes crashing. And for investigating memory usage see Investigating heap and off heap usage in DG 8 OCP 4.

There are several possibilities of crash when using DG Operator and the difficult part is that the pod will vanish and the data will be gone (having set it for ephemeral).
Below the scenarios and steps to debug it:

IssueCommentHow to debug it
pods on OOME*the jvm process will generate a heap dump given HeapDumpOnOutOfMemoryErrorset HeapDumpOnOutOfMemoryError and its path; then see heap (disable ExitOnOutOfMemoryError)
pods removed by cgroups OOMthe container is killed by cgroups OOMKillerVerify its memory usage and track - see memory limit reached below
pods being evictedkubelet is assuming/deducing the OCP node is under pressueVerify its memory usage for the OCP node
pods crash on JVMthe jvm will exit on the spot, create a hs_error but can execute one action if set on OnErrorset OnError to a action that can be saved or ErrorFile setting the crash file path (to be sure it crashed) and verify SOS Report will report SIGABRT signal (this signal can come from within the JVM to abort in some fatal error condition).
memory limit reachedcgroups or system OOME killerconfirm via dmesg and verify statefulset/deployment logs see OCP 3 - OOME memory size
cpu/cpu limit reachedCPU throttling will actsee CPU Throttling lack of cpu will result in timeout when executing CLI commands but not kill and CPU throttling on Data Grid pods in OCP 4
probes nor respondingreadiness probe or liveliness probe can cause restarts of podssee kubelet logs and OCP events: oc get events -n , for EAP see this
native issuesThere is no native heapOne can use jcmd $PID VM.info to see the native usage (from native track and shared libs).
cpu limit is too smallthreads will keep returning timeout on cli commands, like statsdeploy with cpu 1+
HPA issueshorizontal pod scaling issues, e.g. HPA can scale pods that have a usage below a thresholdsee OCP events: oc get events -n <namespace>

*Note: native OOME may not generate an heap dump via HeapDumpOnOutOfMemoryError flag.
Also lack of resources, like cpu, will not cause a crash, but instead the commands, like $cli.sh -c - $ stats to return null and timeout. Adding more cpus will prevent those scenarios.

Example

Below example sets gc logs for /tmp/gc.log, OnError ( = to detect crashes), and HeapDumpOnOutOfMemoryError (=to detect OOME) on extraJvmOpts - both files would go to the PV and will be persisted - for generic use cases/troubleshooting in general:

  spec:
...
    container:
      cpu: '2'
      extraJvmOpts: '-Xlog:gc*=info:file=/tmp/gc.log:time,level,tags,uptimemillis:filecount=10,filesize=1m
              -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/infinispan/server/data OnError="touch /opt/infinispan/server/data/crash.txt"'
      memory: 3Gi

For generic uses cases: Above will produce a heap dump - on OOME event - in the location: /opt/infinispan/server/data, and it will create a crash file, therefore covering both two situations OOME and a crash. Another option would be to create a file on crash via OnError="touch /opt/infinispan/server/data/crash.txt". Or even -XX:ErrorFile=/opt/infinispan/server/data, which created a crash file on the PV.

Depending on the signal sent to the OpenJDK's JVM will generate a crash file and a core dump, following the table below:

SignalOnError and crash file creation
SIGILL -4Yes, SIGILL (0x4) at pc=0x (sent by kill), pid=PID, tid=PID
SIGABRT -6No
SIGBUS - 7Yes, SIGFPE (0x8) at pc=0x (sent by kill), pid=PID, tid=PID
SIGFPE - 8Yes, SIGFPE (0x8) at pc=0x (sent by kill), pid=PID, tid=PID
SIGKILL -9No
SIGSEGV - 11Yes, SIGSEGV (0x8) at pc=0x (sent by kill), pid=PID, tid=PID

Persisting

One can create both files above (crash and oome) inside the PV by setting the path to /opt/infinispan/server/data, and the data will persist even if the pod crashes, the path comes from the data volume. If the heap is too big for the PV it will simply aborted and nothing will be written to the PV. Also take into consideration that the customer is responsible to set the right PV size if wanted.

Java's OOM vs Kernel OOM-Killer

Java OOME is an exception handling inside Java, whereas the kernel's cgroup OOMKIller is an outside agent (the oomkiller). Consequently: doubling Infinispan cluster resources (via spec.container.memory) will impact on heap and off-heap sizes. But for OOM-Killer that is not the case, see solution How to define memory and cpu resource via limits/requests on DG 8 Operator on this matter - particularly given there are two types of OOM Kills: System OCP (nodes with high usage) and cgroups ones (process inside pod high usage).

For the cgroups OOM KILL the kill won't happen exactly when the pod goes beyond the limits it is supposed to, but rather on the next allocation (that is beyond the limit). Therefore that's the reason user will see the cgroups OOM Kill happening after the limit is already breached. finally, the kill won't be a segfault necessarily, given the error from malloc is ENOMEM segfault, which is for illegal operations and allocation beyond the limits is not illegal.
So for instance a pod with 6.5G of total memory and then gets an cgroups OOME Kill with 6.65G (so already beyond the limit):

rss:6655048KB rss_huge:26624KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:6655032KB inactive_file:160KB active_file:28KB unevictable:0KB
[16840841.101964] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
...
[16840841.102159] Memory cgroup out of memory: Kill process 118132 (java) score 1759 or sacrifice child
[16840841.104477] Killed process 117061 (java), UID 1000, total-vm:10787896kB, anon-rss:6178796kB, file-rss:8652kB, shmem-rss:0k

Be aware not to set oom_score_adj for -1k, that might set it for no killable:

[3038012.673260] Out of memory and no killable processes... <------------------ not killable process
[3038012.673459] process invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=-1000

In other words, OOM Exception (an exception process handled in Java) is not the same as OOM Killer, by the kernel's cgroup
The kernel imposes the containers' limits via the cgroups, which can trigger OOMKill in case the process crosses its limits. The allocation trigger might not be the exact offending process given the triggered is on the next allocation, not throwing a segfault but an oomKill as below. Therefore, for those cases, verify the container limits/utilization:

[16840841.104477] Killed process 117061 (java), UID 1000, total-vm:10787896kB, anon-rss:6178796kB, file-rss:8652kB, shmem-rss:0k <----- see anon-rss for memory usage.

Lack of CPU

Lack of memory will cause OOME but lack cpu will not cause a crash, but instead the threads will take so long to execute (given the kernel will give so little kernel time for the threads) that cli commands like $ stats will return null and/or timeouts. For instance with 100milli cores of cpu.
For those scenarios adding more cpus will prevent those scenarios. Note java is inelastic, so it won't use cpu request, but rather cpu limits. No deployment in production should have less than one or two cores.

Exit code

ScenarioExit code
HeapDumpOnOutOfMemoryErrorExit code 3
ExitOnOutOfMemoryErrorExit code 3
CrashOnOutOfMemoryErrorExit code 0
Kill -11 PIDExit code 0
Setting Xmx/Xms> container size (plus Always pre touch)Exit code 137

The Exit 3 agrees with Content from bugs.openjdk.org is not included.JDK-8257790, for more details on exit code specifically see, the article Red Hat OpenJDK's Exit codes in podman/Openshift crio container.

Root Cause

This solution comes together with Red Hat build of OpenJDK container awareness for Kubernetes tuning, which specifically talks about container awareness and Kubernetes guidelines.
Other important aspects:

  • Set the Xmx for Infinispan CRs according to the use; too low values can cause OOME/OOME-killer (cgroups or system) act as well
  • There are two types of OOME, cgroups OOME and system OOME - the cgroups will finish if the heap goes above the Xmx or if native goes above off heap settings (total - heap percentage). See solution How to set QoS on DG 8 pods in OCP 4 and Red Hat build of OpenJDK container awareness for Kubernetes tuning.
  • A low infinispan.spec.container.cpu value can cause JVM to provide just a few threads (given the DG and JVM calculates the threads based on the cpu limits), however, won't cause OOME;
  • Due to the merge of netty threads in non-blocking threads DG 8.3.x will have more consumption on the heap and off-heap (since Netty usage of native), see Differences between DG 8.3 vs DG 8.2. So the heap/off heap is expected to be higher.
  • Note that xss is the thread stack size, where MaxJavaStackTraceDepth is the max depth allowed for a java stack trace. So if an exception stack trace is really long, it can be truncated to a max of 1024 lines by the default MaxJavaStackTraceDepth.
  • MaxDirectMemorySize can be used to limit direct memory size off-heap - the maximum amount of allocatable direct buffer memory
  • The readiness probe failure causes a pod network isolation, but the first time it fails it can cause the pod state to never change for success, kubelet will send a signal to terminate it and therefore so prevent it from coming up. See What does probe checks of JBoss EAP 7 on OpenShift ?.

Relevant flags for this matter:

JVM flagImpact
ErrorFileToStdoutSend the output to pod logs - oc logs
ErrorFileSent path for crash file (hs_error)
HeapDumpPathGiven HeapDumpOnOutOfMemoryError as JVM, you can set where the heap will be written to
CrashOnOutOfMemoryErrorCrashes instead of OOME
ExitOnOutOfMemoryErrorExit on out-of-memory exception - to be removed for HeapDumpOnOutOfMemoryError to act
OnOutOfMemoryErroraction when OOME happened
OnErroraction when crash

Guidelines for customizing JVM flags

The solution Guideliness for customizing JVM flags in DG 8 images brings some guidelines for customizing JVM flags.

Diagnostic Steps

Diagnosing problems:

EventsHow to track the problem
Java CrashVerify the crash file (hs_error) and/or the SIGABRT, which comes from the JVM itself
OOME-KillsFor cgroup OOME killer, see the counter on memory.oom_control.
probe terminationUsually denotes JVM stuck(threads) or lack of CPU - verify the container exit details/ns' events

Example OOM Killer - cgroups' oom killer:

oc exec $podname -- grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
oom_kill 0

And this is explained on OCP Doc - Diagnosing an OOM Kill

dmesg

For investigating both OOMEs (system and cgroups) see dmesg logs from OCP node:

## get OCP nodes
$ oc get nodes
## go on the node:
$ oc debug node/<node_name>
dmesg | grep oom

Example cgroup kill:

[424255.249763] Memory cgroup out of memory: Killed process 1322808 (node) total-vm:1725576kB, anon-rss:238168kB, file-rss:45272kB, shmem-rss:52kB, UID:1001040000 pgtables:6280kB oom_score_adj:-997

Interpretation: Kernel's cgroups killed the process 1322808 that was using 238_168kB of memory. The total-vm VM is just address space, not actual usage. This can happen if DG is using more heap than the allocated on the container size.

Example oom kill:

[ 6338.084330] splunkd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 6338.087427] CPU: 0 PID: 38366 Comm: splunkd Not tainted 4.18.0-477.15.1.el8_8.x86_64 #1

Interpretation: Splunkd tried to make an allocation and there was no memory to fulfill it so it invoked the oom-killer. Usually the process that invokes the oom-killer is generally unimportant. The important piece is consumer process, but the oomkill is invoked due to malloc/memory allocations themselves. There's no monitoring processes for allocations, when an allocation is made by anything the memory check is run for the cgroup/applicable pages it's in and the invocation of the oomkiller can happen if the conditions are right to be triggered

Emphasis: The process with the highest consumption is the one killed, not necessarily the one who crossed the limit. Taking into consideration that the allocation trigger might not be the exact offending process given the triggered is on the next allocation, not throwing a segfault but an oomKill as below. Therefore, for those cases, verify the container limits/utilization and any limits are across all processes in the cgroup will trigger the OOMKill.

termination messages

By default Kubernetes retrieves termination messages from /dev/termination-log, which can be seeing via oc get pod --output=yaml - and see message. However, this can be change to a different file via terminationMessagePath - which cannot be done in DG 8 Operator.

Product(s)
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.