Troubleshooting JBoss EAP 8/7 issues in OCP 3/4

Solution Verified - Updated

Environment

  • Red Hat JBoss Enterprise Application Platform (EAP)
    • 8.x
    • 7.x
  • Red hat OpenShift Container Platform (OCP)
    • 4.x
    • 3.x

Issue

Troubleshooting JBoss EAP 7 issues in OCP 4.x and 4.x
Troubleshooting JBoss EAP 8 issues in OCP 4.x

Resolution

First of all, confirm the image is Red Hat's EAP.
The steps below, assume the image is EAP - so supported by Red Hat:

First step

Collect a namespace's inspect that can be useful for collecting pod logs, deployment info, and services/routes:

$oc adm inspect ns/$namespace

^ There is no get here.

This will bring the pod logs, service, route details. However, it won't have EAP's Operator Custom Resources (CRs) nor application details, for instance JDR report. Attach on the case if requested.

Data Collection

Providing the inspect is the most recommended approach, given it will bring the pods logs (server logs), pod yaml, deployment yaml, configmap yaml, and build config yaml i.e. and all namespace bounded innate resources.
However, in case there is a problem with its collection or not an adm account to fetch it, if needed one can get the server log via:

### server logs == pod logs for EAP 7.3/EAP 7.4:
    $ oc get pod
    podname1;
    $ oc logs $podname1

On the other hand gc logs need to be manually collected:

### gc logs:
$ ls /opt/eap/standalone/log 
audit.log  gc.log.0.current <----
$ oc cp $pod_name//opt/eap/standalone/log/gc.log.0.current ./path/gc.log.0.current

Example, EAP 7.4/EAP 7.3 images do not set server.log and instead write all the logs for the standard output - aka pod logs:

  $ oc logs $podname
## example:
$ oc logs POD1
2022-11-29 00:22:47 Launching EAP Server
INFO Configuring JGroups cluster traffic encryption protocol to SYM_ENCRYPT.
WARN Detected missing JGroups encryption configuration, the communication within the cluster WILL NOT be encrypted.
INFO Configuring JGroups discovery protocol to dns.DNS_PING
INFO Using PicketBox SSL configuration.
INFO Access log is disabled, ignoring configuration.
INFO Server started in admin mode, CLI script executed during server boot.
INFO Running jboss-eap-7/eap74-openjdk11-runtime-openshift-rhel8 image, version 7.4.7
...

The inspect should bring the pod logs. But it is useful to know server logs are on the pod logs.

Deployment methods:

There are several methods for deploying EAP 7: deployment, deploymentConfig (template), EAP 7 Operator - each method has a yaml for making the image/container inserted in the pods;

For DeploymentConfig

Get the DeploymentConfig (in case the EAP application was deployed via template/image):

$ oc describe dc <name>

EAP Operator:

In case the application was deployed via EAP operator you will need the custom resources:

$ oc get WildFlyServer $name -o yaml > WildFlyServer.yaml

For JVM issues

To get a thread dump, use jstack and jcmd, which are included in the image. Finally, VM.info can be useful for container/cgroups investigations, but also for native memory tracking, given it has native details.

thread dumps

EAP 7.3/EAP 7.4 have jstack and jcmd, so the below options can be used:

### using jstack:
POD=your-pod-name; PID=$(oc exec $POD ps aux | grep java | awk '{print $2}'); oc exec $POD -- bash -c "for x in {1..10}; do jstack -l $PID >> /opt/eap/standalone/tmp/jstack.out; sleep 2; done"; oc cp $POD:/opt/eap/standalone/tmp/jstack.out ./jstack.out

Executing above will generate the output in /opt/eap/standalone/tmp/jstack.out and download at the current directory locally - example output:

$ oc rsh $podname
$ cat /opt/eap/standalone/tmp/jstack.out
sh-4.2$ head /opt/eap/standalone/tmp/jstack.out 
2022-12-14 02:39:29
Full thread dump OpenJDK 64-Bit Server VM (25.312-b07 mixed mode):
"Attach Listener" #190 daemon prio=9 os_prio=0 tid=0x000055c705dd0000 nid=0x45e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE
   Locked ownable synchronizers:
        - None
"Thread-20 (org.apache.activemq.artemis.core.remoting.impl.invm.InVMConnector)" #189 prio=5 os_prio=0 tid=0x000055c707f71000 nid=0x447 waiting on condition [0x00007fde587a3000]

Otherwise in case the image not having JDK (therefore not having jmap, jstack tools) use kill -3 $PID with the PID of the process. Example:

$ oc exec $podname -- bash -c "for x in {1..10}; do kill -3 $PID; sleep 10; done" 

Above requires the PODNAME and the PID for the process - and the output will go straight to the pod logs. One option to capture threads without the PID is via: jcmd /opt/eap/jboss-modules.jar Thread.print, however this will generate only one thread at the console

Heap dump

To get a heap dump, use jmap or jcmd:

POD=your-pod-name; oc exec $POD -- jmap -J-d64 -dump:format=b,file='/opt/eap/standalone/tmp/heap.hprof' $(oc exec $POD ps aux | grep java | awk '{print $2}'); oc rsync $POD:/opt/eap/standalone/tmp/heap.hprof .

Above assumes the image has jmap (so a JDK image). Otherwise the heap dump cannot be taken. For alternatives can be found here.

Setting Memory limits

Use of one the methods below, taking in consideration their scopes:

MechanismScopeReference
deployment-configlimit of each podDefined via request:limit - on Operator CR, or pod definition
limit-rangelimits allowed in each pod (i.e. in deployment config)Defined via kind: LimitRange - This page is not included, but the link has been rewritten to point to the nearest parent document.reference on limit-range
resource-quotaAllowed sum of limits for all pods of the projectDefined via kind: ResourceQuota

This means, if the limit-range says 2GB RAM, one can set 1GB limit on deployment config, 2GB limit in deployment config, 512MB. But one cannot set 3GB limit on deployment config, because the limit-range "has precedence" and will stop it.

For network issues

To get a tcpdump, use nsenter (make sure that tcpdump and nsenter are available in it):

# nsenter -n -t $pid -- chroot /host ip a

How to use tcpdump inside OpenShift v4 Pod
How to use tcpdump or other system-level commands inside OpenShift v3 pod?

Java's OOM vs Kernel OOM-Killer

As explained in the solution Troubleshoot options for Data Grid pod crash, Java OOME is an exception handled inside Java, whereas the kernel's cgroup OOMKIller is an outside agent (the oomkiller). Consequently: doubling Infinispan cluster resources (via spec.container.memory) will impact heap and off-heap sizes. But for OOM-Killer that is not the case, see solution How to define memory and CPU resource via limits/requests on DG 8 Operator on this matter - particularly given there are two types of OOM Kills: System OCP (nodes with high usage) and cgroups ones (process inside pod high usage).

For the cgroups OOM KILL the kill won't happen exactly when the pod goes beyond the limits it is supposed to, but rather on the next allocation (that is beyond the limit). Therefore that's the reason user will see the cgroups OOM Kill happening after the limit is already breached. finally, the kill won't be a segfault necessarily, given the error from malloc is ENOMEM segfault, which is for illegal operations and allocation beyond the limits is not illegal.
So for instance a pod with 6.5G of total memory and then gets an cgroups OOME Kill with 6.65G (so already beyond the limit):

rss:6655048KB rss_huge:26624KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:6655032KB inactive_file:160KB active_file:28KB unevictable:0KB
[16840841.101964] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
...
[16840841.102159] Memory cgroup out of memory: Kill process 118132 (java) score 1759 or sacrifice child
[16840841.104477] Killed process 117061 (java), UID 1000, total-vm:10787896kB, anon-rss:6178796kB, file-rss:8652kB, shmem-rss:0k

Be aware not to set oom_score_adj for -1k, that might set it for no killable:

[3038012.673260] Out of memory and no killable processes... <------------------ not killable process
[3038012.673459] process invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=-1000

In other words, OOM Exception (an exception process handled in Java) is not the same as OOM Killer, by the kernel's cgroup
The kernel imposes the containers' limits via the cgroups, which can trigger OOMKill in case the process crosses its limits. The allocation trigger might not be the exact offending process given the triggered is on the next allocation, not throwing a segfault but an oomKill as below. Therefore, for those cases, verify the container limits/utilization:

[16840841.104477] Killed process 117061 (java), UID 1000, total-vm:10787896kB, anon-rss:6178796kB, file-rss:8652kB, shmem-rss:0k <----- see anon-rss for memory usage.

Java Crash

For Java crashes, i.e. abrupt closing of the Java process inside the container creating a hs_error file, there are a few options to be taken, usually save the crash file via ErrorFile (to a persistent volume, to be investigate) or verify the crash output on the console log via ErrorFileToStdout. Another option is to set some specific action via OnError flag. After collecting, verify the problem following Interpreting JVM crash file in OpenJDK/OracleJDK.
See the table below with the flags and its purposes:

FlagAction/Purpose
ErrorFileToStdoutSend the output to pod logs - oc logs
ErrorFileSent path for crash file (hs_error)
OnErroraction when crash

Example of OnError flag

### custom resource:
spec:
  extraJvmOpts: -XX:OnError="touch /opt/infinispan/server/data/crash.txt"
### deployment:
JAVA_OPTS_APPEND: -XX:OnError="touch /opt/infinispan/server/data/crash.txt" -XX:ErrorFile=/opt/infinispan/server/data

Additional references

Troubleshooting OpenShift Container Platform 3.x: Middleware Containers
OCP Developer Guide

Root Cause

Namespace's Inspect

On OCP 4, the namespace's inspect will have all the pod logs from the namespace, the service.yaml, statefulset yaml, and configmap. So it can be used to to compare the service/pods/statefulset labels properly.

About Must-Gather (mg) and sosreport

For application issues usually, the inspect and application-specific data (thread dump, heap dump, VM.info, gc logs) should be enough to troubleshoot most of the use cases.
However, in some instances where the OCP cluster has issues, the must-gather can be useful:

$oc adm must-gather

This is not application specific and the OCP cluster usually is misbehaving for all (or several) applications.
must-gather pulls cluster and namespace objects and logs for analysis from every project that is managed by openshift (openshift-* namespaces) and provides a top-level view of the cluster's current state

Finally, in case the OCP Node is experiencing cgroups/system the sosreport can be useful for dmesg investigation: which pod/container is having crashes/OOMKills.
sosreport pulls journal logs, config files, container logs and kernel information from a specific node in the cluster, which can be useful for understanding discrepant behavior that affects only one or a few host nodes and not the entire cluster.

Issues:

IssueWhat to collect
Problem with deploying or building EAP 7 app in OCP 4Inspect
Verifying JVM settings, VM flags, container detailsVM.info
Problem application being slow or OOMInspect and thread dumps/gc logs
Problem gc collection/heap dump/high CPU inside the containerInspect and thread dumps/gc logs
Problem with OCP Nodes crashing/OOM Kills in the podsSosreport
Problem with OCP networking componentsMust-gather
Problem with probes killing the podsincrease CPU, collect thread dumps, delay probes(can be removed in any case)

For SIGTERM/SIGKILL, see the solution JBossAS process (...) received TERM signal in OpenShift.

Scenario 1: OutOfMemoryError

“2021-07-23 14:16:10
...
org.eclipse.jetty.server.HttpChannel-/account-query java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached”

Scenario 2: Modify JVM options

See methods vs scope table above, and consider the following options:
Option 1: Update template
Option 2. Update DeploymentConfig
Option 3. Use oc env to modify DeploymentConfig

Use JAVA_OPTS_APPEND environment variable or GC_CONTAINER_OPTIONS environment variable.
See solution How to modify JVM options (JAVA_OPTS) for JBoss EAP in Openshift and How to change JVM memory options using Red Hat JBoss EAP image for Openshift.

Scenario 3: OOME Killer

Verify the Deployment Configuration and the flags JAVA_INITIAL_MEM_RATIO and JAVA_MAX_MEM_RATIO, which set the percentage of occupancy for the heap.
See solution Java Pod OOME Killer after migration from OCP 3.9 to OCP 3.11

Scenario 4: Where are the server.logs in EAP 7.3/EAP 7.4?

EAP 7.3/7.4 doesn't have server.log, the server log goes to the pod logs - the standard output goes to console. Meaning there won't be server.log. So server.log are the pod logs: oc logs $podname>. This is expected, our products are moving to leave their logs directly on the pod logs, and not in a separate file.
Example default EAP 7.3 configuration:

### default EAP 7.3 configuration::
        <subsystem xmlns="urn:jboss:domain:logging:8.0">
            <console-handler name="CONSOLE">
                <formatter>
                    <named-formatter name="COLOR-PATTERN"/>
                </formatter>
            </console-handler>
...
            <root-logger>
                <level name="INFO"/>
                <handlers>
                    <handler name="CONSOLE"/>
                </handlers>
            </root-logger>

Scenario 5: Clustering issues

EAP 7 with jgroups/ is not clustering even though the service's name is correct (fqdn or just the service name) but EAP 7 is not clustering.
For clustering issues, see EAP 7 image clustering in OCP 4.

Scenario 6: Set GC logs for each pod (or each pod's name) in EAP deployed via DC

EAP 7 can be deployed via deploymentconfig, which can read pod_name and pod_namespace. To set gc logs for a pod_name:

env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: JAVA_OPTS_APPEND
              value: >-
                -Xlog:gc*:file=/opt/eap/standalone/log/$POD_NAME.log:time,uptimemillis:filecount=5,filesize=3M

Scenario 7: OCP Cluster issues

In case the OCP cluster is misbehaving, collecting the must-gather can be useful: $ oc adm must-gather
This includes - but is not limited to - networking issues in the cluster for example.

Diagnostic Steps

  1. For pod logs: oc logs $podname - but don't send pod logs, prefer sending the inspect file.
  2. For getting deploymentConfig: oc get dc; for getting the deployment yaml file: oc get deployment
  3. For clustering issues, see EAP 7 image clustering in OCP 4.
Components
Category
Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.