Troubleshooting JBoss EAP 8/7 issues in OCP 3/4
Environment
- Red Hat JBoss Enterprise Application Platform (EAP)
- 8.x
- 7.x
- Red hat OpenShift Container Platform (OCP)
- 4.x
- 3.x
Issue
Troubleshooting JBoss EAP 7 issues in OCP 4.x and 4.x
Troubleshooting JBoss EAP 8 issues in OCP 4.x
Resolution
First of all, confirm the image is Red Hat's EAP.
The steps below, assume the image is EAP - so supported by Red Hat:
First step
Collect a namespace's inspect that can be useful for collecting pod logs, deployment info, and services/routes:
$oc adm inspect ns/$namespace
^ There is no get here.
This will bring the pod logs, service, route details. However, it won't have EAP's Operator Custom Resources (CRs) nor application details, for instance JDR report. Attach on the case if requested.
Data Collection
Providing the inspect is the most recommended approach, given it will bring the pods logs (server logs), pod yaml, deployment yaml, configmap yaml, and build config yaml i.e. and all namespace bounded innate resources.
However, in case there is a problem with its collection or not an adm account to fetch it, if needed one can get the server log via:
### server logs == pod logs for EAP 7.3/EAP 7.4:
$ oc get pod
podname1;
$ oc logs $podname1
On the other hand gc logs need to be manually collected:
### gc logs:
$ ls /opt/eap/standalone/log
audit.log gc.log.0.current <----
$ oc cp $pod_name//opt/eap/standalone/log/gc.log.0.current ./path/gc.log.0.current
Example, EAP 7.4/EAP 7.3 images do not set server.log and instead write all the logs for the standard output - aka pod logs:
$ oc logs $podname
## example:
$ oc logs POD1
2022-11-29 00:22:47 Launching EAP Server
INFO Configuring JGroups cluster traffic encryption protocol to SYM_ENCRYPT.
WARN Detected missing JGroups encryption configuration, the communication within the cluster WILL NOT be encrypted.
INFO Configuring JGroups discovery protocol to dns.DNS_PING
INFO Using PicketBox SSL configuration.
INFO Access log is disabled, ignoring configuration.
INFO Server started in admin mode, CLI script executed during server boot.
INFO Running jboss-eap-7/eap74-openjdk11-runtime-openshift-rhel8 image, version 7.4.7
...
The inspect should bring the pod logs. But it is useful to know server logs are on the pod logs.
Deployment methods:
There are several methods for deploying EAP 7: deployment, deploymentConfig (template), EAP 7 Operator - each method has a yaml for making the image/container inserted in the pods;
For DeploymentConfig
Get the DeploymentConfig (in case the EAP application was deployed via template/image):
$ oc describe dc <name>
EAP Operator:
In case the application was deployed via EAP operator you will need the custom resources:
$ oc get WildFlyServer $name -o yaml > WildFlyServer.yaml
For JVM issues
To get a thread dump, use jstack and jcmd, which are included in the image. Finally, VM.info can be useful for container/cgroups investigations, but also for native memory tracking, given it has native details.
thread dumps
EAP 7.3/EAP 7.4 have jstack and jcmd, so the below options can be used:
### using jstack:
POD=your-pod-name; PID=$(oc exec $POD ps aux | grep java | awk '{print $2}'); oc exec $POD -- bash -c "for x in {1..10}; do jstack -l $PID >> /opt/eap/standalone/tmp/jstack.out; sleep 2; done"; oc cp $POD:/opt/eap/standalone/tmp/jstack.out ./jstack.out
Executing above will generate the output in /opt/eap/standalone/tmp/jstack.out and download at the current directory locally - example output:
$ oc rsh $podname
$ cat /opt/eap/standalone/tmp/jstack.out
sh-4.2$ head /opt/eap/standalone/tmp/jstack.out
2022-12-14 02:39:29
Full thread dump OpenJDK 64-Bit Server VM (25.312-b07 mixed mode):
"Attach Listener" #190 daemon prio=9 os_prio=0 tid=0x000055c705dd0000 nid=0x45e waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
Locked ownable synchronizers:
- None
"Thread-20 (org.apache.activemq.artemis.core.remoting.impl.invm.InVMConnector)" #189 prio=5 os_prio=0 tid=0x000055c707f71000 nid=0x447 waiting on condition [0x00007fde587a3000]
Otherwise in case the image not having JDK (therefore not having jmap, jstack tools) use kill -3 $PID with the PID of the process. Example:
$ oc exec $podname -- bash -c "for x in {1..10}; do kill -3 $PID; sleep 10; done"
Above requires the PODNAME and the PID for the process - and the output will go straight to the pod logs. One option to capture threads without the PID is via: jcmd /opt/eap/jboss-modules.jar Thread.print, however this will generate only one thread at the console
Heap dump
To get a heap dump, use jmap or jcmd:
POD=your-pod-name; oc exec $POD -- jmap -J-d64 -dump:format=b,file='/opt/eap/standalone/tmp/heap.hprof' $(oc exec $POD ps aux | grep java | awk '{print $2}'); oc rsync $POD:/opt/eap/standalone/tmp/heap.hprof .
Above assumes the image has jmap (so a JDK image). Otherwise the heap dump cannot be taken. For alternatives can be found here.
Setting Memory limits
Use of one the methods below, taking in consideration their scopes:
| Mechanism | Scope | Reference |
|---|---|---|
| deployment-config | limit of each pod | Defined via request:limit - on Operator CR, or pod definition |
| limit-range | limits allowed in each pod (i.e. in deployment config) | Defined via kind: LimitRange - This page is not included, but the link has been rewritten to point to the nearest parent document.reference on limit-range |
| resource-quota | Allowed sum of limits for all pods of the project | Defined via kind: ResourceQuota |
This means, if the limit-range says 2GB RAM, one can set 1GB limit on deployment config, 2GB limit in deployment config, 512MB. But one cannot set 3GB limit on deployment config, because the limit-range "has precedence" and will stop it.
For network issues
To get a tcpdump, use nsenter (make sure that tcpdump and nsenter are available in it):
# nsenter -n -t $pid -- chroot /host ip a
How to use tcpdump inside OpenShift v4 Pod
How to use tcpdump or other system-level commands inside OpenShift v3 pod?
Java's OOM vs Kernel OOM-Killer
As explained in the solution Troubleshoot options for Data Grid pod crash, Java OOME is an exception handled inside Java, whereas the kernel's cgroup OOMKIller is an outside agent (the oomkiller). Consequently: doubling Infinispan cluster resources (via spec.container.memory) will impact heap and off-heap sizes. But for OOM-Killer that is not the case, see solution How to define memory and CPU resource via limits/requests on DG 8 Operator on this matter - particularly given there are two types of OOM Kills: System OCP (nodes with high usage) and cgroups ones (process inside pod high usage).
For the cgroups OOM KILL the kill won't happen exactly when the pod goes beyond the limits it is supposed to, but rather on the next allocation (that is beyond the limit). Therefore that's the reason user will see the cgroups OOM Kill happening after the limit is already breached. finally, the kill won't be a segfault necessarily, given the error from malloc is ENOMEM segfault, which is for illegal operations and allocation beyond the limits is not illegal.
So for instance a pod with 6.5G of total memory and then gets an cgroups OOME Kill with 6.65G (so already beyond the limit):
rss:6655048KB rss_huge:26624KB mapped_file:4KB swap:0KB inactive_anon:0KB active_anon:6655032KB inactive_file:160KB active_file:28KB unevictable:0KB
[16840841.101964] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
...
[16840841.102159] Memory cgroup out of memory: Kill process 118132 (java) score 1759 or sacrifice child
[16840841.104477] Killed process 117061 (java), UID 1000, total-vm:10787896kB, anon-rss:6178796kB, file-rss:8652kB, shmem-rss:0k
Be aware not to set oom_score_adj for -1k, that might set it for no killable:
[3038012.673260] Out of memory and no killable processes... <------------------ not killable process
[3038012.673459] process invoked oom-killer: gfp_mask=0x6000c0(GFP_KERNEL), order=0, oom_score_adj=-1000
In other words, OOM Exception (an exception process handled in Java) is not the same as OOM Killer, by the kernel's cgroup
The kernel imposes the containers' limits via the cgroups, which can trigger OOMKill in case the process crosses its limits. The allocation trigger might not be the exact offending process given the triggered is on the next allocation, not throwing a segfault but an oomKill as below. Therefore, for those cases, verify the container limits/utilization:
[16840841.104477] Killed process 117061 (java), UID 1000, total-vm:10787896kB, anon-rss:6178796kB, file-rss:8652kB, shmem-rss:0k <----- see anon-rss for memory usage.
Java Crash
For Java crashes, i.e. abrupt closing of the Java process inside the container creating a hs_error file, there are a few options to be taken, usually save the crash file via ErrorFile (to a persistent volume, to be investigate) or verify the crash output on the console log via ErrorFileToStdout. Another option is to set some specific action via OnError flag. After collecting, verify the problem following Interpreting JVM crash file in OpenJDK/OracleJDK.
See the table below with the flags and its purposes:
| Flag | Action/Purpose |
|---|---|
| ErrorFileToStdout | Send the output to pod logs - oc logs |
| ErrorFile | Sent path for crash file (hs_error) |
| OnError | action when crash |
Example of OnError flag
### custom resource:
spec:
extraJvmOpts: -XX:OnError="touch /opt/infinispan/server/data/crash.txt"
### deployment:
JAVA_OPTS_APPEND: -XX:OnError="touch /opt/infinispan/server/data/crash.txt" -XX:ErrorFile=/opt/infinispan/server/data
Additional references
Troubleshooting OpenShift Container Platform 3.x: Middleware Containers
OCP Developer Guide
Root Cause
Namespace's Inspect
On OCP 4, the namespace's inspect will have all the pod logs from the namespace, the service.yaml, statefulset yaml, and configmap. So it can be used to to compare the service/pods/statefulset labels properly.
About Must-Gather (mg) and sosreport
For application issues usually, the inspect and application-specific data (thread dump, heap dump, VM.info, gc logs) should be enough to troubleshoot most of the use cases.
However, in some instances where the OCP cluster has issues, the must-gather can be useful:
$oc adm must-gather
This is not application specific and the OCP cluster usually is misbehaving for all (or several) applications.
must-gather pulls cluster and namespace objects and logs for analysis from every project that is managed by openshift (openshift-* namespaces) and provides a top-level view of the cluster's current state
Finally, in case the OCP Node is experiencing cgroups/system the sosreport can be useful for dmesg investigation: which pod/container is having crashes/OOMKills.
sosreport pulls journal logs, config files, container logs and kernel information from a specific node in the cluster, which can be useful for understanding discrepant behavior that affects only one or a few host nodes and not the entire cluster.
Issues:
| Issue | What to collect |
|---|---|
| Problem with deploying or building EAP 7 app in OCP 4 | Inspect |
| Verifying JVM settings, VM flags, container details | VM.info |
| Problem application being slow or OOM | Inspect and thread dumps/gc logs |
| Problem gc collection/heap dump/high CPU inside the container | Inspect and thread dumps/gc logs |
| Problem with OCP Nodes crashing/OOM Kills in the pods | Sosreport |
| Problem with OCP networking components | Must-gather |
| Problem with probes killing the pods | increase CPU, collect thread dumps, delay probes(can be removed in any case) |
For SIGTERM/SIGKILL, see the solution JBossAS process (...) received TERM signal in OpenShift.
Scenario 1: OutOfMemoryError
“2021-07-23 14:16:10
...
org.eclipse.jetty.server.HttpChannel-/account-query java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached”
Scenario 2: Modify JVM options
See methods vs scope table above, and consider the following options:
Option 1: Update template
Option 2. Update DeploymentConfig
Option 3. Use oc env to modify DeploymentConfig
Use JAVA_OPTS_APPEND environment variable or GC_CONTAINER_OPTIONS environment variable.
See solution How to modify JVM options (JAVA_OPTS) for JBoss EAP in Openshift and How to change JVM memory options using Red Hat JBoss EAP image for Openshift.
Scenario 3: OOME Killer
Verify the Deployment Configuration and the flags JAVA_INITIAL_MEM_RATIO and JAVA_MAX_MEM_RATIO, which set the percentage of occupancy for the heap.
See solution Java Pod OOME Killer after migration from OCP 3.9 to OCP 3.11
Scenario 4: Where are the server.logs in EAP 7.3/EAP 7.4?
EAP 7.3/7.4 doesn't have server.log, the server log goes to the pod logs - the standard output goes to console. Meaning there won't be server.log. So server.log are the pod logs: oc logs $podname>. This is expected, our products are moving to leave their logs directly on the pod logs, and not in a separate file.
Example default EAP 7.3 configuration:
### default EAP 7.3 configuration::
<subsystem xmlns="urn:jboss:domain:logging:8.0">
<console-handler name="CONSOLE">
<formatter>
<named-formatter name="COLOR-PATTERN"/>
</formatter>
</console-handler>
...
<root-logger>
<level name="INFO"/>
<handlers>
<handler name="CONSOLE"/>
</handlers>
</root-logger>
Scenario 5: Clustering issues
EAP 7 with jgroups/ is not clustering even though the service's name is correct (fqdn or just the service name) but EAP 7 is not clustering.
For clustering issues, see EAP 7 image clustering in OCP 4.
Scenario 6: Set GC logs for each pod (or each pod's name) in EAP deployed via DC
EAP 7 can be deployed via deploymentconfig, which can read pod_name and pod_namespace. To set gc logs for a pod_name:
env:
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: JAVA_OPTS_APPEND
value: >-
-Xlog:gc*:file=/opt/eap/standalone/log/$POD_NAME.log:time,uptimemillis:filecount=5,filesize=3M
Scenario 7: OCP Cluster issues
In case the OCP cluster is misbehaving, collecting the must-gather can be useful: $ oc adm must-gather
This includes - but is not limited to - networking issues in the cluster for example.
Diagnostic Steps
- For pod logs:
oc logs $podname- but don't send pod logs, prefer sending the inspect file. - For getting deploymentConfig:
oc get dc; for getting the deployment yaml file:oc get deployment - For clustering issues, see EAP 7 image clustering in OCP 4.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.