Recommended preparation for typical JBoss troubleshooting data points

Updated

Production environments running JBoss EAP and its layered products may experience sudden critical types of issues that are fairly common but require certain data points beyond just a typical server log. This may include slowness or full hangs and unresponsiveness, high CPU, OutOfMemoryErrors, or crashes.

In preparation for such unexpected issues requiring further root cause analysis, we recommend pre-emptively enabling the logging suggested in this article and ensuring the suggested data points can be generated before any such issue occurs. If not done beforehand, this often means a full RCA is not possible on a first issue occurrence, and then the issue would have to occur again after further data is in place and captured for a fuller RCA. But by preparing to capture these data points before any such issue arises, we can best ensure RCAs are possible more consistently on a first occurrence.

Data for slowness, unresponsiveness, or high CPU

If something is slow, unresponsive, or consuming high CPU, it is unlikely it is producing any logging so a server log provides little insight into such an issue. These additional data points captured from and generated during such performance issues can provide more insight.

GC logging

Garbage collection pauses are often a culprit of slowness, unresponsiveness, and high CPU. You can refer to How do I enable Java garbage collection logging? for more details on enabling this. But for Java 8 and earlier, this can be enabled via JVM options like the following:

-verbose:gc -Xloggc:/desired/path/for/gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCApplicationStoppedTime

For Java 11, this can be enabled with JVM options like the following:

-Xlog:gc*:file=/desired/path/gc.log:time,uptimemillis:filecount=5,filesize=3M

You may analyze such GC logs with garbagecat per How do I analyze Java garbage collection logging?

Access logs

This provides a helpful historic log on request traffic and with the %T and %I fields, includes response time and thread name info to correlate with other logs or dumps. EAP 6 access logs can be enabled in the web subsystem:

        <subsystem xmlns="urn:jboss:domain:web:1.1" default-virtual-server="default-host" native="false">
            <connector name="http" protocol="HTTP/1.1" scheme="http" socket-binding="http"/>
            <virtual-server name="default-host" enable-welcome-root="true">
               <access-log pattern="%h %l %u %A %v %t %r %s %b %S %I %T"/> <!-- added -->
            </virtual-server>
        </subsystem>

EAP 7 access logs can be enabled in the undertow subsystem's server element. Also, ensure record-request-start-time is enabled on your undertow listeners:

    <server name="default-server">
        <http-listener name="default" socket-binding="http" redirect-socket="https" record-request-start-time="true"/>
        <host name="default-host" alias="localhost">
            <location name="/" handler="welcome-content"/>
            <access-log pattern="%h %l %u %t &quot;%r&quot; %s %b %S %I %T"/> <!-- added -->
            <filter-ref name="server-header"/>
            <filter-ref name="x-powered-by-header"/>
        </host>
    </server>

Thread dumps

These provide a snapshot of the current calls for each thread in the JVM so this can reveal any stall points or longer running operations that are otherwise not revealed in logging. These can be produced to a desired file with your JDK's provided jstack command, which should be run as the same user running the JBoss process:

jstack -l $JBOSS_PID > jstack.out

An alternative for unix environments is a kill -3 command signal to the process. That will result in the JVM producing a dump to its console output. So this would need to be redirected to a file at start up, and then kill -3 will produce the dump to the redirected console output file:

# ./standalone.sh > console.out &
# kill -3 $JBOSS_PID

Note it is best to capture several thread dumps over time to see any change or progression in the threads or lack thereof as well as corresponding CPU utilization data at that time to see exactly which threads are using the CPU. Scripts and commands are available to generate that more easily:

An easy single command to have on hand to capture thread dumps and CPU snapshots on Linux is the following:

$ for y in {1..20}; do echo $y >&2; for pid in $(ps -C java --noheader -o pid); do jstack -l $pid >> jstack-$pid.out; top -b -n 1 -H -p $pid >> high-cpu-$pid.out; done; top -b -n 1 -H >> high-cpu.out; sleep 10; done

If JBoss stalls, it's not possible to see where without a thread dump and if JBoss consumes increased CPU, it's not possible to identify and confirm any problem CPU consumers without thread level CPU snapshots. So being prepared to capture these data points in such an event is paramount and we recommend having these scripts and commands in place and tested prior to any issue time to ensure proper behavior and data capturing so they can be easily used to generate data successfully in the event of any unexpected performance issue. You may review such thread dumps with analysis tools suggested by How do I analyze a Java thread dump or javacore?.

Stuck thread detection

Performance issues are most often related to investigating request processing and their associated threads. So a helpful tool for this is stuck thread detection. Enable this and it will automatically provide a trace of any individual request thread running longer than desired. This can help capture traces on the very first occurrences of any request slowness or automatically on any request slowness that is intermittent and difficult to capture thread dumps for in a reactionary manner:

/subsystem=undertow/configuration=filter/expression-filter=stuck:add(expression="blocking; stuck-thread-detector(10)")
/subsystem=undertow/server=default-server/host=default-host/filter-ref=stuck:add

Data for OutOfMemoryErrors

An OOME may occur unexpectedly from heap, perm, or metaspace exhaustion. GC logs mentioned above are also a helpful reference for this, but the primary data point needed is a heap dump from the OOME. The best method to create this would be to use the following JVM flags to create the heap dump automatically when an OOME occurs:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath="/desired/directory"

We recommend using Eclipse Memory Analyzer Tool to review such heap dumps per How do I analyze a Java heap dump?
If you experience a 'GC overhead limit exceeded' OOME or near heap exhaustion that doesn't quite reach an OOME, you won't see a heap dump created automatically, but one will still be required to investigate the primary heap consumers. In that case, you may create a heap dump manually with your JDK's jmap command, which should also be executed as the same user running the JBoss process. We recommend ensuring this command is in place and functioning prior to any such issue time:

jmap -dump:format=b,file=heap.hprof JAVA_PID

If you see a 'unable to create new native thread' OOME, note this is not a concern of heap or perm/meta exhaustion so a heap dump is not too helpful or automatically generated here. This is typically an issue of the JVM's thread quantity reaching the user's nproc limit so capturing thread dumps as noted before is helpful to check any thread leak concerns.

Data for a JVM crash

If the JBoss process is dying unexpectedly, you may be seeing some unexpected shutdowns. But if the process dies without any graceful shutdown activity, it is likely the JVM experienced a crash from a fatal error or was forcefully killed.

For such a crash, the JVM will typically produce a fatal error log summarizing the issue. A fatal error log is a file that is typically named something like hs_err_<pid>.log or java_error<pid>.log located in user.dir, the process current directory (typically $EAP_HOME or $EAP_HOME/bin/ by default).

jinfo can be used to determine user.dir:

$ jinfo <pid> | grep "user.dir"
user.dir=/path/to/dir

The fatal error log location can be customized with the following JVM option:

-XX:ErrorFile=/desired/path/for/hs_err_pid%p.log

A full core dump from the crash may also be helpful. For RHEL, you may install the gdb package and add the following JVM option to produce a core automatically with a crash:

# For standalone, add to your standalone.conf's JAVA_OPTS
-XX:OnError=\"gcore -o /desired/path/for/corefile %p\"
# For domain, add to the JVM element used from your domain or host xml
<option value="-XX:OnError=gcore -o /desired/path/for/corefile %p"/>

Install the JDK debug symbols. Debug symbols can provide more detailed information in the fatal error log and core that is critical for analyzing issues.

Category
Components
Article Type