Java application periodic high latency / processing times due to NUMA page reclaim on RHEL
Environment
- Red Hat Enterprise Linux 5.4
- kernel 2.6.18-164.11.1.el5.x86_64
- CPU / memory
- 24 CPUs total, 6 cores
- 16 GB ram, 8 GB swap
- 2 Node NUMA system, with 8GB RAM on each NUMA node
- Jboss (running in its own JVM), jbossas, jboss-messaging
- Jboss interfaces with Oracle via local TCP (port 1521)
- Web application (running in its own JVM)
- JSF Based Web application (TCP / HTTP 1.1) using RichFaces & a4j components.
- Oracle Version: 11gR1 11.1.0.7
- Running with AMM, which forbids the use of HugePages
- Veritas VCS, VxVM, VxDMP
Issue
- JBoss server periodically consuming high CPU and experiencing pauses.
- Periodic (1 out of 100) garbage collections take an excessive amount of system time.
- Java based web application experiences periodic (approximately 5 times out of 100) slow application response times
- Application response is < 100ms 95% of the time; the other 5%, response may take up to 100 seconds.
- Unresponsiveness is seen across several processes (JBoss, Oracle, etc), and slowness appears to be system-wide.
- Periodically, processes such as 'uname', 'grep', and 'perl', take an exceptional amount of time to execute, and all seem to be using an exceptional amount of system time.
- Oracle responds to Jboss calls in less than 1s 90% of the time, but a few times Oracle takes 30-40s, and may exceed the 60s query timeout resulting in Oracle error ORA-01013.
Resolution
- Adding vm.zone_reclaim_mode = 0 in /etc/sysctl.conf, and running "sysctl -a" disabled zone_reclaim.
- This resolved the periodic high system CPU in various processes, and application responses were much more predictible.
Root Cause
- vm.zone_reclaim_mode was set to '1' because a 2-node NUMA was detected on boot.
- Unfortunately, this led to processes going into page reclaim on the local NUMA node instead of accessing memory on the other node.
- For a file-based workload such as a database, file server, or web-server, zone_reclaim_mode should be set to 0.
Diagnostic Steps
- Run "numactl --hardware" and observe at least 2 nodes, with distance > 20, and one node with much lower memory than the other.
available: 2 nodes (0-1) node 0 size: 8035 MB node 0 free: 408 MB node 1 size: 8080 MB node 1 free: 3606 MB node distances: node 0 1 0: 10 21 1: 21 10
- Run the following simple script, which will send 'sysrq-t' output to /var/log/messages, approximately every 5s:
# while (true); do sleep 5; echo 't' > /proc/sysrq-trigger ;done
-
- Analyze the backtraces of 'D' state and 'R' state processes.
- 'R' state process analysis shows most running processes in a 'zone_reclaim() ... isolate_lru_pages()' backtrace, similar to:
Jul 22 03:43:37 linux-s1 kernel: monitor R running task 0 31787 31709 (NOTLB) Jul 22 03:43:37 linux-s1 kernel: ffff81035fff1a58 0000000000000020 0000000000000020 0000000000000000 Jul 22 03:43:37 linux-s1 kernel: 0000000000000020 0000000000000000 0000000000000000 0000000000000020 Jul 22 03:43:37 linux-s1 kernel: 0000000000000000 0000000000000001 ffff8103c52d9a50 0000000000000020 Jul 22 03:43:37 linux-s1 kernel: Call Trace: Jul 22 03:43:37 linux-s1 kernel: [] isolate_lru_pages+0x98/0xbf Jul 22 03:43:37 linux-s1 kernel: [] __pagevec_release+0x19/0x22 Jul 22 03:43:37 linux-s1 kernel: [] shrink_active_list+0x4b4/0x4c4 Jul 22 03:43:37 linux-s1 kernel: [] shrink_zone+0xf7/0x15d Jul 22 03:43:37 linux-s1 kernel: [] zone_reclaim+0x1cc/0x292 Jul 22 03:43:37 linux-s1 kernel: [] zone_reclaim+0x1cc/0x292 Jul 22 03:43:37 linux-s1 kernel: [] get_page_from_freelist+0xbf/0x43a Jul 22 03:43:37 linux-s1 kernel: [] __alloc_pages+0x65/0x2ce Jul 22 03:43:37 linux-s1 kernel: [] do_wp_page+0x4b7/0x8dc Jul 22 03:43:37 linux-s1 kernel: [] filemap_nopage+0x193/0x360 Jul 22 03:43:37 linux-s1 kernel: [] __handle_mm_fault+0xed4/0xf99 Jul 22 03:43:37 linux-s1 kernel: [] math_state_restore+0x23/0x4c Jul 22 03:43:37 linux-s1 kernel: [] error_exit+0x0/0x84 Jul 22 03:43:37 linux-s1 kernel: [] do_page_fault+0x4cb/0x830 Jul 22 03:43:37 linux-s1 kernel: [] sys_rt_sigreturn+0x283/0x356 Jul 22 03:43:37 linux-s1 kernel: [] sys_rt_sigreturn+0x323/0x356 Jul 22 03:43:37 linux-s1 kernel: [] error_exit+0x0/0x84
- Run zone_reclaim.stp: record and print freqency of processes calling zone_reclaim(); print any process in zone_reclaim() exceeding a specified threshold (1s by default).
- Look for a high rate of processes calling zone_reclaim(), some (such as 'grep') calling zone_reclaim() thousands of times in a 5s period.
Product(s)
Components
Tags
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.