Java application periodic high latency / processing times due to NUMA page reclaim on RHEL

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 5.4                      
    • kernel 2.6.18-164.11.1.el5.x86_64
  • CPU / memory          
    • 24 CPUs total, 6 cores
    • 16 GB ram, 8 GB swap
    • 2 Node NUMA system, with 8GB RAM on each NUMA node
  • Jboss (running in its own JVM), jbossas, jboss-messaging
    • Jboss interfaces with Oracle via local TCP (port 1521)
  • Web application (running in its own JVM)             
    • JSF Based Web application (TCP / HTTP 1.1) using RichFaces & a4j components.
  • Oracle Version: 11gR1 11.1.0.7          
    • Running with AMM, which forbids the use of HugePages
  • Veritas VCS, VxVM, VxDMP

Issue

  • JBoss server periodically consuming high CPU and experiencing pauses.          
    • Periodic (1 out of 100) garbage collections take an excessive amount of system time.
  • Java based web application experiences periodic (approximately 5 times out of 100) slow application response times          
    • Application response is < 100ms 95% of the time; the other 5%, response may take up to 100 seconds.
    • Unresponsiveness is seen across several processes (JBoss, Oracle, etc), and slowness appears to be system-wide.
  • Periodically, processes such as 'uname', 'grep', and 'perl', take an exceptional amount of time to execute, and all seem to be using an exceptional amount of system time.
  • Oracle responds to Jboss calls in less than 1s 90% of the time, but a few times Oracle takes 30-40s, and may exceed the 60s query timeout resulting in Oracle error ORA-01013.

Resolution

  • Adding vm.zone_reclaim_mode = 0 in /etc/sysctl.conf, and running "sysctl -a" disabled zone_reclaim.          
    • This resolved the periodic high system CPU in various processes, and application responses were much more predictible.

Root Cause

  • vm.zone_reclaim_mode was set to '1' because a 2-node NUMA was detected on boot.
  • Unfortunately, this led to processes going into page reclaim on the local NUMA node instead of accessing memory on the other node.
  • For a file-based workload such as a database, file server, or web-server, zone_reclaim_mode should be set to 0.

Diagnostic Steps

  • Run "numactl --hardware" and observe at least 2 nodes, with distance > 20, and one node with much lower memory than the other.
    available: 2 nodes (0-1)
    node 0 size: 8035 MB
    node 0 free: 408 MB
    node 1 size: 8080 MB
    node 1 free: 3606 MB
    node distances:
    node    0    1
      0:   10   21
      1:   21   10
    
* Setup sysrq via https://access.redhat.com/kb/docs/DOC-2024
  • Run the following simple script, which will send 'sysrq-t' output to /var/log/messages, approximately every 5s:
    # while (true); do sleep 5; echo 't' > /proc/sysrq-trigger ;done
    
    • Analyze the backtraces of 'D' state and 'R' state processes.
    • 'R' state process analysis shows most running processes in a 'zone_reclaim() ... isolate_lru_pages()' backtrace, similar to:
      Jul 22 03:43:37 linux-s1 kernel: monitor       R  running task       0 31787  31709                     (NOTLB)
      Jul 22 03:43:37 linux-s1 kernel:  ffff81035fff1a58 0000000000000020 0000000000000020 0000000000000000
      Jul 22 03:43:37 linux-s1 kernel:  0000000000000020 0000000000000000 0000000000000000 0000000000000020
      Jul 22 03:43:37 linux-s1 kernel:  0000000000000000 0000000000000001 ffff8103c52d9a50 0000000000000020
      Jul 22 03:43:37 linux-s1 kernel: Call Trace:
      Jul 22 03:43:37 linux-s1 kernel:  [] isolate_lru_pages+0x98/0xbf
      Jul 22 03:43:37 linux-s1 kernel:  [] __pagevec_release+0x19/0x22
      Jul 22 03:43:37 linux-s1 kernel:  [] shrink_active_list+0x4b4/0x4c4
      Jul 22 03:43:37 linux-s1 kernel:  [] shrink_zone+0xf7/0x15d
      Jul 22 03:43:37 linux-s1 kernel:  [] zone_reclaim+0x1cc/0x292
      Jul 22 03:43:37 linux-s1 kernel:  [] zone_reclaim+0x1cc/0x292
      Jul 22 03:43:37 linux-s1 kernel:  [] get_page_from_freelist+0xbf/0x43a
      Jul 22 03:43:37 linux-s1 kernel:  [] __alloc_pages+0x65/0x2ce
      Jul 22 03:43:37 linux-s1 kernel:  [] do_wp_page+0x4b7/0x8dc
      Jul 22 03:43:37 linux-s1 kernel:  [] filemap_nopage+0x193/0x360
      Jul 22 03:43:37 linux-s1 kernel:  [] __handle_mm_fault+0xed4/0xf99
      Jul 22 03:43:37 linux-s1 kernel:  [] math_state_restore+0x23/0x4c
      Jul 22 03:43:37 linux-s1 kernel:  [] error_exit+0x0/0x84
      Jul 22 03:43:37 linux-s1 kernel:  [] do_page_fault+0x4cb/0x830
      Jul 22 03:43:37 linux-s1 kernel:  [] sys_rt_sigreturn+0x283/0x356
      Jul 22 03:43:37 linux-s1 kernel:  [] sys_rt_sigreturn+0x323/0x356
      Jul 22 03:43:37 linux-s1 kernel:  [] error_exit+0x0/0x84
      
  • Run zone_reclaim.stp: record and print freqency of processes calling zone_reclaim(); print any process in zone_reclaim() exceeding a specified threshold (1s by default).
  • Look for a high rate of processes calling zone_reclaim(), some (such as 'grep') calling zone_reclaim() thousands of times in a 5s period.
Components

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.