Java app contention on sun.net.www.http.KeepAliveCache lock

Solution Verified - Updated

Environment

  • Java
  • OpenJDK

Issue

  • Our java application faces slowness and thread dumps show threads facing contention on an sun.net.www.http.KeepAliveCache lock:
 at sun.net.www.http.KeepAliveCache.put(KeepAliveCache.java:80)
 - waiting to lock <0x00000005c0027448> (a sun.net.www.http.KeepAliveCache)
 at sun.net.www.protocol.https.HttpsClient.putInKeepAliveCache(HttpsClient.java:659)
 at sun.net.www.http.HttpClient.finished(HttpClient.java:395)
 at sun.net.www.http.KeepAliveStream.close(KeepAliveStream.java:97)
 at sun.net.www.MeteredStream.justRead(MeteredStream.java:93)
 at sun.net.www.MeteredStream.read(MeteredStream.java:135)
 - locked <0x000000062e959300> (a sun.net.www.http.KeepAliveStream)
 at java.io.FilterInputStream.read(FilterInputStream.java:133)
 at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3456)
 at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
 at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
 at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
 - locked <0x000000062eaa9138> (a java.io.InputStreamReader)
 at java.io.InputStreamReader.read(InputStreamReader.java:184)
  • The lock owner is a Keep-Alive-Timer thread like below that is in a socket read during an SSLSocket close call:
"Keep-Alive-Timer" #12345 daemon prio=8 os_prio=0 tid=0x0 nid=0x3a567 runnable [0x0]
    java.lang.Thread.State: RUNNABLE
 	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at java.net.SocketInputStream.read(SocketInputStream.java:224)
	at sun.security.ssl.SSLSocketInputRecord.deplete(SSLSocketInputRecord.java:488)
	at sun.security.ssl.SSLSocketImpl$AppInputStream.readLockedDeplete(SSLSocketImpl.java:1109)
	at sun.security.ssl.SSLSocketImpl$AppInputStream.deplete(SSLSocketImpl.java:1083)
	at sun.security.ssl.SSLSocketImpl$AppInputStream.access$100(SSLSocketImpl.java:832)
	at sun.security.ssl.SSLSocketImpl.bruteForceCloseInput(SSLSocketImpl.java:709)
	at sun.security.ssl.SSLSocketImpl.duplexCloseOutput(SSLSocketImpl.java:659)
	at sun.security.ssl.SSLSocketImpl.close(SSLSocketImpl.java:560)
	at sun.net.www.http.HttpClient.closeServer(HttpClient.java:1057)
	at sun.net.www.http.KeepAliveCache.run(KeepAliveCache.java:194)
	- locked <0x00000007de05b280> (a sun.net.www.http.ClientVector)
	- locked <0x0000000674d44a30> (a sun.net.www.http.KeepAliveCache)
	at java.lang.Thread.run(Thread.java:750)

Resolution

  • Update to OpenJDK 1.8.0_422+, 11.0.21+, or 17.0.7+
  • Note that with such network behavior, the Keep-Alive-Timer itself could still experience delays in closing keepalive connections as described above. But with the fix in that JDK update, those background close operations should no longer spread blocks to other threads.

Root Cause

  • The blocked threads are application threads creating a new connection through sun.net.www.http.HttpClient, and it is trying to insert the connection to the cache of managed keepalive connections. That keepalive cache is blocked however by the Keep-Alive-Timerbackground thread that goes through that cache and times out and closes old connections from the cache. If https connections, then that close operation will try to exchange graceful TLS shutdown messages before destroying the socket.
  • A delay here receiving that backend response to any TLS shutdown can result in delays removing connections from the keepalive cache, which spreads to any other request sending threads that are trying to insert a connection to the keep alive cache. The delay occurring here would depend on the network behavior with the backend destination in question so this issue could vary across environments and also with load and volume of the connections in question.
  • It may be likely that there is some network behavior that silently drops a longer idle connection before it has idled long enough to be removed from the keep-alive cache and does not close the connection with any FIN/RST. So then when the keepalive timer goes to send and receive its shutdown messages on this dropped connection, it disappears into the void without ever getting a timely response.
    *This resulting KeepAliveCache contention is a known issue more recently reported and addressed:
  • You also could avoid concerns from behavior of reused keepalive connections by disabling keepalive with the following property added to your JVM options:
-Dhttp.keepalive=false
  • Note then every outgoing request from your app would have to be on a new connection. Note that the idle time the keepalive connections may persist would be determined by the backend destination's response first and any keepalive timeout value it provides in its response per https://docs.oracle.com/javase/8/docs/api/java/net/doc-files/net-properties.html:
http.keepAlive.time.server and http.keepAlive.time.proxy

These properties modify the behavior of the HTTP keepalive cache in the case where the server (or proxy) has not specified a keepalive time. If the property is set in this case, then idle connections will be closed after the specified number of seconds. If the property is set, and the server does specify a keepalive time in a "Keep-Alive" response header, then the time specified by the server is used. If the property is not set and also the server does not specify a keepalive time, then connections are kept alive for an implementation defined time, assuming {@code http.keepAlive} is {@code true}.
  • So if you were to want shorter durations of these idle keepalive connections, you would have to first check and adjust any returned keep alive timeout from the backend destination. If the backend destination does not provide a keepalive timeout, you may use the http.keepAlive.time.server property to adjust default keepalive durations on the JVM side.

Diagnostic Steps

  • If you want to debug that network activity itself, then you could pursue tcp dumps from the java application and backend destination for further review.
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.