Caches and cache entries can only be found in one pod in the cluster DG 8 on OCP 4
Environment
- Red hat OpenShift Container Platform (OCP)
- 4.x
- Red Hat Data Grid (RHDG)
- 8.x
- Operator 8.x (prior to Operator 8.4.5)
Issue
Scenario of EAP to DG, where the EAP uses the internal service, however, the externalization makes all data go into only one pod:
### pod 0
ls caches[dg-cluster-nyc-both-0-9259@dg-cluster-nyc-both//containers/default]> ls caches
configMap-cache-01
___script_cache
configMap-cache-02
### pod 1
[dg-cluster-nyc-both-1-34395@dg-cluster-nyc-both//containers/default]> ls caches
configMap-cache-01
___script_cache
configMap-cache-02
### pod 2
[dg-cluster-nyc-both-2-59327@dg-cluster-nyc-both//containers/default]> ls caches
___script_cache
configMap-cache-02
configMap-cache-01
### pod 3
[dg-cluster-nyc-both-3-25482@dg-cluster-nyc-both//containers/default]> ls caches
configMap-cache-02
counter.war <-----------------------------------------------------------------
___script_cache
configMap-cache-01
Resolution
Update to Red Hat Data Grid Operator version 8.4.5 or later.
Root Cause
The cluster is not forming so each pod is behaving as an independent identity:
19:52:58,334 INFO (main) [org.infinispan.CLUSTER] ISPN000094: Received new cluster view for channel dg-cluster-nyc-both: [dg-cluster-nyc-both-0-9259|0] (1) [dg-cluster-nyc-both-0-9259]
This is caused by a bug on the operator which creates the ping service without spec.publishNotReadyAddresses
$ oc get svc dg-cluster-nyc-both -o yaml
apiVersion: v1
kind: Service
metadata:
labels:
app: infinispan-service
clusterName: dg-cluster-nyc-bot
infinispan_cr: dg-cluster-nyc-both
name: dg-cluster-nyc-both
namespace: dg-test-nyc
spec:
clusterIP: ...
clusterIPs:
...
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
ports:
- name: infinispan
port: 11222
protocol: TCP
targetPort: 11222
selector:
app: infinispan-pod
clusterName: dg-cluster-nyc-both
sessionAffinity: None
type: ClusterIP
publishNotReadyAddresses: true <------------------------ missing
Deployment model and its fix:
| Deployment | Jira |
|---|---|
| Operator | This content is not included.JDG-5986 |
| Helm charts | This content is not included.JDG-5988 |
Diagnostic Steps
- Verify the cluster details on the DG pod logs
- Verify the cache presence (cache configuration) on each DG pod.
- Curl just uses one IP. To see a list of DNS records use dig/nslookup/etc.
Example of EAP issue:
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: dg-cluster-nyc.dg-test-nyc.svc.cluster.local/127.0.0.123:11222
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
Track the IP from the internal service (which is tunnel to the DG IP pod itself):
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
dg-cluster-nyc ClusterIP 127.0.0.123 <none> 11222/TCP 133m
Retry:
02:09:20,290 INFO [org.infinispan.HOTROD] (HotRod-client-async-pool-1) ISPN004006: Server sent new topology view (id=1144929326, age=0) containing 1 addresses: [127.0.0.123:11222]
02:09:20,291 INFO [org.infinispan.HOTROD] (HotRod-client-async-pool-1) ISPN004016: Server not in cluster anymore(127.0.0.123:11222), removing from the pool.
When the connection pod is down and the max_retries is set to zero - it will get connection refused as below, where 127.0.0.1 (left side) is the internal's svc ip and the right side is the EAP pod's IP (127.1.1.1).
01:42:04,263 WARN [org.infinispan.HOTROD] (HotRod-client-async-pool-1) ISPN004098: Closing connection [id: 0x0d7b5035, L:/127.0.0.1:34158 ! R:/127.1.1.1:34158:11222] due to transport error: java.net.SocketTimeoutException: PutOperation{counter.war, key=[B0x010403002D9801028A01270A2560C801..[50], value=[B0x01040300599801EA078A01520A1A9801..[94], flags=6, connection=/127.1.1.1:11222} timed out after 60000 ms at org.infinispan.client.hotrod@11.0.17.Final-redhat-00001//org.infinispan.client.hotrod.impl.operations.HotRodOperation.run(HotRodOperation.java:186)
Or connection refused:
$ oc get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
dg-cluster-nyc-one ClusterIP 127.0.0.1 <none> 11222/TCP 30m <-------------------
...
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: dg-cluster-nyc-one.dg-test-nyc.svc.cluster.local/127.0.0.1.:11222 <-------------------
Caused by: java.net.ConnectException: Connection refused
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.