Caches and cache entries can only be found in one pod in the cluster DG 8 on OCP 4

Solution Verified - Updated 13 Jun 2024

Environment

Red hat OpenShift Container Platform (OCP)
- 4.x
Red Hat Data Grid (RHDG)
- 8.x
- Operator 8.x (prior to Operator 8.4.5)

Issue

Scenario of EAP to DG, where the EAP uses the internal service, however, the externalization makes all data go into only one pod:

### pod 0
ls caches[dg-cluster-nyc-both-0-9259@dg-cluster-nyc-both//containers/default]> ls caches
configMap-cache-01                                                                                                                                                                                                 
___script_cache                                                                                                                                                                                                    
configMap-cache-02  
### pod 1
[dg-cluster-nyc-both-1-34395@dg-cluster-nyc-both//containers/default]> ls caches
configMap-cache-01                                                                                                                                                                                                 
___script_cache                                                                                                                                                                                                    
configMap-cache-02  
### pod 2
[dg-cluster-nyc-both-2-59327@dg-cluster-nyc-both//containers/default]> ls caches
___script_cache                                                                                                                                                                                                    
configMap-cache-02                                                                                                                                                                                                 
configMap-cache-01   
### pod 3
[dg-cluster-nyc-both-3-25482@dg-cluster-nyc-both//containers/default]> ls caches
configMap-cache-02                                                                                                                                                                                                 
counter.war             <-----------------------------------------------------------------                                                                                                                                                                                            
___script_cache                                                                                                                                                                                                    
configMap-cache-01

Resolution

Update to Red Hat Data Grid Operator version 8.4.5 or later.

Root Cause

The cluster is not forming so each pod is behaving as an independent identity:

19:52:58,334 INFO (main) [org.infinispan.CLUSTER] ISPN000094: Received new cluster view for channel dg-cluster-nyc-both: [dg-cluster-nyc-both-0-9259|0] (1) [dg-cluster-nyc-both-0-9259]

This is caused by a bug on the operator which creates the ping service without spec.publishNotReadyAddresses

$ oc get svc dg-cluster-nyc-both -o yaml
apiVersion: v1
kind: Service
metadata:
  labels:
    app: infinispan-service
    clusterName: dg-cluster-nyc-bot
    infinispan_cr: dg-cluster-nyc-both
  name: dg-cluster-nyc-both
  namespace: dg-test-nyc
spec:
  clusterIP: ...
  clusterIPs:
  ...
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: infinispan
    port: 11222
    protocol: TCP
    targetPort: 11222
  selector:
    app: infinispan-pod
    clusterName: dg-cluster-nyc-both
  sessionAffinity: None
  type: ClusterIP
  publishNotReadyAddresses: true <------------------------ missing

Deployment model and its fix:

Deployment	Jira
Operator	This content is not included.JDG-5986
Helm charts	This content is not included.JDG-5988

Diagnostic Steps

Verify the cluster details on the DG pod logs
Verify the cache presence (cache configuration) on each DG pod.
Curl just uses one IP. To see a list of DNS records use dig/nslookup/etc.

Example of EAP issue:

Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: dg-cluster-nyc.dg-test-nyc.svc.cluster.local/127.0.0.123:11222
Caused by: java.net.ConnectException: Connection refused
at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)

Track the IP from the internal service (which is tunnel to the DG IP pod itself):

$ oc get svc
NAME                                             TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)           AGE
dg-cluster-nyc                                   ClusterIP      127.0.0.123   <none>                                                                    11222/TCP         133m

Retry:

02:09:20,290 INFO [org.infinispan.HOTROD] (HotRod-client-async-pool-1) ISPN004006: Server sent new topology view (id=1144929326, age=0) containing 1 addresses: [127.0.0.123:11222]
02:09:20,291 INFO [org.infinispan.HOTROD] (HotRod-client-async-pool-1) ISPN004016: Server not in cluster anymore(127.0.0.123:11222), removing from the pool.

When the connection pod is down and the max_retries is set to zero - it will get connection refused as below, where 127.0.0.1 (left side) is the internal's svc ip and the right side is the EAP pod's IP (127.1.1.1).

01:42:04,263 WARN [org.infinispan.HOTROD] (HotRod-client-async-pool-1) ISPN004098: Closing connection [id: 0x0d7b5035, L:/127.0.0.1:34158 ! R:/127.1.1.1:34158:11222] due to transport error: java.net.SocketTimeoutException: PutOperation{counter.war, key=[B0x010403002D9801028A01270A2560C801..[50], value=[B0x01040300599801EA078A01520A1A9801..[94], flags=6, connection=/127.1.1.1:11222} timed out after 60000 ms at org.infinispan.client.hotrod@11.0.17.Final-redhat-00001//org.infinispan.client.hotrod.impl.operations.HotRodOperation.run(HotRodOperation.java:186)

Or connection refused:

$ oc get svc
NAME                                             TYPE           CLUSTER-IP       EXTERNAL-IP                                                               PORT(S)           AGE
dg-cluster-nyc-one                               ClusterIP      127.0.0.1    <none>                                                                    11222/TCP         30m <-------------------
...
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: dg-cluster-nyc-one.dg-test-nyc.svc.cluster.local/127.0.0.1.:11222 <-------------------
Caused by: java.net.ConnectException: Connection refused

SBR

infinispan

Product(s)

Red Hat Data Grid

Components

infinispan

Category

Troubleshoot

Tags

clustering

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.