Features, limitations, and known issues with DG Operator 8.3.x
Environment
- Red Hat Data Grid (RHDG)
- 8.x
- Dg Operator 8.3.x vs 8.2.x
Issue
What are the known issues with DG Operator 8.3.x?
Resolution
As the main recommendation do not install an old version of the DG Operator, install the latest, the following issues were diagnosed on DG Operator 8.3.x:
DG Operator 8.3.6:
| Issue | Comment | Action | Version | Jira |
|---|---|---|---|---|
| Xsite migration | There are known issues with xsite migration | For those cases start a new namespace with the latest DG Operator version | 8.3.x | Content from github.com is not included.Issue 1558; Content from github.com is not included.Issue 1554 |
| Upgrade will be required when installing a DG Operator | Harmless issue | Accept the upgrade for the same version (or accept via oc edit ip $ipname - example see script_approve_ip) | 8.3.x | - |
| Cache doesn’t attach and doesn’t get ready | The caches objects are created even if the cache has issues | Verify the DG Operator logs for the cache creation and if the template cache is wrong;fix the template | 8.x | - |
| Upgrade didn’t appear on DG op 8.3.5 | The upgrade might not be available to DG Op 8.3.6 | Refresh the console; might need to re-install it | 8.3.5 | - |
| Cache created via CM (briefly) disappears | When a cache is set via CM it briefly disappears on the upgrade | No action; the cache will re-appear | 8.3.6 | - |
| Console stuck with a certain number of pods | The console gets stuck in a certain number of pods | Upgrade | 8.3.x (affected) - 8.3.7 (fixed) | This content is not included.JDG-5405 |
| Subsequent Cache CR reconciliations fail with Cache Service | The cache service keeps looping | Avoid deprecated spec.service.type: Cache | 8.3.x (affected) - 8.3.7 (to be fixed) | This content is not included.JDG-5427 |
| DG Operator 8.2.x does not have UFC protocol | Can be an issue on Xsite async | Upgrade to DG Operator 8.3.x | - | |
| Batch creation fails | template creation fails when apply oc apply | template == oc process; yaml declaration == oc apply | all | - |
| Cache creation with memory max-size="1GB" and global state | Infinispan issue: configuring memory max-size as non-byte value causes failure on CacheManager restart | Add the B at the end (binary max size) | DG 8.3x | This content is not included.JDG-5442This content is not included.ISPN-13997 |
| Xsite replication configuration doesn't work even though is right | all configuration is right but still doesn't work | set pods to 0 and then increase again | - | - |
| DG Operator 8.x does not have gc log | there is not gc logs set by default on the image | set via extraJvmOpts flags: extraJvmOpts: '-Xlog:gc*=info:file=/tmp/gc.log:time,level,tags,uptimemillis:filecount=10,filesize=1m | Dg 8.x | This content is not included.JDG-5461 |
| Dg Operator 8.3.x does not have server.log in /server/log dir | DG Operator 8.2.x had server/log/server.log | to set it: provide a custom log4j.xml in a ConfigMap specified via spec.configMapName | - | - |
| DG Operator restarts | {"error": "leader election lost"} | increase resources for Operator pod | Affected DG 8.3.x | - |
| DG Xsite GR pod crashes and xsite is affected | after GR pod go down Xsite will stop communication - issue not consistent | Increase limit-range of GR pod to avoid restarts | Affected DG 8.3.x | This content is not included.JGRP-2634 and Content from github.com is not included.issues/1665 - to be address on DG 8.4.x |
| IllegalArgumentException: No enum constant org.infinispan.security.AuthorizationPermission.READ,WRITE | The Operator has the "admin" role, but this might not be all to work | Use admin role on the cache definition | (fixed) DG Op 8.3.7 | This content is not included.JDG-5458 |
Persistence definition using the tag <queries> breaks the config listener | Having persistence defined with queries tag will break the configuration | Define the Cache CR {{spec.template}} in the DG console | DG 8.3.x | This content is not included.ISPN-14226/This content is not included.JDG-5650 |
DG Operator 8.3.7:
| Issue | Comment | Action | Version | Jira |
|---|---|---|---|---|
| Load Balancer not properly expose | Webconsole won't work | Use NodePort or Route | Operator 8.3.7 | This content is not included.JDG-5527 -Loadbalancer not exposed correctly |
| This content is not included.Valid configuration throws NPE | Valid config will throw NPE | Workaround: disable config-listener pod | Operator 8.3.7/6 | This content is not included.JDG-5531 - YamlConfigurationReader throws NPE for valid cache configuration |
| CacheConfigurationException: java.security.KeyStoreException: ELY02035: KeyStore type could not be detected | corrupt/invalid keystore | See if keystore is not corrupted | Operator 8.3.x | N/A |
DG Operator 8.3.8:
| Issue | Comment | Action | Version | Jira |
|---|---|---|---|---|
| Upgrade to 8.3.8 fails | upgrading from DG 8.3.7 to DG 8.3.8 gets stuck | action: skip upgrade | DG op 8.3.7 to 8.3.8 | OLM upgrade issue |
| Default Anti-affinity strategy configuration with the Operator is not valid | Default configuration will contain a r. before the kubernetes node label | no action | DG 8.3.8 | This content is not included.In investigation |
For more issues, see solution Issues with Data Grid Operator 8.3.8.
DG Operator 8.3.x vs 8.2.x: comparative
| Issue | DG Operator version | Fix/Description |
|---|---|---|
| server.log | Data Grid Operator 8.3.x | server.log is not in /server/log because file based logging was disabled by default by Content from github.com is not included.Operator 1363 - example template cluster-dg-01-log.txt |
| heap dump tool | Data Grid Operator 8.3.x | there is no jcmd on DG 8.3.x - alternatives: Generating JFR and Alternatives for creating heap dump in a DG 8 even without the JDK. Reason: that's because ubi:openjdk-11-runtimes doesn't have it. |
| thread dumps | Data Grid Operator 8.3.x | there is no jcmd on DG 8.3.x - alternatives: kill -3 |
| gc log absence | DG 8.3.x | This content is not included.Bug - Server Image overwrites JVM defaults from server distribution |
| UFC absence | Data Grid Operator 8.2.x | use DG Operator 8.3.x. |
| ConfigMap doesn't work on DG op 8.2.x | Data Grid Operator 8.2.x | use DG Operator 8.3.x (feature added on DG op 8.3.x |
| Permissions change on - additional ServiceAccount permissions were added for config listener | DG Operator 8.3.x | N/A |
Using a different version of DG 8 and DG Operator
By setting the spec.image in the Custom Resource one can overwrite the default version of DG 8 - which fetches DG 8 image directly. However, this is unsupported though, given the DG Operator will bring a specific version already.
But this can be useful for instance in case of adding jcmd on the default image for instance - where the DG 8 image still the default.
Cache Creation [warning]
At this point, there is no warning on wrong configuration of caches, so the caches will be created and the status - will be there and the (wrongful configured) cache will not turn "Ready". Therefore verify the cache template creation, the namespace, the Infinispan's cluster name and the configuration set there:
Also when creating caches and batches, make sure to process templates and apply yamls.
Invalidation cache will fail
Although in DG CLI the cache creation succeeds, in the Operator (via cache yaml) Invalidation cache creation will fail, since there is no valid use-case for invalidation-cache for server inside OCP environment. See more details on Data Grid Invalidation cache on server/client mode. Below cache creation will fail and it is expected to fail:
spec:
clusterName: ${CLUSTER_NAME}
name: operator-cache-98
template: |
invalidationCache:
locking:
acquireTimeout: "15000"
concurrencyLevel: "1000"
striping: "false"
Cross-site change
As explained on Gossip Router pod in DG 8 OCP 4, DG Operator 8.3.x uses GossipRouter pod, where it needs only one Gossip Router pod to nominally operate and if one of the sites crashes, it uses the other site GossipRouter via tunnel protocol - following the diagram shows Tunnel protocol usage:
custer1Node -> cluster1Master -> GossipRouter -> cluster2Master -> cluster2OtherNode
which is the same as:
cluster1Node -> cluster1Relay1 -> GossipRouter -> cluster1Relay2 -> cluster2OtherNode
Memory requirement changes:
There was an increased the default Memory requirements from 512Mi -> 1Gi as the old value only provided a very small amount of memory that could be utilized as a cache when the overhead of the server and JVM was taken into account. See Content from github.com is not included.Operator 1386
Deprecated type Cache
As explained on Deprecated service type Cache in DG 8 in OCP 4, the spec.service.type.Cache is deprecated and should be avoided, use instead type DataGrid. However, it is the default (spec.service.type default is Cache). This is to maintain backwards compatibility on upgrades from other versions. Also note the Cache type will use as GC collector SerialGC, same as the Gossip Router pod.
Console Web
In order for the console to be accessible one need to explicitly expose the svc via a Route or LoadBalancer, regardless of what service type is used - nodePort leads to continuous loading:
expose:
type: LoadBalancer
Diagnostic Steps
For diagnosing issues:
| Scenario | Solution |
|---|---|
| For OCP node crash - and its impact | DG 8 operation in case of OCP nodes crashing |
| For pod crashes/oome | Troubleshoot options for Data Grid pod crash |
| For JFR investigations on DG 8.3 | Generating and analyzing JFR in Data Grid 8 in OCP |
| About jcmd usage | Alternatives for creating heap dump in a DG 8 even without the JDK |
| Using custom DG configuration | Using custom configuration in DG 8 via Operator |
| Project cannot be deleted/cache cr still hang | How to delete all DG 8's objects in OCP 4? |
| How to interpret DG statistics | Interpreting Data Access statistics in DG 8 |
| How to describe CR fields | DG 8 Operator explain or describe fields |
| Migrate/Upgrade to DG 8.3 | How to migrate from DG 8.1/8.2 to DG 8.3 Operator |
| Config Listener pod in DG 8 OCP 4 | Config Listener pod in DG 8 OCP 4 |
| Multiple versions of DG Operator | Data Grid 8 Operators coexistence in a same Openshift cluster |
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.