How an unshared CacheStore behave in a RHDG clustered environment?
Environment
- Red Hat Date Grid (RHDG)
Issue
- What is loaded if a clustered with a underlying file-store is started
- I see stale entries on some nodes in a replicated cache
- How do I shutdown and restart a clusterd cache with unshared Cache Stores
Resolution
Local stores, like file based stores must not be shared, like SingleFileStore, LevelDbStore, RocksDbStore. Other types might use different underlying persistence and might be configured as shared=false.
In case of preload=true the cache loads the persistent store for every node but will only use it if this node is the primary owner.
This will take time and the instance is not available during that time, also it might load stale entries which is useless.
Preload=false prevent to waste time for reading the data from the persistence.
If the cache is clustered, like distributed or replicated, and there are more nodes in the cluster the unshared persistence will cause problems if not handled correctly.
Until version 7 there is no automatism to solve such scenario.
cluster (aka graceful) shutdown - RHDG 7+
If a clustered cache should completely restarted it depends on the cache how it can be handled.
With RHDG 7+ the cluster can be stopped gracefully see Cluster Shutdown chapter
All cluster nodes will stop client access and rebalancing, persist the local data and will go down immediately.
The restart can be done in any order but all nodes which are active during "shutdown cluster" are mandatory to get back to a healthy state!
Nodes which are not part of the cluster before are not able to start.
Currently there is an issue if the starting cluster is accessed by clients.
In case of get requests not all entries might be available. In case of put requests the update might be lost after all nodes are up and running as it will change the state-transfer state.
This could end up in partial inconsistence.
This is tracked by This content is not included.JDG-3967 and will be fixed by RHDG 8.4+.
Individual shutdown of nodes
- In case of replicated caches the nodes can be go down in any order and the last one should hold the master-data. This node need to be started first and all other should start with a empty store to prevent from stale entries.
Note that the last remaining node is still acepting incomming requests.
To start with an empty store it is possible to setpurge=trueor remove the caches DAT files from the serversdatadirectory. - In case of distributed caches the nodes need to be shutdowned one-by-one and the next node shutdown needs to be delayed until the rehashing is completed to not loose data.
Finally the last node will have the complete cache persistent and this node need to be started first.
Warning this is why we don't recommend this approach, a cluster should use the shutdown cluster command available with CLI and REST to handle a complete cluster shutdown.
Note A local, file based store can be used if the data can be retrieved by another persistence or calculation to increase the speed. If the amount of data is huge or a reliable persistence is needed you should consider to use a different cache store approach.
Hint This will apply to an environment which is managed individual, if managed by OpenShift Operator it will behave different.
Root Cause
A cache with a local store, like file-store or any other with shared=false, will load the entries from the store first. Depend on the type it will be read to store the key's in memory (FileStore) first.
If preload=true the values are loaded from the store, in case of a SingleFileStore this will be a second, random access to the file.
After this is completed and the node joined the cluster, the state-transfer is used to update the local cache from the running cluster with changed and new entries.
Because of this the stale entries are available on this instance only as they are not pushed to other nodes!
Also in case of a full cluster restart the first started node is responsible for the new cluster state and ONLY the entities stored here are use for rebalancing, this will cause inconsistency within the cluster after restart if the shutdown and start order is wrong!
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.