RADOS Block Device (RBD) Replica-2 with both disks in the same zone - Developer preview OpenShift Data Foundation 4.16
CAUTION - Read this carefully before proceeding:
This is a hack to achieve the motive and in no way a supported configuration. This involves disabling reconciliation of the cephblockpools which can cause bigger issues that might be hard to recover from. More importantly, for this to work, there must be at least one extra usable disks in each zone before trying to enable it.
Overview
This article covers a scenario in which storage allocation is required within a designated zone so that the application strictly accesses OSDs within their respective zones and this eliminates inter-zone IO and replication traffic. In this scenario, the data needs to be replicated in the zone as at least a replica-2 is required to protect against disk or host failure.
Configuring replica-2 with both disks in the same zone
Procedure
- Install OpenShift Data Foundation.
- Create a storagecluster and wait for the storagecluster to be ready.
- Patch to enable replica-1. This helps to get the extra pools, storageclass and OSDs needed for this scenario.
`oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephNonResilientPools/enable", "value": true }]'`
Wait for the storagecluster to get ready.
4. Disable reconciliation of cephblockpools.
`oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephBlockPools/reconcileStrategy", "value": ignore }]'`
- Edit all the replica-1
cephblockpoolscreated for each failure domain to make the replicated.size to 2.
For example:
`oc patch cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicated/size", "value": 2 }]'`
- Repeat the above steps for all the
cephblockpools. - Edit all the
cephblockpoolsto make the failure domain as OSD.
For example:
`oc patch cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/failureDomain", "value": "osd" }]'`
- Repeat the above step for all the
cephblockpools. - Increase the number of OSDs per failure domain to 2.
`oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephNonResilientPools/count", "value": 2 }]'`
Result
New OSD prepare pods and OSD pods
~ $ oc get pods | grep osd
rook-ceph-osd-0-79fd56bf96-kw8bs 2/2 Running 0 6h26m
rook-ceph-osd-1-7c67574bb9-b446d 2/2 Running 0 6h26m
rook-ceph-osd-2-54f5d8cbb7-6vwlj 2/2 Running 0 6h26m
rook-ceph-osd-3-86b88f6656-q76j2 2/2 Running 0 5h20m
rook-ceph-osd-4-685fd8bd86-fldd8 2/2 Running 0 5h20m
rook-ceph-osd-5-65669d6445-7hpr4 2/2 Running 0 5h20m
rook-ceph-osd-6-7d67dbb8df-qprxh 2/2 Running 0 4h58m
rook-ceph-osd-7-7454fb654f-vqt62 2/2 Running 0 4h58m
rook-ceph-osd-8-b9dbf7ccf-znb5j 2/2 Running 0 4h58m
rook-ceph-osd-9-557bdc8db6-j6gh7 2/2 Running 0 22s
rook-ceph-osd-prepare-567c8a55e618dc355ddb04cfbf7de15b-57gzk 0/1 Completed 0 6h26m
rook-ceph-osd-prepare-693e88a014d9ac263328b45815581cce-fgfjb 0/1 Completed 0 41s
rook-ceph-osd-prepare-dd690e88c9a4a7702c8af06178f70936-q6744 0/1 Completed 0 6h26m
rook-ceph-osd-prepare-f33ba9f1dc57226ca278831d4dfcd327-qvrxh 0/1 Completed 0 6h26m
rook-ceph-osd-prepare-us-east-1a-data-0k4c9x-txg46 0/1 Completed 0 5h20m
rook-ceph-osd-prepare-us-east-1a-data-16wvgx-c4l8w 0/1 Completed 0 4h58m
rook-ceph-osd-prepare-us-east-1b-data-05h78r-4672f 0/1 Completed 0 5h20m
rook-ceph-osd-prepare-us-east-1b-data-1vck42-ntdxb 0/1 Completed 0 4h58m
rook-ceph-osd-prepare-us-east-1c-data-0wnzxh-7zx9b 0/1 Completed 0 5h20m
rook-ceph-osd-prepare-us-east-1c-data-14jkm5-wn2xs 0/1 Completed 0 4h58m
New pools with replica-2
sh-5.1$ ceph osd pool ls detail | grep block
pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 141 lfor 0/0/32 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd
pool 5 'ocs-storagecluster-cephblockpool-us-east-1b' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 137 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'ocs-storagecluster-cephblockpool-us-east-1c' replicated size 2 min_size 1 crush_rule 15 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'ocs-storagecluster-cephblockpool-us-east-1a' replicated size 2 min_size 1 crush_rule 13 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 135 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
New OSD tree
sh-5.1$ ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 18.00000 - 18 TiB 594 MiB 358 MiB 0 B 236 MiB 18 TiB 0.00 1.00 - root default
-5 18.00000 - 18 TiB 594 MiB 358 MiB 0 B 236 MiB 18 TiB 0.00 1.00 - region us-east-1
-4 6.00000 - 6 TiB 195 MiB 119 MiB 0 B 75 MiB 6.0 TiB 0.00 0.98 - zone us-east-1a
-3 2.00000 - 2 TiB 90 MiB 59 MiB 0 B 32 MiB 2.0 TiB 0.00 1.37 - host ocs-deviceset-gp3-csi-2-data-0fjpth
0 ssd 2.00000 1.00000 2 TiB 90 MiB 59 MiB 0 B 32 MiB 2.0 TiB 0.00 1.37 35 up osd.0
-46 2.00000 - 2 TiB 49 MiB 31 MiB 0 B 17 MiB 2.0 TiB 0.00 0.74 - host us-east-1a-data-0k4c9x
4 us-east-1a 2.00000 1.00000 2 TiB 49 MiB 31 MiB 0 B 17 MiB 2.0 TiB 0.00 0.74 41 up osd.4
-66 2.00000 - 2 TiB 55 MiB 29 MiB 0 B 27 MiB 2.0 TiB 0.00 0.84 - host us-east-1a-data-16wvgx
8 us-east-1a 2.00000 1.00000 2 TiB 55 MiB 29 MiB 0 B 27 MiB 2.0 TiB 0.00 0.84 38 up osd.8
-10 6.00000 - 6 TiB 189 MiB 119 MiB 0 B 70 MiB 6.0 TiB 0.00 0.96 - zone us-east-1b
-9 2.00000 - 2 TiB 93 MiB 54 MiB 0 B 39 MiB 2.0 TiB 0.00 1.40 - host ocs-deviceset-gp3-csi-1-data-0578zr
1 ssd 2.00000 1.00000 2 TiB 93 MiB 54 MiB 0 B 39 MiB 2.0 TiB 0.00 1.40 45 up osd.1
-41 2.00000 - 2 TiB 54 MiB 37 MiB 0 B 17 MiB 2.0 TiB 0.00 0.81 - host us-east-1b-data-05h78r
3 us-east-1b 2.00000 1.00000 2 TiB 54 MiB 37 MiB 0 B 17 MiB 2.0 TiB 0.00 0.81 39 up osd.3
-61 2.00000 - 2 TiB 43 MiB 29 MiB 0 B 15 MiB 2.0 TiB 0.00 0.65 - host us-east-1b-data-1vck42
7 us-east-1b 2.00000 1.00000 2 TiB 43 MiB 29 MiB 0 B 15 MiB 2.0 TiB 0.00 0.65 31 up osd.7
-14 6.00000 - 6 TiB 210 MiB 119 MiB 0 B 90 MiB 6.0 TiB 0.00 1.06 - zone us-east-1c
-13 2.00000 - 2 TiB 77 MiB 36 MiB 0 B 41 MiB 2.0 TiB 0.00 1.17 - host ocs-deviceset-gp3-csi-0-data-0flnqg
2 ssd 2.00000 1.00000 2 TiB 77 MiB 36 MiB 0 B 41 MiB 2.0 TiB 0.00 1.17 33 up osd.2
-51 2.00000 - 2 TiB 66 MiB 38 MiB 0 B 28 MiB 2.0 TiB 0.00 1.01 - host us-east-1c-data-0wnzxh
5 us-east-1c 2.00000 1.00000 2 TiB 66 MiB 38 MiB 0 B 28 MiB 2.0 TiB 0.00 1.01 42 up osd.5
-56 2.00000 - 2 TiB 66 MiB 45 MiB 0 B 21 MiB 2.0 TiB 0.00 1.00 - host us-east-1c-data-14jkm5
6 us-east-1c 2.00000 1.00000 2 TiB 66 MiB 45 MiB 0 B 21 MiB 2.0 TiB 0.00 1.00 40 up osd.6
TOTAL 18 TiB 594 MiB 358 MiB 0 B 236 MiB 18 TiB 0.00
MIN/MAX VAR: 0.65/1.40 STDDEV: 0
Using the storageclass with the pools
Procedure
- Prepare a test Namespace.
~ % oc create ns test
~ % oc project test
~ % oc create sa test
~ % oc adm policy add-scc-to-user -z test privileged
- Create a PVC.
~ % cat <<EOF | oc create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: non-resilient-rbd-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: ocs-storagecluster-ceph-non-resilient-rbd
EOF
The PVC will be in pending state, waiting for a consumer.
~ % oc get pvc | grep non-resilient-rbd-pvc
non-resilient-rbd-pvc Pending ocs-storagecluster-ceph-non-resilient-rbd 16s
- Create a pod to consume the pvc
~ % cat <<EOF | oc create -f -
apiVersion: v1
kind: Pod
metadata:
name: task-pv-pod
spec:
nodeSelector:
#Change according to failure domain
topology.kubernetes.io/zone: us-east-1a
volumes:
- name: task-pv-storage
persistentVolumeClaim:
claimName: non-resilient-rbd-pvc
containers:
- name: task-pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: task-pv-storage
securityContext:
privileged: true
EOF
~ % oc get pods | grep task
task-pv-pod 1/1 Running 0 57s
- Check the status of the PVC and the newly created PV.
~ % oc get pvc | grep non-resilient-rbd-pvc
non-resilient-rbd-pvc Bound pvc-d1a26969-a15c-4e60-8f76-847d1fdd6041 1Gi RWO ocs-storagecluster-ceph-non-resilient-rbd 3m49s
~ % oc get pv | grep non-resilient-rbd-pvc
pvc-d1a26969-a15c-4e60-8f76-847d1fdd6041 1Gi RWO Delete Bound openshift-storage/non-resilient-rbd-pvc ocs-storagecluster-ceph-non-resilient-rbd 2m18s
- Check the toolbox pod to identify in which pool the RBD image for the new PV is present.
sh-4.4$ rbd ls ocs-storagecluster-cephblockpool-us-east-1a
csi-vol-937ee9e9-2f99-11ed-b2d0-0a580a83001a
- Match this
imageNameto theimageNamein the PV created earlier.
~ % oc get pv pvc-d1a26969-a15c-4e60-8f76-847d1fdd6041 -o=jsonpath='{.spec.csi.volumeAttributes}' | jq
{
"clusterID": "openshift-storage",
"imageFeatures": "layering,deep-flatten,exclusive-lock,object-map,fast-diff",
"imageFormat": "2",
"imageName": "csi-vol-937ee9e9-2f99-11ed-b2d0-0a580a83001a",
"journalPool": "ocs-storagecluster-cephblockpool",
"pool": "ocs-storagecluster-cephblockpool-us-east-1a",
"storage.kubernetes.io/csiProvisionerIdentity": "1662652453217-8081-openshift-storage.rbd.csi.ceph.com",
"topologyConstrainedPools": "[\n {\n \"poolName\": \"ocs-storagecluster-cephblockpool-us-east-1a\",\n \"domainSegments\": [\n {\n \"domainLabel\": \"zone\",\n \"value\": \"us-east-1a\"\n }\n ]\n },\n {\n \"poolName\": \"ocs-storagecluster-cephblockpool-us-east-1b\",\n \"domainSegments\": [\n {\n \"domainLabel\": \"zone\",\n \"value\": \"us-east-1b\"\n }\n ]\n },\n {\n \"poolName\": \"ocs-storagecluster-cephblockpool-us-east-1c\",\n \"domainSegments\": [\n {\n \"domainLabel\": \"zone\",\n \"value\": \"us-east-1c\"\n }\n ]\n }\n]"
}
Verifying the replication with data
- Create random text data.
~ $ oc rsh task-pv-pod
# cd /usr/share/nginx/html
# ls
lost+found
# tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 100000000 >file.txt
# ls -lh
total 9.5G
-rw-r--r--. 1 root 1000700000 9.5G Apr 15 18:37 file.txt
drwxrws---. 2 root 1000700000 16K Apr 15 18:31 lost+found
- Check the replication.
9.5 GB of data that was added is replicated inside the same zone in osd.5 and osd.6
sh-5.1$ ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME
-1 18.00000 - 18 TiB 19 GiB 19 GiB 0 B 307 MiB 18 TiB 0.11 1.00 - root default
-5 18.00000 - 18 TiB 19 GiB 19 GiB 0 B 307 MiB 18 TiB 0.11 1.00 - region us-east-1
-4 6.00000 - 6 TiB 19 GiB 19 GiB 0 B 189 MiB 6.0 TiB 0.31 2.95 - zone us-east-1a
-3 2.00000 - 2 TiB 76 MiB 43 MiB 0 B 33 MiB 2.0 TiB 0.00 0.03 - host ocs-deviceset-gp3-csi-1-data-0f2pz7
1 ssd 2.00000 1.00000 2 TiB 76 MiB 43 MiB 0 B 33 MiB 2.0 TiB 0.00 0.03 36 up osd.1
-51 2.00000 - 2 TiB 9.5 GiB 9.5 GiB 0 B 82 MiB 2.0 TiB 0.47 4.42 - host us-east-1a-data-0kkr4m
5 us-east-1a 2.00000 1.00000 2 TiB 9.5 GiB 9.5 GiB 0 B 82 MiB 2.0 TiB 0.47 4.42 42 up osd.5
-56 2.00000 - 2 TiB 9.5 GiB 9.4 GiB 0 B 74 MiB 2.0 TiB 0.46 4.41 - host us-east-1a-data-189c4g
6 us-east-1a 2.00000 1.00000 2 TiB 9.5 GiB 9.4 GiB 0 B 74 MiB 2.0 TiB 0.46 4.41 37 up osd.6
-10 6.00000 - 6 TiB 150 MiB 92 MiB 0 B 58 MiB 6.0 TiB 0.00 0.02 - zone us-east-1b
-9 2.00000 - 2 TiB 62 MiB 30 MiB 0 B 33 MiB 2.0 TiB 0.00 0.03 - host ocs-deviceset-gp3-csi-2-data-02msmj
2 ssd 2.00000 1.00000 2 TiB 62 MiB 30 MiB 0 B 33 MiB 2.0 TiB 0.00 0.03 39 up osd.2
-46 2.00000 - 2 TiB 35 MiB 19 MiB 0 B 16 MiB 2.0 TiB 0.00 0.02 - host us-east-1b-data-0zdwlv
3 us-east-1b 2.00000 1.00000 2 TiB 35 MiB 19 MiB 0 B 16 MiB 2.0 TiB 0.00 0.02 39 up osd.3
-61 2.00000 - 2 TiB 53 MiB 44 MiB 0 B 8.8 MiB 2.0 TiB 0.00 0.02 - host us-east-1b-data-1nncsl
7 us-east-1b 2.00000 1.00000 2 TiB 53 MiB 44 MiB 0 B 8.8 MiB 2.0 TiB 0.00 0.02 38 up osd.7
-14 6.00000 - 6 TiB 152 MiB 91 MiB 0 B 61 MiB 6.0 TiB 0.00 0.02 - zone us-east-1c
-13 2.00000 - 2 TiB 60 MiB 27 MiB 0 B 32 MiB 2.0 TiB 0.00 0.03 - host ocs-deviceset-gp3-csi-0-data-0llcxl
0 ssd 2.00000 1.00000 2 TiB 60 MiB 27 MiB 0 B 32 MiB 2.0 TiB 0.00 0.03 31 up osd.0
-41 2.00000 - 2 TiB 44 MiB 29 MiB 0 B 15 MiB 2.0 TiB 0.00 0.02 - host us-east-1c-data-087jsr
4 us-east-1c 2.00000 1.00000 2 TiB 44 MiB 29 MiB 0 B 15 MiB 2.0 TiB 0.00 0.02 50 up osd.4
-66 2.00000 - 2 TiB 48 MiB 35 MiB 0 B 13 MiB 2.0 TiB 0.00 0.02 - host us-east-1c-data-182pqz
8 us-east-1c 2.00000 1.00000 2 TiB 48 MiB 35 MiB 0 B 13 MiB 2.0 TiB 0.00 0.02 33 up osd.8
TOTAL 18 TiB 19 GiB 19 GiB 0 B 307 MiB 18 TiB 0.11
MIN/MAX VAR: 0.02/4.42 STDDEV: 0.19