RADOS Block Device (RBD) Replica-2 with both disks in the same zone - Developer preview OpenShift Data Foundation 4.16

Updated

CAUTION - Read this carefully before proceeding:
This is a hack to achieve the motive and in no way a supported configuration. This involves disabling reconciliation of the cephblockpools which can cause bigger issues that might be hard to recover from. More importantly, for this to work, there must be at least one extra usable disks in each zone before trying to enable it.

Overview

This article covers a scenario in which storage allocation is required within a designated zone so that the application strictly accesses OSDs within their respective zones and this eliminates inter-zone IO and replication traffic. In this scenario, the data needs to be replicated in the zone as at least a replica-2 is required to protect against disk or host failure.

Configuring replica-2 with both disks in the same zone

Procedure

  1. Install OpenShift Data Foundation.
  2. Create a storagecluster and wait for the storagecluster to be ready.
  3. Patch to enable replica-1. This helps to get the extra pools, storageclass and OSDs needed for this scenario.
`oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephNonResilientPools/enable", "value": true }]'`

Wait for the storagecluster to get ready.
4. Disable reconciliation of cephblockpools.

`oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephBlockPools/reconcileStrategy", "value": ignore }]'`
  1. Edit all the replica-1 cephblockpools created for each failure domain to make the replicated.size to 2.
    For example:
`oc patch cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/replicated/size", "value": 2 }]'`
  1. Repeat the above steps for all the cephblockpools.
  2. Edit all the cephblockpools to make the failure domain as OSD.
    For example:
`oc patch cephblockpool ocs-storagecluster-cephblockpool-us-east-1a -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/failureDomain", "value": "osd" }]'`
  1. Repeat the above step for all the cephblockpools.
  2. Increase the number of OSDs per failure domain to 2.
`oc patch storagecluster ocs-storagecluster -n openshift-storage --type json --patch '[{ "op": "replace", "path": "/spec/managedResources/cephNonResilientPools/count", "value": 2 }]'`

Result

New OSD prepare pods and OSD pods

~ $ oc get pods | grep osd
rook-ceph-osd-0-79fd56bf96-kw8bs                                  2/2     Running     0          6h26m
rook-ceph-osd-1-7c67574bb9-b446d                                  2/2     Running     0          6h26m
rook-ceph-osd-2-54f5d8cbb7-6vwlj                                  2/2     Running     0          6h26m
rook-ceph-osd-3-86b88f6656-q76j2                                  2/2     Running     0          5h20m
rook-ceph-osd-4-685fd8bd86-fldd8                                  2/2     Running     0          5h20m
rook-ceph-osd-5-65669d6445-7hpr4                                  2/2     Running     0          5h20m
rook-ceph-osd-6-7d67dbb8df-qprxh                                  2/2     Running     0          4h58m
rook-ceph-osd-7-7454fb654f-vqt62                                  2/2     Running     0          4h58m
rook-ceph-osd-8-b9dbf7ccf-znb5j                                   2/2     Running     0          4h58m
rook-ceph-osd-9-557bdc8db6-j6gh7                                  2/2     Running     0          22s
rook-ceph-osd-prepare-567c8a55e618dc355ddb04cfbf7de15b-57gzk      0/1     Completed   0          6h26m
rook-ceph-osd-prepare-693e88a014d9ac263328b45815581cce-fgfjb      0/1     Completed   0          41s
rook-ceph-osd-prepare-dd690e88c9a4a7702c8af06178f70936-q6744      0/1     Completed   0          6h26m
rook-ceph-osd-prepare-f33ba9f1dc57226ca278831d4dfcd327-qvrxh      0/1     Completed   0          6h26m
rook-ceph-osd-prepare-us-east-1a-data-0k4c9x-txg46                0/1     Completed   0          5h20m
rook-ceph-osd-prepare-us-east-1a-data-16wvgx-c4l8w                0/1     Completed   0          4h58m
rook-ceph-osd-prepare-us-east-1b-data-05h78r-4672f                0/1     Completed   0          5h20m
rook-ceph-osd-prepare-us-east-1b-data-1vck42-ntdxb                0/1     Completed   0          4h58m
rook-ceph-osd-prepare-us-east-1c-data-0wnzxh-7zx9b                0/1     Completed   0          5h20m
rook-ceph-osd-prepare-us-east-1c-data-14jkm5-wn2xs                0/1     Completed   0          4h58m

New pools with replica-2

sh-5.1$ ceph osd pool ls detail | grep block
pool 1 'ocs-storagecluster-cephblockpool' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 141 lfor 0/0/32 flags hashpspool,selfmanaged_snaps stripe_width 0 target_size_ratio 0.49 application rbd
pool 5 'ocs-storagecluster-cephblockpool-us-east-1b' replicated size 2 min_size 1 crush_rule 14 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 137 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 6 'ocs-storagecluster-cephblockpool-us-east-1c' replicated size 2 min_size 1 crush_rule 15 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 139 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 7 'ocs-storagecluster-cephblockpool-us-east-1a' replicated size 2 min_size 1 crush_rule 13 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 135 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd

New OSD tree

sh-5.1$ ceph osd df tree
ID   CLASS       WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME                                           
 -1              18.00000         -  18 TiB  594 MiB  358 MiB   0 B  236 MiB   18 TiB  0.00  1.00    -          root default                                        
 -5              18.00000         -  18 TiB  594 MiB  358 MiB   0 B  236 MiB   18 TiB  0.00  1.00    -              region us-east-1                                
 -4               6.00000         -   6 TiB  195 MiB  119 MiB   0 B   75 MiB  6.0 TiB  0.00  0.98    -                  zone us-east-1a                             
 -3               2.00000         -   2 TiB   90 MiB   59 MiB   0 B   32 MiB  2.0 TiB  0.00  1.37    -                      host ocs-deviceset-gp3-csi-2-data-0fjpth
  0         ssd   2.00000   1.00000   2 TiB   90 MiB   59 MiB   0 B   32 MiB  2.0 TiB  0.00  1.37   35      up                  osd.0                               
-46               2.00000         -   2 TiB   49 MiB   31 MiB   0 B   17 MiB  2.0 TiB  0.00  0.74    -                      host us-east-1a-data-0k4c9x             
  4  us-east-1a   2.00000   1.00000   2 TiB   49 MiB   31 MiB   0 B   17 MiB  2.0 TiB  0.00  0.74   41      up                  osd.4                               
-66               2.00000         -   2 TiB   55 MiB   29 MiB   0 B   27 MiB  2.0 TiB  0.00  0.84    -                      host us-east-1a-data-16wvgx             
  8  us-east-1a   2.00000   1.00000   2 TiB   55 MiB   29 MiB   0 B   27 MiB  2.0 TiB  0.00  0.84   38      up                  osd.8                               
-10               6.00000         -   6 TiB  189 MiB  119 MiB   0 B   70 MiB  6.0 TiB  0.00  0.96    -                  zone us-east-1b                             
 -9               2.00000         -   2 TiB   93 MiB   54 MiB   0 B   39 MiB  2.0 TiB  0.00  1.40    -                      host ocs-deviceset-gp3-csi-1-data-0578zr
  1         ssd   2.00000   1.00000   2 TiB   93 MiB   54 MiB   0 B   39 MiB  2.0 TiB  0.00  1.40   45      up                  osd.1                               
-41               2.00000         -   2 TiB   54 MiB   37 MiB   0 B   17 MiB  2.0 TiB  0.00  0.81    -                      host us-east-1b-data-05h78r             
  3  us-east-1b   2.00000   1.00000   2 TiB   54 MiB   37 MiB   0 B   17 MiB  2.0 TiB  0.00  0.81   39      up                  osd.3                               
-61               2.00000         -   2 TiB   43 MiB   29 MiB   0 B   15 MiB  2.0 TiB  0.00  0.65    -                      host us-east-1b-data-1vck42             
  7  us-east-1b   2.00000   1.00000   2 TiB   43 MiB   29 MiB   0 B   15 MiB  2.0 TiB  0.00  0.65   31      up                  osd.7                               
-14               6.00000         -   6 TiB  210 MiB  119 MiB   0 B   90 MiB  6.0 TiB  0.00  1.06    -                  zone us-east-1c                             
-13               2.00000         -   2 TiB   77 MiB   36 MiB   0 B   41 MiB  2.0 TiB  0.00  1.17    -                      host ocs-deviceset-gp3-csi-0-data-0flnqg
  2         ssd   2.00000   1.00000   2 TiB   77 MiB   36 MiB   0 B   41 MiB  2.0 TiB  0.00  1.17   33      up                  osd.2                               
-51               2.00000         -   2 TiB   66 MiB   38 MiB   0 B   28 MiB  2.0 TiB  0.00  1.01    -                      host us-east-1c-data-0wnzxh             
  5  us-east-1c   2.00000   1.00000   2 TiB   66 MiB   38 MiB   0 B   28 MiB  2.0 TiB  0.00  1.01   42      up                  osd.5                               
-56               2.00000         -   2 TiB   66 MiB   45 MiB   0 B   21 MiB  2.0 TiB  0.00  1.00    -                      host us-east-1c-data-14jkm5             
  6  us-east-1c   2.00000   1.00000   2 TiB   66 MiB   45 MiB   0 B   21 MiB  2.0 TiB  0.00  1.00   40      up                  osd.6                               
                              TOTAL  18 TiB  594 MiB  358 MiB   0 B  236 MiB   18 TiB  0.00                                                                         
MIN/MAX VAR: 0.65/1.40  STDDEV: 0

Using the storageclass with the pools

Procedure

  1. Prepare a test Namespace.
 ~ % oc create ns test

 ~ % oc project test

 ~ % oc create sa test

 ~ % oc adm policy add-scc-to-user -z test privileged
  1. Create a PVC.
~ % cat <<EOF | oc create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: non-resilient-rbd-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ocs-storagecluster-ceph-non-resilient-rbd
EOF

The PVC will be in pending state, waiting for a consumer.

 ~ % oc get pvc | grep non-resilient-rbd-pvc
non-resilient-rbd-pvc             Pending                                                                        ocs-storagecluster-ceph-non-resilient-rbd   16s
  1. Create a pod to consume the pvc
~ % cat <<EOF | oc create -f -             
apiVersion: v1
kind: Pod
metadata:
  name: task-pv-pod
spec:
  nodeSelector:
    #Change according to failure domain
    topology.kubernetes.io/zone: us-east-1a
  volumes:
    - name: task-pv-storage
      persistentVolumeClaim:
        claimName: non-resilient-rbd-pvc
  containers:
    - name: task-pv-container
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: task-pv-storage
      securityContext:
        privileged: true
EOF
 ~ % oc get pods | grep task
task-pv-pod                                                       1/1     Running     0          57s
  1. Check the status of the PVC and the newly created PV.
 ~ % oc get pvc | grep non-resilient-rbd-pvc
non-resilient-rbd-pvc             Bound    pvc-d1a26969-a15c-4e60-8f76-847d1fdd6041   1Gi        RWO            ocs-storagecluster-ceph-non-resilient-rbd   3m49s

 ~ % oc get pv | grep non-resilient-rbd-pvc
pvc-d1a26969-a15c-4e60-8f76-847d1fdd6041   1Gi        RWO            Delete           Bound    openshift-storage/non-resilient-rbd-pvc             ocs-storagecluster-ceph-non-resilient-rbd            2m18s
  1. Check the toolbox pod to identify in which pool the RBD image for the new PV is present.
sh-4.4$ rbd ls ocs-storagecluster-cephblockpool-us-east-1a
csi-vol-937ee9e9-2f99-11ed-b2d0-0a580a83001a
  1. Match this imageName to the imageName in the PV created earlier.
 ~ % oc get pv pvc-d1a26969-a15c-4e60-8f76-847d1fdd6041 -o=jsonpath='{.spec.csi.volumeAttributes}' | jq
{
  "clusterID": "openshift-storage",
  "imageFeatures": "layering,deep-flatten,exclusive-lock,object-map,fast-diff",
  "imageFormat": "2",
  "imageName": "csi-vol-937ee9e9-2f99-11ed-b2d0-0a580a83001a",
  "journalPool": "ocs-storagecluster-cephblockpool",
  "pool": "ocs-storagecluster-cephblockpool-us-east-1a",
  "storage.kubernetes.io/csiProvisionerIdentity": "1662652453217-8081-openshift-storage.rbd.csi.ceph.com",
  "topologyConstrainedPools": "[\n  {\n    \"poolName\": \"ocs-storagecluster-cephblockpool-us-east-1a\",\n    \"domainSegments\": [\n      {\n        \"domainLabel\": \"zone\",\n        \"value\": \"us-east-1a\"\n      }\n    ]\n  },\n  {\n    \"poolName\": \"ocs-storagecluster-cephblockpool-us-east-1b\",\n    \"domainSegments\": [\n      {\n        \"domainLabel\": \"zone\",\n        \"value\": \"us-east-1b\"\n      }\n    ]\n  },\n  {\n    \"poolName\": \"ocs-storagecluster-cephblockpool-us-east-1c\",\n    \"domainSegments\": [\n      {\n        \"domainLabel\": \"zone\",\n        \"value\": \"us-east-1c\"\n      }\n    ]\n  }\n]"
}

Verifying the replication with data

  1. Create random text data.
~ $ oc rsh task-pv-pod
# cd /usr/share/nginx/html
# ls
lost+found
# tr -dc "A-Za-z 0-9" < /dev/urandom | fold -w100|head -n 100000000 >file.txt   
# ls -lh
total 9.5G
-rw-r--r--. 1 root 1000700000 9.5G Apr 15 18:37 file.txt
drwxrws---. 2 root 1000700000  16K Apr 15 18:31 lost+found
  1. Check the replication.
    9.5 GB of data that was added is replicated inside the same zone in osd.5 and osd.6
sh-5.1$ ceph osd df tree
ID   CLASS       WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP  META     AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME                                           
 -1              18.00000         -  18 TiB   19 GiB   19 GiB   0 B  307 MiB   18 TiB  0.11  1.00    -          root default                                        
 -5              18.00000         -  18 TiB   19 GiB   19 GiB   0 B  307 MiB   18 TiB  0.11  1.00    -              region us-east-1                                
 -4               6.00000         -   6 TiB   19 GiB   19 GiB   0 B  189 MiB  6.0 TiB  0.31  2.95    -                  zone us-east-1a                             
 -3               2.00000         -   2 TiB   76 MiB   43 MiB   0 B   33 MiB  2.0 TiB  0.00  0.03    -                      host ocs-deviceset-gp3-csi-1-data-0f2pz7
  1         ssd   2.00000   1.00000   2 TiB   76 MiB   43 MiB   0 B   33 MiB  2.0 TiB  0.00  0.03   36      up                  osd.1                               
-51               2.00000         -   2 TiB  9.5 GiB  9.5 GiB   0 B   82 MiB  2.0 TiB  0.47  4.42    -                      host us-east-1a-data-0kkr4m             
  5  us-east-1a   2.00000   1.00000   2 TiB  9.5 GiB  9.5 GiB   0 B   82 MiB  2.0 TiB  0.47  4.42   42      up                  osd.5                               
-56               2.00000         -   2 TiB  9.5 GiB  9.4 GiB   0 B   74 MiB  2.0 TiB  0.46  4.41    -                      host us-east-1a-data-189c4g             
  6  us-east-1a   2.00000   1.00000   2 TiB  9.5 GiB  9.4 GiB   0 B   74 MiB  2.0 TiB  0.46  4.41   37      up                  osd.6                               
-10               6.00000         -   6 TiB  150 MiB   92 MiB   0 B   58 MiB  6.0 TiB  0.00  0.02    -                  zone us-east-1b                             
 -9               2.00000         -   2 TiB   62 MiB   30 MiB   0 B   33 MiB  2.0 TiB  0.00  0.03    -                      host ocs-deviceset-gp3-csi-2-data-02msmj
  2         ssd   2.00000   1.00000   2 TiB   62 MiB   30 MiB   0 B   33 MiB  2.0 TiB  0.00  0.03   39      up                  osd.2                               
-46               2.00000         -   2 TiB   35 MiB   19 MiB   0 B   16 MiB  2.0 TiB  0.00  0.02    -                      host us-east-1b-data-0zdwlv             
  3  us-east-1b   2.00000   1.00000   2 TiB   35 MiB   19 MiB   0 B   16 MiB  2.0 TiB  0.00  0.02   39      up                  osd.3                               
-61               2.00000         -   2 TiB   53 MiB   44 MiB   0 B  8.8 MiB  2.0 TiB  0.00  0.02    -                      host us-east-1b-data-1nncsl             
  7  us-east-1b   2.00000   1.00000   2 TiB   53 MiB   44 MiB   0 B  8.8 MiB  2.0 TiB  0.00  0.02   38      up                  osd.7                               
-14               6.00000         -   6 TiB  152 MiB   91 MiB   0 B   61 MiB  6.0 TiB  0.00  0.02    -                  zone us-east-1c                             
-13               2.00000         -   2 TiB   60 MiB   27 MiB   0 B   32 MiB  2.0 TiB  0.00  0.03    -                      host ocs-deviceset-gp3-csi-0-data-0llcxl
  0         ssd   2.00000   1.00000   2 TiB   60 MiB   27 MiB   0 B   32 MiB  2.0 TiB  0.00  0.03   31      up                  osd.0                               
-41               2.00000         -   2 TiB   44 MiB   29 MiB   0 B   15 MiB  2.0 TiB  0.00  0.02    -                      host us-east-1c-data-087jsr             
  4  us-east-1c   2.00000   1.00000   2 TiB   44 MiB   29 MiB   0 B   15 MiB  2.0 TiB  0.00  0.02   50      up                  osd.4                               
-66               2.00000         -   2 TiB   48 MiB   35 MiB   0 B   13 MiB  2.0 TiB  0.00  0.02    -                      host us-east-1c-data-182pqz             
  8  us-east-1c   2.00000   1.00000   2 TiB   48 MiB   35 MiB   0 B   13 MiB  2.0 TiB  0.00  0.02   33      up                  osd.8                               
                              TOTAL  18 TiB   19 GiB   19 GiB   0 B  307 MiB   18 TiB  0.11                                                                         
MIN/MAX VAR: 0.02/4.42  STDDEV: 0.19
SBR
Article Type