[ceph-ansible]RHCS 4.3 installation fails while executing the command "ceph mgr dump"

Solution Verified - Updated 13 Jun 2024

Environment

Red Hat Ceph Storage 4.3.
podman-4.2.

Issue

Ansible playbook fails after retrying the task wait for all mgr to be up during RHCS 4.3 installation.
Ansible task wait for all mgr to be up fails while capturing ceph mgr dump.
RHCS 4.3 installation fails while executing the command ceph mgr dump.

Resolution

Red Hat Engineering team already aware of this issue and which is tracked in This content is not included.Bug 2162781
- As a workaround, set the proper SELinux context for all mgr directories.
  After setting the permission, ceph-mgr will automatically start on the corresponding node.
  Apply this workaround on all mgr nodes.
  For example:
```
# chcon  system_u:object_r:container_file_t:s0 -R /var/lib/ceph/mgr/ceph-$(hostname -s) 
```
- Then re-run ansible playbook to continue the deployment.

Root Cause

While using the updated version of podman(podman-4.2), the wrong SELinux context - system_u:object_r:var_lib_t:s0 applied for the mgr keyring file.

Diagnostic Steps

Ansible playbook fails after retrying the task wait for all mgr to be up.

From ceph-ansible logs, TASK [ceph-mgr : wait for all mgr to be up] is getting failed.

For example:

      2023-09-01 10:57:56,972 p=65792 u=admin n=ansible | TASK [ceph-mgr : wait for all mgr to be up] **************************************************************************************************************************************************
      2023-09-01 10:57:56,972 p=65792 u=admin n=ansible | Friday 01 September 2023  10:57:56 -0400 (0:00:00.021)       0:10:30.894 ******
      2023-09-01 11:00:44,366 p=65792 u=admin n=ansible | fatal: [mon3 -> mon1]: FAILED! => changed=false
        attempts: 30
        cmd:
        - podman
        - exec
        - ceph-mon-mon1
        - ceph
        - --cluster
        - ceph
        - mgr
        - dump
        - -f
        - json
        delta: '0:00:00.416016'
        end: '2023-09-01 11:00:44.351865'
        rc: 0
        start: '2023-09-01 11:00:43.935849'
        stderr: ''
        stderr_lines: <omitted>
        stdout: |2-
      
          {"epoch":1,"active_gid":0,"active_name":"","active_addrs":{"addrvec":[]},"active_addr":":/0","active_change":"0.000000","available":false,"standbys":[],"modules":["iostat","restful"],"available_modules":[],"services":{},"always_on_modules":{"nautilus":["balancer","crash","devicehealth","orchestrator_cli","progress","rbd_support","status","volumes"]}}
stdout_lines: <omitted>

ceph -s shows mgr: no daemons active.

For example:

# podman exec -it ceph-mon-$(hostname -s) ceph -s | grep mgr
    mgr: no daemons active

The mgr containers are not starting:
```
  # podman ps | grep mgr
  # 
```

In all nodes, mgr services are not starting and the service shows activating / auto-restart state:.

For example:

[root@mon2 ~]# systemctl --type=service | grep mgr
  ceph-mgr@mon2.service                                 loaded activating auto-restart Ceph Manager

Check the logs for ceph-mgr service.

For example:

[root@mon2 ~]# journalctl -u ceph-mgr@$(hostname -s).service --no-pager | tail -n 15
Sep 01 13:01:19 mon2 systemd[1]: ceph-mgr@mon2.service: Main process exited, code=exited, status=1/FAILURE
Sep 01 13:01:19 mon2 systemd[1]: ceph-mgr@mon2.service: Failed with result 'exit-code'.
Sep 01 13:01:30 mon2 systemd[1]: ceph-mgr@mon2.service: Service RestartSec=10s expired, scheduling restart.
Sep 01 13:01:30 mon2 systemd[1]: ceph-mgr@mon2.service: Scheduled restart job, restart counter is at 1.
Sep 01 13:01:30 mon2 systemd[1]: Stopped Ceph Manager.
Sep 01 13:01:30 mon2 systemd[1]: Starting Ceph Manager...
Sep 01 13:01:30 mon2 podman[139628]: Error: no container with name or ID "ceph-mgr-mon2" found: no such container
Sep 01 13:01:30 mon2 podman[139638]: Error: no container with name or ID "ceph-mgr-mon2" found: no such container
Sep 01 13:01:30 mon2 podman[139648]: 
Sep 01 13:01:30 mon2 podman[139648]: 93b38edb31a97c4aee28f67443d7eec6cbe870af21c5e18a11168f7cc60b2686
Sep 01 13:01:30 mon2 systemd[1]: Started Ceph Manager.
Sep 01 13:01:30 mon2 ceph-mgr-mon2[139658]: find: '/var/lib/ceph/mgr/ceph-mon2/keyring': Permission denied                   
Sep 01 13:01:30 mon2 ceph-mgr-mon2[139658]: chown: cannot access '/var/lib/ceph/mgr/ceph-mon2/keyring': Permission denied
Sep 01 13:01:30 mon2 systemd[1]: ceph-mgr@mon2.service: Main process exited, code=exited, status=1/FAILURE
Sep 01 13:01:30 mon2 systemd[1]: ceph-mgr@mon2.service: Failed with result 'exit-code'.

The above error indicates that while accessing the mgr keyring, its reporting the Permission denied error.

Check the permission of mgr keyring and the directory.

For example:

[root@mon2 ~]# ls -lZd /var/lib/ceph/mgr/
drwxr-xr-x. 3 167 167 system_u:object_r:container_file_t:s0 23 Sep  1 10:52 /var/lib/ceph/mgr/

[root@mon2 ~]# ls -lZ /var/lib/ceph/mgr/*
total 4
-rw-------. 1 167 167 system_u:object_r:var_lib_t:s0 135 Sep  1 10:57 keyring      <<---

Apply the proper SELinux context to mgr keyring and check whether the service is getting started automatically or not.

For example:

[root@mon2 ~]# chcon  system_u:object_r:container_file_t:s0 -R /var/lib/ceph/mgr/ceph-mon2/

[root@mon2 ~]# podman ps | grep mgr
1f131508fa4f  registry.redhat.io/rhceph/rhceph-4-rhel8:4-57              28 seconds ago  Up 28 seconds ago              ceph-mgr-mon2

[root@mon2 ~]# ls -lZ /var/lib/ceph/mgr/ceph-mon2/
total 4
-rw-------. 1 167 167 system_u:object_r:container_file_t:s0 135 Sep  1 10:57 keyring
 
[root@mon2 ~]# systemctl --type=service | grep mgr
  ceph-mgr@mon2.service                                 loaded active running Ceph Manager

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Install

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.