Master Article for Red Hat Ceph Storage Troubleshooting and Recommended configurations

Solution Verified - Updated 2 Aug 2024

Environment

Red Hat Ceph Storage 1.3.z
Red Hat Ceph Storage 2.y
Red Hat Ceph Storage 3.y
Red Hat Ceph Storage 4.y

Issue

What are some of the recommended practices while creating a Red Hat Ceph Storage Cluster?
What are some common issues seen with Red Hat Ceph Storage Cluster?

Resolution

NOTE this article is obsolete as of the Red Hat Ceph Storage release version 5 and will no longer be updated. Please see the official RHCS Documentation for the Troubleshooting and Architecture guides which contain the information from this knowledge-base article

This article is an accumulation of various resources as well as suggested solutions which can help in the successful implementation and maintenance of a Red Hat Ceph Storage (RHCS) Cluster.

Red Hat Ceph Storage versions

What are the Red Hat Ceph Storage releases and corresponding Ceph package versions?
Red Hat Ceph Storage Life Cycle
Red Hat Ceph Storage is currently available in the following versions:
- Red Hat Ceph Storage 1.3 (End of support life 6/20/2018)
- Red Hat Ceph Storage 2 (End of Support life 8/22/2019)
- Red Hat Ceph Storage 3 (End of support life 2/28/2021)
- Red Hat Ceph Storage 4 (End of support life 1/21/2023)

Official Documentation

This content is not included.Red Hat Ceph Storage Official Documentation
Make sure the hardware is supported at Red Hat Ceph Storage Hardware guide.
- If not, check the Special considerations [Support Exceptions] section.
Plan the cluster using the Red Hat Ceph Storage Strategies Guide

Red Hat Ceph Storage Supportability

NOTE: Red Hat only supports directly attached storage for OSD disks.

Red Hat Ceph Storage: Supported configurations

RHCS is supported on the following Operating Systems:

Red Hat Ceph Storage 1.3.x (End of Life)
- Red Hat Enterprise Linux (7.1, 7.2, and 7.3)
- Ubuntu 14.04 (with Cloud Archive)
Red Hat Ceph Storage 2.y (End of Life)
- Red Hat Enterprise Linux (7.2, - 7.7)
- Ubuntu 16.04 (with Cloud Archive)
- for specific minor version use Red Hat Ceph Storage 2 Compatibility Guide
Red Hat Ceph Storage 3.y
- Red Hat Enterprise Linux (7.4)
- Ubuntu 16.04 (with Cloud Archive)
- for specific minor version use Red Hat Ceph Storage 3 Compatibility Guide
Red Hat Ceph Storage 4.y
- Red Hat Enterprise Linux (7.7, 8.1)
- for specific minor version use Red Hat Ceph Storage 4 Compatibility Guide

Red Hat supports bare metal deployments in the following configurations:

Dedicated OSD machines (Limited Availability with RHOSP Compute nodes with Hyper Converged Openstack, please contact Red Hat support for more details).
Dedicated Monitor machines.
Dedicated Ceph Object Gateway machines.
RHCS/Ceph Monitors co-located with OpenStack Controller nodes [provided Ceph is integrated with OpenStack]
- Clusters with over 250 OSDs will need a support exception if running co-located mon services.
Dedicated iSCSI Gateway.

Starting with RHCS 3.0 the following services are supported containerized:

OSD: A single instance of a containerized scale-out daemon (listed below) can be co-located on each OSD host.
MON: With OSD in a containerized environment.
RGW: With OSD in a containerized environment.
MDS: With OSD in a containerized environment.

Special considerations [Support Exceptions]

IMPORTANT: A Support Exception is required for the following configurations.

Co-located OSDs and Monitors.
OSDs and Monitors in a virtualized environment (VM guests).
Any storage which is not directly attached.
Co-resident applications. ie.. third party or other applications running on OSD and MON nodes.
Stretched clusters.
OpenStack clusters with co-located mon/controller process, if more than 250 OSDs are to be used.

Suggestions for optimal performance

For high performance throughout and streamlined operations, Red Hat suggests the following:

All nodes in the cluster (OSD, MON, RGW, CephFS, RBD client nodes) should carry the same RHCS version and package level.
All nodes in the cluster should be on identical hardware. Using the same hardware for Ceph storage pools will help achieve a consistent performance profile.
It is possible to run a Ceph cluster on a small scale with a minimum of 3 OSDs, ie.. one per OSD node and the failure domain is host, but the performance improves with a larger number of OSDs since more OSDs can dramatically improve recovery and back-filling operations.

NOTE: Failure domains address the possibility of concurrent failures. Failure domains consist of using leaf nodes (hosts and OSD) on different shelves, racks, power supplies, controllers, and/or physical locations. It is desirable to ensure that data replicas (or coding chunks) are on leaf nodes on different failure domains.

The default failure domain in RHCS is set to host and the replica count set to 3, which means each replica goes to a separate OSD host node, and the failure of one leg (a single host machine in this case) won't affect access to the data. RHCS supports failure domains configured across hosts, chassis, racks, rows, data centre and more. Ideally, a properly implemented cluster should have each object replicas go to devices in different failure domains.

Read more on Failure domains in the RHCS Storage Strategies guide.
Red Hat recommends using at least three Ceph monitors on separate hosts. The number of monitors in an RHCS cluster should always be an odd number in order to form a quorum. 3 or 5 is the suggested number of MONs, with 5 monitors being recommended for larger Ceph deployments.

NOTE: The MON nodes should not be hosting Ceph OSD daemons. With RHCS 3.0 the monitor can run inside a container with Ceph OSD daemons in containerized environments.
NTP should be configured on all the nodes in the cluster. This ensures that the nodes are running on the same date/time and don't drift apart. A time drift may prevent monitors from agreeing on the state of the cluster, which means that clients lose access to data until the quorum is re-established and the monitors agree on the state of the cluster.
An Administration node or Ceph-Ansible node is required for a standalone RHCS cluster. The Ceph-Ansible node is used for the following:

a. The RHCS cluster installation is kicked off from the Administration node using ceph-ansible.
b. Maintain central management of the Ceph configuration.
c. Push configuration overrides in all.yml file to the cluster nodes (OSD/MON nodes).
d. Able to create keyrings used for authentication by Ceph.
e. Also can be used to host Ceph-Metrics or the Ceph-Dashboard GUI.
f. Add new OSD nodes, OSD disks, MON nodes etc.. to an existing cluster using ceph-ansible.

Host recommendations

1. CPU

OSD nodes should have a reasonable amount of processing power since the OSD processes run the storage cluster service, calculate data placements with CRUSH rule, replicate data, maintain their copy of the cluster maps etc..
OSD nodes taking part in Erasure Coded pools will need more processing power compared to their replicated pool counterparts due to the additional overhead required in calculations.
Monitors usually don't require the same CPU power required for OSD nodes, but it's better to have a similar hardware configuration across all the nodes in the cluster.

2. Memory

Ceph monitors serve the cluster maps to all the clients as well distribute the maps between other monitor nodes and OSD nodes. It should be capable of serving their data quickly. Hence, the MON nodes should have plenty of RAM (e.g., 1GB of RAM per daemon instance). This is not inclusive of the operating system's usage.
OSD will require more RAM compared to MON nodes due to the resource intensive operations such as object recovery, replication, soft/deep scrubbing etc.. A baseline of 16GB RAM per host, with additional RAM per OSD -- Filestore OSDs at least 2GB per OSD daemon, and 5GB per Bluestore OSD daemon. The more the better.
Due to the behavior of Linux kernel page caching, more memory doesn't mean wasted memory. Any extra memory is used for caching data which were read recently and helps in improved performance.

3. Storage

Careful planning is needed on the deciding the type of storage being used in the cluster. Simultaneous OS operations, as well as requests for reading and write operations from multiple daemons against a single drive, can slow performance considerably.
Red Hat recommends creating pools and defining CRUSH hierarchies such that the OSD hardware within the pools is identical. ie.. the cluster nodes should have disks which are identical to:
```
  * Disk controllers
  * Disk size
  * Disk RPMs
  * Seek times
  * I/O
  * Network Throughput
  * Same journal configuration
```
Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines troubleshooting.
The Operating system and OSD writes should not interfere with each other and hence the Operating system, the OSD, and the OSD journals should be on separate disks.

IMPORTANT: It's not suggested to run multiple spinning HDD backed OSDs on a single disk. A separate disk per OSD on spinning HDD's is the way to go.

Red Hat does not recommend Ceph OSDs on RAID arrays. A degraded RAID will adversely affect OSD performance. Ceph already handles replication, and using RAID striping reduces your available storage capacity.
SAS Controllers: If the chassis can support more drives than a single controller can handle, Red Hat recommend installing multiple controllers. However, we recommend against more than 24 lanes of SAS connectivity per chassis due to PCI Express limits and network connectivity issues.
Write Cache: Disable write-caching on your storage drives. Ceph aims for data safety and write guarantees and hence needs the data to be written to the disks rather than being stored in the disk cache.
Filesystem: Red Hat supports using XFS as the filesystems for the OSD disks. Ceph aims for consistency and hence uses the journaling feature on XFS.

XFS filesystem doesn't journal simultaneously with the write operation (compared to btrfs), so writing an object once implies two write operations per OSD per write. ie.. one write for the journal and one write for the OSD. Hence Red Hat recommends using SSDs for journals to overcome the write-twice journaling penalty. Selecting an SSD requires evaluating acceptable IOPS and some important additional considerations.

NOTE:
- RHCS2.x has the BlueStore interface to OSDs (as Technology Preview) which aims to eliminate the file system layer from the cluster.
OSD Journals : Red Hat recommends using SSDs for OSD journals. SSDs are expensive, so partitioning is recommended. The established formula for OSD-to-journal ratio is:
```
Journal number = (SSD seq write speed) / (spinning disk seq write speed)
```
The above formula usually yields around 4 to 5 spinning drives to an SSD journal drive. This means that a single SSD disk can be used to carry the journal of around 4 to 5 OSDs, as per the above formula.

There are a few important performance considerations for journals and SSDs:
- Write-intensive semantics: Journaling involves write-intensive semantics, so ensure that the SSD you choose to deploy will perform equal to or better than a hard disk drive when writing data. Inexpensive SSDs may introduce write latency even as they accelerate access time because sometimes high-performance hard drives can write as fast or faster than some of the more economical SSDs available on the market!
- Sequential Writes: When you store multiple journals on an SSD you must consider the sequential write limitations of the SSD too since they may be handling requests to write to multiple OSD journals simultaneously.
- Partition Alignment: A common problem with SSD performance is that people like to partition drives as a recommended practice, but they often overlook proper partition alignment with SSDs, which can cause SSDs to transfer data much more slowly. Ensure that SSD partitions are properly aligned.
Red Hat recommends a 10GB journal size per OSD.

Network Recommendations

Red Hat supports two networks for an RHCS/Ceph cluster, the first being a public (front-side) network and the second, a cluster (back-end network) network.
The public network handles client traffic towards the MONs and OSDs, while the cluster network (back-end network) handles communication between OSDs, such as heartbeats, replication, backfilling and recovery traffic.
The public network and cluster network should be on separate NICs.
- The bandwidth requirements of the cluster network. For example, in the case of a drive failure, replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, the replication times would be 20 minutes and 1 hour respectively. Administrators will prefer that a cluster recovers as quickly as possible.
For network optimization, Red Hat recommends a jumbo frame for a better CPU/bandwidth ratio, as well as a non-blocking network switch back-pane. This will need validation that all HW in the network path supports jumbo frames.
An RHCS cluster may be deployed across geographic regions; however, this is NOT RECOMMENDED UNLESS a dedicated network connection is used between data centres. Ceph prefers consistency and acknowledges writes synchronously. Using the internet between geographically separate data centres will introduce significant write latency which will affect the cluster performance adversely, and the MONs may end up kicking out the latency-affected OSDs out of the cluster. A maximum of 1ms latency is allowed across stretched clusters and Red Hat Consulting is recommended to be engaged for these types of configurations.

Multiple IP addresses and subnets for the public and cluster network can be specified in the Ceph configuration file, as following:

public_network {ip-address}/{netmask} 
cluster_network {ip-address}/{netmask}

Example:

public_network = 172.16.0.0/24 
cluster_network = 10.0.0.0/24

IPv6 addresses are supported, but the following configuration much be set in the Ceph configuration file to enable the support.
```
ms_bind_ipv6 = true
```
Monitors use port 6789 by default. Ensure that the said port is open for each monitor host. In case a different port needs to be used, open the port and specify the IP address and port in the Ceph configuration. For example:
```
[mon.monname]
host = {hostname}
mon_addr = {ip-address}:{port}
```
Each ceph-osd daemon on the OSD nodes may use up to three ports, beginning at port 6800:
- One for talking to clients and monitors.
- One for sending data to other OSDs (replication, backfill and recovery).
- One for heartbeat traffic between OSDs.
At least three ports per OSD should be opened on the OSD node, beginning at port 6800, to ensure that the OSDs can peer. The port for talking to monitors and clients must be open to the public (front-side) network. The ports for sending data to other OSDs and heart beating must be open on the cluster (back-side) network.
If a different port range other than 6800:7300 needs to be used for the Ceph daemons, adjust the following settings in the Ceph configuration file:
```
ms bind port min = {min-port-num}
ms bind port max = {max-port-num}
```

Minimum recommended hardware configurations

OSD node

Processor - 1 x 64-bit x86-64 quad-core
RAM - 16GB baseline per OSD node, with additional 2GB RAM per Filestore OSD, and 5GB RAM per Bluestore OSD.
Disk - 1 storage drive per OSD daemon (apart from the OS space)
Journal - 1x SSD Partition per daemon (optional)
Network - 2 x 1 GB Ethernet NICs (One each for public and private network)

Monitor nodes

Processor - 1 x 64-bit x86-64 quad-core
RAM - 1 GB for the monitor process (apart from the system usage)
Disk Space - 10 GB for the monitor daemon (apart from the OS space), SSD disk for monitor data is recommended.
Network - 2 x 1 GB Ethernet NICs (One each for public and private network)

Ceph-radosgw node

Processor - 1 x 64-bit x86-64 quad-core
RAM - 1 GB per daemon
Disk Space - 5 GB per daemon
Network - 1x 1GB Ethernet NIC

Ceph-mds node

Processor - 1 x 64-bit x86-64 quad-core
RAM - 1GB per daemon, this is HIGHLY dependent on MDS cache size configured.
Disk Space - 1MB per daemon, varies based upon configured logging level.
Network - 2x 1GB Ethernet NICs

Highly Referenced KCS Articles

Red Hat Ceph Storage and Upstream Ceph versions

What are the Red Hat Ceph releases and corresponding Ceph package versions?

Supported Configurations

Upgrade Procedures

Upstream Ceph to Red Hat Ceph Storage migration support matrix

CRUSH

RADOS

Monitors (MONs)

OSD

Placement Groups (PG)

Rados Block Device (RBD)

OpenStack with Rados Block Device (RBD)

Rados Gateway (RGW)

CephFS

NOTE:

CephFS is shipped in RHCS2.0 as a Technology Preview, and is not supported in Production environments.
CephFS will be fully supported from RHCS3.0 onwards.

Erasure Coded pools

Performance issues

What are the performance benchmarking tools available for Red Hat Ceph Storage?

Generic

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Configure

Tags

Ceph

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.