Master Article for Red Hat Ceph Storage Troubleshooting and Recommended configurations
Environment
- Red Hat Ceph Storage 1.3.z
- Red Hat Ceph Storage 2.y
- Red Hat Ceph Storage 3.y
- Red Hat Ceph Storage 4.y
Issue
- What are some of the recommended practices while creating a Red Hat Ceph Storage Cluster?
- What are some common issues seen with Red Hat Ceph Storage Cluster?
Resolution
NOTE this article is obsolete as of the Red Hat Ceph Storage release version 5 and will no longer be updated. Please see the official RHCS Documentation for the Troubleshooting and Architecture guides which contain the information from this knowledge-base article
- This article is an accumulation of various resources as well as suggested solutions which can help in the successful implementation and maintenance of a Red Hat Ceph Storage (RHCS) Cluster.
Red Hat Ceph Storage versions
- What are the Red Hat Ceph Storage releases and corresponding Ceph package versions?
- Red Hat Ceph Storage Life Cycle
- Red Hat Ceph Storage is currently available in the following versions:
- Red Hat Ceph Storage 1.3 (End of support life 6/20/2018)
- Red Hat Ceph Storage 2 (End of Support life 8/22/2019)
- Red Hat Ceph Storage 3 (End of support life 2/28/2021)
- Red Hat Ceph Storage 4 (End of support life 1/21/2023)
Official Documentation
- Red Hat Ceph Storage Official Documentation
- Make sure the hardware is supported at Red Hat Ceph Storage Hardware guide.
- If not, check the Special considerations [Support Exceptions] section.
- Plan the cluster using the Red Hat Ceph Storage Strategies Guide
Red Hat Ceph Storage Supportability
NOTE: Red Hat only supports directly attached storage for OSD disks.
Red Hat Ceph Storage: Supported configurations
RHCS is supported on the following Operating Systems:
-
Red Hat Ceph Storage 1.3.x (End of Life)
- Red Hat Enterprise Linux (7.1, 7.2, and 7.3)
- Ubuntu 14.04 (with Cloud Archive)
-
Red Hat Ceph Storage 2.y (End of Life)
- Red Hat Enterprise Linux (7.2, - 7.7)
- Ubuntu 16.04 (with Cloud Archive)
- for specific minor version use Red Hat Ceph Storage 2 Compatibility Guide
-
Red Hat Ceph Storage 3.y
- Red Hat Enterprise Linux (7.4)
- Ubuntu 16.04 (with Cloud Archive)
- for specific minor version use Red Hat Ceph Storage 3 Compatibility Guide
-
Red Hat Ceph Storage 4.y
- Red Hat Enterprise Linux (7.7, 8.1)
- for specific minor version use Red Hat Ceph Storage 4 Compatibility Guide
Red Hat supports bare metal deployments in the following configurations:
- Dedicated OSD machines (Limited Availability with RHOSP Compute nodes with Hyper Converged Openstack, please contact Red Hat support for more details).
- Dedicated Monitor machines.
- Dedicated Ceph Object Gateway machines.
- RHCS/Ceph Monitors co-located with OpenStack Controller nodes [provided Ceph is integrated with OpenStack]
- Clusters with over 250 OSDs will need a support exception if running co-located mon services.
- Dedicated iSCSI Gateway.
Starting with RHCS 3.0 the following services are supported containerized:
- OSD: A single instance of a containerized scale-out daemon (listed below) can be co-located on each OSD host.
- MON: With OSD in a containerized environment.
- RGW: With OSD in a containerized environment.
- MDS: With OSD in a containerized environment.
Special considerations [Support Exceptions]
IMPORTANT: A Support Exception is required for the following configurations.
- Co-located OSDs and Monitors.
- OSDs and Monitors in a virtualized environment (VM guests).
- Any storage which is not directly attached.
- Co-resident applications. ie.. third party or other applications running on OSD and MON nodes.
- Stretched clusters.
- OpenStack clusters with co-located mon/controller process, if more than 250 OSDs are to be used.
Suggestions for optimal performance
For high performance throughout and streamlined operations, Red Hat suggests the following:
-
All nodes in the cluster (OSD, MON, RGW, CephFS, RBD client nodes) should carry the same RHCS version and package level.
-
All nodes in the cluster should be on identical hardware. Using the same hardware for Ceph storage pools will help achieve a consistent performance profile.
-
It is possible to run a Ceph cluster on a small scale with a minimum of 3 OSDs, ie.. one per OSD node and the failure domain is
host, but the performance improves with a larger number of OSDs since more OSDs can dramatically improve recovery and back-filling operations.NOTE: Failure domains address the possibility of concurrent failures. Failure domains consist of using leaf nodes (hosts and OSD) on different shelves, racks, power supplies, controllers, and/or physical locations. It is desirable to ensure that data replicas (or coding chunks) are on leaf nodes on different failure domains.
The default failure domain in RHCS is set to
hostand the replica count set to3, which means each replica goes to a separate OSD host node, and the failure of one leg (a single host machine in this case) won't affect access to the data. RHCS supports failure domains configured across hosts, chassis, racks, rows, data centre and more. Ideally, a properly implemented cluster should have each object replicas go to devices in different failure domains.Read more on Failure domains in the RHCS Storage Strategies guide.
-
Red Hat recommends using at least three Ceph monitors on separate hosts. The number of monitors in an RHCS cluster should always be an odd number in order to form a quorum. 3 or 5 is the suggested number of MONs, with 5 monitors being recommended for larger Ceph deployments.
NOTE: The MON nodes should not be hosting Ceph OSD daemons. With RHCS 3.0 the monitor can run inside a container with Ceph OSD daemons in containerized environments.
-
NTP should be configured on all the nodes in the cluster. This ensures that the nodes are running on the same date/time and don't drift apart. A time drift may prevent monitors from agreeing on the state of the cluster, which means that clients lose access to data until the quorum is re-established and the monitors agree on the state of the cluster.
-
An
Administration nodeorCeph-Ansible nodeis required for a standalone RHCS cluster. The Ceph-Ansible node is used for the following:a. The RHCS cluster installation is kicked off from the Administration node using
ceph-ansible.
b. Maintain central management of the Ceph configuration.
c. Push configuration overrides inall.ymlfile to the cluster nodes (OSD/MON nodes).
d. Able to create keyrings used for authentication by Ceph.
e. Also can be used to host Ceph-Metrics or the Ceph-Dashboard GUI.
f. Add new OSD nodes, OSD disks, MON nodes etc.. to an existing cluster usingceph-ansible.
Host recommendations
1. CPU
-
OSD nodes should have a reasonable amount of processing power since the OSD processes run the storage cluster service, calculate data placements with CRUSH rule, replicate data, maintain their copy of the cluster maps etc..
-
OSD nodes taking part in Erasure Coded pools will need more processing power compared to their replicated pool counterparts due to the additional overhead required in calculations.
-
Monitors usually don't require the same CPU power required for OSD nodes, but it's better to have a similar hardware configuration across all the nodes in the cluster.
2. Memory
-
Ceph monitors serve the cluster maps to all the clients as well distribute the maps between other monitor nodes and OSD nodes. It should be capable of serving their data quickly. Hence, the MON nodes should have plenty of RAM (e.g., 1GB of RAM per daemon instance). This is not inclusive of the operating system's usage.
-
OSD will require more RAM compared to MON nodes due to the resource intensive operations such as object recovery, replication, soft/deep scrubbing etc.. A baseline of 16GB RAM per host, with additional RAM per OSD -- Filestore OSDs at least 2GB per OSD daemon, and 5GB per Bluestore OSD daemon. The more the better.
-
Due to the behavior of Linux kernel page caching, more memory doesn't mean wasted memory. Any extra memory is used for caching data which were read recently and helps in improved performance.
3. Storage
-
Careful planning is needed on the deciding the type of storage being used in the cluster. Simultaneous OS operations, as well as requests for reading and write operations from multiple daemons against a single drive, can slow performance considerably.
-
Red Hat recommends creating pools and defining CRUSH hierarchies such that the OSD hardware within the pools is identical. ie.. the cluster nodes should have disks which are identical to:
* Disk controllers * Disk size * Disk RPMs * Seek times * I/O * Network Throughput * Same journal configuration -
Using the same hardware within a pool provides a consistent performance profile, simplifies provisioning and streamlines troubleshooting.
-
The Operating system and OSD writes should not interfere with each other and hence the Operating system, the OSD, and the OSD journals should be on separate disks.
IMPORTANT: It's not suggested to run multiple spinning HDD backed OSDs on a single disk. A separate disk per OSD on spinning HDD's is the way to go.
-
Red Hat does not recommend Ceph OSDs on RAID arrays. A degraded RAID will adversely affect OSD performance. Ceph already handles replication, and using RAID striping reduces your available storage capacity.
-
SAS Controllers: If the chassis can support more drives than a single controller can handle, Red Hat recommend installing multiple controllers. However, we recommend against more than 24 lanes of SAS connectivity per chassis due to PCI Express limits and network connectivity issues.
-
Write Cache: Disable write-caching on your storage drives. Ceph aims for data safety and write guarantees and hence needs the data to be written to the disks rather than being stored in the disk cache.
-
Filesystem: Red Hat supports using XFS as the filesystems for the OSD disks. Ceph aims for consistency and hence uses the journaling feature on XFS.
XFS filesystem doesn't journal simultaneously with the write operation (compared to btrfs), so writing an object once implies two write operations per OSD per write. ie.. one write for the journal and one write for the OSD. Hence Red Hat recommends using SSDs for journals to overcome the write-twice journaling penalty. Selecting an SSD requires evaluating acceptable IOPS and some important additional considerations.
NOTE:
- RHCS2.x has the
BlueStoreinterface to OSDs (as Technology Preview) which aims to eliminate the file system layer from the cluster.
- RHCS2.x has the
-
OSD Journals : Red Hat recommends using SSDs for OSD journals. SSDs are expensive, so partitioning is recommended. The established formula for OSD-to-journal ratio is:
Journal number = (SSD seq write speed) / (spinning disk seq write speed)The above formula usually yields around 4 to 5 spinning drives to an SSD journal drive. This means that a single SSD disk can be used to carry the journal of around 4 to 5 OSDs, as per the above formula.
There are a few important performance considerations for journals and SSDs:
-
Write-intensive semantics: Journaling involves write-intensive semantics, so ensure that the SSD you choose to deploy will perform equal to or better than a hard disk drive when writing data. Inexpensive SSDs may introduce write latency even as they accelerate access time because sometimes high-performance hard drives can write as fast or faster than some of the more economical SSDs available on the market!
-
Sequential Writes: When you store multiple journals on an SSD you must consider the sequential write limitations of the SSD too since they may be handling requests to write to multiple OSD journals simultaneously.
-
Partition Alignment: A common problem with SSD performance is that people like to partition drives as a recommended practice, but they often overlook proper partition alignment with SSDs, which can cause SSDs to transfer data much more slowly. Ensure that SSD partitions are properly aligned.
Red Hat recommends a 10GB journal size per OSD.
-
Network Recommendations
-
Red Hat supports two networks for an RHCS/Ceph cluster, the first being a public (front-side) network and the second, a cluster (back-end network) network.
-
The public network handles client traffic towards the MONs and OSDs, while the cluster network (back-end network) handles communication between OSDs, such as heartbeats, replication, backfilling and recovery traffic.
-
The public network and cluster network should be on separate NICs.
- The bandwidth requirements of the cluster network. For example, in the case of a drive failure, replicating 1TB of data across a 1Gbps network takes 3 hours, and 3TBs (a typical drive configuration) takes 9 hours. By contrast, with a 10Gbps network, the replication times would be 20 minutes and 1 hour respectively. Administrators will prefer that a cluster recovers as quickly as possible.
-
For network optimization, Red Hat recommends a jumbo frame for a better CPU/bandwidth ratio, as well as a non-blocking network switch back-pane. This will need validation that all HW in the network path supports jumbo frames.
-
An RHCS cluster may be deployed across geographic regions; however, this is NOT RECOMMENDED UNLESS a dedicated network connection is used between data centres. Ceph prefers consistency and acknowledges writes synchronously. Using the internet between geographically separate data centres will introduce significant write latency which will affect the cluster performance adversely, and the MONs may end up kicking out the latency-affected OSDs out of the cluster. A maximum of 1ms latency is allowed across stretched clusters and Red Hat Consulting is recommended to be engaged for these types of configurations.
-
Multiple IP addresses and subnets for the public and cluster network can be specified in the Ceph configuration file, as following:
public_network {ip-address}/{netmask} cluster_network {ip-address}/{netmask}- Example:
public_network = 172.16.0.0/24 cluster_network = 10.0.0.0/24 -
IPv6 addresses are supported, but the following configuration much be set in the Ceph configuration file to enable the support.
ms_bind_ipv6 = true -
Monitors use port 6789 by default. Ensure that the said port is open for each monitor host. In case a different port needs to be used, open the port and specify the IP address and port in the Ceph configuration. For example:
[mon.monname] host = {hostname} mon_addr = {ip-address}:{port} -
Each
ceph-osddaemon on the OSD nodes may use up to three ports, beginning at port 6800:- One for talking to clients and monitors.
- One for sending data to other OSDs (replication, backfill and recovery).
- One for heartbeat traffic between OSDs.
-
At least three ports per OSD should be opened on the OSD node, beginning at port 6800, to ensure that the OSDs can peer. The port for talking to monitors and clients must be open to the public (front-side) network. The ports for sending data to other OSDs and heart beating must be open on the cluster (back-side) network.
-
If a different port range other than
6800:7300needs to be used for the Ceph daemons, adjust the following settings in the Ceph configuration file:ms bind port min = {min-port-num} ms bind port max = {max-port-num}
Minimum recommended hardware configurations
OSD node
- Processor - 1 x 64-bit x86-64 quad-core
- RAM - 16GB baseline per OSD node, with additional 2GB RAM per Filestore OSD, and 5GB RAM per Bluestore OSD.
- Disk - 1 storage drive per OSD daemon (apart from the OS space)
- Journal - 1x SSD Partition per daemon (optional)
- Network - 2 x 1 GB Ethernet NICs (One each for public and private network)
Monitor nodes
- Processor - 1 x 64-bit x86-64 quad-core
- RAM - 1 GB for the monitor process (apart from the system usage)
- Disk Space - 10 GB for the monitor daemon (apart from the OS space), SSD disk for monitor data is recommended.
- Network - 2 x 1 GB Ethernet NICs (One each for public and private network)
Ceph-radosgw node
- Processor - 1 x 64-bit x86-64 quad-core
- RAM - 1 GB per daemon
- Disk Space - 5 GB per daemon
- Network - 1x 1GB Ethernet NIC
Ceph-mds node
- Processor - 1 x 64-bit x86-64 quad-core
- RAM - 1GB per daemon, this is HIGHLY dependent on MDS cache size configured.
- Disk Space - 1MB per daemon, varies based upon configured logging level.
- Network - 2x 1GB Ethernet NICs
Highly Referenced KCS Articles
Red Hat Ceph Storage and Upstream Ceph versions
What are the Red Hat Ceph releases and corresponding Ceph package versions?
Supported Configurations
- Red Hat Ceph Storage: Supported configurations
- What is the support scope for Red Hat Enterprise Linux Kernel Ceph RBD client (RBD Driver) without a Red Hat Ceph Storage Subscription?
Upgrade Procedures
Upstream Ceph to Red Hat Ceph Storage migration support matrix
CRUSH
- How to edit a CRUSH map and upload it back to the Ceph cluster?
- The CRUSH map got over-written to the default after a node reboot, why?
- When adding failure domains to a CRUSH map, data movement is seen even when new failure domains have no weight set, why?
- Adding OSD's with initial CRUSH weight of 0 causes 'ceph df' output to report invalid MAX AVAIL on pools
RADOS
- Ceph peering process stalls when an OSD is down, and the cluster won't recover to a healthy state, why?
- How to migrate a pool from using 'straw' to 'straw2'?
Monitors (MONs)
- Red Hat Ceph Storage monitor failure scenarios
- Why does the RHCS/Cep monitor tunable "mon_client_hung_interval" doesn't work as expected?
- How to download, edit, and upload a MON map in a Ceph cluster?
- When does an election happen for a Ceph monitor?
- How to handle an IP address change of a monitor in a Ceph cluster?
OSD
- Ceph OSD weights and related configurations
- How to check the ceph version on all the OSD nodes in an RHCS cluster?
- How to prevent a Ceph cluster from automatically replicating data to other OSDs, while removing one of the OSDs manually?
- OSD throttling and backfilling
- OSD reboots every few minutes with FAILED assert(clone_size.count(clone))
- How to get the same OSD name after repairing a failed OSD disk?
- What is the concept of CRUSH weight and OSD weight?
- How to map a pool to a dedicated set of OSDs?
- Why does an RHCS/Ceph cluster report the status of "HEALTH_WARN x near full osd(s)"?
- This content is not included.What are various full and near full ratios in Ceph?
- The distribution of data among OSDs is uneven, Why?
Placement Groups (PG)
- This content is not included.Red Hat Access Labs - Placement Group calculator
- What are the possible Placement Group states in an RHCS/Ceph Cluster
- Ceph: How do I increase Placement Group (PG) count in a Ceph Cluster
- This content is not included.Incomplete placement groups
- This content is not included.How to recover incomplete Placement Groups in a Red Hat Ceph Storage cluster?
- A Ceph cluster shows a status of 'HEALTH_WARN' with the message "pool
pool_namehas too few pgs", why? - This content is not included.A 'ceph health detail' shows the placement groups in incomplete state, why?
- A Ceph cluster shows a status of 'HEALTH_WARN' warning with the message "too many PGs per OSD", why?
- A degraded ceph cluster (on Firefly) stops recovering and gets stuck degraded PGs after an OSD goes down, why?
- Changing pg_num on a pool that uses a custom CRUSH ruleset, doesn't change the total PGs in the cluster, why?
- Ceph Scrubbing And Its Parameters
Rados Block Device (RBD)
- Virtual machine hangs when transferring large amounts of data to RBD disk
- How to enable RBD caching in a Red Hat Ceph Storage cluster?
- What are the steps to migrate RBD images from one ceph pool to another?
- How to check RBD images are used by which clients ?
OpenStack with Rados Block Device (RBD)
- How to configure OpenStack Glance , Nova, and Cinder storage to use Ceph RBD as the backend?
- How to increase debugging level for RBD logs in OpenStack services like Glance, Cinder and Nova?
- How to boot OpenStack VMs/Instances from choice of Ceph pools?
- NOVA resize for Cinder Volume Backed instances, with Ceph as backend
- Ceph deployed by OpenStack Director has "min_size = 1" configured in pools
Rados Gateway (RGW)
- Rados Gateway degraded write performance after Red Hat Ceph Storage 1.3 upgrade
- Python boto code for using S3 APIs with Ceph RadosGW
- Why does RHCS/Ceph Rados Gateway return an error message while trying to delete objects from a bucket using the '--lazy-remove' option?
- How to set ACLs on buckets in Ceph radosgw?
- Is there an ideal value for the Rados Gateway bucket Sharding in RHCS/Ceph?
CephFS
NOTE:
- CephFS is shipped in RHCS2.0 as a Technology Preview, and is not supported in Production environments.
- CephFS will be fully supported from RHCS3.0 onwards.
Erasure Coded pools
- An erasure coded pool with "4+2" chunks created on a ruleset with four OSD nodes, ends up moving the cluster status to HEALTH_WARN, why?
- What are the best practices for Erasure-coded pools in Red Hat Ceph Storage?
Performance issues
Generic
- What information should be collected while debugging/opening a Red Hat Ceph Storage case?
- How to list all the pools in an RHCS cluster?
- This content is not included.How to troubleshoot "slow requests" in a Ceph cluster?
- How to manipulate objects in an RHCS cluster using 'ceph-objectstore-tool'?
- How to map an object to its respective Placement Groups and OSD disks, in RHCS?
- How to pause/stop re-balance in Ceph?
- How to change a Ceph configuration dynamically?
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.