Verifying performance capabilities of a Ceph storage cluster

Updated 7 Feb 2024

Intro

This process is designed to understand the performance capabilities of the hardware and environment which could impact your Ceph Storage cluster. At the completion of this collection, you should be able to predict a reasonable expectation of performance within Ceph.

NOTE: This is not a step-by-step HowTo for every command, but is instead meant to provide guidance on what is needed to understand the environment's performance.

Need Help?

If you'd like assistance with this process, please reach out to your Account or Sales team to secure a Performance Review engagement with our Storage Consulting practice. Note that Red Hat Support is staffed to handle break/fix cases for our customers. As such, Red Hat Support simply does not have the staff availability to perform a deep dive performance review on an environment. To ensure the best possible experience with Red Hat Ceph Storage, we urge you to work with our Storage Consulting practice.

Data Collection

To understand the capabilities of your environment, we really need to analyze each segment individually.

Physical disk performance -- no, dd is not a valid tool for testing disks.
- In this step, all disks are individually tested with fio to identify the physical disk's performance capabilities
- These test should be performed on all disks associated with OSD operation including Filestore, Journal and Bluestore WAL/DB disks
- For the disk IO tests, it's important to use block sizes similar to the workload you anticipate in the environment.
  - Spinning disks are especially sensitive to block size of the tests, with larger block sizes ( 4mb ) resulting in much higher throughput, and small block sizes ( 4k ) resulting in much higher IOPs rates.
  - This is completely expected and at a minimum, both extremes should be tested.
  - Tests at your expected op size should be added as well, in power of 2 increments ( 4k, 8k, 16k, 32k ... 128k... 512k, 1m, 2m, 4m )
- Note that different operation types should be tested as well ( Sequential Read, Random Read, Write, Read/Write mix, etc )
- Some recommended fio options: --ioengine=libaio --direct=1 --gtod_reduce=1 --max-jobs=16 --iodepth=32
  - --ioengine=libaio == Linux native asynchronous I/O
  - --gtod_reduce=1 == Reduces the Get Time of Day calls to about 0.4%, thus removing extraneous calls not associated with actual disk performance
  - --max-jobs=16 == Set the maximum allowed number of jobs (threads/processes) to support
  - --iodepth=32 == Number of I/O units to keep in flight against the file
Aggregate physical disk performance.
- In this step, all disks on a given node are tested with 'fio' in parallel.
- This parallel test will help identify performance bottlenecks of the RAID controller or PCIe bus
- Like in step 1, various op sizes and op types should be tested.
- Be sure to perform identical tests to those from step 1 to ensure you're able to make valid assessments of aggregate bottlenecks
Client mode Network Performance -- ping is not a network performance testing tool
- This step consists of setting up multiple remote iperf3 servers, then running a local iperf3 client ( per server ) in parallel
- The purpose is to identify that the full aggregate throughput configured for the node is achievable for received traffic.
- While link aggregation and MTU seem easy to configure, it's very common to miss one option on one port and cause a world of performance issue.
Server mode Network Performance
- This step requires running multiple local iperf3 servers (each on a different port), then running a remote iperf3 client (one per remote node) in parallel
- The purpose is to identify that the full aggregate throughput configured for the node is achievable for transmitted traffic.

A few things to note

Each of the steps above must be completed for each disk and each node in the environment. Similar/"identical" hardware does not necessarily perform the same under the stress of heavy workloads.
It's important that the network testing ( items 3 & 4 above ):
- Be performed between your OSD nodes and Clients, as well as OSD node to OSD node
- Be performed on ALL network paths, including both Ceph Cluster network and Ceph Public networks
Longer tests are better and more indicative of actual performance
- 600 seconds is a good compromise between expediency and accuracy of the results, no less than 300 seconds should be considered useful.
- Short tests may show artificially good or bad results based on inefficiencies or advantages of caching or network buffers throughout the environment.

Analysis and Conclusions

At the end of the testing, the following processing needs to be done:

Populate a spreadsheet with the following metrics for individual disk performance:

Hostname Devpath OpSize OpType Throughput IOPs AverageLatency
Calculate Throughput SUM, IOPs SUM and Average Latency for each host at each OpSize/OpType combination
- This will provide you with the maximum available performance of your spinning disks per host
Calculate Throughput SUM, IOPs SUM and Average Latency for the whole cluster based on individual disk tests
- This will provide you with the maximum available performance of your spinning disks for the entire cluster
Populate a spreadsheet with the following metrics for Aggregate disk performance:

Hostname Devpath OpSize OpType Throughput IOPs AverageLatency
Calculate Throughput SUM, IOPs SUM and Average Latency for each host at each OpSize/OpType combination
- This will provide you with the maximum available performance of your spinning disks as limited by your RAID controller or PCIe bus
Calculate Throughput SUM, IOPs SUM and Average Latency for the whole cluster based on aggregate disk tests
- This will provide you with the maximum available performance of your spinning disks as limited by your RAID controller or PCIe bus for the entire cluster
Compare the Aggregate disk performance values to those of the Individual disk performance values
- Any difference in performance here outside of normal sample deviations would be reflecting the bottleneck of your RAID Controller or PCIe bus
- For calculating total cluster throughput and IOPs capabilities, the lowest performing values should be used
Populate a spreadsheet with the following metrics for Client mode Network Performance:

Hostname Path AggregateBitrate AverageLatency
- Note that 'Path' is to track which network segment is under test:
  - OSD to OSD Cluster network, OSD to OSD Public network, OSD to Client Public network, OSD to MON public network
- Note too the Cwnd value -- If this value remains low or fluctuates wildly, you should investigate the network for stability
Populate a spreadsheet with the following metrics for Server mode Network Performance:

Hostname Path AggregateBitrate AverageLatency
- Note that 'Path' is to track which network segment is under test:
  - OSD to OSD Cluster network, OSD to OSD Public network, OSD to Client Public network, OSD to MON public network
- Note too the Cwnd value -- If this value remains low or fluctuates wildly, you should investigate the network for stability
Compare maximum throughput of the network to the maximum throughputs of your disks ( individual and aggregate tests )
- If the disks are able to sustain 40gbit/s of transfer ( from step 4 ), but you only are able to sustain 20gbit/s on the network layer, then the network is your bottleneck
For read IO, Cluster IOPs are roughly 1:1 to Client IOPs
- For a well sized and highly parallel workload, your Ceph cluster should be able to achieve close to the total read IOPs that the disk/HBA or Network will allow
For write IO, there is 'write amplification' of IOPs which occurs based on your config:
- Replicated pools and on-disk journals: Each Client IOP will result in roughly 3x amplification per copy
  - pool size=3, ~9x IOPs amplification
- Replicated pools and dedicated journals: Each Client IOP will result in roughly 1.5x to 2x amplification per copy
  - pool size=3, ~4.5x to ~6x IOPs amplification
- EC Pools with on-disk journals: Each Client IOP will result in roughly 3x amplification PER K+M
  - pool with EC profile 8+4 == 12, 36x IOPs amplification
- EC Pools with dedicated journals: Each Client IOP will result in roughly 1.5x to 2x amplification PER K+M
  - pool with EC profile 8+4 == 12, 18x to 24x amplification
- Why the variation on write amplification:
  - Each write to the backing filestore results in a metadata update as well as a write of the client IO
  - However, due to write batching, it's common for a write of multiple client IOPs to be combined into a smaller number of on-disk write IOPs in Ceph
  - This can result in a varied amplification depending on workload

SBR

Ceph

Product(s)

Red Hat Ceph Storage

Category

Performance tune

Tags

Article Type

General