What data needs to be collected to find performance bottleneck in RHGS ?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 6
  • Red Hat Enterprise Linux 7
  • Red Hat Gluster Storage

Issue

  • Could not able to determine why gluster cluster performance is poor ?
  • Why i/o performance is really slow on client node ?
  • Why should we collect gluster volume profile data ?

Resolution

To determine the exact bottleneck and which component between server and client is causing problem for poor write performance, we need to collect below data for investigation:

  1. Gluster volume profile data
    Refer section Collecting volume profile information
    The profile data for volume should be collected every 10-15 minute interval, if the test is short (means less than 5-10 minutes) then collect it every 2 minutes. Take volume profile data before performance test, then every specific interval and after completion of performance test.

  2. Collect network bandwidth data
    Refer solution to check the gluster server node and client node network bandwidth performance. This will help us to know if the network between server and client nodes is not optimal and having performance problem.

  3. Collect storage disk performance numbers
    The backend storage provided to gluster bricks may be bottleneck and could not handle the workload from gluster brick processes. To determine, if indeed backend storage performance is poor, we need iostat numbers for associated lvm disk or physical disk which provides storage to disk. Collect iostat number during (start 2min before,during the test and after 2min) complete performance test run:

    # iostat -x 1 > iostat_`hostname`_$(date +%F_%H-%M-%S)
    

Collect iostat data from all gluster (server) nodes involved.

  1. Collect system resource performance data
    Collect vmstat/sar data during performance test run:

    # vmstat 1 > vmstat_`hostname`_$(date +%F_%H-%M-%S)
    
  2. Strace of performance tool
    We may also request strace of brick process for bricks where test file being created.

    # strace -f -T -o <file-path> <command>
    
  3. Workload
    If you would like to try single threaded test then you can use dd tests else choose fio or iozone performance tests. Perform at least 3 tests along with above independent set of performance data (point 1-4).

  4. Client side volume profile (FUSE protocol)
    The client side volume profile data also required if gluster volume is being mounted over native fuse. Refer solution to collect client side volume profile.

Root Cause

The bottleneck could be at any layer means either at network or gluster stack or backend storage which provides storage to bricks. So it is really difficult to determine exact performance problem without supporting data for analysis.

Diagnostic Steps

  • iostat data indicating i/o bottleneck at brick backend storage :
# iostat -x 1 /dev/mapper/RHS_vg-lvbrick
Linux 3.10.0-514.el7.x86_64 (glusterc3-node2)   06/26/2018      _x86_64_        (2 CPU)
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00   130.00    0.00    0.00    0.00   0.00 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.50    0.00   16.08   66.83    0.00   16.58

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-6              0.00     0.00    5.00  120.00    56.00  7480.50   120.58   221.86  117.17 2929.20    0.00   8.00 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.51    0.00    2.53   96.97    0.00    0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00   246.00    0.00    0.00    0.00   0.00 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.58    0.00    5.58   87.82    0.00    1.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00   246.00    0.00    0.00    0.00   0.00 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.53   48.47    0.00   50.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
dm-6              0.00     0.00    0.00    0.00     0.00     0.00     0.00   246.00    0.00    0.00    0.00   0.00 100.00
  • vmstat data
# vmstat 1 100
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 198292  84156     88 1400796    0    0     8   824    0    0  5  5 90  0  0
 0  2 198292  79076     88 1406540    0    0     0 63532  854  591  1 16  2 81  0
 0  2 198292  68776     88 1417480    0    0     4    44  560  509  2  5  0 93  1
 0  2 198292  64288     88 1422160    0    0     0     0  342  333  0  3  0 97  0
 0  2 198292  75420     88 1411536    0    0     0     0  264  282  0  1  0 99  0
 0  2 198292  73024     88 1413988    0    0     0     0  283  294  0  0  0 99  0
 1  2 198292  72404     88 1414560    0    0     0     0  219  248  1  0  0 99  0
 0  2 198292  75852     88 1410756    0    0     4    15  440  521  2  3 36 60  0
 0  2 198292  75852     88 1410772    0    0     0     0  281  306  1  0 50 50  1
SBR
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.