Specific debug data that needs to be collected for GlusterFS to help troubleshooting

Updated

Purpose

  1. This article covers commands , log locations, and other debug data which needs to be collected for each GlusterFS component to help troubleshooting customer's issues, while also making easier to approach our engineering team if required.

  2. The intent is to reduce the overall number of iterations to effectively resolve a case.

General information that needs to be captured for any support case

  1. sosreports from all Gluster nodes and problematic client nodes, if any. Please, make sure to upgrade to the latest available sospackage before running the tool. Further instructions on how to collect a sosreportcan be found in KCS document 3592.
  2. Specify the access protocols (native Glusterfs/nfs/SMB) in use by the client.
  3. Details of the file operations being performed, plus any scripts being used (if it's possible to share ) in case we need to reproduce the problem observed locally.
  4. As a quick reference, the Red Hat Gluster Storage Administration Guide provides a summary of the basic logs to be gathered in section 12.2

Generic Gluster Components / Translators

AFR(Automated File Replication)

To effectively troubleshoot AFR related problems, the following additional information is required:

  1. Extended attributes of the files / directories which shows the undesirable behavior:

    • If it's a file, collect the below commands from all the bricks in the subvolume where the file is placed.

    • If it's a directory, collect the below commands from all the bricks in the volume.

            # getfattr -d -m . -e hex <brick-path>/<entry-path>
            # stat <brick-path>/<entry-path>
      
  2. In the sosreports collected, make sure the following information is available:

    • Latest client logs stored at /var/log/glusterfs/mount-point.log.
    • If the problem is observed in an OCS deployment, the client logs can be collected following the steps in KCS document 4047151
    • Latest self-healing daemon logs stored at /var/log/glusterfs/glustershd.login the Gluster server side.

DHT(Distributed Hash Tables)

  1. Please gather a stat output and also the extended attributes of the entries showing issues in all the bricks making up the subvolume where the affected file is placed:

         # getfattr -d -m . -e hex <brick-path>/<entry-path>
         # stat <brick-path>/<entry-path>
    
  2. As the DHT is a client translator, the following information is additionally required:

    • For an NFS client, the /var/log/glusterfs/nfs.log on the node whose NFS server is used to mount the volume.
    • /var/log/glusterfs/<mountpath>.log for FUSE mounts.
    • If the problem is observed in an OCS deployment, the client logs can be collected following the steps in KCS document 4047151
    • /var/log/glusterfs/<volname>-rebalance.log from the Gluster nodes, if the issue is related to a rebalance operation.

Geo-Replication

  1. Master node

    • The logs are located at /var/log/glusterfs/geo-replication/<master-vol> as the base directory.
    • The worker and monitor logs are located at basedir/<slavevol>.log. There's one file per monitor. All the gsync workers on this node log to this file.
    • libfchangelog logs / Agent logs are located at basedir/changes.log. There's one log file per local brick.
    • Master mount log: basedir/*gluster.log. This is the master volume client log file. Please note there will be one log file per master brick.
  2. Slave node.

    • The base directory is /var/log/glusterfs/geo-replication-slaves/
    • The gsyncdlog is located at `basedir/.log
    • The slave mount log is located at basedir/*gluster.log. Please note there will be one log file per master brick.

Quota

  • /var/log/glusterfs/quotad.log Log of the quota daemons running on each node.
  • /var/log/glusterfs/quota-crawl.log Whenever quota is enabled, a file system crawl is performed and the corresponding log is stored in this file
  • /var/log/glusterfs/quota-mount-VOLNAME.log An auxiliary FUSE client is mounted in /VOLNAME of the glusterFS and the corresponding client logs found in this file.

SMB/CIFS

  1. SMB server:

    • A tarball of the directory /var/log/sambacollected after the problem is observed.

    • The file /var/log/log.ctdbfor CTDB related problems.

    • The output of following commands on all the nodes running the Samba server:

        # testparm -s /etc/samba/smb.conf
        # smbstatus
        # ctdb status
        # ctdb ip
      
    • A tcpdumpcollected while the problem is being noticed. Instructions on how to use this tool can be found in KCS document How to capture network packets with tcpdump?

  2. Client side.

gNFS

  1. gNFS server

    • File /var/log/glusterfs/nfs.log from the gNFS servers.
    • gluster v status <volname>from the volumes exported by gNFS.
    • gluster v info <volname>from the volumes exported by gNFS.
    • rpcinfo -p
    • showmount -e
  2. gNFS client

    • /var/log/messages

NFS-Ganesha

  1. Server side

    • Most of the information required should be collected by the sosreportutility. In some cases, it might be needed to manually get a tarball of the contents of /var/log/ganeshafor all the nodes, since the sosreportmight miss this information.

    • A tcpdumpcollected while the problem is being observed. Instructions on how to use this tool can be found in KCS document How to capture network packets with tcpdump?

  2. Client side.

RHHI / RHV

  • In addition to the sosreportsof the Gluster nodes, collect also a sosreport of the Hosted Engine ( HE ).

  • A log-collector. The instructions on how to get this information are posted in KCS document How to collect logs in RHEV 3 and RHV 4

Snapshots

  1. Output of following snapshot commands :

     # gluster snapshot info
     # gluster snapshot status
     # gluster snapshot config
     # lvs  -a ( from all the nodes )
    

Glusterd & CLI

  • The sosreportof the Gluster nodes should collect all the needed information. Make sure the following files are present and cover the timestamp when the problem was observed:

    • /var/log/glusterfs/cmd_history.log from all the nodes.
    • /var/log/glusterfs/cli.log from all the nodes
    • /var/log/glusterfs/glusterd.log from all nodes
  • Depending on the nature of the problem observed, we might need the output of a tcpdumpcommand. Instructions on how to use this tool can be found in KCS document How to capture network packets with tcpdump?

Performance

The information that needs to be collected to troubleshoot a performance case is compiled in KCS article 3363601

Erasure Coding

  • Extended attributes of the file(s) which shows the undesirable behavior, along with stat outputs:

      # getfattr -d -m . -e hex <brick-path</<path-to-the-file>
      # stat <path-to-the-file>
    

Bitrot Detection

  1. Output of gluster vol status VOLNAME

  2. Extended attributes of the file(s) which shows the undesirable behavior:

     # getfattr -d -m . -e hex <brick-path>/<path-to-the-file>
    
  3. Brick logs available at /var/log/glusterfs/bricks/

Tiering

  1. Get statistics :

     # gluster volume rebalance <volname> tier status
    
  2. Get tier logs available at /var/log/glusterfs/<volname>-tier.log

  3. Look for I/O failures in /var/log/glusterfs/nfs.log(nfs mounted) or /var/log/glusterfs/mnt.log (on client). Look for words "ESTALE, "invalid".

  4. Dump the database and send back to dev for troubleshooting

     # echo "select * from gf_flink_tb;" | sqlite3 /<brick_path>/.glusterfs/<volname>/.db
     # echo "select * from gf_file_tb;" | sqlite3 /<brick_path>/.glusterfs/<volname>/.db
    
  5. Confirm the rebalance daemon is running

     # pgrep -a glusterfs|grep rebal
    

Web Admin

To troubleshoot this component, please refer to KCS document 4457041 Basic Troubleshooting and Data Gathering for Red Hat Gluster Storage Web Admin Console

Information Required for Common Use-Cases

Gluster process crash

  • Which process crashed? Was it OOM killed or was it an actual crash?
  • If it's a process crash, the Gluster nodes should be running the abrt tool, which automatically collects all the data needed to analyze the corefile dumped.
  • Run the command abrt-cli lsand check if there's an entry matching the Gluster process that crashed and the date this event occurred. If so, please provide us with a tarball of the Directory listed by this command.
  • If it's an OOM kill, please provide us with a few statedumpsof the corresponding Gluster process that was killed. For instructions on how to gather this information, refer to section 19.8 of the Gluster Administration Guide

Gluster process hang/fail to respond

  • In addition to the sosreports, capture a statedumpof the hanging process. For instructions on how to get this data refer to section 19.8 of the Gluster Administration Guide

  • If possible, gdb into the process and get a coredump:

      gdb -p <pid>
      gdb> gcore
    
  • The output file will be in the working directory, with a file name core.pid

High Memory Usage

  • Make sure the memory size of the process is increasing. Check the RSScolumn in the output if ps -auxwwwand verify that the memory used is growing.
  • If that's the case, collect an initial statedumpof the process to set a baseline. After the memory has increased a few hundreds of MBs, collect a second statedumpto compare. For instructions on how to get this data refer to section 19.8 of the Gluster Administration Guide

Application operation fails on the mount point

  • Collect the client mount logs, available at /var/log/glusterfs/mount-point.log

  • In addition, please collect an strace of the application, to understand the nature of the failure.

      strace -ff -T -s500 -v -f -y -tt -o /tmp/strace-$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S") -p <application PID>
    

Rebalance failure

  • Check the rebalance logs for any error messages.
  • Rebalance requires all child subvols to be available. If there are any CHILD_DOWN messages in the rebalance log, find out which subvolumes were down and get the sosreports for those nodes.
    • Rebalance also connects to glusterd and fails if it is down on any nodes. Start any glusterd instances that are down and retry rebalance.
  • For any other error, please provide sosreports from all the gluster nodes.

Split brain and Self-heal errors.

  • For split-brains, please consult section 11.5 of the Gluster Administration Guide
  • For entries stuck in the output of gluster v heal VOLNAME infocommand, check the extended attributes of these entries. The required commands are described in the AFR section above. Verify if there's a clear source and a sink.
    • If the problem concerns data or metadata heal, check if the parent directory is present in all bricks of the replica.
    • Try launching index heal by running gluster v heal VOLNAME and see if there is anything unusual in the brick and glustershd logs. Temporarily change the client-log-level to DEBUG if additional information is required. For instructions, consult section 12.4 of the Gluster Administration Guide
    • Try restarting the self-healing daemon process by running gluster v start VOLNAME force. As a further step, collect a statedump of the self-healing daemon process, by sending an USR1signal to its pid: kill -SIGUSR1 pid.
SBR
Category
Components
Article Type