Specific debug data that needs to be collected for GlusterFS to help troubleshooting
Purpose
-
This article covers commands , log locations, and other debug data which needs to be collected for each GlusterFS component to help troubleshooting customer's issues, while also making easier to approach our engineering team if required.
-
The intent is to reduce the overall number of iterations to effectively resolve a case.
General information that needs to be captured for any support case
sosreportsfrom all Gluster nodes and problematic client nodes, if any. Please, make sure to upgrade to the latest availablesospackage before running the tool. Further instructions on how to collect asosreportcan be found in KCS document 3592.- Specify the access protocols (native Glusterfs/nfs/SMB) in use by the client.
- Details of the file operations being performed, plus any scripts being used (if it's possible to share ) in case we need to reproduce the problem observed locally.
- As a quick reference, the Red Hat Gluster Storage Administration Guide provides a summary of the basic logs to be gathered in section 12.2
Generic Gluster Components / Translators
AFR(Automated File Replication)
To effectively troubleshoot AFR related problems, the following additional information is required:
-
Extended attributes of the files / directories which shows the undesirable behavior:
-
If it's a file, collect the below commands from all the bricks in the subvolume where the file is placed.
-
If it's a directory, collect the below commands from all the bricks in the volume.
# getfattr -d -m . -e hex <brick-path>/<entry-path> # stat <brick-path>/<entry-path>
-
-
In the sosreports collected, make sure the following information is available:
- Latest client logs stored at
/var/log/glusterfs/mount-point.log. - If the problem is observed in an OCS deployment, the client logs can be collected following the steps in KCS document 4047151
- Latest self-healing daemon logs stored at
/var/log/glusterfs/glustershd.login the Gluster server side.
- Latest client logs stored at
DHT(Distributed Hash Tables)
-
Please gather a stat output and also the extended attributes of the entries showing issues in all the bricks making up the subvolume where the affected file is placed:
# getfattr -d -m . -e hex <brick-path>/<entry-path> # stat <brick-path>/<entry-path> -
As the DHT is a client translator, the following information is additionally required:
- For an NFS client, the
/var/log/glusterfs/nfs.logon the node whose NFS server is used to mount the volume. /var/log/glusterfs/<mountpath>.logfor FUSE mounts.- If the problem is observed in an OCS deployment, the client logs can be collected following the steps in KCS document 4047151
/var/log/glusterfs/<volname>-rebalance.logfrom the Gluster nodes, if the issue is related to a rebalance operation.
- For an NFS client, the
Geo-Replication
-
Master node
- The logs are located at
/var/log/glusterfs/geo-replication/<master-vol>as the base directory. - The worker and monitor logs are located at
basedir/<slavevol>.log. There's one file per monitor. All the gsync workers on this node log to this file. - libfchangelog logs / Agent logs are located at
basedir/changes.log. There's one log file per local brick. - Master mount log:
basedir/*gluster.log. This is the master volume client log file. Please note there will be one log file per master brick.
- The logs are located at
-
Slave node.
- The base directory is
/var/log/glusterfs/geo-replication-slaves/ - The
gsyncdlog is located at `basedir/.log - The slave mount log is located at
basedir/*gluster.log. Please note there will be one log file per master brick.
- The base directory is
Quota
/var/log/glusterfs/quotad.logLog of the quota daemons running on each node./var/log/glusterfs/quota-crawl.logWhenever quota is enabled, a file system crawl is performed and the corresponding log is stored in this file/var/log/glusterfs/quota-mount-VOLNAME.logAn auxiliary FUSE client is mounted in/VOLNAME of the glusterFS and the corresponding client logs found in this file.
SMB/CIFS
-
SMB server:
-
A tarball of the directory
/var/log/sambacollected after the problem is observed. -
The file
/var/log/log.ctdbfor CTDB related problems. -
The output of following commands on all the nodes running the Samba server:
# testparm -s /etc/samba/smb.conf # smbstatus # ctdb status # ctdb ip -
A
tcpdumpcollected while the problem is being noticed. Instructions on how to use this tool can be found in KCS document How to capture network packets with tcpdump?
-
-
Client side.
- A screenshot showing the error observed in Windows.
- A network capture while the issue is being observed. Content from www.wireshark.org is not included.Wireshak is an open-source tool that can be used for this purpose.
gNFS
-
gNFS server
- File
/var/log/glusterfs/nfs.logfrom the gNFS servers. gluster v status <volname>from the volumes exported by gNFS.gluster v info <volname>from the volumes exported by gNFS.rpcinfo -pshowmount -e
- File
-
gNFS client
/var/log/messages
NFS-Ganesha
-
Server side
-
Most of the information required should be collected by the
sosreportutility. In some cases, it might be needed to manually get a tarball of the contents of/var/log/ganeshafor all the nodes, since thesosreportmight miss this information. -
A
tcpdumpcollected while the problem is being observed. Instructions on how to use this tool can be found in KCS document How to capture network packets with tcpdump?
-
-
Client side.
- The information required to troubleshoot a client side issue is covered in KCS document What Information is Required to Debug an NFS-Ganesha Client Issue?
RHHI / RHV
-
In addition to the
sosreportsof the Gluster nodes, collect also a sosreport of the Hosted Engine ( HE ). -
A
log-collector. The instructions on how to get this information are posted in KCS document How to collect logs in RHEV 3 and RHV 4
Snapshots
-
Output of following snapshot commands :
# gluster snapshot info # gluster snapshot status # gluster snapshot config # lvs -a ( from all the nodes )
Glusterd & CLI
-
The
sosreportof the Gluster nodes should collect all the needed information. Make sure the following files are present and cover the timestamp when the problem was observed:/var/log/glusterfs/cmd_history.logfrom all the nodes./var/log/glusterfs/cli.logfrom all the nodes/var/log/glusterfs/glusterd.logfrom all nodes
-
Depending on the nature of the problem observed, we might need the output of a
tcpdumpcommand. Instructions on how to use this tool can be found in KCS document How to capture network packets with tcpdump?
Performance
The information that needs to be collected to troubleshoot a performance case is compiled in KCS article 3363601
Erasure Coding
-
Extended attributes of the file(s) which shows the undesirable behavior, along with stat outputs:
# getfattr -d -m . -e hex <brick-path</<path-to-the-file> # stat <path-to-the-file>
Bitrot Detection
-
Output of
gluster vol status VOLNAME -
Extended attributes of the file(s) which shows the undesirable behavior:
# getfattr -d -m . -e hex <brick-path>/<path-to-the-file> -
Brick logs available at
/var/log/glusterfs/bricks/
Tiering
-
Get statistics :
# gluster volume rebalance <volname> tier status -
Get tier logs available at
/var/log/glusterfs/<volname>-tier.log -
Look for I/O failures in
/var/log/glusterfs/nfs.log(nfs mounted) or/var/log/glusterfs/mnt.log(on client). Look for words "ESTALE, "invalid". -
Dump the database and send back to dev for troubleshooting
# echo "select * from gf_flink_tb;" | sqlite3 /<brick_path>/.glusterfs/<volname>/.db # echo "select * from gf_file_tb;" | sqlite3 /<brick_path>/.glusterfs/<volname>/.db -
Confirm the rebalance daemon is running
# pgrep -a glusterfs|grep rebal
Web Admin
To troubleshoot this component, please refer to KCS document 4457041 Basic Troubleshooting and Data Gathering for Red Hat Gluster Storage Web Admin Console
Information Required for Common Use-Cases
Gluster process crash
- Which process crashed? Was it OOM killed or was it an actual crash?
- If it's a process crash, the Gluster nodes should be running the abrt tool, which automatically collects all the data needed to analyze the corefile dumped.
- Run the command
abrt-cli lsand check if there's an entry matching the Gluster process that crashed and the date this event occurred. If so, please provide us with a tarball of theDirectorylisted by this command. - If it's an OOM kill, please provide us with a few
statedumpsof the corresponding Gluster process that was killed. For instructions on how to gather this information, refer to section 19.8 of the Gluster Administration Guide
Gluster process hang/fail to respond
-
In addition to the sosreports, capture a
statedumpof the hanging process. For instructions on how to get this data refer to section 19.8 of the Gluster Administration Guide -
If possible, gdb into the process and get a coredump:
gdb -p <pid> gdb> gcore -
The output file will be in the working directory, with a file name
core.pid
High Memory Usage
- Make sure the memory size of the process is increasing. Check the
RSScolumn in the output ifps -auxwwwand verify that the memory used is growing. - If that's the case, collect an initial
statedumpof the process to set a baseline. After the memory has increased a few hundreds of MBs, collect a secondstatedumpto compare. For instructions on how to get this data refer to section 19.8 of the Gluster Administration Guide
Application operation fails on the mount point
-
Collect the client mount logs, available at
/var/log/glusterfs/mount-point.log -
In addition, please collect an strace of the application, to understand the nature of the failure.
strace -ff -T -s500 -v -f -y -tt -o /tmp/strace-$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S") -p <application PID>
Rebalance failure
- Check the rebalance logs for any error messages.
- Rebalance requires all child subvols to be available. If there are any
CHILD_DOWNmessages in the rebalance log, find out which subvolumes were down and get the sosreports for those nodes.- Rebalance also connects to glusterd and fails if it is down on any nodes. Start any glusterd instances that are down and retry rebalance.
- For any other error, please provide sosreports from all the gluster nodes.
Split brain and Self-heal errors.
- For split-brains, please consult section 11.5 of the Gluster Administration Guide
- For entries stuck in the output of
gluster v heal VOLNAME infocommand, check the extended attributes of these entries. The required commands are described in the AFR section above. Verify if there's a clear source and a sink.- If the problem concerns data or metadata heal, check if the parent directory is present in all bricks of the replica.
- Try launching index heal by running
gluster v heal VOLNAMEand see if there is anything unusual in the brick and glustershd logs. Temporarily change the client-log-level to DEBUG if additional information is required. For instructions, consult section 12.4 of the Gluster Administration Guide - Try restarting the self-healing daemon process by running
gluster v start VOLNAME force. As a further step, collect a statedump of the self-healing daemon process, by sending anUSR1signal to its pid:kill -SIGUSR1 pid.