Network debugging on Red Hat Enterprise Linux
There is usually no definitive answer as to why network performance is degraded. Network performance varies considerably depending on many variables. This article outlines initial information that can be gathered to give a case a head start when the problem is not understood.
- Table of Contents
- monitor.sh script notes
- source: monitor.sh ; copy&paste from this document to file
- Items to consider
- Tests and information to gather
- Packet captures
- General recommendations
- monitor.sh script notes
Items to consider
- Do I have basic connectivity? Is the link up? Is the router, if there is one in the path to the target system, responsive to pings? Is the target system responsive to pings?
- Is the application doing anything else like disk I/O? Disk I/O can severely impact network throughput due to I/O bottlenecks, dirty pages running out, etc. It is better to measure network throughput using tools like
iperfornetperf, as they are not dependent on disk I/O. If the problem is an FTP performance problem useiperfornetperfto verify the bottleneck is or is not the network. - Why is my
scporsftpthroughput low? Thescpcommand is not a good indicator of network speed. It is susceptible to both slow disk I/O and the overhead of single-threaded encryption. Do not use it as a gauge of network speed. - What is the LAN topology? There may be 10Gbps cards installed in some systems, but traffic may be routed via a 1Gbps link. The network is only as fast as the slowest component, and components may be being shared by other competing traffic.
- What effect on the network are the other machines having? An old fashioned way to check the state of a network is to monitor the LEDs on the switch ports. If one port is very busy then trace it back to a machine and investigate.
- Is this a virtual machine on a hypervisor? Check the configuration of the hypervisor. Run an
iperftest from a bare metal machine to the hypervisor. If it is also slow, then trouble-shooting outside of the VM is required. - If all of the above are ruled out, then a tuning exercise needs to be carried out to determine why the machine is slow in processing network input and output. There are many ways to do this; please contact Red Hat Support.
Tests and information to gather
-
Confirm network throughput between systems using
iperf3,netperf, or another tool that does not use encryption or generate disk I/O. Try the same tests between a particular system and others on the network. -
If performance is degraded for a specific machine, provide the output from
iperf3ornetperf.increase the buffer size on both client and server to 4 MiB: # sysctl -w net.core.wmem_max=4194304 # sysctl -w net.core.rmem_max=4194304 execute on the server machine: # iperf3 -f M -s execute on the client machine: # iperf3 -l 1M -w 4M -f M -t 60 -c SERVER-IP -
At very least, gather the following before and after the throughput test as preliminary data:
# netstat -s # ss -noemitaup # cat /proc/net/snmp
monitor.sh script notes
-
Use the following script to gather snap-shots of network statistics at intervals, while the networking issue occurs. Please check that enough disk space exists if the scripts are going to be run for a long time.
- To create the script, copy the lines "
#!/bin/bash ... # monitor.sh begins here" through to the end of the line "# monitor.sh ends here" and paste them into a file called monitor.sh. Thenchmod +x monitor.shto allow the script to be invoked.
- To create the script, copy the lines "
-
This script should work on all supported versions of RHEL
- RHEL5 Note: Remove the Usage notes section from
USAGEuntilEOM)
- RHEL5 Note: Remove the Usage notes section from
-
When the test is finished stop the script then run the following command to compress the data:
$ tar cvzf net_stats_$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").tgz *network_stats_* and upload the compressed *.tgz file onto the case.
monitor.sh source (copy&paste)
#!/bin/bash
# monitor.sh begins here
# Save this script as monitor.sh
# Allocate read write execute permissions: chmod +rwx monitor.sh
# Help available with: ./monitor.sh -h
# License: Creative Commons Zero - https://creativecommons.org/publicdomain/zero/1.0/
# Requirements: bash coreutils findutils gawk procps-ng iproute (EL8+ iproute-tc) net-tools numactl sysstat
# shellcheck disable=SC2129,SC2181
VERSION=63
USAGE=$(cat <<-EOM
Usage: monitor.sh [-d DELAY] [-i ITERATIONS] [-n] [-h]
This script collects data relevant to network debugging.
Valid parameters are all optional.
-d DELAY
Specifies a delay between collections. Default is 5 seconds.
Examples:
./monitor.sh -d 10 # 10 seconds
./monitor.sh -d 2 # 2 seconds
-i ITERATIONS
Specifies the number of collections. Default is to run forever.
Examples:
./monitor.sh -i 10 # 10 iterations
./monitor.sh -i 2 # 2 iterations
-p
Disables process collection in "ss", except when SS_OPTS used.
Default is process collection enabled when SS_OPTS not provided.
Example:
./monitor.sh -p
-n
Disables collection of per netns netstat
-h
Displays this help message.
Example:
./monitor.sh -h
Options can be combined.
Example:
./monitor.sh -d 10 -i 360 # run every 10 secs, for an hour
This script recognizes an environment variable SS_OPTS which will
override the script's default command line switches when running
the 'ss' utility.
Example:
env SS_OPTS="-pantoOemi sport = :22" bash monitor.sh
EOM
)
## defaults
DELAY=5
ITERATIONS=-1
DEF_SS_OPTS="-noemitaup"
DEF_SS_OPTS_NOP="-noemitau"
NETNS=1
## Check PATH because when run from cron /usr/sbin often isn't in PATH
## ip & ss come from /usr/sbin/
which ss 2>&1 | grep -q ": no ss in"
if [ "$?" -eq 0 ] ; then
if [ ! -x /usr/sbin/ss ] ; then
echo "ss binary not found, check if iproute2 rpm is installed."
else
echo "$PATH" | grep -q ":/usr/sbin:"
if [ "$?" -gt 0 ] ; then
PATH=/usr/sbin:$PATH
fi
fi
fi
#
# Add single line socket output if iproute2 is version 5.2 or newer
#
hasO=$(ss -v | sed -e 's/^.*iproute2-//' | awk -F. '{ if ($1 >= 5 && $2 >=2 ) { print "1" } else { print "0"}}')
if [ "$hasO" -eq "1" ] ; then
DEF_SS_OPTS="-noemitaupO"
DEF_SS_OPTS_NOP="-noemitauO"
fi
## option parsing
REAL_SS_OPTS=${SS_OPTS:-$DEF_SS_OPTS}
while getopts ":d:i:pnh" OPT; do
case "$OPT" in
"d")
# something was passed, check it's a positive integer
if [ "$OPTARG" -eq "$OPTARG" ] 2>/dev/null && [ "$OPTARG" -gt 0 ] 2>/dev/null; then
DELAY="$OPTARG"
else
echo "ERROR: $OPTARG not a valid option for delay. Run 'monitor.sh -h' for help."
exit 1
fi
;;
"i")
# something was passed, check it's a positive integer
if [ "$OPTARG" -eq "$OPTARG" ] 2>/dev/null && [ "$OPTARG" -gt 0 ] 2>/dev/null; then
ITERATIONS="$OPTARG"
else
echo "ERROR: $OPTARG not a valid option for iterations. Run 'monitor.sh -h' for help."
exit 1
fi
;;
"p")
REAL_SS_OPTS=${SS_OPTS:-$DEF_SS_OPTS_NOP}
;;
"n")
NETNS=0
;;
"h")
echo "$USAGE"
exit 0
;;
":")
echo "ERROR: -$OPTARG requires an argument. Run 'monitor.sh -h' for help."
exit 1
;;
"?")
echo "ERROR: -$OPTARG is not a valid option. Run 'monitor.sh -h' for help."
exit 1
;;
esac
done
#
# Removed default addition of -S for ss options due to
# https://bugzilla.redhat.com/show_bug.cgi?id=1982804
# which causes ss coredump in RHEL8.0 - RHEL8.4. when there
# are active SCTP associations
#
#if [ -z "$SS_OPTS" ] ; then
# if ! ss -S 2>&1 | grep -q "invalid option"; then
# REAL_SS_OPTS+="S"
# fi
#fi
## reporting
if [ "$ITERATIONS" -gt 0 ]; then
echo "Running network monitoring with $DELAY second delay for $ITERATIONS iterations."
else
echo "Running network monitoring with $DELAY second delay. Press Ctrl+c to stop..."
fi
## one-time commands
# shellcheck disable=SC2207
MQDEVS=( $(tc qdisc show | awk '/^qdisc mq/{print $(NF-1)}') )
## data collection loop
while [ "$ITERATIONS" != 0 ]; do
#start timer in background
eval sleep "$DELAY" &
now=$(date +%Y_%m_%d_%H)
then=$(date --date="yesterday" +%Y_%m_%d_%H)
rm -rf "$HOSTNAME-network_stats_$then"
mkdir -p "$HOSTNAME-network_stats_$now"
if ! [ -e "$HOSTNAME-network_stats_$now/version.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/version.txt"
echo "This output created with monitor.sh version $VERSION" >> "$HOSTNAME-network_stats_$now/version.txt"
echo "See https://access.redhat.com/articles/1311173" >> "$HOSTNAME-network_stats_$now/version.txt"
echo "Delay: $DELAY" >> "$HOSTNAME-network_stats_$now/version.txt"
echo "Iterations: $ITERATIONS" >> "$HOSTNAME-network_stats_$now/version.txt"
echo "SS_OPTS: $REAL_SS_OPTS" >> "$HOSTNAME-network_stats_$now/version.txt"
fi
if ! [ -e "$HOSTNAME-network_stats_$now/sysctl.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/sysctl.txt"
sysctl -a 2>/dev/null >> "$HOSTNAME-network_stats_$now/sysctl.txt"
fi
if ! [ -e "$HOSTNAME-network_stats_$now/ip-address.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/ip-address.txt"
ip -d address list >> "$HOSTNAME-network_stats_$now/ip-address.txt"
fi
if ! [ -e "$HOSTNAME-network_stats_$now/ip-route.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/ip-route.txt"
ip route show table all >> "$HOSTNAME-network_stats_$now/ip-route.txt"
fi
if ! [ -e "$HOSTNAME-network_stats_$now/uname.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/uname.txt"
uname -a >> "$HOSTNAME-network_stats_$now/uname.txt"
fi
if ! [ -e "$HOSTNAME-network_stats_$now/uptime.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/uptime.txt"
uptime >> "$HOSTNAME-network_stats_$now/uptime.txt"
fi
if [ $NETNS -eq 1 ] && ! [ -e "$HOSTNAME-network_stats_$now/netns.txt" ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" > "$HOSTNAME-network_stats_$now/netns.txt"
ip netns ls >> "$HOSTNAME-network_stats_$now/netns.txt"
fi
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/ip_neigh"
ip neigh show >> "$HOSTNAME-network_stats_$now/ip_neigh"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/ip_maddr"
ip maddr show >> "$HOSTNAME-network_stats_$now/ip_maddr"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/igmp"
cat /proc/net/igmp >> "$HOSTNAME-network_stats_$now/igmp"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/tc_qdisc"
tc -s qdisc >> "$HOSTNAME-network_stats_$now/tc_qdisc"
if [ "${#MQDEVS[@]}" -gt 0 ]; then
for MQDEV in "${MQDEVS[@]}"; do
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/tc_class_$MQDEV"
tc -s class show dev "$MQDEV" >> "$HOSTNAME-network_stats_$now/tc_class_$MQDEV"
done
fi
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/netstat"
netstat -s >> "$HOSTNAME-network_stats_$now/netstat"
if [ $NETNS -eq 1 ] ; then
for ns in $(ip netns ls | awk '{print $1}'); do
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/netns_$ns"
ip netns exec "$ns" netstat -s >> "$HOSTNAME-network_stats_$now/netns_$ns"
done
fi
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/nstat"
nstat -az >> "$HOSTNAME-network_stats_$now/nstat"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/ss"
eval "ss $REAL_SS_OPTS" >> "$HOSTNAME-network_stats_$now/ss"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/interrupts"
cat /proc/interrupts >> "$HOSTNAME-network_stats_$now/interrupts"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/softnet_stat"
cat /proc/net/softnet_stat >> "$HOSTNAME-network_stats_$now/softnet_stat"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/vmstat"
cat /proc/vmstat >> "$HOSTNAME-network_stats_$now/vmstat"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/ps"
ps -alfFe >> "$HOSTNAME-network_stats_$now/ps"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/mpstat"
eval mpstat -A "$DELAY" 1 2>/dev/null >> "$HOSTNAME-network_stats_$now/mpstat" &
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/top"
top -c -b -n1 >> "$HOSTNAME-network_stats_$now/top"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/numastat"
numastat 2>/dev/null >> "$HOSTNAME-network_stats_$now/numastat"
if [ -e /proc/softirqs ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/softirqs"
cat /proc/softirqs >> "$HOSTNAME-network_stats_$now/softirqs"
fi
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/sockstat"
cat /proc/net/sockstat >> "$HOSTNAME-network_stats_$now/sockstat"
if [ -e /proc/net/sockstat6 ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/sockstat6"
cat /proc/net/sockstat6 >> "$HOSTNAME-network_stats_$now/sockstat6"
fi
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/netdev"
cat /proc/net/dev >> "$HOSTNAME-network_stats_$now/netdev"
for DEV in $(ip a l | grep mtu | awk '{print $2}' | awk -F "[:@]" '{print $1}'); do echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/ethtool_$DEV"; ethtool -S "$DEV" >> "$HOSTNAME-network_stats_$now/ethtool_$DEV" 2>/dev/null; done
tmp_file=$(mktemp)
find /sys/devices -type f | grep -E '/net/.*/statistics' | xargs grep . > "${tmp_file}"
for DEV in $(ip a l | grep mtu | awk '{print $2}' | awk -F "[:@]" '{print $1}'); do
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/sys_statistics_$DEV"
awk -F "/" "/${DEV}/{print \$NF}" >> "$HOSTNAME-network_stats_$now/sys_statistics_$DEV" < "${tmp_file}"
done
rm -f "${tmp_file}"
if [ -e /proc/net/sctp ]; then
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/sctp-assocs"
cat /proc/net/sctp/assocs >> "$HOSTNAME-network_stats_$now/sctp-assocs"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/sctp-remaddr"
cat /proc/net/sctp/remaddr >> "$HOSTNAME-network_stats_$now/sctp-remaddr"
echo "===== $(date +"%F %T.%N%:z (%Z)") =====" >> "$HOSTNAME-network_stats_$now/sctp-snmp"
cat /proc/net/sctp/snmp >> "$HOSTNAME-network_stats_$now/sctp-snmp"
fi
if [ "$ITERATIONS" -gt 0 ]; then (( ITERATIONS-=1 )); fi
# Wait till background jobs are finished
wait
done
#
# monitor.sh ends here
To the extent possible under law, Red Hat, Inc. has dedicated all copyright to this software to the public domain worldwide, pursuant to the CC0 Public Domain Dedication. This software is distributed without any warranty. See <Content from creativecommons.org is not included.http://creativecommons.org/publicdomain/zero/1.0/>.
Packet captures
-
Perform a packet capture using tcpdump on both ends while the test is under way. Be sure to provide the IP addresses of the machines relative to each pcap.
-
For networks that have multiple components such as virtual machines, please reproduce the issue and perform a packet capture at each device between the VM and the remote host (if possible, simultaneously). For example:
eth0 ---- vnet0 --- br0 --- eth0 ------ switch ------ eth1 [-- VM --][------- hypervisor -------][-- physical --][-- remote --] On VM: # tcpdump -i eth0 -w /tmp/vm_$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").pcap On Hypervisor: # tcpdump -i vnet0 -w /tmp/vnet_$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").pcap # tcpdump -i br0 -w /tmp/br0_$(hostname)-$(date +"%Y-%m-%d-%H-%M-%S").pcap On switch: mirror switch port and perform a packet capture On remote host: # tcpdump -i eth1 -w /tmp/eth1_remote.pcap -
If the packet capture output is large and the problem is related to a specific area like UDP performance, you may be able to use
tcpdumpfiltering options such as:# tcpdump -i ethX udp port 2222
The filter syntax is described in man pcap-filter.
- Please do not use the
-i anyargument withtcpdump. This option creates a "cooked capture" which omits some information. There are times when this is warranted, but unless specifically requested, please specify a device such as-i eth0or-i bond0or-i br0. - For further information on
tcpdumpsee https://access.redhat.com/solutions/8787 - Packet captures on fast networks can be large in filesize.
- If an issue timestamp or traffic port number is provided, this can save a lot of time. Please provide as much information about network communication as possible.
- Always compress (eg
gzip *.pcap) packet capture files before transfer off-site.
General recommendations
- Please provide a clear description of exactly what the problem is.
- Take a
sosreportfrom before and after the test on each system and upload with a file name that matches the system and includes before and after. - If the networking configuration of a system is complex (bonding, vlans, or bridges) run the
plotnetcfgutility and provide the output. - If possible, please provide a network diagram. These help to determine how the RHEL system fits into the network.