How to diagnose fence_vmware_rest resource timeout event with tcpdump?
Environment
- Red Hat Enterprise Linux 7, 8 & 9
- High Availability Pacemaker Cluster Setup
fence_vmware_soaporfence_vmware_rest
Issue
- How to diagnose why fence resource reported a timeout event?
May 17 12:23:47 node1 pacemaker-fenced [2780] (child_timeout_callback) warning: fence_vmware_rest_monitor_1 process (PID 563752) timed out
May 17 12:23:47 node1 pacemaker-fenced [2780] (operation_finished) warning: fence_vmware_rest_monitor_1[563752] timed out after 120000ms
Resolution
If a similar events are recorded and reviewed as detailed in Diagnostic Steps, it clearly indicates that there is some issue at VMware level. The fence agent's connection with the vCentre was in ESTABLISHED state for the entire 120s and the stack reports that the fence_vmware_rest process was waiting for some data/information to arrive at the socket so it can process ahead.
As a part of recovery operation triggered by cluster, if the recovery operation succeeds with no configuration changes at fence then it also indicates for an intermittent issue arising from VMware. It is recommended to engage VMware vendor to review the logs at the same time.
As a quick starting point:
- Validate if DNS query is reaching its maximum limit (if FQDN on vCentre is used in fence configuration)
- Validate the VMWare certificates.
- Validate if AD is reporting any issues (if AD user is used as the fence user)
The above points is only a starting point at VMWare level and does not guarantee that the issue would be found in the above three points. VMware vendor should be engaged to assist ahead.
Diagnostic Steps
For simplicity, the following steps & troubleshooting is focussed on fence_vmware_soap fence agent.
- The first step is to get the monitor operation timeout timeout value increased to a higher value than the default of 20s:
# pcs resource update <fence_ID> pcmk_monitor_timeout=120s op monitor timeout=120s
- Configure
ha-resourcemon.shscript on all the cluster nodes so to gather granular system resource utilization & process details:
- Configure rolling tcpdump capture from all the nodes referring to the steps in the following article:
- As per the script
pcapLogwatch.sh, it can be used with:
- LOG-FILE-NAME as
/var/log/messages - TRIGGER-STRING as
timed out after 120000ms
- Ensure that the previous logs are removed from
LOG-FILE-NAMEso that the scriptpcapLogwatch.shdoes not exit instantaneously. This is because the script will read the entire fileLOG-FILE-NAMEfor theTRIGGER-STRING.
Log Analysis
Once a timeout event for fence is reported, the following log sets will be required:
- It is always better to get the sosreport from all the nodes else sosreport from the affected node should suffice.
- The
ha-resourcemon.shlogs. - TCP dump file.
While reviewing the sosreport:
- The cluster status will report for the event under
Failed Resource Actions:
Failed Resource Actions:
* fence_rest_res_monitor_60000 on node2 'OCF_TIMEOUT' (198): call=45, status='Timed Out', exitreason='', last-rc-change='2023-05-17 12:21:47 -04:00', queued=0ms, exec=120006ms
-
The
execvalue confirms that the execution for the monitor operation happened for 120s (120006ms). -
Reviewing the logs gathered in sosreport, the following event is logged:
May 17 12:23:47 node2 pacemaker-fenced [2780] (child_timeout_callback) warning: fence_vmware_rest_monitor_1 process (PID 563752) timed out
May 17 12:23:47 node2 pacemaker-fenced [2780] (operation_finished) warning: fence_vmware_rest_monitor_1[563752] timed out after 120000ms
May 17 12:23:47 node2 pacemaker-fenced [2780] (log_operation) notice: Operation 'monitor' [563752] using fence_rest_res returned -62 (Timer expired)
-
From the above logs, the PID for the fence monitor operation is found to be 563752. The monitor operation timed out at 12:23:47 which means that the monitor operation was triggered 2 mins before i.e. 12:21:47.
-
With the ha-resourcemon.sh data, it can be found the the
fence_vmware_restwas idle for the complete 120s fromps-wchanfile:
Wed May 17 12:22:02 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 563752 0.5 0.0 231968 20308 ? S 12:21:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
Wed May 17 12:22:22 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 563752 0.2 0.0 231968 20308 ? S 12:21:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
Wed May 17 12:22:42 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 563752 0.1 0.0 231968 20308 ? S 12:21:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
Wed May 17 12:23:02 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 563752 0.0 0.0 231968 20308 ? S 12:21:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
Wed May 17 12:23:22 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 563752 0.0 0.0 231968 20308 ? S 12:21:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
Wed May 17 12:23:42 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 563752 0.0 0.0 231968 20308 ? S 12:21:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
Wed May 17 12:24:02 EDT 2023
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME WCHAN COMMAND
root 566837 0.4 0.0 231968 20332 ? S 12:23:47 00:00:00 - /usr/libexec/platform-python -tt /usr/sbin/fence_vmware_rest
- With the netstat details gathered, the connection with the vCentre shows
ESTABLISHEDstate which means that communication from the VM to its vCentre was working well for the entire period:
Wed May 17 12:22:02 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:43170 10.20.20.20:443 ESTABLISHED 0 14451262 563752/platform-pyt off (0.00/0/0)
...
Wed May 17 12:22:22 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:43170 10.20.20.20:443 ESTABLISHED 0 14451262 563752/platform-pyt off (0.00/0/0)
...
Wed May 17 12:22:42 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:43170 10.20.20.20:443 ESTABLISHED 0 14451262 563752/platform-pyt off (0.00/0/0)
...
Wed May 17 12:23:02 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:43170 10.20.20.20:443 ESTABLISHED 0 14451262 563752/platform-pyt off (0.00/0/0)
...
Wed May 17 12:23:22 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:43170 10.20.20.20:443 ESTABLISHED 0 14451262 563752/platform-pyt off (0.00/0/0)
...
Wed May 17 12:23:42 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:43170 10.20.20.20:443 ESTABLISHED 0 14451262 563752/platform-pyt off (0.00/0/0)
...
Wed May 17 12:24:02 EDT 2023
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name Timer
tcp 0 0 10.10.10.10:52284 10.20.20.20:443 ESTABLISHED 0 14456986 566837/platform-pyt off (0.00/0/0)
- Now reviewing the TCP dump, it could be found that after [a] there was no exchange of data between the VM and the vCentre:
612981 2023-05-17 12:21:48.057651 10.10.10.10 → 10.20.20.20 TCP 74 43170 → 443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=1356901162 TSecr=0 WS=128
612982 2023-05-17 12:21:48.057831 10.20.20.20 → 10.10.10.10 TCP 66 443 → 43170 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=256
612983 2023-05-17 12:21:48.057850 10.10.10.10 → 10.20.20.20 TCP 54 43170 → 443 [ACK] Seq=1 Ack=1 Win=29312 Len=0
612984 2023-05-17 12:21:48.063934 10.10.10.10 → 10.20.20.20 TLSv1 571 Client Hello
612985 2023-05-17 12:21:48.064125 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43170 [ACK] Seq=1 Ack=518 Win=30464 Len=0
612986 2023-05-17 12:21:48.066779 10.20.20.20 → 10.10.10.10 TLSv1.2 1514 Server Hello, Certificate
612987 2023-05-17 12:21:48.066786 10.10.10.10 → 10.20.20.20 TCP 54 43170 → 443 [ACK] Seq=518 Ack=1461 Win=32128 Len=0
612988 2023-05-17 12:21:48.066795 10.20.20.20 → 10.10.10.10 TLSv1.2 84 Server Key Exchange, Server Hello Done
612989 2023-05-17 12:21:48.066799 10.10.10.10 → 10.20.20.20 TCP 54 43170 → 443 [ACK] Seq=518 Ack=1491 Win=32128 Len=0
612990 2023-05-17 12:21:48.067252 10.10.10.10 → 10.20.20.20 TLSv1.2 180 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
612991 2023-05-17 12:21:48.067968 10.20.20.20 → 10.10.10.10 TLSv1.2 105 Change Cipher Spec, Encrypted Handshake Message
612992 2023-05-17 12:21:48.068077 10.10.10.10 → 10.20.20.20 TLSv1.2 476 Application Data
612993 2023-05-17 12:21:48.111651 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43170 [ACK] Seq=1542 Ack=1066 Win=31488 Len=0 <<=--- [a]
628793 2023-05-17 12:23:47.989333 10.10.10.10 → 10.20.20.20 TCP 54 43170 → 443 [FIN, ACK] Seq=1066 Ack=1542 Win=32128 Len=0
628794 2023-05-17 12:23:47.989752 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43170 [FIN, ACK] Seq=1542 Ack=1067 Win=31488 Len=0
628795 2023-05-17 12:23:47.989783 10.10.10.10 → 10.20.20.20 TCP 54 43170 → 443 [ACK] Seq=1067 Ack=1543 Win=32128 Len=0
- Comparing with the TCP dump stream of the last successful connection between VM & the vCentre, it shows continuous traffic/data exchange between VM & vCentre at [b]:
602378 2023-05-17 12:20:47.727099 10.10.10.10 → 10.20.20.20 TCP 74 43104 → 443 [SYN] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=1356840832 TSecr=0 WS=128
602379 2023-05-17 12:20:47.727226 10.20.20.20 → 10.10.10.10 TCP 66 443 → 43104 [SYN, ACK] Seq=0 Ack=1 Win=29200 Len=0 MSS=1460 SACK_PERM=1 WS=256
602380 2023-05-17 12:20:47.727248 10.10.10.10 → 10.20.20.20 TCP 54 43104 → 443 [ACK] Seq=1 Ack=1 Win=29312 Len=0
602381 2023-05-17 12:20:47.733738 10.10.10.10 → 10.20.20.20 TLSv1 571 Client Hello
602382 2023-05-17 12:20:47.733883 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43104 [ACK] Seq=1 Ack=518 Win=30464 Len=0
602383 2023-05-17 12:20:47.737063 10.20.20.20 → 10.10.10.10 TLSv1.2 1514 Server Hello, Certificate
602384 2023-05-17 12:20:47.737072 10.10.10.10 → 10.20.20.20 TCP 54 43104 → 443 [ACK] Seq=518 Ack=1461 Win=32128 Len=0
602385 2023-05-17 12:20:47.737082 10.20.20.20 → 10.10.10.10 TLSv1.2 84 Server Key Exchange, Server Hello Done
602386 2023-05-17 12:20:47.737086 10.10.10.10 → 10.20.20.20 TCP 54 43104 → 443 [ACK] Seq=518 Ack=1491 Win=32128 Len=0
602387 2023-05-17 12:20:47.737591 10.10.10.10 → 10.20.20.20 TLSv1.2 180 Client Key Exchange, Change Cipher Spec, Encrypted Handshake Message
602388 2023-05-17 12:20:47.738220 10.20.20.20 → 10.10.10.10 TLSv1.2 105 Change Cipher Spec, Encrypted Handshake Message
602389 2023-05-17 12:20:47.738333 10.10.10.10 → 10.20.20.20 TLSv1.2 476 Application Data
602390 2023-05-17 12:20:47.779669 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43104 [ACK] Seq=1542 Ack=1066 Win=31488 Len=0
602443 2023-05-17 12:20:47.912997 10.20.20.20 → 10.10.10.10 TLSv1.2 540 Application Data <<<===------ [b]
602444 2023-05-17 12:20:47.913803 10.10.10.10 → 10.20.20.20 TLSv1.2 452 Application Data
602445 2023-05-17 12:20:47.913988 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43104 [ACK] Seq=2028 Ack=1464 Win=32512 Len=0
602452 2023-05-17 12:20:47.970378 10.20.20.20 → 10.10.10.10 TCP 1514 [TCP segment of a reassembled PDU]
602453 2023-05-17 12:20:47.970411 10.20.20.20 → 10.10.10.10 TCP 1514 443 → 43104 [ACK] Seq=3488 Ack=1464 Win=32512 Len=1460 [TCP segment of a reassembled PDU]
602454 2023-05-17 12:20:47.970429 10.10.10.10 → 10.20.20.20 TCP 54 43104 → 443 [ACK] Seq=1464 Ack=4948 Win=40960 Len=0
602455 2023-05-17 12:20:47.970439 10.20.20.20 → 10.10.10.10 TCP 1514 443 → 43104 [ACK] Seq=4948 Ack=1464 Win=32512 Len=1460 [TCP segment of a reassembled PDU]
602456 2023-05-17 12:20:47.970444 10.20.20.20 → 10.10.10.10 TLSv1.2 132 Application Data
602457 2023-05-17 12:20:47.970448 10.10.10.10 → 10.20.20.20 TCP 54 43104 → 443 [ACK] Seq=1464 Ack=6486 Win=43904 Len=0
602458 2023-05-17 12:20:47.971100 10.10.10.10 → 10.20.20.20 TLSv1.2 467 Application Data
602459 2023-05-17 12:20:47.971213 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43104 [ACK] Seq=6486 Ack=1877 Win=33536 Len=0
602460 2023-05-17 12:20:47.972138 10.20.20.20 → 10.10.10.10 TLSv1.2 207 Application Data
602461 2023-05-17 12:20:47.972230 10.10.10.10 → 10.20.20.20 TLSv1.2 85 Encrypted Alert
602462 2023-05-17 12:20:47.972363 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43104 [FIN, ACK] Seq=6639 Ack=1908 Win=33536 Len=0
602463 2023-05-17 12:20:47.972896 10.10.10.10 → 10.20.20.20 TCP 54 43104 → 443 [FIN, ACK] Seq=1908 Ack=6640 Win=46720 Len=0
602464 2023-05-17 12:20:47.973076 10.20.20.20 → 10.10.10.10 TCP 60 443 → 43104 [ACK] Seq=6640 Ack=1909 Win=33536 Len=0
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.