Mellanox ConnectX-5 internal error when removing PF from the bond or disabling SR-IOV
Environment
-
Red Hat OpenStack Platform 13.0 (Red Hat Enterprise Linux 7)
-
Red Hat OpenStack Platform 16.1 (Red Hat Enterprise Linux 8)
-
Red Hat OpenStack Platform 16.2 (Red Hat Enterprise Linux 8)
Issue
When using Mellanox ConnectX-5 adapter cards (mlx5 driver) in VF LAG mode, and when at least one VF of either PF is still bound/attached to a VM, an internal error might take place in the firmware when performing any of the following:
-
Removing PF from the bond (using ifdown, ip link or any other function)
-
Attempting to disable SR-IOV
Resolution
Reboot the node to restore the bond.
Root Cause
You might encounter this known issue when you use the Mellanox ConnectX-5 adapter cards with the virtual function (VF) link aggregation group (LAG) configuration in an OVS OFFLOAD deployment, SRIOV Switchdev mode.
Diagnostic Steps
-
Stablish a ping test to a VM attached to a VF or PF belonging to the bond.
-
Check tc qdisk (each device should have a qdisc):
[root@overcloud-computeovshwoffload-0 ~]# ip link | grep lxbond
20: ens2f0: <BROADCAST,MULTICAST,PROMISC,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master lxbond state UP mode DEFAULT group default qlen 1000
31: ens2f1: <BROADCAST,MULTICAST,PROMISC,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master lxbond state UP mode DEFAULT group default qlen 1000
40: lxbond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000
[root@overcloud-computeovshwoffload-0 ~]# tc qdisc show | grep ingress
│································qdisc ingress ffff: dev ens1f1 parent ffff:fff1 ---------------- │································qdisc ingress ffff: dev ens2f0 parent ffff:fff1 ingress_block 40 ---------------- │································qdisc ingress ffff: dev ens2f0_3 parent ffff:fff1 ---------------- │································qdisc ingress ffff: dev ens2f1 parent ffff:fff1 ingress_block 40 ---------------- │································qdisc ingress ffff: dev lxbond parent ffff:fff1 ingress_block 40 ---------------- │································qdisc ingress ffff: dev genev_sys_6081 parent ffff:fff1 ----------------
- Turn down the ovshwoffload bond:
[root@overcloud-computeovshwoffload-0 ~]# ifdown lxbond
- Check the logs:
[699999.866619] mlx5_core 0000:5e:00.0: mlx5_cmd_check:772:(pid 307452): DESTROY_LAG(0x843) op_mod(0x0) failed, status bad system state(0x4), syndrome (0xad20d2)
[699999.889414] lxbond: (slave ens2f0): Releasing backup interface
[700000.036386] lxbond: (slave ens2f0): the permanent HWaddr of slave - 98:03:9b:9d:79:00 - is still in use by bond - set the HWaddr of slave to a different address to avoid conflicts
[700000.036388] device ens2f0 left promiscuous mode
[700000.036405] mlx5_core 0000:5e:00.0: mlx5_deactivate_lag:225:(pid 307452): Failed to deactivate VF LAG; driver restart required
Make sure all VFs are unbound prior to VF LAG activation or deactivation
[699999.866619] mlx5_core 0000:5e:00.0: mlx5_cmd_check:772:(pid 307452): DESTROY_LAG(0x843) op_mod(0x0) failed, status bad system state(0x4), syndrome (0xad20d2)
[699999.889414] lxbond: (slave ens2f0): Releasing backup interface
[700000.036386] lxbond: (slave ens2f0): the permanent HWaddr of slave - 98:03:9b:9d:79:00 - is still in use by bond - set the HWaddr of slave to a different address to avoid conflicts
[700000.036388] device ens2f0 left promiscuous mode
[700000.036405] mlx5_core 0000:5e:00.0: mlx5_deactivate_lag:225:(pid 307452): Failed to deactivate VF LAG; driver restart required
Make sure all VFs are unbound prior to VF LAG activation or deactivation
[700000.036528] lxbond: (slave ens2f1): making interface the new active one
[700000.267419] device ens2f1 entered promiscuous mode
[700000.500566] lxbond: (slave ens2f1): Releasing backup interface
[700000.500570] device ens2f1 left promiscuous mode
[700000.693502] device lxbond left promiscuous mode
- Turn up the ovshwoffload bond:
[root@overcloud-computeovshwoffload-0 ~]# ifup lxbond
- Check the logs:
[700292.395317] mlx5_core 0000:5e:00.0: mlx5_cmd_check:772:(pid 307792): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0xd1a4f2)
[700292.403023] 8021q: adding VLAN 0 to HW filter on device lxbond
[700292.560890] mlx5_core 0000:5e:00.0: mlx5_create_lag:178:(pid 307792): Failed to create LAG (-22)
[700292.560892] mlx5_core 0000:5e:00.0: mlx5_activate_lag:198:(pid 307792): Failed to activate VF LAG
Make sure all VFs are unbound prior to VF LAG activation or deactivation
[700295.037843] device lxbond entered promiscuous mode
[700295.037854] device ens2f0 entered promiscuous mode
-
Notice the
Failed to activate VF LAGmessage. -
Check the tc qdisc again (notice ens2f1 does not have a qdisc anymore):
[root@overcloud-computeovshwoffload-0 ~]# tc qdisc show | grep ingress
│································qdisc ingress ffff: dev ens1f1 parent ffff:fff1 ---------------- │································qdisc ingress ffff: dev ens2f0 parent ffff:fff1 ingress_block 40 ---------------- │································qdisc ingress ffff: dev ens2f0_3 parent ffff:fff1 ---------------- │································qdisc ingress ffff: dev lxbond parent ffff:fff1 ingress_block 40 ---------------- │································qdisc ingress ffff: dev genev_sys_6081 parent ffff:fff1 ---------------- │································qdisc ingress ffff: dev tapb9084e26-08 parent ffff:fff1 ---------------- │································qdisc ingress ffff: dev tapd226045a-d0 parent ffff:fff1 ----------------
Conclusion
The ping traffic we established before turning down the bond was restored after turning it up. However the other slave interface ens2f1 did not re-joined the bond according to tc.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.