Virtual machines cannot communicate over network bridges

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux 7
  • Red Hat Enterprise Linux 6
  • KVM or RHV/RHEV virtualization
  • Linux bridge

Issue

  • KVM or RHEV virtual machines cannot communicate over network bridges
  • We have two RHEL6.5 hosts with a KVM guest on each. The second NIC with addresses in the 192.168.1.x IP range. The host machines have a bridge interface also in this range. The guests use a NIC configured to use the bridge as source and model is virtio. The KVM guests cannot ping each other, but can ping the host on which they reside only. The hosts can ping all four addresses: the two hosts and the two KVM guests.
  • When we startup service avahi-daemon on the RHEL guest, after a while when we ping the ip address of the guest, we will meet the data lost issue.
  • When our Windows VMs start IP Multicast, they become unreachable on the network.

Resolution

Remove any network loop, or remove any device reflecting traffic back to the hypervisor from the external network.

Network Loop

A subtle network loop may be created where multiple interfaces are "bonded" or "bundled" or "aggregated" into one larger link, but only on one end of the bundle. This can cause broadcast traffic to come from one slave/channel of the bundle, and go into another slave to be received back on the same interface, creating a network loop.

In this instance, the proper resolution is to remove the network loop.

Reflected traffic

As described in the Root Cause, the bridge learns where a MAC address is located by looking at received traffic. When the bridge receives traffic from the VM's MAC coming in the external interface, the bridge thinks the VM is located outside the hypervisor instead of inside the hypervisor and traffic to the VM fails.

This is commonly seen with duplicated packets on the bridge, one rapidly after another. The second packet may have a VSS-Monitoring ethernet trailer added to the ethernet frame.

In this instance, the proper resolution is to stop the external network device from reflecting traffic back to the hypervisor.

Workaround

The bridge device can be configured to act like a "hub" or "repeater" instead of a "switch".

With this configuration, the bridge will forward every packet to every port, instead of forwarding traffic for the VM to the wrong port.

This is achieved by changing each bridge's ageing parameter to 0 with the command:

brctl setageing <brname> 0

This change can be persisted across reboots by editing the /etc/sysconfig/network-scripts/ifcfg-<brname> file and adding the line:

BRIDGING_OPTS="ageing_time=0"

An alternative workaround is to create static ARP entries on the Virtual Machine with arp -s. A given virtual machine needs static ARP entries for all network devices and systems it is expected to communicate with.

There is a different known issue with HP blade systems and their Emulex network interfaces, caused by the NIC firmware and SR-IOV.

There is a different known issue where only multicast traffic fails between VMs, caused by a change in the bridge driver.

Root Cause

The bridge is a software implementation of a Layer 2 switch. Being a device which operates at Layer 2 of the OSI Model, it works on Data Link Layer addressing, or "MAC Addresses". A layer 2 switch follows a few simple rules:

  • If the destination MAC is the broadcast address (FF:FF:FF:FF:FF:FF), send the traffic out every switchport, except the switchport where it came from
  • If the source MAC is not in my Forwarding Database (aka Forwarding Information Base or MAC table) then add the source address and its source port to the FDB
  • If the destination MAC is in my Forwarding Database, forward the traffic out that switchport
  • If the destination MAC is not in my Forwarding Database, forward the traffic out every switchport

A switch's forwarding database typically looks something like this:

 +------+-------------------+
 | Port | MAC Address       |
 +------+-------------------+
 | 1    | AB:CD:EF:00:01:02 |
 | 2    | FE:DC:BA:99:88:77 |
 +------+-------------------+

If that switch got traffic for MAC "AB:CD:EF:00:01:02" it would send the traffic out port 1, likewise traffic for MAC "FE:DC:BA:99:88:77" would go out port 2. If that switch got traffic for any other MAC, the traffic would be sent out all ports.

Consider the setup where there are two hypervisors, and a VM on each hypervisor:

VM -- vnet -- bridge -- eth ------------network------------ eth -- bridge -- vnet -- VM

With working traffic, a given bridge should have a FDB like:

 Port  | MAC Address
 ------+-------------
 vnetX | Local VM's eth0
 vnetY | Local VM's eth1
 ethX  | Remote VM's eth0
 ethX  | Remote VM's eth1

So let's look at how that FDB is ideally built:

  • The FDB is blank
  • The VM ARPs out with its own Source MAC, and a destination MAC of the broadcast
  • The bridge receives that ARP request, and adds the VM's MAC address to the FDB as "Local VM MAC reachable via vnetX"
  • The bridge forwards that ARP request out every port, except the vnetX port where it came from
  • The remote VM replies, with a destination MAC of the VM, and its own source MAC
  • The bridge receives that reply, and adds the remote VM's MAC address to the FDB as "Remote VM MAC reachable via ethX"
  • The bridge has this destination MAC in the FDB, so forwards the reply down the vnetX interface

However, that's not happening on the systems we have here. What IS happening is:

  • The FDB is blank
  • The VM ARPs out with its own Source MAC, and a destination MAC of the broadcast
  • The bridge receives that ARP request, and adds the VM's MAC address to the FDB as "this MAC reachable via vnetX"
  • The bridge forwards that ARP request out every port, except the vnetX port where it came from

Here's where we trip up:

  • The blade switch forwards the broadcast BACK to the bridge
  • The bridge has now received traffic, with the VM's MAC as the source MAC, coming in eth0
  • The bridge updates the FDB that "Local VM MAC reachable via eth0"

Now the reply traffic cannot work:

  • The remote VM replies, with a destination MAC of the VM, and its own source MAC
  • The bridge receives that reply, and adds the remote VM's MAC address to the FDB as "Remote VM MAC reachable via ethX"
  • The bridge has this destination MAC in the FDB (as per the three points above), so forwards the reply down the ethX interface
  • The Local VM never receives the reply.

We currently believe the network switch is sending broadcast traffic back to the interface where it came from.

Diagnostic Steps

The issue can be confirmed by viewing the MAC address table of the bridge when the issue is observed.

The following output is "good" output, the virtual machine is reachable via the vnet interface:

# brctl showmacs br0
port no mac addr                is local?       ageing timer
  1     00:12:34:00:00:01       yes                0.00    ## physical ethernet interface
  1     00:34:56:00:00:00       no                62.43    ## another system elsewhere on the network
  2     fe:54:00:aa:aa:aa       yes                0.00    ## the vnet interface to the VM
  2     52:54:00:aa:aa:aa       no                45.21    ## the virtual machine reachable via the vnet port

The following output is "bad" output, the virtual machine is reachable via the physical ethernet interface:

# brctl showmacs br0
port no mac addr                is local?       ageing timer
  1     00:12:34:00:00:01       yes                0.00
  1     00:34:56:00:00:00       no                62.43
  2     fe:54:00:aa:aa:aa       yes                0.00
  1     52:54:00:aa:aa:aa       no                45.21    ## the virtual machine now reachable via the physical port

An instrumented kernel was built which monitors changes in the bridge's forwarding database. We can see here where the bridge thinks the local VM is reachable out vnetX, then almost straight away, it receives traffic which says the local VM is reachable via the ethX:

Mar 11 07:57:41 hypervisor kernel: br_fdb_update, dev = vnet1, mac = 52:54:00:aa:aa:aa, port_no = 2
Mar 11 07:57:41 hypervisor kernel: br_fdb_update, dev =  eth2, mac = 52:54:00:aa:aa:aa, port_no = 1

During another instance where this was seen, a packet capture was performed on the external interface in both incoming and outgoing directions:

Note: This is using a later version of tcpdump than the one shipped with RHEL6, that allows packet direction. This is important as it allows us to see reflected packets by monitoring for incoming packets. If an incoming packet is received with the MAC address of the VM from the physical interface then it confirms that this condition is being encountered.

# tcpdump -Q in -i bond1 -w bond1.in.pcap
# tcpdump -Q out -i bond1 -w bond1.out.pcap

Shows traffic leaving the hypervisor from the VM:

$ tshark -nr bond1.out.pcap "ip.src eq 10.0.0.8"
 68  15.709067 10.0.0.8  224.0.0.251   IGMPv2 50 Membership Report group 224.0.0.251
211  20.459975 10.0.0.8  224.0.0.251   IGMPv2 50 Membership Report group 224.0.0.251
223  24.125831 10.0.0.8  224.0.0.251   IGMPv2 50 Membership Report group 224.0.0.251
236  31.478999 10.0.0.8  224.0.0.251   IGMPv2 50 Membership Report group 224.0.0.251

And also traffic from the VM coming back into the hypervisor from the external network:

 $ tshark -nr bond1.in.pcap "ip.src eq 10.0.0.8"
 737  16.187322 10.0.0.8  224.0.0.251   IGMPv2 60 Membership Report group 224.0.0.251
 968  20.940013 10.0.0.8  224.0.0.251   IGMPv2 60 Membership Report group 224.0.0.251
1229  24.605441 10.0.0.8  224.0.0.251   IGMPv2 60 Membership Report group 224.0.0.251
1792  31.957237 10.0.0.8  224.0.0.251   IGMPv2 60 Membership Report group 224.0.0.251

This is incorrect, as the VM cannot send traffic from outside the hypervisor. This traffic must come from an external source outside the hypervisor, likely a network loop, or a network device which reflects or broadcasts traffic it receives.

On the hypervisor, bridge monitor fdb will show the VM's MAC being remapped as an external source when an ingress packet arrives due to the loop:

# bridge monitor fdb
11:22:33:44:55:66 dev vnet0 master barmetal 
11:22:33:44:55:66 dev bond0 master baremetal 

Bridge monitoring can be started at boot, for instance when a bastion host boots:

cat > /usr/local/bin/bridge-monitor.sh <<EOF
#!/bin/bash
bridge monitor fdb
EOF
chmod +x /usr/local/bin/bridge-monitor.sh

cat > /etc/systemd/system/bridge-monitor.service <<EOF
[Unit]
Description=Bridge monitor
Wants=network.target
After=network-online.target

[Service]
Restart=on-failure
TimeoutStopSec=70
ExecStart=/usr/local/bin/bridge-monitor.sh

[Install]
WantedBy=multi-user.target default.target
EOF
systemctl daemon-reload
systemctl enable --now bridge-monitor.service
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.