How are the values for different policies in "xmit_hash_policy" bonding parameter calculated?

Solution Verified - Updated

Environment

  • Red Hat Enterprise Linux
  • Bonding driver providing link aggregation in Mode 2 (balance-xor) or Mode 4 (802.3ad aka LACP) or Mode 5 (balance-tlb) or Mode 6 (balance-alb)

Issue

  • How are the values for different policies in xmit_hash_policy bonding parameter calculated?
  • We need to understand the practical implementation of the logic/math behind the load balancing algorithms.
  • How the algorithms are employed for each of the three policies layer2, layer2+3, layer3+4,encap2+3,encap3+4 and vlan+srcmac?
  • What formula is used to compute the Network bonding hashing policies?
  • What are the different hash policies in network bonding and how to configure it?

Resolution

Configuration

The xmit_hash_policy load balancing parameter can be used with mode=2, mode=4, mode=5 and mode=6. However, mode=5 and mode=6 it will applied only if tlb_dynamic_lb=0 has been set.

For example, consider we have to configure bondX as mode=2 balance-xor with xmit_hash_policy=layer2+3:

### if using network service we can modify BONDING_OPTS in ifcfg-bondX to:
BONDING_OPTS="miimon=100 mode=2 xmit_hash_policy=layer2+3"

### if using NetworkManager we can use:
# nmcli con modify bond.options "miimon=100,mode=2,xmit_hash_policy=layer2+3"

Complete configuration for bonding devices is discussed at:

layer2

The layer2 policy uses the XOR of source and destination MAC addresses and ethernet protocol type.

The calculation is:

  hash = source MAC XOR destination MAC XOR packet type ID
  slave number = hash modulo slave count

This algorithm will place all traffic to a particular network peer on the same slave.

If network traffic is between this system and multiple other systems in the same broadcast domain, this is a good algorithm.

If network traffic is mostly between this system and multiple other systems behind a default gateway, another algorithm should be considered.

This algorithm is 802.3ad compliant.

This is the default policy if no configuration is provided.

layer2+3

The layer2+3 policy uses the XOR of source and destination MAC addresses and IP addresses.

The calculation is:

  hash = source MAC XOR destination MAC XOR packet type ID
  hash = hash XOR source IP XOR destination IP
  hash = hash XOR (hash RSHIFT 16)
  hash = hash XOR (hash RSHIFT 8)
  slave number = hash modulo slave count

This algorithm will place all traffic to a particular IP address on the same slave.

If network traffic between this system and multiple other systems goes through a default gateway, this is a good algorithm.

If network traffic is mostly between this system and one other system, another algorithm should be considered.

For non-IP traffic, the formula is the same as for the layer2 transmit policy.

This algorithm is 802.3ad compliant.

layer3+4

The layer3+4 policy uses the XOR of source and destination ports and IP addresses.

The calculation is:

  hash = source port , destination port (as in the header)
  hash = hash XOR source IP XOR destination IP
  hash = hash XOR (hash RSHIFT 16)
  hash = hash XOR (hash RSHIFT 8)
  hash = hash RSHIFT 1
  slave number = hash modulo slave count

If network traffic between this system and another system uses the same IPs but multiple ports, this algorithm is a good choice.

For non-IP traffic, the formula is the same as for the layer2 transmit policy.

This algorithm is not 802.3ad compliant.

For fragmented TCP or UDP packets and all other IP protocol traffic, the source and destination port information is omitted. This policy is intended to mimic the behavior of certain switches, notably Cisco switches with PFC2 as well as some Foundry and IBM products.

A single TCP or UDP conversation containing both fragmented and unfragmented packets may see traffic balanced across two interfaces, which may result in Out-of-Order delivery. Most traffic types will not meet this criteria, as TCP rarely fragments traffic, and most UDP traffic is not involved in extended conversations. Other implementations of 802.3ad may or may not tolerate this noncompliance.

encap2+3

This policy uses the same formula as layer2+3 but it relies on skb_flow_dissect to obtain the header fields which might result in the use of inner headers if an encapsulation protocol is used.

This will improve the performance for tunnel users because the packets will be distributed according to the encapsulated flows.

encap3+4

This policy uses the same formula as layer3+4 but it relies on skb_flow_dissect to obtain the header fields which might result in the use of inner headers if an encapsulation protocol is used.

This will improve the performance for tunnel users because the packets will be distributed according to the encapsulated flows.

vlan+srcmac

The vlan+srcmac policy uses the XOR of vlan ID and source MAC vendor and source MAC dev.

The calculation is:

  hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)
  slave number = hash modulo slave count

This policy uses a very rudimentary vlan ID and source mac hash to load-balance traffic per-vlan, with failover should one leg fail.

The intended use case is for a bond shared by multiple virtual machines, all configured to use their own vlan, to give lacp-like functionality without requiring lacp-capable switching hardware.

This feature is available from RHEL 8.4 or kernel-4.18.0-305.el8 onwards.

Single Stream

For traffic where the primary use is a single large Layer 4 stream, such as a single NFS mount, or single iSCSI target/initiator, or other persistent single TCP/UDP connection, this traffic cannot be load balanced.

If a single persistent stream is required to go faster, faster network interfaces and network infrastructure must be used.

Diagnostic Steps

The relevant code that deals with the hash policies is:

5.14.0-284.11.1.el9/drivers/net/bonding/bond_main.c

Following xmit policies are available:
#define BOND_XMIT_POLICY_LAYER2         0 /* layer 2 (MAC only), default */
#define BOND_XMIT_POLICY_LAYER34        1 /* layer 3+4 (IP ^ (TCP || UDP)) */
#define BOND_XMIT_POLICY_LAYER23        2 /* layer 2+3 (IP ^ MAC) */
#define BOND_XMIT_POLICY_ENCAP23        3 /* encapsulated layer 2+3 */
#define BOND_XMIT_POLICY_ENCAP34        4 /* encapsulated layer 3+4 */
#define BOND_XMIT_POLICY_VLAN_SRCMAC    5 /* vlan + source MAC */

/**
 * bond_xmit_hash - generate a hash value based on the xmit policy
 * @bond: bonding device
 * @skb: buffer to use for headers
 *
 * This function will extract the necessary headers from the skb buffer and use
 * them to generate a hash based on the xmit_policy set in the bonding device
 */
u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
{
        if (bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP34 &&
            skb->l4_hash)
                return skb->hash;

        return __bond_xmit_hash(bond, skb, skb->data, skb->protocol,
                                skb_mac_offset(skb), skb_network_offset(skb),
                                skb_headlen(skb));
}


/* Generate hash based on xmit policy. If @skb is given it is used to linearize
 * the data as required, but this function can be used without it if the data is
 * known to be linear (e.g. with xdp_buff).
 */
static u32 __bond_xmit_hash(struct bonding *bond, struct sk_buff *skb, const void *data,
                            __be16 l2_proto, int mhoff, int nhoff, int hlen)
{
        struct flow_keys flow;
        u32 hash;

        if (bond->params.xmit_policy == BOND_XMIT_POLICY_VLAN_SRCMAC)
                return bond_vlan_srcmac_hash(skb, data, mhoff, hlen);

        if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
            !bond_flow_dissect(bond, skb, data, l2_proto, nhoff, hlen, &flow))
                return bond_eth_hash(skb, data, mhoff, hlen);

        if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
            bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23) {
                hash = bond_eth_hash(skb, data, mhoff, hlen);
        } else {
                if (flow.icmp.id)
                        memcpy(&hash, &flow.icmp, sizeof(hash));
                else
                        memcpy(&hash, &flow.ports.ports, sizeof(hash));
        }

        return bond_ip_hash(hash, &flow, bond->params.xmit_policy);
}


/* L2 hash helper */
static inline u32 bond_eth_hash(struct sk_buff *skb, const void *data, int mhoff, int hlen)
{
        struct ethhdr *ep;

        data = bond_pull_data(skb, data, hlen, mhoff + sizeof(struct ethhdr));
        if (!data)
                return 0;

        ep = (struct ethhdr *)(data + mhoff);
        return ep->h_dest[5] ^ ep->h_source[5] ^ ep->h_proto;
}

static u32 bond_ip_hash(u32 hash, struct flow_keys *flow, int xmit_policy)
{
        hash ^= (__force u32)flow_get_u32_dst(flow) ^
                (__force u32)flow_get_u32_src(flow);
        hash ^= (hash >> 16);
        hash ^= (hash >> 8);

        /* discard lowest hash bit to deal with the common even ports pattern */
        if (xmit_policy == BOND_XMIT_POLICY_LAYER34 ||
                xmit_policy == BOND_XMIT_POLICY_ENCAP34)
                return hash >> 1;

        return hash;
}

static u32 bond_vlan_srcmac_hash(struct sk_buff *skb, const void *data, int mhoff, int hlen)
{
        struct ethhdr *mac_hdr;
        u32 srcmac_vendor = 0, srcmac_dev = 0;
        u16 vlan;
        int i;

        data = bond_pull_data(skb, data, hlen, mhoff + sizeof(struct ethhdr));
        if (!data)
                return 0;
        mac_hdr = (struct ethhdr *)(data + mhoff);

        for (i = 0; i < 3; i++)
                srcmac_vendor = (srcmac_vendor << 8) | mac_hdr->h_source[i];

        for (i = 3; i < ETH_ALEN; i++)
                srcmac_dev = (srcmac_dev << 8) | mac_hdr->h_source[i];

        if (!skb_vlan_tag_present(skb))
                return srcmac_vendor ^ srcmac_dev;

        vlan = skb_vlan_tag_get(skb);

        return vlan ^ srcmac_vendor ^ srcmac_dev;
}

Here, we use flow in order to find actual packet header information such as ip and port detail.

The BOND_XMIT_POLICY_ENCAP23 and BOND_XMIT_POLICY_ENCAP34 work like normal layer23 or layer34 xmit policy, but helps in parsing an encapsulated packet and read the IP and Network header from it for doing the hashing. 

Following is the HASH computation for selecting interface for sending out data based on bonding mode selection:

Assumed Topology
----------------
Server

    bond0
    MAC: 00:1b:21:74:b6:39
    IP : 169.254.92.64 = 0xA9FE5C40
    UDP: 12243         = 0x2FD3
    packet ID:         = 0x0800   (considering IPv4)

    NIC_Count = 2
    NIC0 assigned # value: 0
    NIC1 assigned # value: 1

Destination

    Client1
        MAC: 00:1a:22:12:34:59
        IP : 192.168.1.11  = 0xC0A8010A
        UDP: 42424         = 0xA5B8

    Client2
        MAC: 00:1e:c1:07:45:1A
        IP : 192.168.100.24 = 0xC0A86418
        UDP: dst port 42424 = 0xA5B8


Mode Behaviour
--------------
1. layer2:

        Hash = ( SRC_MAC[5] ^ DST_MAC[5] ^ packet ID ) % NIC_Count

        Server --> Client1
        Hash = ((0x0039 ^ 0x0059) ^ 0x0800) % 2 = 0 ---> send packet through NIC0

        Server --> Client2
        Hash = ((0x0039 ^ 0x001A) ^ 0x0800)  % 2 = 1 ---> send packet through NIC1

2. layer2+3:

              hash = source MAC XOR destination MAC XOR packet type ID
              hash = hash XOR source IP XOR destination IP
              hash = hash XOR (hash RSHIFT 16)
              hash = hash XOR (hash RSHIFT 8)
              slave number = hash modulo slave count
              
              Server --> Client1
              
              hash = (0x0039 ^ 0x0059) ^ 0x0800) = 0x0860 
              hash =  0x0860  ^ ( 0xA9FE5C40 ^ 0xC0A8010A ) ) = 0x6956552A
              hash =  0x6956552A ^ (0x6956552A >> 16) = 0x69563C7C
              hash =  0x69563C7C ^ (0x69563C7C >> 8)  = 0x693F6A40
              slave number = 0x693F6A40 % 2 = 0 ---> send packet through NIC0

              Server --> Client2
              
              hash = (0x0039 ^ 0x001A) ^ 0x0800) = 0x0835 
              hash =  0x0835  ^ ( 0xA9FE5C40 ^ 0xC0A86418 )  = 0x6956306D
              hash =  0x6956306D ^ (0x6956306D >> 16) = 0x6956593B
              hash =  0x6956593B ^ (0x6956593B >> 8)  = 0x693F0F62
              slave number = 0x693F0F62 % 2 = 0 ---> send packet through NIC0

3. layer3+4:

              hash = source port , destination port (as in the header)
              hash = hash XOR source IP XOR destination IP
              hash = hash XOR (hash RSHIFT 16)
              hash = hash XOR (hash RSHIFT 8)
              hash  = hash RSHIFT 1

              Server --> Client1

              hash = (0x2FD3 , 0xA5B8) = 0x2FD3A5B8
              hash = 0x2FD3A5B8^ ( 0xA9FE5C40 ^ 0xC0A8010A ) = 0x4685F8F2
              hash = 0x4685F8F2 ^ (0x4685F8F2 >> 16) = 0x4685BE77
              hash = 0x4685BE77 ^ (0x4685BE77 >> 8)  = 0x46C33BC9
              hash = 0x46C33BC9 >> 1 = 0x23619DE4
              slave number = 0x23619DE4 % 2 = 0 ---> send packet through NIC0 

              Server --> Client2

              hash = (0x2FD3 , 0xA5B8) = 0x2FD3A5B8 
              hash = 0x2FD3A5B8 ^ ( 0xA9FE5C40 ^ 0xC0A86418 ) = 0x46859DE0
              hash = 0x46859DE0 ^ (0x46859DE0 >> 16) = 0x4685DB65
              hash = 0x4685DB65 ^ (0x4685DB65 >> 8)  = 0x46C35EBE
              hash = 0x46C35EBE >> 1 = 0x2361AF5F
              slave number = 0x2361AF5F % 2 = 1 ---> send packet through NIC1  


4. vlan+srcmac

    Consider bond has VLAN interface with VLAN ID 100 and 101

              hash = (vlan ID) XOR (source MAC vendor) XOR (source MAC dev)

              Server wth VLAN 100(0x64) --> Client1

              hash = 0x64 ^ 0x001B21 ^ 0x74B639 = 0x74AD7C
              slave number = 0x74AD7C % 2 = 0 ---> send packet through NIC0 

              Server wth VLAN 101(0x65) --> Client1

              hash = 0x65 ^ 0x001B21 ^ 0x74B639 = 0x74AD7D
              slave number = 0x74AD7D % 2 = 1 --> send packet through NIC1
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.