OpenShift Data Foundation - Multus prerequisite validation tool

Updated

Version support

This article is supported for OpenShift Data Foundation versions 4.14 through 4.19. For version 4.20 and above, please refer to official OpenShift Data Foundation documentation.

Background

Multus CNI is a container network interface that provides a pluggable application programming interface to configure network interfaces in Linux containers. It is considered a meta-plug-in: a CNI plug-in that can run other CNI plug-ins. This tool should be used to validate the OpenShift configuration, NetworkAttachmentDefinitions, and underlying network compatibility before installing ODF.

This is an interactive tool to help support in-the-field debug and resolution of common configuration issues that affect Multus clusters. It runs a validation test that determines whether the current NetworkAttachmentDefinition, and system configurations will support OpenShift Data Foundation with Multus.

It is a fairly long-running test. It starts up a web server and many clients to verify that Multus network communication works properly.

It does not perform any load testing. Networks that cannot support high volumes of Ceph traffic may still encounter runtime issues. This may be particularly noticeable with high I/O load or during OSD rebalancing (e.g., during node/disk failure or during ODF upgrade). Therefore, we still recommend doing network load testing to ensure that base network configurations meet user requirements. To know more about OSD rebalancing refer to the Content from docs.ceph.com is not included.doc.

Usage

Run this tool from an OpenShift administrator shell after Multus NetworkAttachmentDefinitions are configured and the ODF operator is installed but before an ODF StorageCluster is installed.

Download the odf-multus-validation-tool archive attached to this article. The tool provides extensive help text. The most up-to-date usage information can be gotten by running the tool:

$ ./rook multus validation run -h

The tool supports a configuration file that allows configuring the number of test daemon on different types of nodes. The config file can be used to test CSI+Ceph+OSD placement on storage nodes while simultaneously testing only CSI placement on non-storage nodes. The tool has built-in config file examples with commented documentation to help get started more quickly.

$ ./rook multus validation config -h

Example to run multus validation tool when only one NAD is configured for cluster network and want to reduce the number of test pods (by default the test will create 95 pods per node , so if won't have enough IPs in the NAD network it will fail with "error adding container to network "your_NAD_name": error at storage engine: Could not allocate IP in range:") :

$ ./rook  multus validation run --cluster-network your_NAD_name --daemons-per-node 5 

Troubleshooting

If the test fails, it will suggest some things to check based on the failure condition. This will help resolve common configuration issues quickly.

General troubleshooting

For diagnosing issues with the tool itself, use the --log-level DEBUG for getting additional details.

If pods aren’t starting

  • Problem with NAD configuration
  • oc pod describe usually contains the error as an event
  • Some NICs don’t support enough virtual ports/VLANs for the additional MAC addresses

If pods are starting but not communicating

  • Network design/configuration may contain errors
  • Network switch may be blocking sub-interface MAC addresses/IPs (check promiscuous mode settings)
  • Switch firewalling
  • System or NAD Linux networking configurations may be blocking traffic
  • SOS report will have good info to look at next
    • ip_netns_exec_*_address_* and ip_netns_exec_*_route_* from container namespaces

Getting help

If you are having trouble resolving the issue based on these troubleshooting steps above, a substantial amount of info should be collected for getting help.

The tool will list resources that should be collected into an archive file in this case. For OpenShift and ODF, the following OCP tools should contain information needed for further debugging:

  • Network diagrams
  • ODF must-gather
  • OCP must-gather
  • OCP must-gather network logs (how-to)
  • SOS reports from all nodes on which test pods are running (https://access.redhat.com/solutions/3530881)

Multus is highly integrated among various product layers, and issues are most likely to be related to configuration or the physical hardware environment rather than product bugs. Initial investigation at the engineering layer will involve ODF, OpenShift, and possibly RHEL networking experts to rule out configuration or environment.

  • General Multus bugs/help: OCP Bugs Jira: https://issues.redhat.com/projects/OCPBUGSM
  • ODF Multus bugs: ODF Bugzilla: https://bugzilla.redhat.com/
  • Obviously, follow the customer escalation process if required
SBR
Category
Article Type