OpenShift update stuck after migrating to OVN-Kubernetes with Trident CSI

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform
    • 4.12.x
    • 4.13.x
    • 4.14.x
    • 4.15.x
    • 4.16.x

Issue

  • After migrating an OpenShift cluster from OpenShift SDN to OVN-Kubernetes, the cluster update process has become stuck.
  • The Trident CSI cannot establish connections anymore.
  • Error message: connection.go:173: Still connecting to unix:///plugin/csi.sock
  • This error occurs on all nodes, preventing the use of PVCs and PVs.

Resolution

Below are three potential solutions and links to their respective implementations. Select the one that best suits your situation.

  1. Configure OVN to use kernel routing table:

Modify the handling of egress traffic routing by OVN. Configure the routing to be handled directly by the node instead of through the gateway router interface. This can be achieved by setting the routingViaHost and ipForwarding fields in .spec.defaultNetwork.ovnKubernetesConfig.gatewayConfig in the Cluster Network Operator configuration.

This setting is for highly-specialized installations and applications that rely on manually configured routes in the kernel routing table. You wont receive the performance benefits of the offloading and/or routing feature that OVN provides because egress traffic is processed by the host networking stack.

Refer to this specific configuration in this article.

  1. Use Multus for additional interfaces in the Trident controller:

Consider using Multus to create additional network interfaces (network-attachment-definitions) connected to the NetApp network for storage-related traffic. These interfaces are managed by Multus, bypassing the OVN network, similar to trident-node-linux pods that use hostNetwork.

An example of this configuration can be found in This content is not included.this article.

More information on Multus and multiple networks can be found here.

  1. Set the Trident controller to use hostNetwork:

With this option the Trident controller will be attached to the local network interfaces in the node, as the pods part of the DamonSet deployed in the nodes.

The implementation of this solution can be found in this article.

Root Cause

Migrating to OVN changes the cluster network and routing within OCP and nodes. OVN creates an L3 gateway router on each node, routing all cluster network egress traffic through this gateway interface. Consequently, other interfaces and OS routes are not involved in the traffic.

This issue can arise if the storage network is accessible only over a non-primary interface, such as when nodes have a dedicated network interface for Layer 2 storage networks (e.g., NFS), sometimes configured using the nmstate operator.

Diagnostic Steps

  1. Confirm Trident pods status:
    Check the status of the Trident pods to ensure they are running properly.

       $ oc get pods -n trident
    
       NAME                                 READY   STATUS             RESTARTS   AGE
       trident-controller-8d88f76f7-24g7k   5/6     CrashLoopBackOff   23         2h
       trident-node-linux-2tr99             1/2     Running            0          2h
       trident-node-linux-4t4f7             1/2     Running            0          2h
       trident-node-linux-5rtbt             1/2     Running            0          2h
       trident-node-linux-9qbsf             1/2     Running            0          2h
       trident-node-linux-mtlp9             1/2     Running            0          2h
       trident-node-linux-p27xz             1/2     Running            0          2h
       trident-node-linux-t2z5s             1/2     Running            0          2h
       trident-node-linux-w6k5c             1/2     Running            0          2h
       trident-node-linux-z98vx             1/2     Running            0          2h
       trident-operator-6856db8579-pvm2p    1/1     Running            0          2h
    
  2. Verify Trident node logs:
    Check the logs of the trident-node-<xxx> containers to confirm they cannot communicate with the Trident controller.

       $ oc logs trident-node-linux-4t4f7 -n trident -c trident-main | grep error | tail -3
    
       2023-11-08T13:06:17.143663936Z time="2023-11-08T13:06:17Z" level=warning msg="Could not update Trident controller with node registration, will retry." error="could not add CSI node" increment=2m11.637589421s logLayer=csi_frontend requestID=<request id> requestSource=Internal workflow="plugin=activate"
       2023-11-08T13:08:58.782541791Z time="2023-11-08T13:08:58Z" level=warning msg="Could not update Trident controller with node registration, will retry." error="could not log into the Trident CSI Controller: error communicating with Trident CSI Controller; Put \"https://<ip>:34571/trident/v1/node/<node>\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" increment=1m51.912536847s logLayer=csi_frontend requestID=<request id> requestSource=Internal workflow="plugin=activate"
    
  3. Confirm Trident controller initialization:
    Verify that the Trident controller cannot be initialized due to application errors.

    2023-11-08T13:06:35.835478689Z time="2023-11-08T13:06:35Z" level=error msg="Could not initialize storage driver." error="error initializing ontap-nas driver: could not create Data ONTAP API client: error creating ONTAP API client: error reading SVM details: Post \"https://<ip>/servlets/netapp.servlets.admin.XMLrequest_filer\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" logLayer=core requestID=<request id> requestSource=Internal workflow="core=bootstrap"
    
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.