How to deploy NVIDIA GRID drivers on an OpenShift cluster

Solution Unverified - Updated

Environment

  • OpenShift Container Platform 4.10
  • OpenShift Virtualization 4.10

Issue

  • I want to use NVIDIA vGPUs with OpenShift Virtualization

Resolution

Important: The order of the steps outlined below must be kept. Following them in a different order might lead to a dysfunctional system.

Host preparation

  • Verify that the virtualization extension and the IOMMU extension (Intel VT-d or AMD IOMMU) are enabled in the BIOS.

Adding kernel arguments to enable the IOMMU

  • Create the following MachineConfig object and add IOMMU to the host kernel arguments :
kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
  name: 100-worker-iommu
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
    - intel_iommu=on (or amd_iommu=on for AMD)
    - iommu=pt

Download the NVIDIA GRID Host Driver

Building and Deploying the NVIDIA GRID driver with a container

Requirements

  • Please make sure to have access to a private registry. A private registry will be needed to hold the created driver container image.
    • Note: Please avoid using a public registry as it is forbidden to redistribute this image
  • Please use the Content from github.com is not included.following repository to build and deploy the NVIDIA driver container.

Obtaining the driver toolkit base image

  • Get a reference to an image of the OpenShift Driver-Container Toolkit that matches the target cluster version. This image will contain the necessary kernel sources to build the NVIDIA Grid driver.
$ oc adm release info --image-for=driver-toolkit

quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:[example sha256]

Note: Driver-toolkit image may change with every version of OpenShift. Rebuilding this image will be required after each of OpenShift upgrade.

Configure

podman build --build-arg NVIDIA_INSTALLER_BINARY=NVIDIA-Linux-x86_64-470.82-vgpu-kvm.run -t ocp-nvidia-vgpu-nstaller .
podman push [registry_url]/ocp-nvidia-vgpu-nstaller

Deployment

  • Update the 1000-drivercontainer.yaml to point to the container image.
  • Use a specific node label for a deployment on relevant nodes where this driver should be installed :
oc label nodes worker-0 hasGpu=true
  • Update the 1000-drivercontainer.yaml with the chosen label.
    • A more advanced configuration is possible if Node Feature Discovery Operator is deployed in your cluster. To target, all the nodes which have the NVIDIA GPU cards installed and KubeVirt is running. Update the node selector to :
nodeSelector:
    feature.node.kubernetes.io/pci-0302_10de.present: true
    kubevirt.io/schedulable: true
  • Apply the 1000-drivercontainer.yaml to the cluster :
oc create -f 1000-driverscontainer.yaml
  • Validate that the simple-kmod-driver-container-XXXXX pod is in Running state :
oc get pods -A | grep "simple-kmod-driver-container"
  • Validate that mdev_bus exist on the nodes with the GPU (replace with your node name) :
oc debug node/worker-node-XYZ
...

Starting pod/worker-node-XYZ-debug
To use host binaries, run `chroot /host`
Pod IP: 10.1.156.17
If you don't see a command prompt, try pressing enter.

sh-4.4# ls /sys/class | grep mdev_bus
mdev_bus
  • After completing the steps, please follow the official documentation for using the newly created vGPU devices with your VM.

Root Cause

  • Configuring NVIDIA vGPUs requires the NVIDIA GRID drivers
  • NVIDIA is not providing an Operator to deliver the GRID drivers
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.