How to deploy NVIDIA GRID drivers on an OpenShift cluster

Solution Unverified - Updated 13 Jun 2024

Environment

OpenShift Container Platform 4.10
OpenShift Virtualization 4.10

Issue

I want to use NVIDIA vGPUs with OpenShift Virtualization

Resolution

Important: The order of the steps outlined below must be kept. Following them in a different order might lead to a dysfunctional system.

Host preparation

Verify that the virtualization extension and the IOMMU extension (Intel VT-d or AMD IOMMU) are enabled in the BIOS.

Adding kernel arguments to enable the IOMMU

Create the following MachineConfig object and add IOMMU to the host kernel arguments :

kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
  name: 100-worker-iommu
  labels:
    machineconfiguration.openshift.io/role: worker
spec:
  config:
    ignition:
      version: 3.2.0
  kernelArguments:
    - intel_iommu=on (or amd_iommu=on for AMD)
    - iommu=pt

Download the NVIDIA GRID Host Driver

The generic GRID Linux installer for vGPU should be Content from docs.nvidia.com is not included.obtained from the NVIDIA Licensing portal. Please follow the instructions provided by Nvidia to obtain the GRID driver.
This example will refer to the vGPU 13.0 GRID driver and uses NVIDIA-Linux-x86_64-470.82-vgpu-kvm.run binary as the driver installer.

Building and Deploying the NVIDIA GRID driver with a container

Requirements

Please make sure to have access to a private registry. A private registry will be needed to hold the created driver container image.
- Note: Please avoid using a public registry as it is forbidden to redistribute this image
Please use the Content from github.com is not included.following repository to build and deploy the NVIDIA driver container.

Obtaining the driver toolkit base image

Get a reference to an image of the OpenShift Driver-Container Toolkit that matches the target cluster version. This image will contain the necessary kernel sources to build the NVIDIA Grid driver.

$ oc adm release info --image-for=driver-toolkit

quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:[example sha256]

Note: Driver-toolkit image may change with every version of OpenShift. Rebuilding this image will be required after each of OpenShift upgrade.

Configure

Update the FROM in the Content from github.com is not included.Dockerfile with the relevant driver toolkit image.
Building the container image :

podman build --build-arg NVIDIA_INSTALLER_BINARY=NVIDIA-Linux-x86_64-470.82-vgpu-kvm.run -t ocp-nvidia-vgpu-nstaller .
podman push [registry_url]/ocp-nvidia-vgpu-nstaller

Deployment

Update the 1000-drivercontainer.yaml to point to the container image.
Use a specific node label for a deployment on relevant nodes where this driver should be installed :

oc label nodes worker-0 hasGpu=true

Update the 1000-drivercontainer.yaml with the chosen label.
- A more advanced configuration is possible if Node Feature Discovery Operator is deployed in your cluster. To target, all the nodes which have the NVIDIA GPU cards installed and KubeVirt is running. Update the node selector to :

nodeSelector:
    feature.node.kubernetes.io/pci-0302_10de.present: true
    kubevirt.io/schedulable: true

Apply the 1000-drivercontainer.yaml to the cluster :

oc create -f 1000-driverscontainer.yaml

Validate that the simple-kmod-driver-container-XXXXX pod is in Running state :

oc get pods -A | grep "simple-kmod-driver-container"

Validate that mdev_bus exist on the nodes with the GPU (replace with your node name) :

oc debug node/worker-node-XYZ
...

Starting pod/worker-node-XYZ-debug
To use host binaries, run `chroot /host`
Pod IP: 10.1.156.17
If you don't see a command prompt, try pressing enter.

sh-4.4# ls /sys/class | grep mdev_bus
mdev_bus

After completing the steps, please follow the official documentation for using the newly created vGPU devices with your VM.

Root Cause

Configuring NVIDIA vGPUs requires the NVIDIA GRID drivers
NVIDIA is not providing an Operator to deliver the GRID drivers

SBR

Virtualization

Product(s)

Red Hat OpenShift Container Platform

Components

Category

Configure

Tags

virtualization

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.