How to deploy NVIDIA GRID drivers on an OpenShift cluster
Environment
- OpenShift Container Platform 4.10
- OpenShift Virtualization 4.10
Issue
- I want to use NVIDIA vGPUs with OpenShift Virtualization
Resolution
Important: The order of the steps outlined below must be kept. Following them in a different order might lead to a dysfunctional system.
Host preparation
- Verify that the virtualization extension and the IOMMU extension (Intel VT-d or AMD IOMMU) are enabled in the BIOS.
Adding kernel arguments to enable the IOMMU
- Create the following MachineConfig object and add IOMMU to the host kernel arguments :
kind: MachineConfig
apiVersion: machineconfiguration.openshift.io/v1
metadata:
name: 100-worker-iommu
labels:
machineconfiguration.openshift.io/role: worker
spec:
config:
ignition:
version: 3.2.0
kernelArguments:
- intel_iommu=on (or amd_iommu=on for AMD)
- iommu=pt
Download the NVIDIA GRID Host Driver
-
The generic GRID Linux installer for vGPU should be Content from docs.nvidia.com is not included.obtained from the NVIDIA Licensing portal. Please follow the instructions provided by Nvidia to obtain the GRID driver.
-
This example will refer to the vGPU 13.0 GRID driver and uses
NVIDIA-Linux-x86_64-470.82-vgpu-kvm.runbinary as the driver installer.
Building and Deploying the NVIDIA GRID driver with a container
Requirements
- Please make sure to have access to a private registry. A private registry will be needed to hold the created driver container image.
- Note: Please avoid using a public registry as it is forbidden to redistribute this image
- Please use the Content from github.com is not included.following repository to build and deploy the NVIDIA driver container.
Obtaining the driver toolkit base image
- Get a reference to an image of the OpenShift Driver-Container Toolkit that matches the target cluster version. This image will contain the necessary kernel sources to build the NVIDIA Grid driver.
$ oc adm release info --image-for=driver-toolkit
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:[example sha256]
Note: Driver-toolkit image may change with every version of OpenShift. Rebuilding this image will be required after each of OpenShift upgrade.
Configure
-
Update the
FROMin the Content from github.com is not included.Dockerfile with the relevant driver toolkit image. -
Building the container image :
podman build --build-arg NVIDIA_INSTALLER_BINARY=NVIDIA-Linux-x86_64-470.82-vgpu-kvm.run -t ocp-nvidia-vgpu-nstaller .
podman push [registry_url]/ocp-nvidia-vgpu-nstaller
Deployment
- Update the
1000-drivercontainer.yamlto point to the container image. - Use a specific node label for a deployment on relevant nodes where this driver should be installed :
oc label nodes worker-0 hasGpu=true
- Update the
1000-drivercontainer.yamlwith the chosen label.- A more advanced configuration is possible if Node Feature Discovery Operator is deployed in your cluster. To target, all the nodes which have the NVIDIA GPU cards installed and KubeVirt is running. Update the node selector to :
nodeSelector:
feature.node.kubernetes.io/pci-0302_10de.present: true
kubevirt.io/schedulable: true
- Apply the
1000-drivercontainer.yamlto the cluster :
oc create -f 1000-driverscontainer.yaml
- Validate that the
simple-kmod-driver-container-XXXXXpod is in Running state :
oc get pods -A | grep "simple-kmod-driver-container"
- Validate that
mdev_busexist on the nodes with the GPU (replace with your node name) :
oc debug node/worker-node-XYZ
...
Starting pod/worker-node-XYZ-debug
To use host binaries, run `chroot /host`
Pod IP: 10.1.156.17
If you don't see a command prompt, try pressing enter.
sh-4.4# ls /sys/class | grep mdev_bus
mdev_bus
- After completing the steps, please follow the official documentation for using the newly created vGPU devices with your VM.
Root Cause
- Configuring NVIDIA vGPUs requires the NVIDIA GRID drivers
- NVIDIA is not providing an Operator to deliver the GRID drivers
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.