Workaround for model deployment failure when using hardware profiles

Solution Verified - Updated

Environment

OpenShift AI 2.23
OpenShift AI 2.24

Issue

Model deployments that use hardware profiles fail because the Red Hat OpenShift AI Operator does not inject the tolerations, nodeSelector, or identifiers from the hardware profile into the underlying InferenceService when manually creating InferenceService resources. As a result, the model deployment pods cannot be scheduled to suitable nodes and the deployment fails to enter a ready state. Workbenches that use the same hardware profile deploy successfully.

Resolution

To resolve this issue, run the following script to manually inject the tolerations, nodeSelector, and identifiers from the hardware profile into the underlying InferenceService.

Replace the HARDWARE_PROFILE_NAME, HARDWARE_PROFILE_NAMESPACE, ISVC_NAME, and ISVC_NAMESPACE values for your environment.

#!/bin/bash
# This script manually injects a HardwareProfile nodeSelector, tolerations, and identifiers into an InferenceService at the correct path.

set -e

HARDWARE_PROFILE_NAME="<HardwareProfile .metadata.name>"
HARDWARE_PROFILE_NAMESPACE="<HardwareProfile .metadata.namespace>"
ISVC_NAME="<InferenceService .metadata.name>"
ISVC_NAMESPACE="<InferenceService .metadata.namespace>"

# Extract nodeSelector from HardwareProfile
NODE_SELECTOR=$(oc get hardwareprofiles.infrastructure.opendatahub.io "${HARDWARE_PROFILE_NAME}" -n "${HARDWARE_PROFILE_NAMESPACE}" \
  -o jsonpath='{.spec.scheduling.node.nodeSelector}')

# Extract tolerations from HardwareProfile
TOLERATIONS=$(oc get hardwareprofiles.infrastructure.opendatahub.io "${HARDWARE_PROFILE_NAME}" -n "${HARDWARE_PROFILE_NAMESPACE}" \
  -o jsonpath='{.spec.scheduling.node.tolerations}')

# Extract identifiers (resources) from HardwareProfile
IDENTIFIERS=$(oc get hardwareprofiles.infrastructure.opendatahub.io "${HARDWARE_PROFILE_NAME}" -n "${HARDWARE_PROFILE_NAMESPACE}" \
  -o jsonpath='{.spec.identifiers}')

# Build the patch JSON for nodeSelector
if [ -n "${NODE_SELECTOR}" ] && [ "${NODE_SELECTOR}" != "{}" ]; then
  oc patch inferenceservice "${ISVC_NAME}" -n "${ISVC_NAMESPACE}" --type=merge -p "{\"spec\":{\"predictor\":{\"nodeSelector\":${NODE_SELECTOR}}}}"
fi

# Build the patch JSON for tolerations
if [ -n "${TOLERATIONS}" ] && [ "${TOLERATIONS}" != "null" ]; then
  oc patch inferenceservice "${ISVC_NAME}" -n "${ISVC_NAMESPACE}" --type=merge -p "{\"spec\":{\"predictor\":{\"tolerations\":${TOLERATIONS}}}}"
fi

# Build the patch JSON for resources (from identifiers)
if [ -n "${IDENTIFIERS}" ] && [ "${IDENTIFIERS}" != "null" ]; then
  # Parse identifiers and build resources object
  RESOURCES=$(oc get hardwareprofiles.infrastructure.opendatahub.io "${HARDWARE_PROFILE_NAME}" -n "${HARDWARE_PROFILE_NAMESPACE}" -o jsonpath='{range .spec.identifiers[*]}{.identifier}{"\t"}{.defaultCount}{"\n"}{end}')

  # Build a proper resources patch from identifiers
  cat > /tmp/resources-patch.json <<EOF
{
  "spec": {
    "predictor": {
      "model": {
        "resources": {
          "requests": {
EOF

  # Add each identifier as a resource request
  oc get hardwareprofiles.infrastructure.opendatahub.io "${HARDWARE_PROFILE_NAME}" -n "${HARDWARE_PROFILE_NAMESPACE}" \
    -o json | jq -r '.spec.identifiers[] | "            \"" + .identifier + "\": \"" + (.defaultCount | tostring) + "\","' >> /tmp/resources-patch.json

  # Remove trailing comma and close JSON
  sed -i '$ s/,$//' /tmp/resources-patch.json

  cat >> /tmp/resources-patch.json <<EOF
          },
          "limits": {
EOF

  # Add limits if maxCount is specified
  oc get hardwareprofiles.infrastructure.opendatahub.io "${HARDWARE_PROFILE_NAME}" -n "${HARDWARE_PROFILE_NAMESPACE}" \
    -o json | jq -r '.spec.identifiers[] | select(.maxCount != null) | "            \"" + .identifier + "\": \"" + (.maxCount | tostring) + "\","' >> /tmp/resources-patch.json

  # Remove trailing comma and close JSON
  sed -i '$ s/,$//' /tmp/resources-patch.json

  cat >> /tmp/resources-patch.json <<EOF
          }
        }
      }
    }
  }
}
EOF

  oc patch inferenceservice "${ISVC_NAME}" -n "${ISVC_NAMESPACE}" --type=merge -p "$(cat /tmp/resources-patch.json)"
  rm /tmp/resources-patch.json
fi

Root Cause

The Red Hat OpenShift AI Operator does not inject the tolerations, nodeSelector, or identifiers from the hardware profile into the underlying InferenceService when manually creating InferenceService resources.

Diagnostic Steps

  1. Create a hardware profile with nodeSelector and tolerations.
  2. Create the label and taint on the node with GPU so it matches the hardware profile.
  3. Deploy the model and validate if a GPU is used. The model deployment fails with no nodes available.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.