Migrating from vLLM InferenceService to LLMInferenceService on Red Hat OpenShift AI

Updated

1. Overview

This article describes how to migrate existing vLLM-based InferenceService deployments to the new LLMInferenceService resource on Red Hat OpenShift AI.

LLMInferenceService (serving.kserve.io/v1alpha2, shortName: llmisvc) is a purpose-built custom resource for deploying and managing Large Language Model (LLM) workloads. It replaces the pattern of using InferenceService with a ServingRuntime for vLLM deployments, providing first-class support for LLM-specific concerns such as model URI schemes, parallelism strategies, disaggregated prefill/decode serving, and LLM-aware request scheduling.

Important: InferenceService deployments remain fully supported. This migration is only relevant for InferenceService deployments using vLLM serving runtimes.

Note: This guide applies to Red Hat OpenShift AI 3.4 and later. Earlier versions have different behavior for the args field on LLMInferenceService.

This article covers:

  • Key differences between InferenceService and LLMInferenceService
  • A before/after YAML comparison
  • Step-by-step field mapping
  • The migration procedure (zero-downtime and minimal options)
  • How the LLMInferenceServiceConfig replaces ServingRuntime
  • How to select accelerator-specific configurations
  • Known issues and considerations

2. Prerequisites

Before creating an LLMInferenceService, ensure your cluster meets the prerequisites for deploying LLMs with llm-d on Red Hat OpenShift AI. See the official documentation.

3. Key Differences at a Glance

The following table compares key differences between the InferenceService and LLMInferenceService custom resources.

ConceptInferenceServiceLLMInferenceService
API versionserving.kserve.io/v1beta1serving.kserve.io/v1alpha2
Kind / shortNameInferenceService / isvcLLMInferenceService / llmisvc
Model locationspec.predictor.model.storageUrispec.model.uri
Model formatspec.predictor.model.modelFormat.nameNot needed (LLM-only)
Runtime templateServingRuntimeLLMInferenceServiceConfig via spec.baseRefs
Container imageFrom ServingRuntimeFrom LLMInferenceServiceConfig defaults
vLLM argumentsspec.predictor.model.argsContainer args on the main container
Static replicasspec.predictor.minReplicasspec.replicas
AutoscalingminReplicas / maxReplicas / scaleTarget / scaleMetric (Horizontal Pod Autoscaler (HPA) or Kubernetes Event-Driven Autoscaling (KEDA))spec.scaling with vLLM Workload Autoscaler (WVA) layer on top of HPA or KEDA
NetworkingKnative serverless or RawDeploymentspec.router (Gateway API)
AuthenticationOff by default (opt-in)On by default
Multi-nodespec.predictor.workerSpecspec.parallelism + spec.worker (LeaderWorkerSet (LWS))
Disaggregated prefill/decodeNot supportedspec.prefill section
LoRA adaptersNot first-classspec.model.lora
Canary trafficcanaryTrafficPercentGateway API traffic splitting
Logger / Batcherspec.predictor.logger, batcherNot in LLMInferenceService

4. Before and After

The following examples demonstrate what your YAML will look like before and after the migration.

4.1 InferenceService (before)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2-7b
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    model:
      modelFormat:
        name: vLLM
      storageUri: hf://Qwen/Qwen2.5-7B-Instruct
      args:
        - --backend
        - vllm
        - --max-model-len
        - "4096"
      resources:
        limits:
          nvidia.com/gpu: "1"
          memory: 32Gi
        requests:
          nvidia.com/gpu: "1"
          memory: 16Gi

4.2 LLMInferenceService (after)

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: qwen2-7b
spec:
  model:
    uri: hf://Qwen/Qwen2.5-7B-Instruct
  replicas: 3
  router:
    scheduler: {}
    route: {}
    gateway: {}
  template:
    containers:
      - name: main
        args:
          - --max-model-len
          - "4096"
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 32Gi
          requests:
            nvidia.com/gpu: "1"
            memory: 16Gi

4.3 What changed

InferenceServiceLLMInferenceServiceWhy
modelFormat.name: vLLMRemovedLLMInferenceService is LLM-only; no format selection needed
--backend vllmRemovedThe default config template already runs vllm serve
--max-model-len 4096Pass via container argsvLLM flags are now passed as container args, not model args
storageUrimodel.uriSame URI schemes supported (hf://, s3://, pvc://)
No image specifiedNo image specifiedBoth inherit from their respective template (ServingRuntime vs LLMInferenceServiceConfig)
minReplicas: 1, maxReplicas: 3replicas: 3Static replicas; for autoscaling use spec.scaling instead
Knative networkingrouter: { scheduler: {}, route: {}, gateway: {} }Gateway API with LLM-aware request load balancing

4.4 Associated ServingRuntime

With InferenceService, the container image, command, and runtime configuration are defined in a separate ServingRuntime resource referenced by model format. Here's a typical vLLM ServingRuntime that would accompany the InferenceService in 4.1:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-cuda-runtime
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: quay.io/modh/vllm:rhoai-3.4
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - --port=8080
        - --model=/mnt/models
      ports:
        - containerPort: 8080
          protocol: TCP

With LLMInferenceService, this role is filled by LLMInferenceServiceConfig. The well-known configs (e.g. kserve-config-llm-template) ship with the Red Hat OpenShift AI installation and are auto-applied by the controller based on deployment pattern. You do not need to create or maintain your own equivalent of this ServingRuntime unless you have custom requirements - see Section 7: LLMInferenceServiceConfig: Replacing ServingRuntime.

Mapping:

ServingRuntime fieldLLMInferenceServiceConfig equivalent
spec.containers[].imagespec.template.containers[name=main].image
spec.containers[].commandspec.template.containers[name=main].command
spec.containers[].argsspec.template.containers[name=main].args
spec.supportedModelFormatsNot needed - LLMInferenceService is LLM-only

5. ServingRuntime vs LLMInferenceServiceConfig

If you are coming from InferenceService, you are used to a deployment model where the container image, entrypoint command, and runtime configuration live in a separate ServingRuntime resource. Your InferenceService then references it indirectly through the modelFormat field. Understanding how LLMInferenceServiceConfig replaces this pattern is key to a smooth migration.

5.1 How ServingRuntime works with InferenceService

With InferenceService, you create and maintain a ServingRuntime that defines:

  • The container image (e.g. quay.io/modh/vllm:rhoai-3.4)
  • The entrypoint command and default arguments
  • The supported model formats (e.g. vLLM, huggingface)
  • Port configuration

Your InferenceService selects a ServingRuntime by matching on modelFormat.name. The controller finds the ServingRuntime whose supportedModelFormats matches, and uses it to build the inference pod. You are responsible for creating, versioning, and updating the ServingRuntime - including keeping the vLLM image up to date.

5.2 How LLMInferenceServiceConfig works with LLMInferenceService

With LLMInferenceService, the equivalent of a ServingRuntime is an LLMInferenceServiceConfig. The key differences are:

  • Pre-installed defaults: Well-known configs ship with the Red Hat OpenShift AI installation and are maintained by the platform. You do not need to create or version your own runtime template for standard deployments.
  • Automatic selection: The controller selects the appropriate config based on your deployment pattern (single-node, multi-node, disaggregated) rather than matching on a model format field.
  • Reusable presets: Configs can be shared globally when created in the RHOAI system namespace (redhat-ods-applications), or scoped to a specific namespace when created in a user namespace.
  • Explicit override via baseRefs: If you need a non-default config (e.g. a custom image or accelerator-specific settings), you reference it by name through spec.baseRefs rather than through model format matching.
  • Layered merging: Multiple configs can be composed together. The controller merges well-known defaults, then baseRefs entries in order, then your LLMInferenceService spec on top. See Section 8 for details.

5.3 Side-by-side comparison

AspectServingRuntimeLLMInferenceServiceConfig
Who creates itYou create and maintain itShips with Red Hat OpenShift AI; you only create one for custom requirements
How it's selectedImplicitly via modelFormat matchingAutomatically by deployment pattern, or explicitly via baseRefs
ScopeNamespace-scoped (ServingRuntime)Namespace-scoped, or global when in the RHOAI system namespace
Image managementYou are responsible for updating the imagePlatform-managed defaults; override via baseRefs if needed
ComposabilityOne ServingRuntime per deploymentMultiple configs merged in order (defaults + baseRefs + service spec)
What it containsContainer image, command, args, ports, supported formatsSame as LLMInferenceService spec (except model is optional)
Model format fieldRequired (supportedModelFormats)Not needed - LLMInferenceService is LLM-only

5.4 What this means for your migration

For most migrations, you do not need to create an LLMInferenceServiceConfig at all. The platform-managed defaults handle the standard vLLM deployment. Your migration steps are:

  • Delete your ServingRuntime (or stop maintaining it) - the well-known configs replace it.
  • Remove the modelFormat field from your deployment YAML - format matching is no longer used.
  • If you were using a custom image, command, or environment variables in your ServingRuntime, create an LLMInferenceServiceConfig with those overrides and reference it via baseRefs. See Section 8 for creating custom configs.
  • If you were using a standard ServingRuntime provided by Red Hat OpenShift AI (e.g. the default vLLM runtime), no action is needed - the well-known config is the equivalent.

Tip: To see which LLMInferenceServiceConfig resources are available in your namespace, run:

oc get llminferenceserviceconfigs

6. Step-by-Step Migration

The following steps walk through each custom resource field that needs to change when translating your InferenceService YAML to an LLMInferenceService YAML. Work through these to produce your new YAML file, then follow Section 7: Migration Procedure to deploy it.

6.1 Change API version and Kind

Before:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService

After:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService

Both v1alpha1 and v1alpha2 are served. Use v1alpha2 for new deployments.

6.2 Migrate model specification

Before:

spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      storageUri: hf://Qwen/Qwen2.5-7B-Instruct

After:

spec:
  model:
    uri: hf://Qwen/Qwen2.5-7B-Instruct
    name: Qwen/Qwen2.5-7B-Instruct  # defaults to metadata.name if omitted

Drop the modelFormat field entirely. The model.name field controls the model name in request parameters (e.g. the model field in OpenAI-compatible API calls).

6.3 Drop ServingRuntime references

There is no servingRuntime field on LLMInferenceService. The controller automatically applies well-known LLMInferenceServiceConfig templates based on the deployment pattern (single-node, multi-node, disaggregated). To use a non-default config, see Selecting an Accelerator Configuration.

6.4 Migrate static replicas

Before:

spec:
  predictor:
    minReplicas: 1    # when minReplicas == maxReplicas (static)
    maxReplicas: 1

After:

spec:
  replicas: 3

6.5 Migrate autoscaling (Technology Preview)

Important: Autoscaling on LLMInferenceService (via WVA) is a Technology Preview feature in Red Hat OpenShift AI 3.4. If you have production autoscaling requirements, consider remaining on InferenceService until WVA reaches General Availability.

Before:

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    scaleTarget: 70
    scaleMetric: cpu

After:

spec:
  scaling:
    minReplicas: 1
    maxReplicas: 5
    wva:
      hpa: {}    # or keda: {} for KEDA-based scaling

Note: replicas and scaling are mutually exclusive - the API server rejects resources with both set.

6.6 Migrate container resources

Tip: When translating, carry over all metadata.annotations from your InferenceService to your LLMInferenceService. Some annotations are consumed by the Red Hat OpenShift AI operator to inject additional configuration (e.g. hardware profile resource requests), and preserving them keeps those behaviors working.

Before:

spec:
  predictor:
    model:
      resources:
        limits:
          nvidia.com/gpu: "1"

After:

spec:
  template:
    containers:
      - name: main    # must be "main", not "kserve-container"
        resources:
          limits:
            nvidia.com/gpu: "1"

The container name must be main. The well-known configs and controller use this name for matching during strategic merge.

6.7 Migrate runtime arguments

The base LLMInferenceServiceConfig template places the vLLM launch logic (including default flags like --served-model-name, --port, and Transport Layer Security (TLS) configuration) in the container's command field. The args field is left empty, so any arguments you specify on your LLMInferenceService append to the vLLM command rather than replacing the defaults.

Before:

spec:
  predictor:
    model:
      args:
        - --backend
        - vllm
        - --max-model-len
        - "4096"

After:

spec:
  template:
    containers:
      - name: main
        args:
          - --max-model-len
          - "4096"
  • Remove --backend vllm - the base config template already runs vllm serve
  • User-specified args are forwarded to vLLM as positional parameters via the bash wrapper in the base template
  • If the same flag appears in both the base template defaults and your args, your value takes precedence (standard vLLM CLI behavior)

Note: The VLLM_ADDITIONAL_ARGS environment variable is also supported as an alternative mechanism for passing additional vLLM flags. However, using the args field directly is the recommended approach as it follows standard Kubernetes conventions and keeps your configuration declarative.

6.8 Configure networking

Before - serving.kserve.io/deploymentMode: RawDeployment mode (default in RHOAI 3.x)

After - Gateway API:

spec:
  router:
    scheduler: {}  # deploys the Endpoint Picker for LLM-aware scheduling
    route: {}      # creates a managed HTTPRoute
    gateway: {}    # uses the default ingress gateway

Empty objects {} trigger controller-managed defaults. Omit the scheduler if you do not need LLM-aware request routing.

Important: LLMInferenceService uses Gateway API for external access, not OpenShift Routes. At present, Gateway services need to be of type LoadBalancer, so users need an external load balancer to expose the service. This is a change from InferenceService where Knative or RawDeployment handled external routing automatically.

7. Migration Procedure

The following steps take you through the migration process: applying the new resource that you created in Step 6, verifying health, cutting traffic, and cleaning up the old resource.

InferenceService and LLMInferenceService are different CRDs with different Group/Version/Kind identifiers. There is no in-place conversion - you cannot oc edit one into the other. Both CRDs coexist; migration is opt-in.

Important: If you include a servingRuntime field in your LLMInferenceService YAML, running oc apply (without --validate=strict) silently drops the unknown field, and the controller falls back to the default config with no error. Always apply with --validate=strict during migration to catch stale fields.

Important: During side-by-side migration, use a different metadata.name for the LLMInferenceService (e.g. qwen2-7b-llmisvc) to avoid Domain Name System (DNS)/service name conflicts while both resources coexist. Once the old InferenceService is deleted, you can keep the new name or recreate with the original if needed.

  1. Translate your InferenceService YAML using this guide.
  2. Create the LLMInferenceService alongside the existing InferenceService:
oc apply -f <your-llmisvc.yaml> --validate=strict
  1. Verify health:
oc get llmisvc <name>

Wait for Ready=True.

  1. Cut over traffic - update DNS, Gateway HTTPRoute, or consumers to point to the new endpoint.
  2. Delete the old InferenceService:
oc delete inferenceservice <name>

7.2 Option B: Delete and recreate migration (with downtime)

  1. Delete the InferenceService:
oc delete inferenceservice <name>
  1. Apply the LLMInferenceService:
oc apply -f <your-llmisvc.yaml> --validate=strict
  1. Wait for Ready=True:
oc get llmisvc <name>

Note: Platform features that integrate specifically with LLMInferenceService (e.g. Red Hat OpenShift AI dashboard integrations) will only work with LLMInferenceService resources. Existing InferenceService deployments must be migrated to access these features.

8. LLMInferenceServiceConfig: Replacing ServingRuntime

With InferenceService, you create a ServingRuntime that defines the container image, command, and environment, then reference it by model format.

With LLMInferenceService, this role is filled by LLMInferenceServiceConfig - a template resource with the same spec as LLMInferenceService (except model is optional). LLMInferenceServiceConfig provides reusable presets that can be shared globally when created in the RHOAI system namespace (redhat-ods-applications), or scoped to a specific namespace when created in a user namespace.

Deployment patternAuto-applied config
Single-nodekserve-config-llm-template
Disaggregated decodekserve-config-llm-decode-template
Disaggregated prefillkserve-config-llm-prefill-template
Multi-node data parallelkserve-config-llm-worker-data-parallel
Scheduler (Endpoint Picker Plugin (EPP))kserve-config-llm-scheduler
HTTPRoutekserve-config-llm-router-route

Note: This table lists the most common configs. Additional configs exist for combined patterns (e.g. disaggregated + multi-node). Run oc get llminferenceserviceconfigs to see all configs available in your namespace.

8.1 How config merging works

The controller merges configurations in this order (last wins):

  1. Well-known defaults - auto-selected based on deployment pattern
  2. User baseRefs - in order; last entry overrides earlier ones
  3. Service spec - fields on the LLMInferenceService itself (highest precedence)

Configs are looked up in the service namespace first, then the kserve system namespace.

The merge uses Kubernetes strategic merge patch semantics:

  • Containers merge by name (the main container in your override merges with main in the base)
  • Environment variables on a container replace as a block - they do not merge individually by name

8.2 Creating a custom config

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: my-custom-vllm
  namespace: my-namespace
spec:
  template:
    containers:
      - name: main
        image: my-registry/my-vllm:v1.0

Reference it via baseRefs:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: my-model
  namespace: my-namespace
spec:
  baseRefs:
    - name: my-custom-vllm
  model:
    uri: hf://my-org/my-model
  replicas: 1
  router:
    route: {}
    gateway: {}

The controller applies kserve-config-llm-template first (well-known default), then merges my-custom-vllm on top (overriding only the image), then applies any fields from the service spec itself.

9. Selecting an Accelerator Configuration

Accelerator configs are LLMInferenceServiceConfig resources that override the container image - and optionally environment variables or pod-level fields - for specific hardware (e.g. NVIDIA CUDA, AMD ROCm, IBM Spyre).

Reference the accelerator config via baseRefs:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: my-llm
spec:
  baseRefs:
    - name: kserve-config-llm-template-nvidia-cuda
  model:
    uri: hf://meta-llama/Llama-4-Scout-17B-16E-Instruct
  replicas: 1
  router:
    scheduler: {}
    route: {}
    gateway: {}
  template:
    containers:
      - name: main
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 64Gi

Important: baseRefs is the only mechanism for selecting an accelerator config. There is no servingRuntime or accelerator field on LLMInferenceService.

The available accelerator configs are installed by the cluster administrator and ship with the Red Hat OpenShift AI installation. To see which configs are available in your namespace:

oc get llminferenceserviceconfigs

Tip: Do not override command or args in accelerator configs. The base template places the vLLM launch logic in command and leaves args free for user configuration. Overriding either in an accelerator config silently discards user-specified vLLM flags.

10. Known issues and considerations

10.1 servingRuntime field does not exist

LLMInferenceService has no servingRuntime field. If you include it in your YAML, oc apply (without --validate=strict) silently drops the unknown field and the controller falls back to the default config with no error.

Fix: Use baseRefs to reference an LLMInferenceServiceConfig. Always apply with oc apply --validate=strict during migration to catch stale fields.

10.2 Container name must be main

The well-known configs and controller expect the inference container to be named main, not kserve-container (which InferenceService uses). Resource limits, env var overrides, and probe customizations must target containers[name=main].

10.3 No in-place conversion

These are different CRDs - you must delete the InferenceService and create a new LLMInferenceService. See Migration Procedure.

10.4 --backend vllm is no longer needed

The base config template already runs vllm serve. Passing --backend vllm as an argument causes a startup error. Remove it.

10.5 replicas and scaling are mutually exclusive

The API server rejects resources with both spec.replicas and spec.scaling set. Choose one: static replicas or WVA-based autoscaling.

10.6 Do not override command or args in accelerator configs

The base template uses a bash wrapper that conditionally sources accelerator-specific setup scripts and forwards all arguments via "$@". Overriding command or args in a config silently discards user-specified vLLM flags.

10.7 No Knative dependency

LLMInferenceService uses Gateway API for networking. The serverless vs rawDeployment deployment mode distinction does not apply.

10.8 LeaderWorkerSet is required for multi-node only

If you set spec.worker, the LWS operator must be installed. Single-node deployments use standard Kubernetes Deployments and do not require LWS. The controller checks for LWS CRD availability at startup. On Red Hat OpenShift, install LWS via Software Catalog.

10.9 Logger, Batcher, and canaryTrafficPercent are not available

These InferenceService features are not part of LLMInferenceService. Use Gateway API mechanisms for traffic management.

10.10 Environment variable merge semantics

When overriding a container via template or baseRefs, environment variables merge at the container level via strategic merge patch. If both the base config and your override define env vars, the override's env list replaces the base's entirely - individual environment variables are not merged by name.

10.11 Both API versions work

Both the v1alpha1 and v1alpha2 API versions are served. v1alpha2 is the storage version and recommended for new deployments. The API server handles conversion transparently.

11. Next Steps

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Category
Article Type