Migrating from vLLM InferenceService to LLMInferenceService on Red Hat OpenShift AI

Updated 21 May 2026

1. Overview

This article describes how to migrate existing vLLM-based InferenceService deployments to the new LLMInferenceService resource on Red Hat OpenShift AI.

LLMInferenceService (serving.kserve.io/v1alpha2, shortName: llmisvc) is a purpose-built custom resource for deploying and managing Large Language Model (LLM) workloads. It replaces the pattern of using InferenceService with a ServingRuntime for vLLM deployments, providing first-class support for LLM-specific concerns such as model URI schemes, parallelism strategies, disaggregated prefill/decode serving, and LLM-aware request scheduling.

Important: InferenceService deployments remain fully supported. This migration is only relevant for InferenceService deployments using vLLM serving runtimes.

Note: This guide applies to Red Hat OpenShift AI 3.4 and later. Earlier versions have different behavior for the args field on LLMInferenceService.

This article covers:

Key differences between InferenceService and LLMInferenceService
A before/after YAML comparison
Step-by-step field mapping
The migration procedure (zero-downtime and minimal options)
How the LLMInferenceServiceConfig replaces ServingRuntime
How to select accelerator-specific configurations
Known issues and considerations

2. Prerequisites

Before creating an LLMInferenceService, ensure your cluster meets the prerequisites for deploying LLMs with llm-d on Red Hat OpenShift AI. See the official documentation.

3. Key Differences at a Glance

The following table compares key differences between the InferenceService and LLMInferenceService custom resources.

Concept	InferenceService	LLMInferenceService
API version	`serving.kserve.io/v1beta1`	`serving.kserve.io/v1alpha2`
Kind / shortName	`InferenceService` / `isvc`	`LLMInferenceService` / `llmisvc`
Model location	`spec.predictor.model.storageUri`	`spec.model.uri`
Model format	`spec.predictor.model.modelFormat.name`	Not needed (LLM-only)
Runtime template	`ServingRuntime`	`LLMInferenceServiceConfig` via `spec.baseRefs`
Container image	From `ServingRuntime`	From `LLMInferenceServiceConfig` defaults
vLLM arguments	`spec.predictor.model.args`	Container `args` on the `main` container
Static replicas	`spec.predictor.minReplicas`	`spec.replicas`
Autoscaling	`minReplicas` / `maxReplicas` / `scaleTarget` / `scaleMetric` (Horizontal Pod Autoscaler (HPA) or Kubernetes Event-Driven Autoscaling (KEDA))	`spec.scaling` with vLLM Workload Autoscaler (WVA) layer on top of HPA or KEDA
Networking	Knative serverless or RawDeployment	`spec.router` (Gateway API)
Authentication	Off by default (opt-in)	On by default
Multi-node	`spec.predictor.workerSpec`	`spec.parallelism` + `spec.worker` (LeaderWorkerSet (LWS))
Disaggregated prefill/decode	Not supported	`spec.prefill` section
LoRA adapters	Not first-class	`spec.model.lora`
Canary traffic	`canaryTrafficPercent`	Gateway API traffic splitting
Logger / Batcher	`spec.predictor.logger`, `batcher`	Not in `LLMInferenceService`

4. Before and After

The following examples demonstrate what your YAML will look like before and after the migration.

4.1 InferenceService (before)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: qwen2-7b
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 3
    model:
      modelFormat:
        name: vLLM
      storageUri: hf://Qwen/Qwen2.5-7B-Instruct
      args:
        - --backend
        - vllm
        - --max-model-len
        - "4096"
      resources:
        limits:
          nvidia.com/gpu: "1"
          memory: 32Gi
        requests:
          nvidia.com/gpu: "1"
          memory: 16Gi

4.2 LLMInferenceService (after)

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: qwen2-7b
spec:
  model:
    uri: hf://Qwen/Qwen2.5-7B-Instruct
  replicas: 3
  router:
    scheduler: {}
    route: {}
    gateway: {}
  template:
    containers:
      - name: main
        args:
          - --max-model-len
          - "4096"
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 32Gi
          requests:
            nvidia.com/gpu: "1"
            memory: 16Gi

4.3 What changed

InferenceService	LLMInferenceService	Why
`modelFormat.name: vLLM`	Removed	`LLMInferenceService` is LLM-only; no format selection needed
`--backend vllm`	Removed	The default config template already runs `vllm serve`
`--max-model-len 4096`	Pass via container `args`	vLLM flags are now passed as container `args`, not model `args`
`storageUri`	`model.uri`	Same URI schemes supported (`hf://`, `s3://`, `pvc://`)
No image specified	No image specified	Both inherit from their respective template (`ServingRuntime` vs `LLMInferenceServiceConfig`)
`minReplicas: 1`, `maxReplicas: 3`	`replicas: 3`	Static replicas; for autoscaling use `spec.scaling` instead
Knative networking	`router: { scheduler: {}, route: {}, gateway: {} }`	Gateway API with LLM-aware request load balancing

4.4 Associated ServingRuntime

With InferenceService, the container image, command, and runtime configuration are defined in a separate ServingRuntime resource referenced by model format. Here's a typical vLLM ServingRuntime that would accompany the InferenceService in 4.1:

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: vllm-cuda-runtime
spec:
  supportedModelFormats:
    - name: vLLM
      autoSelect: true
  containers:
    - name: kserve-container
      image: quay.io/modh/vllm:rhoai-3.4
      command:
        - python
        - -m
        - vllm.entrypoints.openai.api_server
      args:
        - --port=8080
        - --model=/mnt/models
      ports:
        - containerPort: 8080
          protocol: TCP

With LLMInferenceService, this role is filled by LLMInferenceServiceConfig. The well-known configs (e.g. kserve-config-llm-template) ship with the Red Hat OpenShift AI installation and are auto-applied by the controller based on deployment pattern. You do not need to create or maintain your own equivalent of this ServingRuntime unless you have custom requirements - see Section 7: LLMInferenceServiceConfig: Replacing ServingRuntime.

Mapping:

ServingRuntime field	LLMInferenceServiceConfig equivalent
`spec.containers[].image`	`spec.template.containers[name=main].image`
`spec.containers[].command`	`spec.template.containers[name=main].command`
`spec.containers[].args`	`spec.template.containers[name=main].args`
`spec.supportedModelFormats`	Not needed - `LLMInferenceService` is LLM-only

5. ServingRuntime vs LLMInferenceServiceConfig

If you are coming from InferenceService, you are used to a deployment model where the container image, entrypoint command, and runtime configuration live in a separate ServingRuntime resource. Your InferenceService then references it indirectly through the modelFormat field. Understanding how LLMInferenceServiceConfig replaces this pattern is key to a smooth migration.

5.1 How ServingRuntime works with InferenceService

With InferenceService, you create and maintain a ServingRuntime that defines:

The container image (e.g. quay.io/modh/vllm:rhoai-3.4)
The entrypoint command and default arguments
The supported model formats (e.g. vLLM, huggingface)
Port configuration

Your InferenceService selects a ServingRuntime by matching on modelFormat.name. The controller finds the ServingRuntime whose supportedModelFormats matches, and uses it to build the inference pod. You are responsible for creating, versioning, and updating the ServingRuntime - including keeping the vLLM image up to date.

5.2 How LLMInferenceServiceConfig works with LLMInferenceService

With LLMInferenceService, the equivalent of a ServingRuntime is an LLMInferenceServiceConfig. The key differences are:

Pre-installed defaults: Well-known configs ship with the Red Hat OpenShift AI installation and are maintained by the platform. You do not need to create or version your own runtime template for standard deployments.
Automatic selection: The controller selects the appropriate config based on your deployment pattern (single-node, multi-node, disaggregated) rather than matching on a model format field.
Reusable presets: Configs can be shared globally when created in the RHOAI system namespace (redhat-ods-applications), or scoped to a specific namespace when created in a user namespace.
Explicit override via baseRefs: If you need a non-default config (e.g. a custom image or accelerator-specific settings), you reference it by name through spec.baseRefs rather than through model format matching.
Layered merging: Multiple configs can be composed together. The controller merges well-known defaults, then baseRefs entries in order, then your LLMInferenceService spec on top. See Section 8 for details.

5.3 Side-by-side comparison

Aspect	ServingRuntime	LLMInferenceServiceConfig
Who creates it	You create and maintain it	Ships with Red Hat OpenShift AI; you only create one for custom requirements
How it's selected	Implicitly via `modelFormat` matching	Automatically by deployment pattern, or explicitly via `baseRefs`
Scope	Namespace-scoped (`ServingRuntime`)	Namespace-scoped, or global when in the RHOAI system namespace
Image management	You are responsible for updating the image	Platform-managed defaults; override via `baseRefs` if needed
Composability	One `ServingRuntime` per deployment	Multiple configs merged in order (defaults + `baseRefs` + service spec)
What it contains	Container image, command, args, ports, supported formats	Same as `LLMInferenceService` spec (except `model` is optional)
Model format field	Required (`supportedModelFormats`)	Not needed - `LLMInferenceService` is LLM-only

5.4 What this means for your migration

For most migrations, you do not need to create an LLMInferenceServiceConfig at all. The platform-managed defaults handle the standard vLLM deployment. Your migration steps are:

Delete your ServingRuntime (or stop maintaining it) - the well-known configs replace it.
Remove the modelFormat field from your deployment YAML - format matching is no longer used.
If you were using a custom image, command, or environment variables in your ServingRuntime, create an LLMInferenceServiceConfig with those overrides and reference it via baseRefs. See Section 8 for creating custom configs.
If you were using a standard ServingRuntime provided by Red Hat OpenShift AI (e.g. the default vLLM runtime), no action is needed - the well-known config is the equivalent.

Tip: To see which LLMInferenceServiceConfig resources are available in your namespace, run:

oc get llminferenceserviceconfigs

6. Step-by-Step Migration

The following steps walk through each custom resource field that needs to change when translating your InferenceService YAML to an LLMInferenceService YAML. Work through these to produce your new YAML file, then follow Section 7: Migration Procedure to deploy it.

6.1 Change API version and Kind

Before:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService

After:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService

Both v1alpha1 and v1alpha2 are served. Use v1alpha2 for new deployments.

6.2 Migrate model specification

Before:

spec:
  predictor:
    model:
      modelFormat:
        name: vLLM
      storageUri: hf://Qwen/Qwen2.5-7B-Instruct

After:

spec:
  model:
    uri: hf://Qwen/Qwen2.5-7B-Instruct
    name: Qwen/Qwen2.5-7B-Instruct  # defaults to metadata.name if omitted

Drop the modelFormat field entirely. The model.name field controls the model name in request parameters (e.g. the model field in OpenAI-compatible API calls).

6.3 Drop ServingRuntime references

There is no servingRuntime field on LLMInferenceService. The controller automatically applies well-known LLMInferenceServiceConfig templates based on the deployment pattern (single-node, multi-node, disaggregated). To use a non-default config, see Selecting an Accelerator Configuration.

6.4 Migrate static replicas

Before:

spec:
  predictor:
    minReplicas: 1    # when minReplicas == maxReplicas (static)
    maxReplicas: 1

After:

spec:
  replicas: 3

6.5 Migrate autoscaling (Technology Preview)

Important: Autoscaling on LLMInferenceService (via WVA) is a Technology Preview feature in Red Hat OpenShift AI 3.4. If you have production autoscaling requirements, consider remaining on InferenceService until WVA reaches General Availability.

Before:

spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    scaleTarget: 70
    scaleMetric: cpu

After:

spec:
  scaling:
    minReplicas: 1
    maxReplicas: 5
    wva:
      hpa: {}    # or keda: {} for KEDA-based scaling

Note: replicas and scaling are mutually exclusive - the API server rejects resources with both set.

6.6 Migrate container resources

Tip: When translating, carry over all metadata.annotations from your InferenceService to your LLMInferenceService. Some annotations are consumed by the Red Hat OpenShift AI operator to inject additional configuration (e.g. hardware profile resource requests), and preserving them keeps those behaviors working.

Before:

spec:
  predictor:
    model:
      resources:
        limits:
          nvidia.com/gpu: "1"

After:

spec:
  template:
    containers:
      - name: main    # must be "main", not "kserve-container"
        resources:
          limits:
            nvidia.com/gpu: "1"

The container name must be main. The well-known configs and controller use this name for matching during strategic merge.

6.7 Migrate runtime arguments

The base LLMInferenceServiceConfig template places the vLLM launch logic (including default flags like --served-model-name, --port, and Transport Layer Security (TLS) configuration) in the container's command field. The args field is left empty, so any arguments you specify on your LLMInferenceService append to the vLLM command rather than replacing the defaults.

Before:

spec:
  predictor:
    model:
      args:
        - --backend
        - vllm
        - --max-model-len
        - "4096"

After:

spec:
  template:
    containers:
      - name: main
        args:
          - --max-model-len
          - "4096"

Remove --backend vllm - the base config template already runs vllm serve
User-specified args are forwarded to vLLM as positional parameters via the bash wrapper in the base template
If the same flag appears in both the base template defaults and your args, your value takes precedence (standard vLLM CLI behavior)

Note: The VLLM_ADDITIONAL_ARGS environment variable is also supported as an alternative mechanism for passing additional vLLM flags. However, using the args field directly is the recommended approach as it follows standard Kubernetes conventions and keeps your configuration declarative.

6.8 Configure networking

Before - serving.kserve.io/deploymentMode: RawDeployment mode (default in RHOAI 3.x)

After - Gateway API:

spec:
  router:
    scheduler: {}  # deploys the Endpoint Picker for LLM-aware scheduling
    route: {}      # creates a managed HTTPRoute
    gateway: {}    # uses the default ingress gateway

Empty objects {} trigger controller-managed defaults. Omit the scheduler if you do not need LLM-aware request routing.

Important: LLMInferenceService uses Gateway API for external access, not OpenShift Routes. At present, Gateway services need to be of type LoadBalancer, so users need an external load balancer to expose the service. This is a change from InferenceService where Knative or RawDeployment handled external routing automatically.

7. Migration Procedure

The following steps take you through the migration process: applying the new resource that you created in Step 6, verifying health, cutting traffic, and cleaning up the old resource.

InferenceService and LLMInferenceService are different CRDs with different Group/Version/Kind identifiers. There is no in-place conversion - you cannot oc edit one into the other. Both CRDs coexist; migration is opt-in.

Important: If you include a servingRuntime field in your LLMInferenceService YAML, running oc apply (without --validate=strict) silently drops the unknown field, and the controller falls back to the default config with no error. Always apply with --validate=strict during migration to catch stale fields.

7.1 Option A: Zero-downtime side-by-side migration (Recommended)

Important: During side-by-side migration, use a different metadata.name for the LLMInferenceService (e.g. qwen2-7b-llmisvc) to avoid Domain Name System (DNS)/service name conflicts while both resources coexist. Once the old InferenceService is deleted, you can keep the new name or recreate with the original if needed.

Translate your InferenceService YAML using this guide.
Create the LLMInferenceService alongside the existing InferenceService:

oc apply -f <your-llmisvc.yaml> --validate=strict

Verify health:

oc get llmisvc <name>

Wait for Ready=True.

Cut over traffic - update DNS, Gateway HTTPRoute, or consumers to point to the new endpoint.
Delete the old InferenceService:

oc delete inferenceservice <name>

7.2 Option B: Delete and recreate migration (with downtime)

Delete the InferenceService:

oc delete inferenceservice <name>

Apply the LLMInferenceService:

oc apply -f <your-llmisvc.yaml> --validate=strict

Wait for Ready=True:

oc get llmisvc <name>

Note: Platform features that integrate specifically with LLMInferenceService (e.g. Red Hat OpenShift AI dashboard integrations) will only work with LLMInferenceService resources. Existing InferenceService deployments must be migrated to access these features.

8. LLMInferenceServiceConfig: Replacing ServingRuntime

With InferenceService, you create a ServingRuntime that defines the container image, command, and environment, then reference it by model format.

With LLMInferenceService, this role is filled by LLMInferenceServiceConfig - a template resource with the same spec as LLMInferenceService (except model is optional). LLMInferenceServiceConfig provides reusable presets that can be shared globally when created in the RHOAI system namespace (redhat-ods-applications), or scoped to a specific namespace when created in a user namespace.

Deployment pattern	Auto-applied config
Single-node	`kserve-config-llm-template`
Disaggregated decode	`kserve-config-llm-decode-template`
Disaggregated prefill	`kserve-config-llm-prefill-template`
Multi-node data parallel	`kserve-config-llm-worker-data-parallel`
Scheduler (Endpoint Picker Plugin (EPP))	`kserve-config-llm-scheduler`
HTTPRoute	`kserve-config-llm-router-route`

Note: This table lists the most common configs. Additional configs exist for combined patterns (e.g. disaggregated + multi-node). Run oc get llminferenceserviceconfigs to see all configs available in your namespace.

8.1 How config merging works

The controller merges configurations in this order (last wins):

Well-known defaults - auto-selected based on deployment pattern
User baseRefs - in order; last entry overrides earlier ones
Service spec - fields on the LLMInferenceService itself (highest precedence)

Configs are looked up in the service namespace first, then the kserve system namespace.

The merge uses Kubernetes strategic merge patch semantics:

Containers merge by name (the main container in your override merges with main in the base)
Environment variables on a container replace as a block - they do not merge individually by name

8.2 Creating a custom config

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
  name: my-custom-vllm
  namespace: my-namespace
spec:
  template:
    containers:
      - name: main
        image: my-registry/my-vllm:v1.0

Reference it via baseRefs:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: my-model
  namespace: my-namespace
spec:
  baseRefs:
    - name: my-custom-vllm
  model:
    uri: hf://my-org/my-model
  replicas: 1
  router:
    route: {}
    gateway: {}

The controller applies kserve-config-llm-template first (well-known default), then merges my-custom-vllm on top (overriding only the image), then applies any fields from the service spec itself.

9. Selecting an Accelerator Configuration

Accelerator configs are LLMInferenceServiceConfig resources that override the container image - and optionally environment variables or pod-level fields - for specific hardware (e.g. NVIDIA CUDA, AMD ROCm, IBM Spyre).

Reference the accelerator config via baseRefs:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: my-llm
spec:
  baseRefs:
    - name: kserve-config-llm-template-nvidia-cuda
  model:
    uri: hf://meta-llama/Llama-4-Scout-17B-16E-Instruct
  replicas: 1
  router:
    scheduler: {}
    route: {}
    gateway: {}
  template:
    containers:
      - name: main
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 64Gi

Important: baseRefs is the only mechanism for selecting an accelerator config. There is no servingRuntime or accelerator field on LLMInferenceService.

The available accelerator configs are installed by the cluster administrator and ship with the Red Hat OpenShift AI installation. To see which configs are available in your namespace:

oc get llminferenceserviceconfigs

Tip: Do not override command or args in accelerator configs. The base template places the vLLM launch logic in command and leaves args free for user configuration. Overriding either in an accelerator config silently discards user-specified vLLM flags.

10. Known issues and considerations

10.1 servingRuntime field does not exist

LLMInferenceService has no servingRuntime field. If you include it in your YAML, oc apply (without --validate=strict) silently drops the unknown field and the controller falls back to the default config with no error.

Fix: Use baseRefs to reference an LLMInferenceServiceConfig. Always apply with oc apply --validate=strict during migration to catch stale fields.

10.2 Container name must be main

The well-known configs and controller expect the inference container to be named main, not kserve-container (which InferenceService uses). Resource limits, env var overrides, and probe customizations must target containers[name=main].

10.3 No in-place conversion

These are different CRDs - you must delete the InferenceService and create a new LLMInferenceService. See Migration Procedure.

10.4 --backend vllm is no longer needed

The base config template already runs vllm serve. Passing --backend vllm as an argument causes a startup error. Remove it.

10.5 replicas and scaling are mutually exclusive

The API server rejects resources with both spec.replicas and spec.scaling set. Choose one: static replicas or WVA-based autoscaling.

10.6 Do not override command or args in accelerator configs

The base template uses a bash wrapper that conditionally sources accelerator-specific setup scripts and forwards all arguments via "$@". Overriding command or args in a config silently discards user-specified vLLM flags.

10.7 No Knative dependency

LLMInferenceService uses Gateway API for networking. The serverless vs rawDeployment deployment mode distinction does not apply.

10.8 LeaderWorkerSet is required for multi-node only

If you set spec.worker, the LWS operator must be installed. Single-node deployments use standard Kubernetes Deployments and do not require LWS. The controller checks for LWS CRD availability at startup. On Red Hat OpenShift, install LWS via Software Catalog.

10.9 Logger, Batcher, and canaryTrafficPercent are not available

These InferenceService features are not part of LLMInferenceService. Use Gateway API mechanisms for traffic management.

10.10 Environment variable merge semantics

When overriding a container via template or baseRefs, environment variables merge at the container level via strategic merge patch. If both the base config and your override define env vars, the override's env list replaces the base's entirely - individual environment variables are not merged by name.

10.11 Both API versions work

Both the v1alpha1 and v1alpha2 API versions are served. v1alpha2 is the storage version and recommended for new deployments. The API server handles conversion transparently.

11. Next Steps

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Review the Red Hat OpenShift AI documentation for deploying models with LLMInferenceService: Deploying models on Red Hat OpenShift AI
See the sample deployments in the Content from github.com is not included.kserve repository under docs/samples/llmisvc/ for CPU, single-node GPU, disaggregated, and multi-node deployment examples

SBR

Red Hat OpenShift AI Cloud Service + Red Hat OpenShift AI Self-Managed

Product(s)

Red Hat OpenShift AI

Category

Migrate

Tags

Article Type

General