Migrating from vLLM InferenceService to LLMInferenceService on Red Hat OpenShift AI
1. Overview
This article describes how to migrate existing vLLM-based InferenceService deployments to the new LLMInferenceService resource on Red Hat OpenShift AI.
LLMInferenceService (serving.kserve.io/v1alpha2, shortName: llmisvc) is a purpose-built custom resource for deploying and managing Large Language Model (LLM) workloads. It replaces the pattern of using InferenceService with a ServingRuntime for vLLM deployments, providing first-class support for LLM-specific concerns such as model URI schemes, parallelism strategies, disaggregated prefill/decode serving, and LLM-aware request scheduling.
Important: InferenceService deployments remain fully supported. This migration is only relevant for InferenceService deployments using vLLM serving runtimes.
Note: This guide applies to Red Hat OpenShift AI 3.4 and later. Earlier versions have different behavior for the args field on LLMInferenceService.
This article covers:
- Key differences between
InferenceServiceandLLMInferenceService - A before/after YAML comparison
- Step-by-step field mapping
- The migration procedure (zero-downtime and minimal options)
- How the
LLMInferenceServiceConfigreplacesServingRuntime - How to select accelerator-specific configurations
- Known issues and considerations
2. Prerequisites
Before creating an LLMInferenceService, ensure your cluster meets the prerequisites for deploying LLMs with llm-d on Red Hat OpenShift AI. See the official documentation.
3. Key Differences at a Glance
The following table compares key differences between the InferenceService and LLMInferenceService custom resources.
| Concept | InferenceService | LLMInferenceService |
|---|---|---|
| API version | serving.kserve.io/v1beta1 | serving.kserve.io/v1alpha2 |
| Kind / shortName | InferenceService / isvc | LLMInferenceService / llmisvc |
| Model location | spec.predictor.model.storageUri | spec.model.uri |
| Model format | spec.predictor.model.modelFormat.name | Not needed (LLM-only) |
| Runtime template | ServingRuntime | LLMInferenceServiceConfig via spec.baseRefs |
| Container image | From ServingRuntime | From LLMInferenceServiceConfig defaults |
| vLLM arguments | spec.predictor.model.args | Container args on the main container |
| Static replicas | spec.predictor.minReplicas | spec.replicas |
| Autoscaling | minReplicas / maxReplicas / scaleTarget / scaleMetric (Horizontal Pod Autoscaler (HPA) or Kubernetes Event-Driven Autoscaling (KEDA)) | spec.scaling with vLLM Workload Autoscaler (WVA) layer on top of HPA or KEDA |
| Networking | Knative serverless or RawDeployment | spec.router (Gateway API) |
| Authentication | Off by default (opt-in) | On by default |
| Multi-node | spec.predictor.workerSpec | spec.parallelism + spec.worker (LeaderWorkerSet (LWS)) |
| Disaggregated prefill/decode | Not supported | spec.prefill section |
| LoRA adapters | Not first-class | spec.model.lora |
| Canary traffic | canaryTrafficPercent | Gateway API traffic splitting |
| Logger / Batcher | spec.predictor.logger, batcher | Not in LLMInferenceService |
4. Before and After
The following examples demonstrate what your YAML will look like before and after the migration.
4.1 InferenceService (before)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: qwen2-7b
spec:
predictor:
minReplicas: 1
maxReplicas: 3
model:
modelFormat:
name: vLLM
storageUri: hf://Qwen/Qwen2.5-7B-Instruct
args:
- --backend
- vllm
- --max-model-len
- "4096"
resources:
limits:
nvidia.com/gpu: "1"
memory: 32Gi
requests:
nvidia.com/gpu: "1"
memory: 16Gi
4.2 LLMInferenceService (after)
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: qwen2-7b
spec:
model:
uri: hf://Qwen/Qwen2.5-7B-Instruct
replicas: 3
router:
scheduler: {}
route: {}
gateway: {}
template:
containers:
- name: main
args:
- --max-model-len
- "4096"
resources:
limits:
nvidia.com/gpu: "1"
memory: 32Gi
requests:
nvidia.com/gpu: "1"
memory: 16Gi
4.3 What changed
| InferenceService | LLMInferenceService | Why |
|---|---|---|
modelFormat.name: vLLM | Removed | LLMInferenceService is LLM-only; no format selection needed |
--backend vllm | Removed | The default config template already runs vllm serve |
--max-model-len 4096 | Pass via container args | vLLM flags are now passed as container args, not model args |
storageUri | model.uri | Same URI schemes supported (hf://, s3://, pvc://) |
| No image specified | No image specified | Both inherit from their respective template (ServingRuntime vs LLMInferenceServiceConfig) |
minReplicas: 1, maxReplicas: 3 | replicas: 3 | Static replicas; for autoscaling use spec.scaling instead |
| Knative networking | router: { scheduler: {}, route: {}, gateway: {} } | Gateway API with LLM-aware request load balancing |
4.4 Associated ServingRuntime
With InferenceService, the container image, command, and runtime configuration are defined in a separate ServingRuntime resource referenced by model format. Here's a typical vLLM ServingRuntime that would accompany the InferenceService in 4.1:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: vllm-cuda-runtime
spec:
supportedModelFormats:
- name: vLLM
autoSelect: true
containers:
- name: kserve-container
image: quay.io/modh/vllm:rhoai-3.4
command:
- python
- -m
- vllm.entrypoints.openai.api_server
args:
- --port=8080
- --model=/mnt/models
ports:
- containerPort: 8080
protocol: TCP
With LLMInferenceService, this role is filled by LLMInferenceServiceConfig. The well-known configs (e.g. kserve-config-llm-template) ship with the Red Hat OpenShift AI installation and are auto-applied by the controller based on deployment pattern. You do not need to create or maintain your own equivalent of this ServingRuntime unless you have custom requirements - see Section 7: LLMInferenceServiceConfig: Replacing ServingRuntime.
Mapping:
| ServingRuntime field | LLMInferenceServiceConfig equivalent |
|---|---|
spec.containers[].image | spec.template.containers[name=main].image |
spec.containers[].command | spec.template.containers[name=main].command |
spec.containers[].args | spec.template.containers[name=main].args |
spec.supportedModelFormats | Not needed - LLMInferenceService is LLM-only |
5. ServingRuntime vs LLMInferenceServiceConfig
If you are coming from InferenceService, you are used to a deployment model where the container image, entrypoint command, and runtime configuration live in a separate ServingRuntime resource. Your InferenceService then references it indirectly through the modelFormat field. Understanding how LLMInferenceServiceConfig replaces this pattern is key to a smooth migration.
5.1 How ServingRuntime works with InferenceService
With InferenceService, you create and maintain a ServingRuntime that defines:
- The container image (e.g.
quay.io/modh/vllm:rhoai-3.4) - The entrypoint command and default arguments
- The supported model formats (e.g. vLLM, huggingface)
- Port configuration
Your InferenceService selects a ServingRuntime by matching on modelFormat.name. The controller finds the ServingRuntime whose supportedModelFormats matches, and uses it to build the inference pod. You are responsible for creating, versioning, and updating the ServingRuntime - including keeping the vLLM image up to date.
5.2 How LLMInferenceServiceConfig works with LLMInferenceService
With LLMInferenceService, the equivalent of a ServingRuntime is an LLMInferenceServiceConfig. The key differences are:
- Pre-installed defaults: Well-known configs ship with the Red Hat OpenShift AI installation and are maintained by the platform. You do not need to create or version your own runtime template for standard deployments.
- Automatic selection: The controller selects the appropriate config based on your deployment pattern (single-node, multi-node, disaggregated) rather than matching on a model format field.
- Reusable presets: Configs can be shared globally when created in the RHOAI system namespace (
redhat-ods-applications), or scoped to a specific namespace when created in a user namespace. - Explicit override via baseRefs: If you need a non-default config (e.g. a custom image or accelerator-specific settings), you reference it by name through
spec.baseRefsrather than through model format matching. - Layered merging: Multiple configs can be composed together. The controller merges well-known defaults, then
baseRefsentries in order, then yourLLMInferenceServicespec on top. See Section 8 for details.
5.3 Side-by-side comparison
| Aspect | ServingRuntime | LLMInferenceServiceConfig |
|---|---|---|
| Who creates it | You create and maintain it | Ships with Red Hat OpenShift AI; you only create one for custom requirements |
| How it's selected | Implicitly via modelFormat matching | Automatically by deployment pattern, or explicitly via baseRefs |
| Scope | Namespace-scoped (ServingRuntime) | Namespace-scoped, or global when in the RHOAI system namespace |
| Image management | You are responsible for updating the image | Platform-managed defaults; override via baseRefs if needed |
| Composability | One ServingRuntime per deployment | Multiple configs merged in order (defaults + baseRefs + service spec) |
| What it contains | Container image, command, args, ports, supported formats | Same as LLMInferenceService spec (except model is optional) |
| Model format field | Required (supportedModelFormats) | Not needed - LLMInferenceService is LLM-only |
5.4 What this means for your migration
For most migrations, you do not need to create an LLMInferenceServiceConfig at all. The platform-managed defaults handle the standard vLLM deployment. Your migration steps are:
- Delete your
ServingRuntime(or stop maintaining it) - the well-known configs replace it. - Remove the
modelFormatfield from your deployment YAML - format matching is no longer used. - If you were using a custom image, command, or environment variables in your
ServingRuntime, create anLLMInferenceServiceConfigwith those overrides and reference it viabaseRefs. See Section 8 for creating custom configs. - If you were using a standard
ServingRuntimeprovided by Red Hat OpenShift AI (e.g. the default vLLM runtime), no action is needed - the well-known config is the equivalent.
Tip: To see which LLMInferenceServiceConfig resources are available in your namespace, run:
oc get llminferenceserviceconfigs
6. Step-by-Step Migration
The following steps walk through each custom resource field that needs to change when translating your InferenceService YAML to an LLMInferenceService YAML. Work through these to produce your new YAML file, then follow Section 7: Migration Procedure to deploy it.
6.1 Change API version and Kind
Before:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
After:
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
Both v1alpha1 and v1alpha2 are served. Use v1alpha2 for new deployments.
6.2 Migrate model specification
Before:
spec:
predictor:
model:
modelFormat:
name: vLLM
storageUri: hf://Qwen/Qwen2.5-7B-Instruct
After:
spec:
model:
uri: hf://Qwen/Qwen2.5-7B-Instruct
name: Qwen/Qwen2.5-7B-Instruct # defaults to metadata.name if omitted
Drop the modelFormat field entirely. The model.name field controls the model name in request parameters (e.g. the model field in OpenAI-compatible API calls).
6.3 Drop ServingRuntime references
There is no servingRuntime field on LLMInferenceService. The controller automatically applies well-known LLMInferenceServiceConfig templates based on the deployment pattern (single-node, multi-node, disaggregated). To use a non-default config, see Selecting an Accelerator Configuration.
6.4 Migrate static replicas
Before:
spec:
predictor:
minReplicas: 1 # when minReplicas == maxReplicas (static)
maxReplicas: 1
After:
spec:
replicas: 3
6.5 Migrate autoscaling (Technology Preview)
Important: Autoscaling on LLMInferenceService (via WVA) is a Technology Preview feature in Red Hat OpenShift AI 3.4. If you have production autoscaling requirements, consider remaining on InferenceService until WVA reaches General Availability.
Before:
spec:
predictor:
minReplicas: 1
maxReplicas: 5
scaleTarget: 70
scaleMetric: cpu
After:
spec:
scaling:
minReplicas: 1
maxReplicas: 5
wva:
hpa: {} # or keda: {} for KEDA-based scaling
Note: replicas and scaling are mutually exclusive - the API server rejects resources with both set.
6.6 Migrate container resources
Tip: When translating, carry over all metadata.annotations from your InferenceService to your LLMInferenceService. Some annotations are consumed by the Red Hat OpenShift AI operator to inject additional configuration (e.g. hardware profile resource requests), and preserving them keeps those behaviors working.
Before:
spec:
predictor:
model:
resources:
limits:
nvidia.com/gpu: "1"
After:
spec:
template:
containers:
- name: main # must be "main", not "kserve-container"
resources:
limits:
nvidia.com/gpu: "1"
The container name must be main. The well-known configs and controller use this name for matching during strategic merge.
6.7 Migrate runtime arguments
The base LLMInferenceServiceConfig template places the vLLM launch logic (including default flags like --served-model-name, --port, and Transport Layer Security (TLS) configuration) in the container's command field. The args field is left empty, so any arguments you specify on your LLMInferenceService append to the vLLM command rather than replacing the defaults.
Before:
spec:
predictor:
model:
args:
- --backend
- vllm
- --max-model-len
- "4096"
After:
spec:
template:
containers:
- name: main
args:
- --max-model-len
- "4096"
- Remove
--backend vllm- the base config template already runsvllm serve - User-specified
argsare forwarded to vLLM as positional parameters via the bash wrapper in the base template - If the same flag appears in both the base template defaults and your
args, your value takes precedence (standard vLLM CLI behavior)
Note: The VLLM_ADDITIONAL_ARGS environment variable is also supported as an alternative mechanism for passing additional vLLM flags. However, using the args field directly is the recommended approach as it follows standard Kubernetes conventions and keeps your configuration declarative.
6.8 Configure networking
Before - serving.kserve.io/deploymentMode: RawDeployment mode (default in RHOAI 3.x)
After - Gateway API:
spec:
router:
scheduler: {} # deploys the Endpoint Picker for LLM-aware scheduling
route: {} # creates a managed HTTPRoute
gateway: {} # uses the default ingress gateway
Empty objects {} trigger controller-managed defaults. Omit the scheduler if you do not need LLM-aware request routing.
Important: LLMInferenceService uses Gateway API for external access, not OpenShift Routes. At present, Gateway services need to be of type LoadBalancer, so users need an external load balancer to expose the service. This is a change from InferenceService where Knative or RawDeployment handled external routing automatically.
7. Migration Procedure
The following steps take you through the migration process: applying the new resource that you created in Step 6, verifying health, cutting traffic, and cleaning up the old resource.
InferenceService and LLMInferenceService are different CRDs with different Group/Version/Kind identifiers. There is no in-place conversion - you cannot oc edit one into the other. Both CRDs coexist; migration is opt-in.
Important: If you include a servingRuntime field in your LLMInferenceService YAML, running oc apply (without --validate=strict) silently drops the unknown field, and the controller falls back to the default config with no error. Always apply with --validate=strict during migration to catch stale fields.
7.1 Option A: Zero-downtime side-by-side migration (Recommended)
Important: During side-by-side migration, use a different metadata.name for the LLMInferenceService (e.g. qwen2-7b-llmisvc) to avoid Domain Name System (DNS)/service name conflicts while both resources coexist. Once the old InferenceService is deleted, you can keep the new name or recreate with the original if needed.
- Translate your
InferenceServiceYAML using this guide. - Create the
LLMInferenceServicealongside the existingInferenceService:
oc apply -f <your-llmisvc.yaml> --validate=strict
- Verify health:
oc get llmisvc <name>
Wait for Ready=True.
- Cut over traffic - update DNS, Gateway HTTPRoute, or consumers to point to the new endpoint.
- Delete the old
InferenceService:
oc delete inferenceservice <name>
7.2 Option B: Delete and recreate migration (with downtime)
- Delete the
InferenceService:
oc delete inferenceservice <name>
- Apply the
LLMInferenceService:
oc apply -f <your-llmisvc.yaml> --validate=strict
- Wait for
Ready=True:
oc get llmisvc <name>
Note: Platform features that integrate specifically with LLMInferenceService (e.g. Red Hat OpenShift AI dashboard integrations) will only work with LLMInferenceService resources. Existing InferenceService deployments must be migrated to access these features.
8. LLMInferenceServiceConfig: Replacing ServingRuntime
With InferenceService, you create a ServingRuntime that defines the container image, command, and environment, then reference it by model format.
With LLMInferenceService, this role is filled by LLMInferenceServiceConfig - a template resource with the same spec as LLMInferenceService (except model is optional). LLMInferenceServiceConfig provides reusable presets that can be shared globally when created in the RHOAI system namespace (redhat-ods-applications), or scoped to a specific namespace when created in a user namespace.
| Deployment pattern | Auto-applied config |
|---|---|
| Single-node | kserve-config-llm-template |
| Disaggregated decode | kserve-config-llm-decode-template |
| Disaggregated prefill | kserve-config-llm-prefill-template |
| Multi-node data parallel | kserve-config-llm-worker-data-parallel |
| Scheduler (Endpoint Picker Plugin (EPP)) | kserve-config-llm-scheduler |
| HTTPRoute | kserve-config-llm-router-route |
Note: This table lists the most common configs. Additional configs exist for combined patterns (e.g. disaggregated + multi-node). Run oc get llminferenceserviceconfigs to see all configs available in your namespace.
8.1 How config merging works
The controller merges configurations in this order (last wins):
- Well-known defaults - auto-selected based on deployment pattern
- User
baseRefs- in order; last entry overrides earlier ones - Service spec - fields on the
LLMInferenceServiceitself (highest precedence)
Configs are looked up in the service namespace first, then the kserve system namespace.
The merge uses Kubernetes strategic merge patch semantics:
- Containers merge by name (the
maincontainer in your override merges withmainin the base) - Environment variables on a container replace as a block - they do not merge individually by name
8.2 Creating a custom config
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceServiceConfig
metadata:
name: my-custom-vllm
namespace: my-namespace
spec:
template:
containers:
- name: main
image: my-registry/my-vllm:v1.0
Reference it via baseRefs:
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: my-model
namespace: my-namespace
spec:
baseRefs:
- name: my-custom-vllm
model:
uri: hf://my-org/my-model
replicas: 1
router:
route: {}
gateway: {}
The controller applies kserve-config-llm-template first (well-known default), then merges my-custom-vllm on top (overriding only the image), then applies any fields from the service spec itself.
9. Selecting an Accelerator Configuration
Accelerator configs are LLMInferenceServiceConfig resources that override the container image - and optionally environment variables or pod-level fields - for specific hardware (e.g. NVIDIA CUDA, AMD ROCm, IBM Spyre).
Reference the accelerator config via baseRefs:
apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
name: my-llm
spec:
baseRefs:
- name: kserve-config-llm-template-nvidia-cuda
model:
uri: hf://meta-llama/Llama-4-Scout-17B-16E-Instruct
replicas: 1
router:
scheduler: {}
route: {}
gateway: {}
template:
containers:
- name: main
resources:
limits:
nvidia.com/gpu: "1"
memory: 64Gi
Important: baseRefs is the only mechanism for selecting an accelerator config. There is no servingRuntime or accelerator field on LLMInferenceService.
The available accelerator configs are installed by the cluster administrator and ship with the Red Hat OpenShift AI installation. To see which configs are available in your namespace:
oc get llminferenceserviceconfigs
Tip: Do not override command or args in accelerator configs. The base template places the vLLM launch logic in command and leaves args free for user configuration. Overriding either in an accelerator config silently discards user-specified vLLM flags.
10. Known issues and considerations
10.1 servingRuntime field does not exist
LLMInferenceService has no servingRuntime field. If you include it in your YAML, oc apply (without --validate=strict) silently drops the unknown field and the controller falls back to the default config with no error.
Fix: Use baseRefs to reference an LLMInferenceServiceConfig. Always apply with oc apply --validate=strict during migration to catch stale fields.
10.2 Container name must be main
The well-known configs and controller expect the inference container to be named main, not kserve-container (which InferenceService uses). Resource limits, env var overrides, and probe customizations must target containers[name=main].
10.3 No in-place conversion
These are different CRDs - you must delete the InferenceService and create a new LLMInferenceService. See Migration Procedure.
10.4 --backend vllm is no longer needed
The base config template already runs vllm serve. Passing --backend vllm as an argument causes a startup error. Remove it.
10.5 replicas and scaling are mutually exclusive
The API server rejects resources with both spec.replicas and spec.scaling set. Choose one: static replicas or WVA-based autoscaling.
10.6 Do not override command or args in accelerator configs
The base template uses a bash wrapper that conditionally sources accelerator-specific setup scripts and forwards all arguments via "$@". Overriding command or args in a config silently discards user-specified vLLM flags.
10.7 No Knative dependency
LLMInferenceService uses Gateway API for networking. The serverless vs rawDeployment deployment mode distinction does not apply.
10.8 LeaderWorkerSet is required for multi-node only
If you set spec.worker, the LWS operator must be installed. Single-node deployments use standard Kubernetes Deployments and do not require LWS. The controller checks for LWS CRD availability at startup. On Red Hat OpenShift, install LWS via Software Catalog.
10.9 Logger, Batcher, and canaryTrafficPercent are not available
These InferenceService features are not part of LLMInferenceService. Use Gateway API mechanisms for traffic management.
10.10 Environment variable merge semantics
When overriding a container via template or baseRefs, environment variables merge at the container level via strategic merge patch. If both the base config and your override define env vars, the override's env list replaces the base's entirely - individual environment variables are not merged by name.
10.11 Both API versions work
Both the v1alpha1 and v1alpha2 API versions are served. v1alpha2 is the storage version and recommended for new deployments. The API server handles conversion transparently.
11. Next Steps
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
- Review the Red Hat OpenShift AI documentation for deploying models with
LLMInferenceService: Deploying models on Red Hat OpenShift AI - See the sample deployments in the Content from github.com is not included.kserve repository under
docs/samples/llmisvc/for CPU, single-node GPU, disaggregated, and multi-node deployment examples