Distributed Inference with llm-d: Release Components Version

Updated 19 Mar 2026

Released Components Version

	llm-d upstream version	RHOAI Version	RHAIIS version	Dates
General Availability (GA)	0.4	RHOAI 3.3	RHAIIS 3.3	March, 2026
General Availability (GA)	0.4	RHOAI 3.2	RHAIIS 3.2.5	January 20, 2026
General Availability (GA)	0.3	RHOAI 3.0	RHAIIS 3.2.2	November 13, 2025
Tech Preview (TP)	0.2	RHOAI 2.25	RHAIIS 3.2.2	October 23, 2025

Components level checklist

Component Level	TP	GA	Comments
OpenShift	4.19.9+	4.20+

API Compatibility

Supported API Endpoints

We support OpenAI-compatible Chat Completions endpoints as the stable interface. - `/v1/chat/completions` - `/v1/completions`
**Note:** Per-request token usage (`prompt_tokens`, `completion_tokens`) is returned in the `usage` field for text inputs.

Out of Scope

The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer): - Anthropic Messages API - OpenAI Responses API - Provider-specific APIs

GA RHOAI 3.3

Supported configuration(s):

Note:

Wide Expert-Parallelism multi-node: Developer Preview
Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

GA RHOAI 3.2

Supported configuration(s):

Note:

Wide Expert-Parallelism multi-node: Developer Preview
Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
Multi node on GB200 is not supported

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

GA RHOAI 3.0

llm-d Supported configuration:

Note:

Wide Expert-Parallelism multi-node: Developer Preview
Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
Multi node on GB200 is not supported

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

Tech Preview - RHOAI 2.25

Supported configuration:

Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended NVIDIA Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	H100, H200, B200, A100	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
P/D Disaggregation	Separate prefill and decode compute stages.	H100, H200, B200	HPC Fabric with RDMA • InfiniBand • RoCE	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	H100, H200, B200, A100	PCIe 5+	Not Applicable
Wide Expert Parallelism (WEP)	Distribute MoE models across many GPUs.	H100, H200, B200, ~~GB200~~ NVL72	HPC Fabric with RDMA • InfiniBand • RoCE	High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit Path	Primary Goal	Recommended AMD Hardware	Networking/Interconnect Requirement	Storage
Intelligent Inference Scheduling	Route requests to the most optimal GPU.	MI300X	Standard DC Ethernet (25/100 GbE)	Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)	Increase throughput by offloading KV cache to CPU RAM.	MI300X	PCIe 5+	Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

Feature	Mixed Accelerator Support
Intelligent Inference Scheduling	Supported
P/D Disaggregation	Not supported
Wide Expert Parallelism	Not supported

Product(s)

Red Hat OpenShift AI

Category

Supportability

Article Type

General