Distributed Inference with llm-d: Release Components Version

Updated

Released Components Version

llm-d upstream versionRHOAI VersionRHAIIS versionDates
General Availability (GA)0.4RHOAI 3.3RHAIIS 3.3March, 2026
General Availability (GA)0.4RHOAI 3.2RHAIIS 3.2.5January 20, 2026
General Availability (GA)0.3RHOAI 3.0RHAIIS 3.2.2November 13, 2025
Tech Preview (TP)0.2RHOAI 2.25RHAIIS 3.2.2October 23, 2025

Components level checklist

Component LevelTPGAComments
OpenShift4.19.9+4.20+

API Compatibility

Supported API Endpoints


We support OpenAI-compatible Chat Completions endpoints as the stable interface. - `/v1/chat/completions` - `/v1/completions`
**Note:** Per-request token usage (`prompt_tokens`, `completion_tokens`) is returned in the `usage` field for text inputs.

Out of Scope


The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer): - Anthropic Messages API - OpenAI Responses API - Provider-specific APIs

GA RHOAI 3.3

Supported configuration(s):

Note:

  • Wide Expert-Parallelism multi-node: Developer Preview
  • Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
  • Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended NVIDIA HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.H100, H200, B200, A100Standard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
P/D DisaggregationSeparate prefill and decode compute stages.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.H100, H200, B200, A100PCIe 5+Not Applicable
Wide Expert Parallelism (WEP)Distribute MoE models across many GPUs.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended AMD HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.MI300XStandard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.MI300XPCIe 5+Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

FeatureMixed Accelerator Support
Intelligent Inference SchedulingSupported
P/D DisaggregationNot supported
Wide Expert ParallelismNot supported

GA RHOAI 3.2

Supported configuration(s):

Note:

  • Wide Expert-Parallelism multi-node: Developer Preview
  • Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
  • Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended NVIDIA HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.H100, H200, B200, A100Standard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
P/D DisaggregationSeparate prefill and decode compute stages.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.H100, H200, B200, A100PCIe 5+Not Applicable
Wide Expert Parallelism (WEP)Distribute MoE models across many GPUs.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended AMD HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.MI300XStandard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.MI300XPCIe 5+Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

FeatureMixed Accelerator Support
Intelligent Inference SchedulingSupported
P/D DisaggregationNot supported
Wide Expert ParallelismNot supported

GA RHOAI 3.0

llm-d Supported configuration:

Note:

  • Wide Expert-Parallelism multi-node: Developer Preview
  • Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
  • Multi node on GB200 is not supported

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended NVIDIA HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.H100, H200, B200, A100Standard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
P/D DisaggregationSeparate prefill and decode compute stages.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.H100, H200, B200, A100PCIe 5+Not Applicable
Wide Expert Parallelism (WEP)Distribute MoE models across many GPUs.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended AMD HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.MI300XStandard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.MI300XPCIe 5+Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

FeatureMixed Accelerator Support
Intelligent Inference SchedulingSupported
P/D DisaggregationNot supported
Wide Expert ParallelismNot supported

Tech Preview - RHOAI 2.25

Supported configuration:

Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.

Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.

NVIDIA: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended NVIDIA HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.H100, H200, B200, A100Standard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
P/D DisaggregationSeparate prefill and decode compute stages.H100, H200, B200HPC Fabric with RDMA
• InfiniBand
• RoCE
Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.H100, H200, B200, A100PCIe 5+Not Applicable
Wide Expert Parallelism (WEP)Distribute MoE models across many GPUs.H100, H200, B200, GB200 NVL72HPC Fabric with RDMA
• InfiniBand
• RoCE
High-speed NVMe SSDs.

AMD: Hardware & Accelerator Matrix

Well-Lit PathPrimary GoalRecommended AMD HardwareNetworking/Interconnect RequirementStorage
Intelligent Inference SchedulingRoute requests to the most optimal GPU.MI300XStandard DC Ethernet (25/100 GbE)Local SSD (NVMe Recommended)
KV Cache Management (Local CPU Offload)Increase throughput by offloading KV cache to CPU RAM.MI300XPCIe 5+Not Applicable

Mixed Accelerator Architectures

Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.

FeatureMixed Accelerator Support
Intelligent Inference SchedulingSupported
P/D DisaggregationNot supported
Wide Expert ParallelismNot supported
Category
Article Type