Distributed Inference with llm-d: Release Components Version
Released Components Version
| llm-d upstream version | RHOAI Version | RHAIIS version | Dates | |
|---|---|---|---|---|
| General Availability (GA) | 0.4 | RHOAI 3.3 | RHAIIS 3.3 | March, 2026 |
| General Availability (GA) | 0.4 | RHOAI 3.2 | RHAIIS 3.2.5 | January 20, 2026 |
| General Availability (GA) | 0.3 | RHOAI 3.0 | RHAIIS 3.2.2 | November 13, 2025 |
| Tech Preview (TP) | 0.2 | RHOAI 2.25 | RHAIIS 3.2.2 | October 23, 2025 |
Components level checklist
| Component Level | TP | GA | Comments |
|---|---|---|---|
| OpenShift | 4.19.9+ | 4.20+ |
API Compatibility
Supported API Endpoints
We support OpenAI-compatible Chat Completions endpoints as the stable interface. - `/v1/chat/completions` - `/v1/completions`
**Note:** Per-request token usage (`prompt_tokens`, `completion_tokens`) is returned in the `usage` field for text inputs.
Out of Scope
The following are not supported due to architectural boundary and should be handled at the AI gateway layer (e.g. Model as a Service layer): - Anthropic Messages API - OpenAI Responses API - Provider-specific APIs
GA RHOAI 3.3
Supported configuration(s):
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
GA RHOAI 3.2
Supported configuration(s):
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
GA RHOAI 3.0
llm-d Supported configuration:
Note:
- Wide Expert-Parallelism multi-node: Developer Preview
- Wide Expert-Parallelism on Blackwell B200: Not available but can be provided as a Tech Preview
- Multi node on GB200 is not supported
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |
Tech Preview - RHOAI 2.25
Supported configuration:
Note: WIDE EP multi-node support is included in this Tech Preview, but it may not function as expected and is not yet stable.
Hardware and Accelerator support per "Well-Lit" paths for distributed inference with llm-d
A well-lit path is a documented, tested, and benchmarked deployment pattern that reduces adoption risk and maintenance cost.
NVIDIA: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended NVIDIA Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | H100, H200, B200, A100 | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| P/D Disaggregation | Separate prefill and decode compute stages. | H100, H200, B200 | HPC Fabric with RDMA • InfiniBand • RoCE | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | H100, H200, B200, A100 | PCIe 5+ | Not Applicable |
| Wide Expert Parallelism (WEP) | Distribute MoE models across many GPUs. | H100, H200, B200, | HPC Fabric with RDMA • InfiniBand • RoCE | High-speed NVMe SSDs. |
AMD: Hardware & Accelerator Matrix
| Well-Lit Path | Primary Goal | Recommended AMD Hardware | Networking/Interconnect Requirement | Storage |
|---|---|---|---|---|
| Intelligent Inference Scheduling | Route requests to the most optimal GPU. | MI300X | Standard DC Ethernet (25/100 GbE) | Local SSD (NVMe Recommended) |
| KV Cache Management (Local CPU Offload) | Increase throughput by offloading KV cache to CPU RAM. | MI300X | PCIe 5+ | Not Applicable |
Mixed Accelerator Architectures
Mixed accelerator refers to combining different hardware generations from the same vendor (e.g., NVIDIA H200 and B200) within the same inference cluster.
| Feature | Mixed Accelerator Support |
|---|---|
| Intelligent Inference Scheduling | Supported |
| P/D Disaggregation | Not supported |
| Wide Expert Parallelism | Not supported |