Release notes

Red Hat AI Inference Server 3.2

Highlights of what is new and what has changed with this Red Hat AI Inference Server release

Red Hat AI Documentation Team

Abstract

The release notes for Red Hat AI Inference Server summarize all new features and enhancements, notable technical changes, major corrections from the previous version, and any known bugs upon general availability.

Chapter 1. Red Hat AI Inference Server release notes

Red Hat AI Inference Server provides developers and IT organizations with a scalable inference platform for deploying and customizing AI models on secure, scalable resources with minimal configuration and resource usage.

These release notes document new features, enhancements, bug fixes, known issues, and deprecated functionality for each Red Hat AI Inference Server release. Security advisories and asynchronous errata updates are published separately as container images become available.

Chapter 2. Version 3.2.5 release notes

Red Hat AI Inference Server 3.2.5 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

The following container images are Generally Available (GA) from Content from registry.redhat.io is not included.registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.5
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (s390x and ppc64le)
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.5

The following container images are Technology Preview features:

registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (x86)

Important

The registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5 and registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (x86) containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

2.1. Early access AI Inference Server images

To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.

You can find available fast release images in the This content is not included.Red Hat ecosystem catalog.

2.2. New Red Hat AI Inference Server developer features

Red Hat AI Inference Server 3.2.5 packages the upstream vLLM v0.11.2 release. You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.11.2 release notes.

PyTorch 2.9.0, CUDA 12.9.1 updates

NVIDIA CUDA has been updated with PyTorch 2.9.0, enabling Inductor partitioning and enabling multiple fixes in graph-partition rules and compile-cache integration.

Batch-invariant torch.compile

Generalized batch-invariant support across attention and MoE model backends, with explicit support for DeepGEMM and FlashInfer on NVIDIA Hopper and Blackwell AI accelerators.

Robust async scheduling

Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, DeepEP or Dynamic Compressing Prompts (DCP) processing. The --async-scheduling option will be enabled by default in a future release.

Stronger scheduler + KV ecosystem

The scheduler is now more robust with KV connectors, prefix caching, and multi-node deployments.

Anthropic API support

Added support for the /v1/messages API endpoint, you can now use vllm serve with Anthropic-compatible clients.

AI accelerator hardware updates

IBM Spyre support for IBM Power and IBM Z is now Generally Available.

Note

Single-host deployments for IBM Spyre AI accelerators on IBM Z and IBM Power are supported for RHEL AI 9.6 only.
Cluster deployments for IBM Spyre AI accelerators on IBM Z are supported as part of Red Hat OpenShift AI version 3.0+ only.

2.3. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.5 packages the upstream LLM Compressor v0.8.1 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 and 3.2.4 releases. See the Version 3.2.3 release notes for more information.

2.4. Known issues

The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.
This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:
```
VLLM_USE_FLASHINFER_SAMPLER=1
```
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.
Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture

Inference fails for MP3 and M4A file formats. When querying audio models with these file formats, the system returns a "format not recognized" error.
```
{"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}
```
This issue affects audio transcription models such as openai/whisper-large-v3 and mistralai/Voxtral-Small-24B-2507. To workaround this issue, convert audio files to WAV format before processing.

Jemalloc consumes more memory than glibc when deploying models on IBM Spyre AI accelerators.
When deploying models with jemalloc as the memory allocator, overall memory usage is significantly higher than when using glibc. In testing, jemalloc increased memory consumption by more than 50% compared to glibc. To workaround this issue, disable jemalloc by unsetting the LD_PRELOAD environment variable so the system uses glibc as the memory allocator instead.

On IBM Z systems with FIPS mode enabled, Red Hat AI Inference Server fails to start when the IBM Spyre platform plugin is in use. A _hashlib.UnsupportedDigestmodError error is shown in the model startup logs. This issue occurs in Red Hat AI Inference Server 3.2.5 with the IBM Spyre plugin on IBM Z, which uses vLLM v0.11.0. The issue is fixed in vLLM v0.11.1, and will be included in a future version of Red Hat AI Inference Server.

Chapter 3. Version 3.2.4 release notes

Red Hat AI Inference Server 3.2.4 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The following container images are Generally Available (GA) from Content from registry.redhat.io is not included.registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.4
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.4
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.4

The following container image is a Technology Preview feature:

registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.4
Important
The rhaiis/vllm-tpu-rhel9:3.2.4 container is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Note

You can find available fast release images in the This content is not included.Red Hat ecosystem catalog.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

3.1. New Red Hat AI Inference Server developer features

Red Hat AI Inference Server 3.2.4 packages the upstream vLLM v0.11.0 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 release. See the Version 3.2.3 release notes for more information.

3.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.4 packages the upstream LLM Compressor v0.8.1 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 release. See the Version 3.2.3 release notes for more information.

3.3. Known issues

The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.
This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:
```
VLLM_USE_FLASHINFER_SAMPLER=1
```
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.
Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture

Chapter 4. Version 3.2.3 release notes

Red Hat AI Inference Server 3.2.3 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The following container images are Generally Available (GA) from Content from registry.redhat.io is not included.registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3

The following container images are Technology Preview features:

registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.3
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3

Important

rhaiis/vllm-tpu-rhel9:3.2.3 and rhaiis/vllm-spyre-rhel9:3.2.3 containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Supported product and hardware configurations.

Note

To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are now available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.

You can find available fast release images in the This content is not included.Red Hat ecosystem catalog.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

4.1. New vLLM developer features

Red Hat AI Inference Server 3.2.3 packages the upstream vLLM v0.11.0 release. You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.11.0 release notes.

The release completes the removal of the vLLM V0 engine. V1 is now the only inference engine in vLLM.

The FULL_AND_PIECEWISE mode is now the CUDA graph mode default. This provides better performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Inference engine updates

Added KV cache offloading with CPU offload and LRU cache management.
Added new vLLM V1 engine features including prompt embeddings, sharded state loading, and sliding window attention.
Added pipeline parallel and variable hidden size support to the hybrid allocator.
Extended async scheduling to support uniprocessor execution.
Removed tokenizer groups and added multimodal caching in shared memory as part of architecture changes.
Improved attention with hybrid SSM/Attention and FlashAttention 3 for ViT.
Achieved multiple Triton and RoPE kernel speedups, with speculative decoding now 8 times faster.
Optimized LoRA weight loading.
Changed CUDA graph mode default to FULL_AND_PIECEWISE and disabled the standalone compile feature in the Inductor.
Added integrated CUDA graph inductor partition for torch.compile.

Model support

Added support for new architectures including DeepSeek-V3.2-Exp, Qwen3-VL, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, and CWM.
Added RADIO encoder and transformer backend support for encoder-only models.
Enabled new tasks including BERT NER/token classification and multimodal pooling tasks.
Added data parallelism for InternVL, Qwen2-VL, and Qwen3-VL.
Implemented EAGLE3 speculative decoding for MiniCPM3 and GPT-OSS.
Added new features including Qwen3-VL text-only mode, EVS video pruning, Mamba2 quantization, MRoPE and YaRN, and LongCat-Flash-Chat tools.
Delivered performance optimizations across GLM, Qwen, and LongCat series.
Added SeedOSS reason parser for reasoning tasks.

AI Accelerator hardware updates

NVIDIA: Added FP8 FlashInfer decoding and BF16 fused MoE for NVIDIA Hopper and Blackwell AI accelerators.
AMD: Added MI300X tuning for GLM-4.5.
Enabled DeepGEMM by default, providing a 5.5% throughput gain for model serving.

Performance improvements

Introduced dual-batch overlap (DBO) as an overlapping compute mechanism for higher throughput.
Enhanced data parallelism with the new torchrun launcher, Ray placement groups, and Triton DP/EP kernels.
Reduced EPLB overhead and added static placement.
Added KV metrics and latent dimension support for disaggregated serving.
Optimized MoE with shared expert overlap optimization, SiLU kernel, and Allgather/ReduceScatter backend.
Updated distributed NCCL symmetric memory performance resulting in a 3-4% throughput improvement.

New quantization options

Enhanced FP8 with per-token-group quantization, hardware acceleration, and paged attention update.
Added FP4 support for dense NVFP4 models and large Llama/Gemma variants.
Updated W4A8 to perform faster preprocessing.
Added blocked FP8 support for MoE models in compressed tensors.

API and front-end improvements

Enhanced OpenAI compatibility with full-token logprobs, reasoning event streaming, MCP tools, and better error handling.
Improved multimodal support with UUID caching and updated image path formats.
Added XML parser for Qwen3-Coder and Hermes token format for tool calling.
Added new --enable-logging flag and improved help output in the command line interface.
Enhanced configuration with speculative engine args, NVTX profiling, and backward compatibility fixes.
Cleaned up metrics outputs and added KV cache units in GiB.
Removed misleading quantization warning to improve UX.

Dependency updates

Upgraded PyTorch to 2.8 for CUDA and ROCm, FlashInfer to 0.3.1, and CUDA to version 13.
Enforced C++17 globally across builds.
Replaced xm.mark_step with torch_xla.sync for Google TPU.

Security updates

Fixed advisory Content from github.com is not included.GHSA-wr9h-g72x-mwhm.

vLLM V0 engine deprecation is complete

Removed AsyncLLMEngine, LLMEngine, MQLLMEngine, attention backends, encoder-decoder, samplers, LoRA interface, and hybrid model support.
Removed legacy attention classes, multimodal registry, compilation fallbacks, and default args from the old system during clean-up.

4.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.3 is now generally available (GA).

Red Hat AI Model Optimization Toolkit 3.2.3 packages the upstream LLM Compressor v0.8.1 release.

The registry.redhat.io/rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.8.1 separately in its own runtime image, shipped as a second container image alongside the primary registry.redhat.io/rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.

You can review the complete list of updates in the upstream Content from github.com is not included.llm-compressor v0.8.1 release notes.

Support for multiple modifiers in oneshot compression runs

LLM Compressor now supports using multiple modifiers in oneshot compression runs.

You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.

Quantization and calibration support for Qwen3 models

Quantization and calibration support for Qwen3 models has been added to LLM Compressor.

An updated Qwen3NextSparseMoeBlock modeling definition has been added to temporarily update the MoE block during calibration to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.

FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model.

FP8 quantization support for Qwen3 VL MoE models

LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.

Transforms support for non-full-size rotation sizes

You can now set a transform_block_size field in the Transform-based modifier classes SpinQuantModifier and QuIPModifier. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.

It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.

Improved accuracy recovery by updating W4A16 schemes to use actorder weight by default

The GPTQModifier class now uses weight activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime.

Re-enabled support for W8A8 INT8 decompression

W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.

Updated ignore lists in example recipes to capture all vision components

Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like model.vision_tower were not being caught, causing downstream issues when serving models with vLLM.

Deprecated and removed unittest.TestCase

The unittest.TestCase test case has been deprecated and removed and has been replaced with standardized pytest test definitions.

4.3. Known issues

The FlashInfer kernel sampler is disabled by default in Red Hat AI Inference Server to address non-deterministic behavior and correctness errors in model output. This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods.
If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:
```
VLLM_USE_FLASHINFER_SAMPLER=1
```
When serving a model, using --async-scheduling produces incorrect output for preemption and other modes.
BART support is temporarily removed in vLLM v0.11.0 as part of the finalization of the vLLM V0 engine deprecation. It will be reinstated in a future release.
The aiter Python package is disabled by default in registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3.
To enable aiter, configure the following Red Hat AI Inference Server runtime environment variables:
```
VLLM_ROCM_USE_AITER=1
VLLM_ROCM_USE_AITER_RMSNORM=0
VLLM_ROCM_USE_AITER_MHA=0
```
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.
Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture

Chapter 5. Version 3.2.2 release notes

Red Hat AI Inference Server 3.2.2 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from Content from registry.redhat.io is not included.registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.2
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.2

This release also includes a new rhaiis/model-opt-cuda-rhel9:3.2.2 container. This new toolkit is called Red Hat AI Model Optimization Toolkit.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

5.1. New vLLM developer features

Red Hat AI Inference Server 3.2.2 packages the upstream vLLM v0.10.1.1 release.

You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.10.1.1 release notes.

Inference engine updates

CUDA graph performance: Full CUDA graph support with separate attention routines, FA2 and FlashInfer compatibility
Attention system improvements: Multiple attention metadata builders per KV cache, tree attention backend for v1 engine
Speculative decoding: N-gram speculative decoding with single KMP token proposal algorithm.
Configuration improvements: Model loader plugin system, rate limiting with bucket algorithm

Performance improvements

Improved startup time: enhanced headless models for pooling in the Transformers backend
NVIDIA Blackwell/SM100 optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 support
NVIDIA RTX PRO 6000 (SM120): Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations
AMD ROCm enhancements: Flash Attention backend for Qwen-VL models, optimized kernel performance for small batch sizes
Memory and throughput: Improved efficiency through reduced memory copying, fused RMSNorm kernels, faster multimodal hashing for repeated image prompts, and multithreaded async input loading
Parallelization and MoE: Faster guided decoding, better expert sharding for MoE, expanded fused kernel support for top-k softmax, and fused MoE support for nomic-embed-text-v2-moe
Hardware and kernels: Fixed ARM CPU builds without BF16, improved Machete on memory-bound tasks, added FlashInfer TRT-LLM prefill kernel, sped up CUDA reshape_and_cache_flash, and enabled CPU transfer in NixlConnector
Specialized CUDA kernels: GPT-OSS activation functions implemented, faster RLHF weight loading

New quantization options

Added MXFP4/bias support in Marlin and NVFP4 GEMM backends, introduced dynamic 4-bit CPU quantization with Kleidiai, and expanded model support with BitsAndBytes for MoE and Gemma3n compatibility.

API and frontend improvements

Added OpenAI API Unix socket support and better error alignment, new reward model interface and chunked input processing, multi-key and custom config support, plus HermesToolParser and multi-turn benchmarking.

Dependency updates

FlashInfer v0.3.1: now an optional via pip install vllm[flashinfer]
Mamba SSM 2.2.5: removed from core dependencies
Docker: Precompiled wheel support for easier containerized deployment
Python: OpenAI dependency bumped for API compatibility
Various dependency optimizations: Dropped xformers for Mistral models, added DeepGEMM deprecation warnings

V0 deprecation breaking changes

V0 deprecation: Continued cleanup of legacy engine components including removal of multi-step scheduling
CLI updates: Various flag updates and deprecated argument removals as part of V0 engine cleanup
Quantization: Removed AQLM quantization support - users should migrate to alternative methods

Tool calling support for gpt-oss models

Red Hat AI Inference Server now supports calling built-in tools directly in gpt-oss models. Tool calling uses the Chat Completions and Responses APIs, both of which can carry function-calling capabilities for gpt-oss models. For more information, see Content from docs.vllm.ai is not included.Tool use.

Note

Tool calling for gpt-oss models is supported on NVIDIA CUDA AI accelerators only.

5.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.2 packages the upstream LLM Compressor v0.7.1 release.

You can review the complete list of updates in the upstream Content from github.com is not included.llm-compressor v0.7.1 release notes.

New Red Hat AI Model Optimization Toolkit container: The rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.7.1 separately in its own runtime image, shipped as a second container image alongside the primary rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.
Introducing transforms: Red Hat AI Model Optimization Toolkit now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization.
Applying multiple compressors to a single model: Red Hat AI Model Optimization Toolkit now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization.
Support for DeepSeekV3-style block FP8 quantization: You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference.
Mixture of Experts support: Red Hat AI Model Optimization Toolkit now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization.
Llama4 quantization: LLama4 quantization is now supported in Red Hat AI Model Optimization Toolkit.
Simplified and updated Recipe classes: The Recipe system has been streamlined by merging multiple classes into one unified Recipe class. Modifier creation, lifecycle management, and parsing are now simpler. Serialization and deserialization are improved.
Configurable Observer arguments: Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

5.3. Anonymous statistics collection

Anonymous Red Hat AI Inference Server 3.2.2 usage statistics are now sent to Red Hat. Model consumption and usage stats are collected and stored centrally via Red Hat Observatorium.

5.4. Known issues

The gpt-oss language model family is supported in Red Hat AI Inference Server 3.2.2 for NVIDIA CUDA AI accelerators only.
Red Hat AI Inference Server 3.2.2 include RPMs provided by IBM to support IBM Spyre AIU. The RPMs in the 3.2.2 release are pre-GA and are not GPG signed. IBM does not sign pre-GA RPMs.

Chapter 6. Version 3.2.1 release notes

Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from Content from registry.redhat.io is not included.registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1

Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.

You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.11.2 release notes.

Note

The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

6.1. New models enabled

Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:

Llama 4 with EAGLE support
EXAONE 4.0
Microsoft Phi‑4‑mini‑flash‑reasoning
Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
Ling mixture-of-experts (MoE) models
JinaVL Reranker
Nemotron‑Nano‑VL‑8B‑V1
Arcee
Voxtral

6.2. New developer features

Inference engine updates

V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
Experimental asynchronous scheduling can be enabled by using the --async-scheduling flag to overlap engine core scheduling with the GPU runner for improved inference throughput.
Reduced startup time for CUDA graphs by calling gc.freeze before capture.

Performance improvements

48% request duration reduction by using micro-batch tokenization for concurrent requests
Added fused MLA QKV and strided layernorm.
Added Triton causal-conv1d for Mamba models.

New quantization options

MXFP4 quantization for Mixture of Experts models.
BNB (Bits and Bytes) support for Mixtral models.
Hardware-specific quantization improvements.

Expanded model support

Llama 4 with EAGLE speculative decoding support.
EXAONE 4.0 and Microsoft Phi-4-mini model families.
Hunyuan V1 Dense and Ling MoE architectures.

OpenAI compatibility

Added new OpenAI Responses API implementation.
Added tool calling with required choice and $defs.

Dependency updates

Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
NVIDIA CUDA uses PyTorch 2.7.1.
AMD ROCm remains on PyTorch 2.7.0.
FlashInfer library is updated to v0.2.8rc1.

6.3. Known issues

In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.

To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:

Table 6.1. Supported OpenShift Container Platform and ROCm driver versions

OpenShift Container Platform version	ROCm driver version
4.17	6.4.2
4.17	6.3.4

Chapter 7. Version 3.2.0 release notes

Red Hat AI Inference Server 3.2.0 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA and AMD ROCm AI accelerators. The container images are available from Content from registry.redhat.io is not included.registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.0

With Red Hat AI Inference Server, you can serve and inference models with higher performance, lower cost, and enterprise-grade stability and security. Red Hat AI Inference Server is built on the upstream, open source Content from github.com is not included.vLLM software project.

New versions of vLLM and LLM Compressor are included in this release:

Content from github.com is not included.vLLM v0.9.2
- 400+ upstream commits since vLLM v0.9.0.1
Content from github.com is not included.LLM Compressor v0.8.1

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

Table 7.1. AI accelerator performance highlights

Feature	Benefit	Supported GPUs
Blackwell support	Runs on NVIDIA B200 compute capability 10.0 GPUs with FP8 kernels and full CUDA Graph acceleration	NVIDIA Blackwell
FP8 KV-cache on ROCm	Roughly twice as large context windows with no accuracy loss	All AMD GPUs
Skinny GEMMs	Roughly 10% lower inference latency	AMD MI300X
Full CUDA Graph mode	6–8% improved average Time Per Output Token (TPOT) for small models.	NVIDIA A100 and H100
Auto FP16 fallback	Stable runs on pre-Ampere cards without manual flags, for example, NVIDIA T4 GPUs	Older NVIDIA GPUs

7.1. New models enabled

Table 7.2. AI accelerator performance highlights

Feature	Benefit	Supported GPUs
Blackwell compute capability 12.0	Runs on NVIDIA RTX PRO 6000 Blackwell Server Edition supporting W8A8/FP8 kernels and related tuning	NVIDIA RTX PRO 6000 Blackwell Server Edition
ROCm improvements	Full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill	AMD ROCm

7.2. New models enabled

Red Hat AI Inference Server 3.2.0 expands capabilities by enabling the following models added in vLLM v0.9.1:

LoRA support for InternVL
Magistral
Minicpm eagle support
NemotronH

The following models were added in vLLM v0.9.0:

dots1
Ernie 4.5
FalconH1
Gemma‑3
GLM‑4.1 V
GPT‑2 for Sequence Classification
Granite 4
Keye‑VL‑8B‑Preview
LlamaGuard4
MiMo-7B
MiniMax-M1
MiniMax-VL-01
Ovis 1.6, Ovis 2
Phi‑tiny‑MoE‑instruct
Qwen 3 Embedding & Reranker
Slim-MoE
Tarsier 2
Tencent HunYuan‑MoE‑V1

7.3. New developer features

Improved scheduler performance

The vLLM scheduler API CachedRequestData class has been updated, resulting in improved performance for object and cached sampler‑ID stores.

CUDA graph execution

CUDA graph execution is now available for all FlashAttention-3 (FA3) and FlashMLA paths, including prefix‑caching.
New live CUDA graph capture progress bar makes debugging easier.

Scheduling

Priority scheduling is now implemented in the vLLM V1 engine.

Chapter 8. Asynchronous errata updates

Security, bug fix, and enhancement updates for Red Hat AI Inference Server are released as asynchronous errata through the Red Hat Network. All Red Hat AI Inference Server errata are available on the Red Hat Customer Portal. See the Red Hat AI Inference Server Life Cycle for more information about asynchronous errata.

You can enable errata notifications in your Red Hat Customer Portal account settings.

Note

You must register your hosts and configure them for consuming Red Hat Customer Portal AI Inference Server entitlements for the errata notification emails to be generated.

Legal Notice

The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at Content from creativecommons.org is not included.http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

Java® is a registered trademark of Oracle and/or its affiliates.

XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.

MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.

Node.js® is an official trademark of Joyent. Red Hat Software Collections is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.

The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

All other trademarks are the property of their respective owners.