Release notes
Highlights of what is new and what has changed with this Red Hat AI Inference Server release
Abstract
Chapter 1. Red Hat AI Inference Server release notes
Red Hat AI Inference Server provides developers and IT organizations with a scalable inference platform for deploying and customizing AI models on secure, scalable resources with minimal configuration and resource usage.
These release notes document new features, enhancements, bug fixes, known issues, and deprecated functionality for each Red Hat AI Inference Server release. Security advisories and asynchronous errata updates are published separately as container images become available.
Chapter 2. Version 3.2.5 release notes
Red Hat AI Inference Server 3.2.5 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).
The following container images are Generally Available (GA) from Content from registry.redhat.io is not included.registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.5 -
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5(s390x and ppc64le) -
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.5
The following container images are Technology Preview features:
-
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5 -
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5(x86)
The registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5 and registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (x86) containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
2.1. Early access AI Inference Server images
To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.
You can find available fast release images in the This content is not included.Red Hat ecosystem catalog.
2.2. New Red Hat AI Inference Server developer features
Red Hat AI Inference Server 3.2.5 packages the upstream vLLM v0.11.2 release. You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.11.2 release notes.
- PyTorch 2.9.0, CUDA 12.9.1 updates
- NVIDIA CUDA has been updated with PyTorch 2.9.0, enabling Inductor partitioning and enabling multiple fixes in graph-partition rules and compile-cache integration.
- Batch-invariant torch.compile
- Generalized batch-invariant support across attention and MoE model backends, with explicit support for DeepGEMM and FlashInfer on NVIDIA Hopper and Blackwell AI accelerators.
- Robust async scheduling
-
Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, DeepEP or Dynamic Compressing Prompts (DCP) processing. The
--async-schedulingoption will be enabled by default in a future release. - Stronger scheduler + KV ecosystem
- The scheduler is now more robust with KV connectors, prefix caching, and multi-node deployments.
- Anthropic API support
-
Added support for the
/v1/messagesAPI endpoint, you can now usevllm servewith Anthropic-compatible clients. - AI accelerator hardware updates
IBM Spyre support for IBM Power and IBM Z is now Generally Available.
Note- Single-host deployments for IBM Spyre AI accelerators on IBM Z and IBM Power are supported for RHEL AI 9.6 only.
- Cluster deployments for IBM Spyre AI accelerators on IBM Z are supported as part of Red Hat OpenShift AI version 3.0+ only.
2.3. New Red Hat AI Model Optimization Toolkit developer features
Red Hat AI Model Optimization Toolkit 3.2.5 packages the upstream LLM Compressor v0.8.1 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 and 3.2.4 releases. See the Version 3.2.3 release notes for more information.
2.4. Known issues
The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.
This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the
VLLM_USE_FLASHINFER_SAMPLERenvironment variable at runtime:VLLM_USE_FLASHINFER_SAMPLER=1
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause
NotImplementedErrorfailures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture
Inference fails for MP3 and M4A file formats. When querying audio models with these file formats, the system returns a "format not recognized" error.
{"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}This issue affects audio transcription models such as
openai/whisper-large-v3andmistralai/Voxtral-Small-24B-2507. To workaround this issue, convert audio files to WAV format before processing.
Jemalloc consumes more memory than glibc when deploying models on IBM Spyre AI accelerators.
When deploying models with jemalloc as the memory allocator, overall memory usage is significantly higher than when using glibc. In testing, jemalloc increased memory consumption by more than 50% compared to glibc. To workaround this issue, disable jemalloc by unsetting the
LD_PRELOADenvironment variable so the system uses glibc as the memory allocator instead.
-
On IBM Z systems with FIPS mode enabled, Red Hat AI Inference Server fails to start when the IBM Spyre platform plugin is in use. A
_hashlib.UnsupportedDigestmodErrorerror is shown in the model startup logs. This issue occurs in Red Hat AI Inference Server 3.2.5 with the IBM Spyre plugin on IBM Z, which uses vLLM v0.11.0. The issue is fixed in vLLM v0.11.1, and will be included in a future version of Red Hat AI Inference Server.
Chapter 3. Version 3.2.4 release notes
Red Hat AI Inference Server 3.2.4 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The following container images are Generally Available (GA) from Content from registry.redhat.io is not included.registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.4 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4 -
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.4 -
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.4
The following container image is a Technology Preview feature:
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.4ImportantThe
rhaiis/vllm-tpu-rhel9:3.2.4container is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.
You can find available fast release images in the This content is not included.Red Hat ecosystem catalog.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
3.1. New Red Hat AI Inference Server developer features
Red Hat AI Inference Server 3.2.4 packages the upstream vLLM v0.11.0 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 release. See the Version 3.2.3 release notes for more information.
3.2. New Red Hat AI Model Optimization Toolkit developer features
Red Hat AI Model Optimization Toolkit 3.2.4 packages the upstream LLM Compressor v0.8.1 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 release. See the Version 3.2.3 release notes for more information.
3.3. Known issues
The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.
This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the
VLLM_USE_FLASHINFER_SAMPLERenvironment variable at runtime:VLLM_USE_FLASHINFER_SAMPLER=1
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause
NotImplementedErrorfailures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture
Chapter 4. Version 3.2.3 release notes
Red Hat AI Inference Server 3.2.3 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The following container images are Generally Available (GA) from Content from registry.redhat.io is not included.registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3 -
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3
The following container images are Technology Preview features:
+
-
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.3 -
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
rhaiis/vllm-tpu-rhel9:3.2.3 and rhaiis/vllm-spyre-rhel9:3.2.3 containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Supported product and hardware configurations.
To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are now available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.
You can find available fast release images in the This content is not included.Red Hat ecosystem catalog.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
4.1. New vLLM developer features
Red Hat AI Inference Server 3.2.3 packages the upstream vLLM v0.11.0 release. You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.11.0 release notes.
The release completes the removal of the vLLM V0 engine. V1 is now the only inference engine in vLLM.
The FULL_AND_PIECEWISE mode is now the CUDA graph mode default. This provides better performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.
- Inference engine updates
- Added KV cache offloading with CPU offload and LRU cache management.
- Added new vLLM V1 engine features including prompt embeddings, sharded state loading, and sliding window attention.
- Added pipeline parallel and variable hidden size support to the hybrid allocator.
- Extended async scheduling to support uniprocessor execution.
- Removed tokenizer groups and added multimodal caching in shared memory as part of architecture changes.
- Improved attention with hybrid SSM/Attention and FlashAttention 3 for ViT.
- Achieved multiple Triton and RoPE kernel speedups, with speculative decoding now 8 times faster.
- Optimized LoRA weight loading.
-
Changed CUDA graph mode default to
FULL_AND_PIECEWISEand disabled the standalone compile feature in the Inductor. -
Added integrated CUDA graph inductor partition for
torch.compile.
- Model support
- Added support for new architectures including DeepSeek-V3.2-Exp, Qwen3-VL, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, and CWM.
- Added RADIO encoder and transformer backend support for encoder-only models.
- Enabled new tasks including BERT NER/token classification and multimodal pooling tasks.
- Added data parallelism for InternVL, Qwen2-VL, and Qwen3-VL.
- Implemented EAGLE3 speculative decoding for MiniCPM3 and GPT-OSS.
- Added new features including Qwen3-VL text-only mode, EVS video pruning, Mamba2 quantization, MRoPE and YaRN, and LongCat-Flash-Chat tools.
- Delivered performance optimizations across GLM, Qwen, and LongCat series.
- Added SeedOSS reason parser for reasoning tasks.
- AI Accelerator hardware updates
-
NVIDIA: Added
FP8FlashInfer decoding andBF16fused MoE for NVIDIA Hopper and Blackwell AI accelerators. - AMD: Added MI300X tuning for GLM-4.5.
- Enabled DeepGEMM by default, providing a 5.5% throughput gain for model serving.
-
NVIDIA: Added
- Performance improvements
- Introduced dual-batch overlap (DBO) as an overlapping compute mechanism for higher throughput.
- Enhanced data parallelism with the new torchrun launcher, Ray placement groups, and Triton DP/EP kernels.
- Reduced EPLB overhead and added static placement.
- Added KV metrics and latent dimension support for disaggregated serving.
- Optimized MoE with shared expert overlap optimization, SiLU kernel, and Allgather/ReduceScatter backend.
- Updated distributed NCCL symmetric memory performance resulting in a 3-4% throughput improvement.
- New quantization options
-
Enhanced
FP8with per-token-group quantization, hardware acceleration, and paged attention update. -
Added
FP4support for dense NVFP4 models and large Llama/Gemma variants. -
Updated
W4A8to perform faster preprocessing. - Added blocked FP8 support for MoE models in compressed tensors.
-
Enhanced
- API and front-end improvements
- Enhanced OpenAI compatibility with full-token logprobs, reasoning event streaming, MCP tools, and better error handling.
- Improved multimodal support with UUID caching and updated image path formats.
- Added XML parser for Qwen3-Coder and Hermes token format for tool calling.
-
Added new
--enable-loggingflag and improved help output in the command line interface. - Enhanced configuration with speculative engine args, NVTX profiling, and backward compatibility fixes.
- Cleaned up metrics outputs and added KV cache units in GiB.
- Removed misleading quantization warning to improve UX.
- Dependency updates
- Upgraded PyTorch to 2.8 for CUDA and ROCm, FlashInfer to 0.3.1, and CUDA to version 13.
- Enforced C++17 globally across builds.
-
Replaced
xm.mark_stepwithtorch_xla.syncfor Google TPU.
- Security updates
- Fixed advisory Content from github.com is not included.GHSA-wr9h-g72x-mwhm.
- vLLM V0 engine deprecation is complete
-
Removed
AsyncLLMEngine,LLMEngine,MQLLMEngine, attention backends, encoder-decoder, samplers, LoRA interface, and hybrid model support. - Removed legacy attention classes, multimodal registry, compilation fallbacks, and default args from the old system during clean-up.
-
Removed
4.2. New Red Hat AI Model Optimization Toolkit developer features
Red Hat AI Model Optimization Toolkit 3.2.3 is now generally available (GA).
Red Hat AI Model Optimization Toolkit 3.2.3 packages the upstream LLM Compressor v0.8.1 release.
The registry.redhat.io/rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.8.1 separately in its own runtime image, shipped as a second container image alongside the primary registry.redhat.io/rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.
You can review the complete list of updates in the upstream Content from github.com is not included.llm-compressor v0.8.1 release notes.
- Support for multiple modifiers in oneshot compression runs
LLM Compressor now supports using multiple modifiers in oneshot compression runs.
You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.
- Quantization and calibration support for Qwen3 models
Quantization and calibration support for Qwen3 models has been added to LLM Compressor.
An updated
Qwen3NextSparseMoeBlockmodeling definition has been added to temporarily update the MoE block during calibration to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model.
- FP8 quantization support for Qwen3 VL MoE models
- LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.
- Transforms support for non-full-size rotation sizes
You can now set a
transform_block_sizefield in the Transform-based modifier classesSpinQuantModifierandQuIPModifier. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.
- Improved accuracy recovery by updating W4A16 schemes to use actorder weight by default
-
The
GPTQModifierclass now uses weight activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime. - Re-enabled support for W8A8 INT8 decompression
- W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.
- Updated ignore lists in example recipes to capture all vision components
-
Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like
model.vision_towerwere not being caught, causing downstream issues when serving models with vLLM. - Deprecated and removed unittest.TestCase
-
The
unittest.TestCasetest case has been deprecated and removed and has been replaced with standardized pytest test definitions.
4.3. Known issues
The FlashInfer kernel sampler is disabled by default in Red Hat AI Inference Server to address non-deterministic behavior and correctness errors in model output. This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods.
If required, you can enable FlashInfer by setting the
VLLM_USE_FLASHINFER_SAMPLERenvironment variable at runtime:VLLM_USE_FLASHINFER_SAMPLER=1
-
When serving a model, using
--async-schedulingproduces incorrect output for preemption and other modes. - BART support is temporarily removed in vLLM v0.11.0 as part of the finalization of the vLLM V0 engine deprecation. It will be reinstated in a future release.
The aiter Python package is disabled by default in
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3.To enable aiter, configure the following Red Hat AI Inference Server runtime environment variables:
VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_RMSNORM=0 VLLM_ROCM_USE_AITER_MHA=0
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause
NotImplementedErrorfailures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture
Chapter 5. Version 3.2.2 release notes
Red Hat AI Inference Server 3.2.2 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from Content from registry.redhat.io is not included.registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.2 -
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2 -
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.2
This release also includes a new rhaiis/model-opt-cuda-rhel9:3.2.2 container. This new toolkit is called Red Hat AI Model Optimization Toolkit.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
5.1. New vLLM developer features
Red Hat AI Inference Server 3.2.2 packages the upstream vLLM v0.10.1.1 release.
You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.10.1.1 release notes.
- Inference engine updates
- CUDA graph performance: Full CUDA graph support with separate attention routines, FA2 and FlashInfer compatibility
- Attention system improvements: Multiple attention metadata builders per KV cache, tree attention backend for v1 engine
- Speculative decoding: N-gram speculative decoding with single KMP token proposal algorithm.
- Configuration improvements: Model loader plugin system, rate limiting with bucket algorithm
- Performance improvements
- Improved startup time: enhanced headless models for pooling in the Transformers backend
- NVIDIA Blackwell/SM100 optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 support
- NVIDIA RTX PRO 6000 (SM120): Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations
- AMD ROCm enhancements: Flash Attention backend for Qwen-VL models, optimized kernel performance for small batch sizes
- Memory and throughput: Improved efficiency through reduced memory copying, fused RMSNorm kernels, faster multimodal hashing for repeated image prompts, and multithreaded async input loading
- Parallelization and MoE: Faster guided decoding, better expert sharding for MoE, expanded fused kernel support for top-k softmax, and fused MoE support for nomic-embed-text-v2-moe
-
Hardware and kernels: Fixed ARM CPU builds without BF16, improved Machete on memory-bound tasks, added FlashInfer TRT-LLM prefill kernel, sped up CUDA
reshape_and_cache_flash, and enabled CPU transfer in NixlConnector - Specialized CUDA kernels: GPT-OSS activation functions implemented, faster RLHF weight loading
- New quantization options
- Added MXFP4/bias support in Marlin and NVFP4 GEMM backends, introduced dynamic 4-bit CPU quantization with Kleidiai, and expanded model support with BitsAndBytes for MoE and Gemma3n compatibility.
- API and frontend improvements
- Added OpenAI API Unix socket support and better error alignment, new reward model interface and chunked input processing, multi-key and custom config support, plus HermesToolParser and multi-turn benchmarking.
- Dependency updates
-
FlashInfer v0.3.1: now an optional via pip install
vllm[flashinfer] - Mamba SSM 2.2.5: removed from core dependencies
- Docker: Precompiled wheel support for easier containerized deployment
- Python: OpenAI dependency bumped for API compatibility
- Various dependency optimizations: Dropped xformers for Mistral models, added DeepGEMM deprecation warnings
-
FlashInfer v0.3.1: now an optional via pip install
- V0 deprecation breaking changes
- V0 deprecation: Continued cleanup of legacy engine components including removal of multi-step scheduling
- CLI updates: Various flag updates and deprecated argument removals as part of V0 engine cleanup
- Quantization: Removed AQLM quantization support - users should migrate to alternative methods
- Tool calling support for gpt-oss models
Red Hat AI Inference Server now supports calling built-in tools directly in gpt-oss models. Tool calling uses the Chat Completions and Responses APIs, both of which can carry function-calling capabilities for gpt-oss models. For more information, see Content from docs.vllm.ai is not included.Tool use.
NoteTool calling for gpt-oss models is supported on NVIDIA CUDA AI accelerators only.
5.2. New Red Hat AI Model Optimization Toolkit developer features
Red Hat AI Model Optimization Toolkit 3.2.2 packages the upstream LLM Compressor v0.7.1 release.
You can review the complete list of updates in the upstream Content from github.com is not included.llm-compressor v0.7.1 release notes.
- New Red Hat AI Model Optimization Toolkit container
-
The
rhaiis/model-opt-cuda-rhel9container image packages LLM Compressor v0.7.1 separately in its own runtime image, shipped as a second container image alongside the primaryrhaiis/vllm-cuda-rhel9container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows. - Introducing transforms
- Red Hat AI Model Optimization Toolkit now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization.
- Applying multiple compressors to a single model
- Red Hat AI Model Optimization Toolkit now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization.
- Support for DeepSeekV3-style block FP8 quantization
- You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference.
- Mixture of Experts support
- Red Hat AI Model Optimization Toolkit now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization.
- Llama4 quantization
- LLama4 quantization is now supported in Red Hat AI Model Optimization Toolkit.
- Simplified and updated Recipe classes
- The Recipe system has been streamlined by merging multiple classes into one unified Recipe class. Modifier creation, lifecycle management, and parsing are now simpler. Serialization and deserialization are improved.
- Configurable Observer arguments
-
Observer arguments can now be configured as a dict through the
observer_kwargsquantization argument, which can be set through oneshot recipes.
5.3. Anonymous statistics collection
Anonymous Red Hat AI Inference Server 3.2.2 usage statistics are now sent to Red Hat. Model consumption and usage stats are collected and stored centrally via Red Hat Observatorium.
5.4. Known issues
- The gpt-oss language model family is supported in Red Hat AI Inference Server 3.2.2 for NVIDIA CUDA AI accelerators only.
- Red Hat AI Inference Server 3.2.2 include RPMs provided by IBM to support IBM Spyre AIU. The RPMs in the 3.2.2 release are pre-GA and are not GPG signed. IBM does not sign pre-GA RPMs.
Chapter 6. Version 3.2.1 release notes
Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from Content from registry.redhat.io is not included.registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1 -
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1
Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.
You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.11.2 release notes.
The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
6.1. New models enabled
Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:
- Llama 4 with EAGLE support
- EXAONE 4.0
- Microsoft Phi‑4‑mini‑flash‑reasoning
- Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
- Ling mixture-of-experts (MoE) models
- JinaVL Reranker
- Nemotron‑Nano‑VL‑8B‑V1
- Arcee
- Voxtral
6.2. New developer features
- Inference engine updates
- V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
-
Experimental asynchronous scheduling can be enabled by using the
--async-schedulingflag to overlap engine core scheduling with the GPU runner for improved inference throughput. -
Reduced startup time for CUDA graphs by calling
gc.freezebefore capture.
- Performance improvements
- 48% request duration reduction by using micro-batch tokenization for concurrent requests
- Added fused MLA QKV and strided layernorm.
- Added Triton causal-conv1d for Mamba models.
- New quantization options
- MXFP4 quantization for Mixture of Experts models.
- BNB (Bits and Bytes) support for Mixtral models.
- Hardware-specific quantization improvements.
- Expanded model support
- Llama 4 with EAGLE speculative decoding support.
- EXAONE 4.0 and Microsoft Phi-4-mini model families.
- Hunyuan V1 Dense and Ling MoE architectures.
- OpenAI compatibility
- Added new OpenAI Responses API implementation.
-
Added tool calling with required choice and
$defs.
- Dependency updates
- Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
- NVIDIA CUDA uses PyTorch 2.7.1.
- AMD ROCm remains on PyTorch 2.7.0.
- FlashInfer library is updated to v0.2.8rc1.
6.3. Known issues
In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.
To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:
Table 6.1. Supported OpenShift Container Platform and ROCm driver versions
OpenShift Container Platform version ROCm driver version 4.17
6.4.2
4.17
6.3.4
Chapter 7. Version 3.2.0 release notes
Red Hat AI Inference Server 3.2.0 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA and AMD ROCm AI accelerators. The container images are available from Content from registry.redhat.io is not included.registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.0 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.0
With Red Hat AI Inference Server, you can serve and inference models with higher performance, lower cost, and enterprise-grade stability and security. Red Hat AI Inference Server is built on the upstream, open source Content from github.com is not included.vLLM software project.
New versions of vLLM and LLM Compressor are included in this release:
Content from github.com is not included.vLLM v0.9.2
- 400+ upstream commits since vLLM v0.9.0.1
- Content from github.com is not included.LLM Compressor v0.8.1
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
Table 7.1. AI accelerator performance highlights
| Feature | Benefit | Supported GPUs |
|---|---|---|
| Blackwell support | Runs on NVIDIA B200 compute capability 10.0 GPUs with FP8 kernels and full CUDA Graph acceleration | NVIDIA Blackwell |
| FP8 KV-cache on ROCm | Roughly twice as large context windows with no accuracy loss | All AMD GPUs |
| Skinny GEMMs | Roughly 10% lower inference latency | AMD MI300X |
| Full CUDA Graph mode | 6–8% improved average Time Per Output Token (TPOT) for small models. | NVIDIA A100 and H100 |
| Auto FP16 fallback | Stable runs on pre-Ampere cards without manual flags, for example, NVIDIA T4 GPUs | Older NVIDIA GPUs |
7.1. New models enabled
Table 7.2. AI accelerator performance highlights
| Feature | Benefit | Supported GPUs |
|---|---|---|
| Blackwell compute capability 12.0 | Runs on NVIDIA RTX PRO 6000 Blackwell Server Edition supporting W8A8/FP8 kernels and related tuning | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| ROCm improvements | Full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill | AMD ROCm |
7.2. New models enabled
Red Hat AI Inference Server 3.2.0 expands capabilities by enabling the following models added in vLLM v0.9.1:
- LoRA support for InternVL
- Magistral
- Minicpm eagle support
- NemotronH
The following models were added in vLLM v0.9.0:
- dots1
- Ernie 4.5
- FalconH1
- Gemma‑3
- GLM‑4.1 V
- GPT‑2 for Sequence Classification
- Granite 4
- Keye‑VL‑8B‑Preview
- LlamaGuard4
- MiMo-7B
- MiniMax-M1
- MiniMax-VL-01
- Ovis 1.6, Ovis 2
- Phi‑tiny‑MoE‑instruct
- Qwen 3 Embedding & Reranker
- Slim-MoE
- Tarsier 2
- Tencent HunYuan‑MoE‑V1
7.3. New developer features
- Improved scheduler performance
-
The vLLM scheduler API
CachedRequestDataclass has been updated, resulting in improved performance for object and cached sampler‑ID stores. - CUDA graph execution
- CUDA graph execution is now available for all FlashAttention-3 (FA3) and FlashMLA paths, including prefix‑caching.
- New live CUDA graph capture progress bar makes debugging easier.
- Scheduling
- Priority scheduling is now implemented in the vLLM V1 engine.
Chapter 8. Asynchronous errata updates
Security, bug fix, and enhancement updates for Red Hat AI Inference Server are released as asynchronous errata through the Red Hat Network. All Red Hat AI Inference Server errata are available on the Red Hat Customer Portal. See the Red Hat AI Inference Server Life Cycle for more information about asynchronous errata.
You can enable errata notifications in your Red Hat Customer Portal account settings.
You must register your hosts and configure them for consuming Red Hat Customer Portal AI Inference Server entitlements for the errata notification emails to be generated.