3.4 Early Access (EA1) Release Notes

Red Hat AI Inference Server 3.4

Highlights of what is new and what has changed with this Red Hat AI Inference Server release

Red Hat AI Documentation Team

Abstract

The release notes for Red Hat AI Inference Server summarize all new features and enhancements, notable technical changes, major corrections from the previous version, and any known bugs upon general availability.

Chapter 1. Red Hat AI Inference Server release notes

Red Hat AI Inference Server provides developers and IT organizations with a scalable inference platform for deploying and customizing AI models on secure, scalable resources with minimal configuration and resource usage.

These release notes document new features, enhancements, bug fixes, known issues, and deprecated functionality for each Red Hat AI Inference Server release. Security advisories and asynchronous errata updates are published separately as container images become available.

Chapter 2. Version 3.4.0-ea.1 release notes

Red Hat AI Inference Server 3.4.0-ea.1 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

Important

Red Hat AI Inference Server 3.4.0-ea.1 is an Early Access release. Early Access releases are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Early Access releases for production or business-critical workloads. Use Early Access releases to test upcoming product features in advance of their possible inclusion in a Red Hat product offering, and to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Early Access features without an associated SLA.

The following container images are available as early access releases from Content from registry.redhat.io is not included.registry.redhat.io:

  • registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1
  • registry.redhat.io/rhaii-early-access/vllm-rocm-rhel9:3.4.0-ea.1
  • registry.redhat.io/rhaii-early-access/vllm-spyre-rhel9:3.4.0-ea.1 (s390x, ppc64le, x86_64)
  • registry.redhat.io/rhaii-early-access/model-opt-cuda-rhel9:3.4.0-ea.1

The following container images are Technology Preview features:

  • registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.1
  • registry.redhat.io/rhaii-early-access/vllm-neuron-rhel9:3.4.0-ea.1
  • registry.redhat.io/rhaii-early-access/vllm-cpu-rhel9:3.4.0-ea.1
Important

Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

Note

The following Technology Preview container images bundle different upstream vLLM versions:

  • vllm-tpu-rhel9:3.4.0-ea.1 bundles vLLM v0.13.0.
  • vllm-neuron-rhel9:3.4.0-ea.1 bundles vLLM v0.11.0.

2.1. New Red Hat AI Inference Server developer features

Red Hat AI Inference Server 3.4.0-ea.1 packages the upstream vLLM v0.14.1 release. You can review the complete list of updates in the upstream Content from github.com is not included.vLLM v0.14.1 release notes. vLLM v0.14.1 is a patch release on top of v0.14.0 adding security and memory leak fixes.

TPU backend migration to tpu-inference plugin (Technology Preview)
The TPU-enabled container image (vllm-tpu-rhel9) uses the upstream vLLM TPU backend powered by the Content from github.com is not included.tpu-inference hardware plugin, replacing the deprecated PyTorch/XLA-based integration. API endpoints and model serving behavior remain unchanged. Container deployment flags have been updated; see the updated deployment procedure for details. Users upgrading from 3.3.0 must update their container deployment commands. The PJRT_DEVICE=TPU environment variable and /dev/vfio device mounts are no longer required.
Asynchronous scheduling enabled by default
Asynchronous scheduling now overlaps engine core scheduling with GPU execution, improving throughput without requiring manual configuration. Asynchronous scheduling works with speculative decoding and structured outputs. Users experiencing issues can disable it with the --no-async-scheduling CLI argument. Asynchronous scheduling excludes pipeline parallel, CPU backend, and non-MTP/Eagle speculative decoding configurations.
gRPC server entrypoint
A new gRPC server entrypoint provides a binary protocol with HTTP/2 multiplexing as an alternative to the REST API. The gRPC protocol can improve performance for high-throughput inference serving deployments.
Auto-context length fitting
The new --max-model-len auto CLI argument automatically adjusts the context length to fit the available GPU memory. This prevents out-of-memory errors during model loading and simplifies deployment configuration.
Various new model support features
  • Added support for Grok-2, LFM2-VL, MiMo-V2-Flash, openPangu MoE, IQuestCoder, Nemotron Parse 1.1, GLM-ASR, Isaac vision models, Kanana-1.5-v-3b, and K-EXAONE-236B-A23B model architectures
  • Added LoRA multimodal tower and connector support for LLaVA, BLIP2, PaliGemma, Pixtral, GLM4-V, and vision LoRA processor caching
  • Added Qwen3-VL reranking, DeepSeek v3.2 chat prefix completion, GLM-4.5/4.7 thinking control, and video timestamp support
  • Added tool calling enhancements including FunctionGemma and GLM-4.7 parsers.
  • Added reasoning_effort parameter for supported models
Performance improvements
  • CUTLASS MoE optimizations deliver up to 5.3% throughput gain and up to 10.8% time to first token (TTFT) improvement depending on workload
  • Fused RoPE and MLA KV-cache write optimization for DeepSeek-style models
  • GDN attention decode speedup for Qwen3-Next models
  • Sliding window attention optimization
  • FlashInfer DeepGEMM SM90 improvements
  • Non-uniform memory access (NUMA) interleaved memory support for multi-socket systems
Quantization advances
  • Marlin support extended to Turing (sm75) architecture
  • Added Quark int4-fp8 w4a8 MoE support
  • Added MXFP4 W4A16 support for dense models
  • Added ModelOpt FP8 variants including FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO
  • Improved KV cache quantization handling
AI accelerator and platform hardware updates
  • NVIDIA:

    • Added RTX PRO 4500 Blackwell Server Edition GPU support
    • Added Mixture of Experts (MoE) kernel configurations for B300 Blackwell and SM103 architectures introduced in Red Hat AI Inference Server 3.3.0
  • AMD: Added AITER RMSNorm fusion, moriio connector, and xgrammar upstream support
  • Intel: Added XPU FP8 streaming quantization and custom workers
  • CPU: Added support for head sizes 80 and 112
IBM Spyre accelerator feature updates
  • Added chunked prefill and prefix caching support for IBM Spyre accelerators. When using pre-compiled model caches, the supported chunk lengths are:

    • 1024 for x86 and Power architectures
    • 512 for IBM Z (s390x) architecture

      This feature improves inference performance by enabling reuse of previously computed prefix states during model execution.

Large-scale serving updates
  • Added Extended Dual-Batch Overlap (XBO) implementation for improved large-scale serving
  • Added NVIDIA Inference Xfer Library (NIXL) asymmetric tensor parallelism supporting different tensor-parallel-size for prefill and decode
  • Added heterogeneous BlockSize and KV layout support
  • Added LMCache KV cache registration
API and compatibility changes
  • PyTorch 2.9.1 is now required with cu129 default wheel compilation
  • Deprecated quantization schemes have been removed
  • The compressed-tensors dependency has been updated to version 0.13.0
  • Speculative decoding now rejects unsupported sampling parameters instead of silently ignoring them. Existing speculative decoding configurations that use unsupported parameters will fail after upgrading
  • Added model inspection view via the VLLM_LOG_MODEL_INSPECTION=1 environment variable to examine modules, attention backends, and quantization
  • Added /server_info endpoint for environment information
Security fixes
  • Token leak prevention in crash logs.
  • weights_only=True enforcement in torch.load for safer model loading.

2.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.4.0-ea.1 packages the upstream LLM Compressor v0.9.0.2 release. You can review the complete list of updates in the upstream Content from github.com is not included.LLM Compressor v0.9.0.2 release notes.

LLM Compressor v0.9.0.2 is a maintenance release that updates the Pillow dependency upper bound to version 12.1.1 for compatibility with newer versions of the image processing library. No new features or breaking changes are included in this release.

2.3. Distributed Inference with llm-d (Developer Preview)

Red Hat AI Inference Server 3.4.0-ea.1 introduces Distributed Inference with llm-d, enabling scalable, high-performance AI workloads on external Kubernetes services.

Distributed Inference with llm-d is a distributed inference framework for running Generative AI (GenAI) models across multiple cluster nodes on supported Kubernetes platforms.

You can deploy Distributed Inference with llm-d on Azure Kubernetes Service (AKS), CoreWeave Kubernetes Service (CKS), or OpenShift Container Platform using Helm charts. For more information, see Content from opendatahub-io.github.io is not included.Red Hat AI Inference on managed Kubernetes.

Important

Distributed Inference with llm-d is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

2.4. Known issues

  • The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.

    This change affects sampling behavior when using FlashInfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:

    VLLM_USE_FLASHINFER_SAMPLER=1
  • AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.

    Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.

    Affected models include, but are not limited to, the following:

    • Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
    • Vision-language models, for example microsoft/Phi-3.5-vision-instruct
    • Translation models, for example T5, BART, MarianMT
    • Any models using cross-attention or an encoder-decoder architecture
  • Inference fails for MP3 and M4A file formats. When querying audio models with these file formats, the system returns a "format not recognized" error.

    {"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}

    This issue affects audio transcription models such as openai/whisper-large-v3 and mistralai/Voxtral-Small-24B-2507. To work around this issue, convert audio files to WAV format before processing.

  • Jemalloc consumes more memory than glibc when deploying models on IBM Spyre AI accelerators.

    When deploying models with jemalloc as the memory allocator, overall memory usage is significantly higher than when using glibc. In testing, jemalloc increased memory consumption by more than 50% compared to glibc. To work around this issue, disable jemalloc by unsetting the LD_PRELOAD environment variable so the system uses glibc as the memory allocator instead.

  • GPT-OSS model produces empty or gibberish responses when using multiple GPUs.

    When deploying the GPT-OSS model with tensor parallelism greater than 1, the model produces empty or incorrect output. This issue is related to the Triton attention kernel. To work around this issue, use the --no-enable-prefix-caching CLI argument when running the model.

  • Google TPU inference supports only a limited set of model architectures.

    Native TPU inference currently supports a subset of model architectures. Unsupported models might fail to load or might result in reduced performance. Additionally, native TPU inference support is not available for some Google models.

    For the list of models with native TPU support, see Content from docs.vllm.ai is not included.vLLM TPU Recommended Models and Features.

Chapter 3. Asynchronous errata updates

Security, bug fix, and enhancement updates for Red Hat AI Inference Server are released as asynchronous errata through the Red Hat Network. All Red Hat AI Inference Server errata are available on the Red Hat Customer Portal. See the Red Hat AI Inference Server Life Cycle for more information about asynchronous errata.

You can enable errata notifications in your Red Hat Customer Portal account settings.

Note

You must register your hosts and configure them for consuming Red Hat Customer Portal AI Inference Server entitlements for the errata notification emails to be generated.

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.