Getting started

Red Hat Enterprise Linux AI 3.4

Getting started with Red Hat Enterprise Linux AI

Red Hat RHEL AI Documentation Team

Abstract

Learn how to use Red Hat Enterprise Linux AI for model serving and inferencing.

Preface

Red Hat Enterprise Linux AI is a bootc container image that optimizes serving and inferencing with Large Language Models (LLMs) on your platform of choice. Using RHEL AI, you can serve and inference models in a way that boosts performance while reducing costs.

Important

Before you deploy Red Hat Enterprise Linux AI, review Supported product and hardware configurations.

Red Hat Enterprise Linux AI supports NVIDIA CUDA and AMD ROCm AI accelerators only.

Chapter 1. About RHEL AI

Red Hat Enterprise Linux AI is a portable bootc image built on Red Hat Enterprise Linux (RHEL) that you can use for inference serving large language models (LLMs) in the cloud or on bare metal. RHEL AI leverages the upstream Content from github.com is not included.vLLM project that provides state-of-the-art inferencing and model compression features. RHEL AI is validated and certified as part of the Red Hat AI portfolio.

Red Hat Enterprise Linux AI integrates the following Red Hat AI features:

Red Hat AI Inference Server
Run your choice of models across AI accelerators and Linux environments.
Red Hat AI Model Optimization Toolkit
Compress models to optimize AI accelerators and compute, reducing compute costs while maintaining high model accuracy.
Pre-optimized validated models
With Red Hat Enterprise Linux AI, you have access to a collection of near-upstream optimized models ready for inference deployment with support for vLLM and validated hardware. Models are available in multiple formats, including Hugging Face models, ModelCar container images, and OCI artifact images.

Chapter 2. Model serving options for RHEL AI

When deploying models with Red Hat Enterprise Linux AI and Red Hat AI Inference Server, you can choose from different approaches to supply a model for inference serving. Understanding the differences between each approach helps you select the right one for your deployment scenario.

2.1. Hugging Face models

Hugging Face models are the recommended approach for RHEL AI deployments that use the Red Hat AI Inference Server systemd Quadlet service. With this approach, you can either download models directly from Hugging Face Hub at runtime, or pre-download models to the local file system for offline serving.

Use Hugging Face models in the following scenarios:

  • You are deploying RHEL AI using the Red Hat AI Inference Server systemd Quadlet service.
  • You want to serve models from Hugging Face Hub (online mode) or from a local directory (offline mode).
  • You want to use Red Hat AI Model Optimization Toolkit to quantize and compress models before serving.

2.2. ModelCar container images

ModelCar container images are OCI-compliant container images that package language models as standard container images. You can pull ModelCar images from registry.redhat.io using podman pull and mount them directly into the Red Hat AI Inference Server vLLM container by using the --mount type=image option.

Note

For a list of available ModelCar container images, see ModelCar container images.

Use ModelCar container images in the following scenarios:

  • You are running Red Hat AI Inference Server with Podman directly, outside of the RHEL AI Quadlet service.
  • You want to package and distribute models as container images.
  • You want a container-native workflow for model distribution and versioning.

For example, to pull and run a ModelCar image with Podman:

  • Pull the ModelCar image:

    $ podman pull registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4
  • Inference serve the ModelCar container image:

    $ podman run --rm -it \
      --device nvidia.com/gpu=all \
      --security-opt=label=disable \
      --shm-size=4g \
      --userns=keep-id:uid=1001 \
      -p 8000:8000 \
      -e HF_HUB_OFFLINE=1 \
      -e TRANSFORMERS_OFFLINE=1 \
      --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4,destination=/model \
      registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
      --model /model/models \
      --port 8000
    Note

    The --device nvidia.com/gpu=all option is specific to NVIDIA GPUs. If you are using AMD ROCm GPUs, use --device /dev/kfd --device /dev/dri instead and set --tensor-parallel-size to match the number of available GPUs. For the complete AMD ROCm deployment procedure, see Serving and inferencing with Podman using AMD ROCm AI accelerators.

2.3. OCI artifact images

OCI artifact images use the OCI artifact specification to distribute model weights as container registry artifacts rather than as container images. OCI artifact images are distinct from ModelCar container images and require different tooling.

Important

OCI artifact images are designed for use with Red Hat OpenShift AI on OpenShift Container Platform, where the model serving infrastructure handles artifact retrieval natively. If you are using Red Hat AI Inference Server with Podman on RHEL AI, use either Hugging Face models or ModelCar container images instead.

Use OCI artifact images in the following scenarios:

  • You are deploying models on OpenShift Container Platform with Red Hat OpenShift AI.
  • Your deployment platform supports OCI artifact retrieval natively.

Chapter 3. Product and version compatibility

The following table lists the supported software versions for Red Hat Enterprise Linux AI.

Table 3.1. Product and version compatibility

Red Hat Enterprise Linux AI versionvLLM core versionLLM Compressor version

3.4.0-ea.1

v0.14.1

v0.9.0.2

3.3

v0.13.0

v0.9.0.1

3.2

v0.11.2

v0.8.1

3.0

v0.11.0

v0.8.1

Note

Red Hat Enterprise Linux AI skips version 3.1 to align with the Red Hat AI product release schedule.

Chapter 4. Reviewing Red Hat AI Inference Server Python packages

You can review the Python packages installed in the Red Hat AI Inference Server container image by running the container with Podman and reviewing the pip list package output.

Prerequisites

  • You are logged in as a user with sudo access.
  • You have access to registry.redhat.io and have logged in.

Procedure

  1. Run the Red Hat Enterprise Linux AI container image with the pip list package command to view all installed Python packages. For example:

    $ podman run --rm --entrypoint=/bin/bash \
      registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
      -c "pip list"
  2. To view detailed information about a specific package, run the Podman command with pip show <package_name>. For example:

    $ podman run --rm --entrypoint=/bin/bash \
      registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
      -c "pip show vllm"

    Example output

    Name: vllm
    Version: v0.14.1

Chapter 5. Enabling the Red Hat AI Inference Server systemd Quadlet service

You can enable the Red Hat AI Inference Server systemd Quadlet service to inference serve language models with NVIDIA CUDA or AMD ROCm AI accelerators on your RHEL AI instance. After you configure the service, the service automatically starts on system boot.

Prerequisites

Note

You do not need to create cache or model folders for Red Hat AI Inference Server or Red Hat AI Model Optimization Toolkit. On first boot, the following folders are created with the correct permissions for model serving:

/var/lib/rhaiis/cache
/var/lib/rhaiis/models

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Review the images that are shipped with Red Hat Enterprise Linux AI. For example, run the following command:

    [cloud-user@localhost ~]$ podman images

    A list of shipped images is returned.

  3. Make a copy of the example configuration file:

    [cloud-user@localhost ~]$ sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf
  4. Edit the configuration file and update with the required parameters:

    [cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
    [Container]
    # Set to 1 to run in offline mode and disable model downloading at runtime.
    # Default value is 0.
    Environment=HF_HUB_OFFLINE=0
    
    # Update with the required authentication token for downloading models from Hugging Face.
    Environment=HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>
    
    # Set to 1 to disable vLLM usage statistics collection. Default value is 0.
    # Environment=VLLM_NO_USAGE_STATS=1
    
    # Configure the vLLM server arguments
    Exec=--model meta-llama/Llama-3.2-1B-Instruct \
         --tensor-parallel-size 1 \
         --max-model-len 4096
    
    PublishPort=8000:8000
    ShmSize=4G
    
    [Install]
    WantedBy=multi-user.target

    Use the following table to understand the required parameters to set:

    Table 5.1. Red Hat AI Inference Server configuration parameters

    ParameterDescription

    HF_HUB_OFFLINE

    Set to 1 to run in offline mode and disable model downloading at runtime. Default value is 0.

    HUGGING_FACE_HUB_TOKEN

    Required authentication token for downloading models from Content from huggingface.co is not included.Hugging Face.

    VLLM_NO_USAGE_STATS

    Set to 1 to disable vLLM usage statistics collection. Default value is 0.

    --model

    vLLM server argument for the model identifier or local path to the model to serve, for example, meta-llama/Llama-3.2-1B-Instruct or /opt/app-root/src/models/<MODEL_NAME>.

    --tensor-parallel-size

    Number of AI accelerators to use for tensor parallelism when serving the model. Default value is 1.

    --max-model-len

    Maximum model length (context size). This depends on available AI accelerator memory. The default value is 131072, but lower values such as 4096 might be better for accelerators with less memory.

    Note

    See vLLM server arguments for the complete list of server arguments that you can configure.

  5. Reload the systemd configuration:

    [cloud-user@localhost ~]$ sudo systemctl daemon-reload
  6. Enable and start the Red Hat AI Inference Server systemd service:

    [cloud-user@localhost ~]$ sudo systemctl start rhaiis

Verification

  1. Check the service status:

    [cloud-user@localhost ~]$ sudo systemctl status rhaiis

    Example output

    ● rhaiis.service - Red Hat AI Inference Server (vLLM)
         Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated)
         Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago
           Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/
        Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)

  2. Monitor the service logs to verify the model is loaded and vLLM server is running:

    [cloud-user@localhost ~]$ sudo podman logs -f rhaiis

    Example output

    (APIServer pid=1) INFO:     Started server process [1]
    (APIServer pid=1) INFO:     Waiting for application startup.
    (APIServer pid=1) INFO:     Application startup complete.

  3. Test the inference server API:

    [cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://localhost:8000/v1/completions | jq

    Example output

    {
      "id": "cmpl-81f99f3c28d34f99a4c2d154d6bac822",
      "object": "text_completion",
      "created": 1762952825,
      "model": "RedHatAI/granite-3.3-8b-instruct",
      "choices": [
        {
          "index": 0,
          "text": "\n\nThe capital of France is Paris.",
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "token_ids": null,
          "prompt_logprobs": null,
          "prompt_token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 7,
        "total_tokens": 18,
        "completion_tokens": 11,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }

Chapter 6. Downloading a model from Hugging Face before running Red Hat AI Inference Server

You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.

Prerequisites

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Stop the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ systemctl stop rhaiis
  3. Open a command prompt inside the Red Hat AI Inference Server container:

    [cloud-user@localhost ~]$ sudo podman run -it --rm \
      -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \
      -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \
      -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \
      --entrypoint /bin/bash \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1
    Note

    You use the sudo command because the download writes to directories owned by the root group.

  4. Inside the container, set HF_HUB_OFFLINE to 0. Run the following command:

    (app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
  5. Download the model to the default directory. For example:

    (app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \
      --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
      --token $HF_TOKEN
    Note

    The rhaiis/vllm-cuda-rhel9, rhaiis/model-opt-cuda-rhel9, and rhaiis/vllm-rocm-rhel9 containers have the same version of the Hugging Face CLI available.

  6. Exit the container:

    exit
  7. Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:

    [cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

    Update the configuration to enable offline mode and use the local model path:

    [Container]
    # Set to 1 to run in offline mode and disable model downloading at runtime
    Environment=HF_HUB_OFFLINE=1
    
    # Token is not required when running in offline mode with a local model
    # Environment=HUGGING_FACE_HUB_TOKEN=
    
    # Configure vLLM to use the locally downloaded model
    Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
         --tensor-parallel-size 1 \
         --served-model-name RedHatAI/granite-3.3-8b-instruct \
         --max-model-len 4096
    
    PublishPort=8000:8000
    ShmSize=4G
    
    [Install]
    WantedBy=multi-user.target
    Note

    When you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container, /opt/app-root/src/models/.

  8. Reload the systemd configuration:

    [cloud-user@localhost ~]$ sudo systemctl daemon-reload
  9. Start the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ sudo systemctl start rhaiis

Verification

  1. Monitor the service logs to verify the vLLM server is using the local model:

    [cloud-user@localhost ~]$ sudo podman logs -f rhaiis

    Example output

    (APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}

  2. Test the inference server API:

    [cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
        "prompt": "What is the capital of France?",
        "max_tokens": 50
    }' http://localhost:8000/v1/completions | jq

    Example output

    {
      "id": "cmpl-f3e12cc62bee438c86af676332f8fe55",
      "object": "text_completion",
      "created": 1762956836,
      "model": "RedHatAI/granite-3.3-8b-instruct",
      "choices": [
        {
          "index": 0,
          "text": "\n\nThe capital of France is Paris.",
          "logprobs": null,
          "finish_reason": "stop",
          "stop_reason": null,
          "token_ids": null,
          "prompt_logprobs": null,
          "prompt_token_ids": null
        }
      ],
      "service_tier": null,
      "system_fingerprint": null,
      "usage": {
        "prompt_tokens": 7,
        "total_tokens": 18,
        "completion_tokens": 11,
        "prompt_tokens_details": null
      },
      "kv_transfer_params": null
    }

Chapter 7. Compressing language models with Red Hat AI Model Optimization Toolkit

Quantize and compress large language models with Content from github.com is not included.llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.

Prerequisites

  • You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the Content from github.com is not included.llama3_example.py compression recipe. To use this model, you must to request access from the Content from huggingface.co is not included.meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Stop the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ systemctl stop rhaiis
  3. Create a working directory:

    [cloud-user@localhost ~]$ mkdir -p model-opt
  4. Change permissions on the project folder and enter the folder:

    [cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt
  5. Add the compression recipe Python script. For example, create the following example.py file that compresses the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in quantized FP8 format:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.utils import dispatch_for_generation
    
    import os
    
    MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    
    # Configure the quantization algorithm and scheme
    recipe = QuantizationModifier(
        targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
    )
    
    # Create log directory in a writable location
    LOG_DIR = "./sparse_logs"
    os.makedirs(LOG_DIR, exist_ok=True)
    
    # Apply quantization
    oneshot(model=model, recipe=recipe)
    
    # Confirm quantized model looks OK
    print("========== SAMPLE GENERATION ==============")
    dispatch_for_generation(model)
    input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
        model.device
    )
    output = model.generate(input_ids, max_new_tokens=20)
    print(tokenizer.decode(output[0]))
    print("==========================================")
    
    # Save to disk in compressed-tensors format
    SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
    model.save_pretrained(SAVE_DIR)
    tokenizer.save_pretrained(SAVE_DIR)
  6. Export your Hugging Face token.

    $ export HF=<YOUR_HUGGING_FACE_TOKEN>
  7. Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:

    [cloud-user@localhost ~]$ sudo podman run -it \
      -v ./model-opt:/opt/app-root/model-opt:z \
      -v /var/lib/rhaiis/models:/opt/app-root/models:z \
      --device nvidia.com/gpu=all \
      --workdir /opt/app-root/model-opt \
      -e HF_HOME=/opt/app-root/models \
      -e HF_TOKEN=$HF \
      --entrypoint python \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1 \
      python example.py

Verification

  • Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the ./model-opt folder.

    Example output

    2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
    Compressing model: 154it [00:02, 59.18it/s]

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.