Getting started

Red Hat Enterprise Linux AI 3.4

Getting started with Red Hat Enterprise Linux AI

Red Hat RHEL AI Documentation Team

Abstract

Learn how to use Red Hat Enterprise Linux AI for model serving and inferencing.

Preface

Red Hat Enterprise Linux AI is a bootc container image that optimizes serving and inferencing with Large Language Models (LLMs) on your platform of choice. Using RHEL AI, you can serve and inference models in a way that boosts performance while reducing costs.

Important

Before you deploy Red Hat Enterprise Linux AI, review Supported product and hardware configurations.

Red Hat Enterprise Linux AI supports NVIDIA CUDA and AMD ROCm AI accelerators only.

Chapter 1. About RHEL AI

Red Hat Enterprise Linux AI is a portable bootc image built on Red Hat Enterprise Linux (RHEL) that you can use for inference serving large language models (LLMs) in the cloud or on bare metal. RHEL AI leverages the upstream Content from github.com is not included.vLLM project that provides state-of-the-art inferencing and model compression features. RHEL AI is validated and certified as part of the Red Hat AI portfolio.

Red Hat Enterprise Linux AI integrates the following Red Hat AI features:

Red Hat AI Inference Server: Run your choice of models across AI accelerators and Linux environments.
Red Hat AI Model Optimization Toolkit: Compress models to optimize AI accelerators and compute, reducing compute costs while maintaining high model accuracy.
Pre-optimized validated models: With Red Hat Enterprise Linux AI, you have access to a collection of near-upstream optimized models ready for inference deployment with support for vLLM and validated hardware. Models are available in multiple formats, including Hugging Face models, ModelCar container images, and OCI artifact images.

Chapter 2. Model serving options for RHEL AI

When deploying models with Red Hat Enterprise Linux AI and Red Hat AI Inference Server, you can choose from different approaches to supply a model for inference serving. Understanding the differences between each approach helps you select the right one for your deployment scenario.

2.1. Hugging Face models

Hugging Face models are the recommended approach for RHEL AI deployments that use the Red Hat AI Inference Server systemd Quadlet service. With this approach, you can either download models directly from Hugging Face Hub at runtime, or pre-download models to the local file system for offline serving.

Use Hugging Face models in the following scenarios:

You are deploying RHEL AI using the Red Hat AI Inference Server systemd Quadlet service.
You want to serve models from Hugging Face Hub (online mode) or from a local directory (offline mode).
You want to use Red Hat AI Model Optimization Toolkit to quantize and compress models before serving.

2.2. ModelCar container images

ModelCar container images are OCI-compliant container images that package language models as standard container images. You can pull ModelCar images from registry.redhat.io using podman pull and mount them directly into the Red Hat AI Inference Server vLLM container by using the --mount type=image option.

Note

For a list of available ModelCar container images, see ModelCar container images.

Use ModelCar container images in the following scenarios:

You are running Red Hat AI Inference Server with Podman directly, outside of the RHEL AI Quadlet service.
You want to package and distribute models as container images.
You want a container-native workflow for model distribution and versioning.

For example, to pull and run a ModelCar image with Podman:

Pull the ModelCar image:

$ podman pull registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4

Inference serve the ModelCar container image:

$ podman run --rm -it \
  --device nvidia.com/gpu=all \
  --security-opt=label=disable \
  --shm-size=4g \
  --userns=keep-id:uid=1001 \
  -p 8000:8000 \
  -e HF_HUB_OFFLINE=1 \
  -e TRANSFORMERS_OFFLINE=1 \
  --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4,destination=/model \
  registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
  --model /model/models \
  --port 8000

Note

The --device nvidia.com/gpu=all option is specific to NVIDIA GPUs. If you are using AMD ROCm GPUs, use --device /dev/kfd --device /dev/dri instead and set --tensor-parallel-size to match the number of available GPUs. For the complete AMD ROCm deployment procedure, see Serving and inferencing with Podman using AMD ROCm AI accelerators.

2.3. OCI artifact images

OCI artifact images use the OCI artifact specification to distribute model weights as container registry artifacts rather than as container images. OCI artifact images are distinct from ModelCar container images and require different tooling.

Important

OCI artifact images are designed for use with Red Hat OpenShift AI on OpenShift Container Platform, where the model serving infrastructure handles artifact retrieval natively. If you are using Red Hat AI Inference Server with Podman on RHEL AI, use either Hugging Face models or ModelCar container images instead.

Use OCI artifact images in the following scenarios:

You are deploying models on OpenShift Container Platform with Red Hat OpenShift AI.
Your deployment platform supports OCI artifact retrieval natively.

Additional resources

Chapter 3. Product and version compatibility

The following table lists the supported software versions for Red Hat Enterprise Linux AI.

Table 3.1. Product and version compatibility

Red Hat Enterprise Linux AI version	vLLM core version	LLM Compressor version
3.4.0-ea.1	v0.14.1	v0.9.0.2
3.3	v0.13.0	v0.9.0.1
3.2	v0.11.2	v0.8.1
3.0	v0.11.0	v0.8.1

Note

Red Hat Enterprise Linux AI skips version 3.1 to align with the Red Hat AI product release schedule.

Chapter 4. Reviewing Red Hat AI Inference Server Python packages

You can review the Python packages installed in the Red Hat AI Inference Server container image by running the container with Podman and reviewing the pip list package output.

Prerequisites

You are logged in as a user with sudo access.
You have access to registry.redhat.io and have logged in.

Procedure

Run the Red Hat Enterprise Linux AI container image with the pip list package command to view all installed Python packages. For example:
```
$ podman run --rm --entrypoint=/bin/bash \
  registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
  -c "pip list"
```
To view detailed information about a specific package, run the Podman command with pip show <package_name>. For example:
```
$ podman run --rm --entrypoint=/bin/bash \
  registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \
  -c "pip show vllm"
```
Example output
```
Name: vllm
Version: v0.14.1
```

Chapter 5. Enabling the Red Hat AI Inference Server systemd Quadlet service

You can enable the Red Hat AI Inference Server systemd Quadlet service to inference serve language models with NVIDIA CUDA or AMD ROCm AI accelerators on your RHEL AI instance. After you configure the service, the service automatically starts on system boot.

Prerequisites

You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA or AMD ROCm AI accelerators installed.
You are logged in as a user with sudo access.
You have a Hugging Face access token. You can obtain a token from Content from huggingface.co is not included.Hugging Face settings.

Note

You do not need to create cache or model folders for Red Hat AI Inference Server or Red Hat AI Model Optimization Toolkit. On first boot, the following folders are created with the correct permissions for model serving:

/var/lib/rhaiis/cache
/var/lib/rhaiis/models

Procedure

Open a shell prompt on the RHEL AI server.
Review the images that are shipped with Red Hat Enterprise Linux AI. For example, run the following command:
```
[cloud-user@localhost ~]$ podman images
```
A list of shipped images is returned.

Make a copy of the example configuration file:

[cloud-user@localhost ~]$ sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf

Edit the configuration file and update with the required parameters:

[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime.
# Default value is 0.
Environment=HF_HUB_OFFLINE=0

# Update with the required authentication token for downloading models from Hugging Face.
Environment=HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN>

# Set to 1 to disable vLLM usage statistics collection. Default value is 0.
# Environment=VLLM_NO_USAGE_STATS=1

# Configure the vLLM server arguments
Exec=--model meta-llama/Llama-3.2-1B-Instruct \
     --tensor-parallel-size 1 \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

Use the following table to understand the required parameters to set:

Table 5.1. Red Hat AI Inference Server configuration parameters

Parameter	Description
`HF_HUB_OFFLINE`	Set to `1` to run in offline mode and disable model downloading at runtime. Default value is `0`.
`HUGGING_FACE_HUB_TOKEN`	Required authentication token for downloading models from Content from huggingface.co is not included.Hugging Face.
`VLLM_NO_USAGE_STATS`	Set to `1` to disable vLLM usage statistics collection. Default value is `0`.
`--model`	vLLM server argument for the model identifier or local path to the model to serve, for example, `meta-llama/Llama-3.2-1B-Instruct` or `/opt/app-root/src/models/<MODEL_NAME>`.
`--tensor-parallel-size`	Number of AI accelerators to use for tensor parallelism when serving the model. Default value is `1`.
`--max-model-len`	Maximum model length (context size). This depends on available AI accelerator memory. The default value is 131072, but lower values such as 4096 might be better for accelerators with less memory.

Note

See vLLM server arguments for the complete list of server arguments that you can configure.

Reload the systemd configuration:
```
[cloud-user@localhost ~]$ sudo systemctl daemon-reload
```
Enable and start the Red Hat AI Inference Server systemd service:
```
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
```

Verification

Check the service status:

[cloud-user@localhost ~]$ sudo systemctl status rhaiis

Example output

● rhaiis.service - Red Hat AI Inference Server (vLLM)
     Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated)
     Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago
       Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/
    Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)

Monitor the service logs to verify the model is loaded and vLLM server is running:

[cloud-user@localhost ~]$ sudo podman logs -f rhaiis

Example output

(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Test the inference server API:

[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Example output

{
  "id": "cmpl-81f99f3c28d34f99a4c2d154d6bac822",
  "object": "text_completion",
  "created": 1762952825,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Chapter 6. Downloading a model from Hugging Face before running Red Hat AI Inference Server

You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.

Prerequisites

You are logged in as a user with sudo access.
You have a Hugging Face Hub token. You can obtain a token from Content from huggingface.co is not included.Hugging Face settings.
You have enabled the Red Hat AI Inference Server systemd Quadlet service.

Procedure

Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
```
[cloud-user@localhost ~]$ systemctl stop rhaiis
```

Open a command prompt inside the Red Hat AI Inference Server container:

[cloud-user@localhost ~]$ sudo podman run -it --rm \
  -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \
  -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \
  -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \
  --entrypoint /bin/bash \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1

Note

You use the sudo command because the download writes to directories owned by the root group.

Inside the container, set HF_HUB_OFFLINE to 0. Run the following command:
```
(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
```
Download the model to the default directory. For example:
```
(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \
  --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
  --token $HF_TOKEN
```
Note
The rhaiis/vllm-cuda-rhel9, rhaiis/model-opt-cuda-rhel9, and rhaiis/vllm-rocm-rhel9 containers have the same version of the Hugging Face CLI available.
Exit the container:
```
exit
```

Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:

[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf

Update the configuration to enable offline mode and use the local model path:

[Container]
# Set to 1 to run in offline mode and disable model downloading at runtime
Environment=HF_HUB_OFFLINE=1

# Token is not required when running in offline mode with a local model
# Environment=HUGGING_FACE_HUB_TOKEN=

# Configure vLLM to use the locally downloaded model
Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \
     --tensor-parallel-size 1 \
     --served-model-name RedHatAI/granite-3.3-8b-instruct \
     --max-model-len 4096

PublishPort=8000:8000
ShmSize=4G

[Install]
WantedBy=multi-user.target

Note

When you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container, /opt/app-root/src/models/.

Reload the systemd configuration:
```
[cloud-user@localhost ~]$ sudo systemctl daemon-reload
```
Start the Red Hat AI Inference Server service:
```
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
```

Verification

Monitor the service logs to verify the vLLM server is using the local model:

[cloud-user@localhost ~]$ sudo podman logs -f rhaiis

Example output

(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}

Test the inference server API:

[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{
    "prompt": "What is the capital of France?",
    "max_tokens": 50
}' http://localhost:8000/v1/completions | jq

Example output

{
  "id": "cmpl-f3e12cc62bee438c86af676332f8fe55",
  "object": "text_completion",
  "created": 1762956836,
  "model": "RedHatAI/granite-3.3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "text": "\n\nThe capital of France is Paris.",
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null,
      "token_ids": null,
      "prompt_logprobs": null,
      "prompt_token_ids": null
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 18,
    "completion_tokens": 11,
    "prompt_tokens_details": null
  },
  "kv_transfer_params": null
}

Chapter 7. Compressing language models with Red Hat AI Model Optimization Toolkit

Quantize and compress large language models with Content from github.com is not included.llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.

Prerequisites

You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the Content from github.com is not included.llama3_example.py compression recipe. To use this model, you must to request access from the Content from huggingface.co is not included.meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
```
[cloud-user@localhost ~]$ systemctl stop rhaiis
```
Create a working directory:
```
[cloud-user@localhost ~]$ mkdir -p model-opt
```
Change permissions on the project folder and enter the folder:
```
[cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt
```

Add the compression recipe Python script. For example, create the following example.py file that compresses the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in quantized FP8 format:

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

import os

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Create log directory in a writable location
LOG_DIR = "./sparse_logs"
os.makedirs(LOG_DIR, exist_ok=True)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Confirm quantized model looks OK
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Export your Hugging Face token.
```
$ export HF=<YOUR_HUGGING_FACE_TOKEN>
```

Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:

[cloud-user@localhost ~]$ sudo podman run -it \
  -v ./model-opt:/opt/app-root/model-opt:z \
  -v /var/lib/rhaiis/models:/opt/app-root/models:z \
  --device nvidia.com/gpu=all \
  --workdir /opt/app-root/model-opt \
  -e HF_HOME=/opt/app-root/models \
  -e HF_TOKEN=$HF \
  --entrypoint python \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1 \
  python example.py

Verification

Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the ./model-opt folder.

Example output

2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 154it [00:02, 59.18it/s]

Legal Notice

Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.

The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.

All other trademarks are the property of their respective owners.