Getting started
Getting started with Red Hat Enterprise Linux AI
Abstract
Preface
Red Hat Enterprise Linux AI is a bootc container image that optimizes serving and inferencing with Large Language Models (LLMs) on your platform of choice. Using RHEL AI, you can serve and inference models in a way that boosts performance while reducing costs.
Before you deploy Red Hat Enterprise Linux AI, review Supported product and hardware configurations.
Red Hat Enterprise Linux AI supports NVIDIA CUDA and AMD ROCm AI accelerators only.
Chapter 1. About RHEL AI
Red Hat Enterprise Linux AI is a portable bootc image built on Red Hat Enterprise Linux (RHEL) that you can use for inference serving large language models (LLMs) in the cloud or on bare metal. RHEL AI leverages the upstream Content from github.com is not included.vLLM project that provides state-of-the-art inferencing and model compression features. RHEL AI is validated and certified as part of the Red Hat AI portfolio.
Red Hat Enterprise Linux AI integrates the following Red Hat AI features:
- Red Hat AI Inference Server
- Run your choice of models across AI accelerators and Linux environments.
- Red Hat AI Model Optimization Toolkit
- Compress models to optimize AI accelerators and compute, reducing compute costs while maintaining high model accuracy.
- Pre-optimized validated models
- With Red Hat Enterprise Linux AI, you have access to a collection of near-upstream optimized models ready for inference deployment with support for vLLM and validated hardware. Models are available in multiple formats, including Hugging Face models, ModelCar container images, and OCI artifact images.
Chapter 2. Model serving options for RHEL AI
When deploying models with Red Hat Enterprise Linux AI and Red Hat AI Inference Server, you can choose from different approaches to supply a model for inference serving. Understanding the differences between each approach helps you select the right one for your deployment scenario.
2.1. Hugging Face models
Hugging Face models are the recommended approach for RHEL AI deployments that use the Red Hat AI Inference Server systemd Quadlet service. With this approach, you can either download models directly from Hugging Face Hub at runtime, or pre-download models to the local file system for offline serving.
Use Hugging Face models in the following scenarios:
- You are deploying RHEL AI using the Red Hat AI Inference Server systemd Quadlet service.
- You want to serve models from Hugging Face Hub (online mode) or from a local directory (offline mode).
- You want to use Red Hat AI Model Optimization Toolkit to quantize and compress models before serving.
2.2. ModelCar container images
ModelCar container images are OCI-compliant container images that package language models as standard container images. You can pull ModelCar images from registry.redhat.io using podman pull and mount them directly into the Red Hat AI Inference Server vLLM container by using the --mount type=image option.
For a list of available ModelCar container images, see ModelCar container images.
Use ModelCar container images in the following scenarios:
- You are running Red Hat AI Inference Server with Podman directly, outside of the RHEL AI Quadlet service.
- You want to package and distribute models as container images.
- You want a container-native workflow for model distribution and versioning.
For example, to pull and run a ModelCar image with Podman:
Pull the ModelCar image:
$ podman pull registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4
Inference serve the ModelCar container image:
$ podman run --rm -it \ --device nvidia.com/gpu=all \ --security-opt=label=disable \ --shm-size=4g \ --userns=keep-id:uid=1001 \ -p 8000:8000 \ -e HF_HUB_OFFLINE=1 \ -e TRANSFORMERS_OFFLINE=1 \ --mount type=image,source=registry.redhat.io/rhelai1/modelcar-granite-8b-code-instruct:1.4,destination=/model \ registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \ --model /model/models \ --port 8000
NoteThe
--device nvidia.com/gpu=alloption is specific to NVIDIA GPUs. If you are using AMD ROCm GPUs, use--device /dev/kfd --device /dev/driinstead and set--tensor-parallel-sizeto match the number of available GPUs. For the complete AMD ROCm deployment procedure, see Serving and inferencing with Podman using AMD ROCm AI accelerators.
2.3. OCI artifact images
OCI artifact images use the OCI artifact specification to distribute model weights as container registry artifacts rather than as container images. OCI artifact images are distinct from ModelCar container images and require different tooling.
OCI artifact images are designed for use with Red Hat OpenShift AI on OpenShift Container Platform, where the model serving infrastructure handles artifact retrieval natively. If you are using Red Hat AI Inference Server with Podman on RHEL AI, use either Hugging Face models or ModelCar container images instead.
Use OCI artifact images in the following scenarios:
- You are deploying models on OpenShift Container Platform with Red Hat OpenShift AI.
- Your deployment platform supports OCI artifact retrieval natively.
Additional resources
Chapter 3. Product and version compatibility
The following table lists the supported software versions for Red Hat Enterprise Linux AI.
Table 3.1. Product and version compatibility
| Red Hat Enterprise Linux AI version | vLLM core version | LLM Compressor version |
|---|---|---|
| 3.4.0-ea.1 | v0.14.1 | v0.9.0.2 |
| 3.3 | v0.13.0 | v0.9.0.1 |
| 3.2 | v0.11.2 | v0.8.1 |
| 3.0 | v0.11.0 | v0.8.1 |
Red Hat Enterprise Linux AI skips version 3.1 to align with the Red Hat AI product release schedule.
Chapter 4. Reviewing Red Hat AI Inference Server Python packages
You can review the Python packages installed in the Red Hat AI Inference Server container image by running the container with Podman and reviewing the pip list package output.
Prerequisites
- You are logged in as a user with sudo access.
-
You have access to
registry.redhat.ioand have logged in.
Procedure
Run the Red Hat Enterprise Linux AI container image with the
pip list packagecommand to view all installed Python packages. For example:$ podman run --rm --entrypoint=/bin/bash \ registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \ -c "pip list"
To view detailed information about a specific package, run the Podman command with
pip show <package_name>. For example:$ podman run --rm --entrypoint=/bin/bash \ registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.1 \ -c "pip show vllm"
Example output
Name: vllm Version: v0.14.1
Chapter 5. Enabling the Red Hat AI Inference Server systemd Quadlet service
You can enable the Red Hat AI Inference Server systemd Quadlet service to inference serve language models with NVIDIA CUDA or AMD ROCm AI accelerators on your RHEL AI instance. After you configure the service, the service automatically starts on system boot.
Prerequisites
- You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA or AMD ROCm AI accelerators installed.
- You are logged in as a user with sudo access.
- You have a Hugging Face access token. You can obtain a token from Content from huggingface.co is not included.Hugging Face settings.
You do not need to create cache or model folders for Red Hat AI Inference Server or Red Hat AI Model Optimization Toolkit. On first boot, the following folders are created with the correct permissions for model serving:
/var/lib/rhaiis/cache /var/lib/rhaiis/models
Procedure
- Open a shell prompt on the RHEL AI server.
Review the images that are shipped with Red Hat Enterprise Linux AI. For example, run the following command:
[cloud-user@localhost ~]$ podman images
A list of shipped images is returned.
Make a copy of the example configuration file:
[cloud-user@localhost ~]$ sudo cp /etc/containers/systemd/rhaiis.container.d/install.conf.example /etc/containers/systemd/rhaiis.container.d/install.conf
Edit the configuration file and update with the required parameters:
[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
[Container] # Set to 1 to run in offline mode and disable model downloading at runtime. # Default value is 0. Environment=HF_HUB_OFFLINE=0 # Update with the required authentication token for downloading models from Hugging Face. Environment=HUGGING_FACE_HUB_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> # Set to 1 to disable vLLM usage statistics collection. Default value is 0. # Environment=VLLM_NO_USAGE_STATS=1 # Configure the vLLM server arguments Exec=--model meta-llama/Llama-3.2-1B-Instruct \ --tensor-parallel-size 1 \ --max-model-len 4096 PublishPort=8000:8000 ShmSize=4G [Install] WantedBy=multi-user.targetUse the following table to understand the required parameters to set:
Table 5.1. Red Hat AI Inference Server configuration parameters
Parameter Description HF_HUB_OFFLINESet to
1to run in offline mode and disable model downloading at runtime. Default value is0.HUGGING_FACE_HUB_TOKENRequired authentication token for downloading models from Content from huggingface.co is not included.Hugging Face.
VLLM_NO_USAGE_STATSSet to
1to disable vLLM usage statistics collection. Default value is0.--modelvLLM server argument for the model identifier or local path to the model to serve, for example,
meta-llama/Llama-3.2-1B-Instructor/opt/app-root/src/models/<MODEL_NAME>.--tensor-parallel-sizeNumber of AI accelerators to use for tensor parallelism when serving the model. Default value is
1.--max-model-lenMaximum model length (context size). This depends on available AI accelerator memory. The default value is 131072, but lower values such as 4096 might be better for accelerators with less memory.
NoteSee vLLM server arguments for the complete list of server arguments that you can configure.
Reload the systemd configuration:
[cloud-user@localhost ~]$ sudo systemctl daemon-reload
Enable and start the Red Hat AI Inference Server systemd service:
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
Verification
Check the service status:
[cloud-user@localhost ~]$ sudo systemctl status rhaiis
Example output
● rhaiis.service - Red Hat AI Inference Server (vLLM) Loaded: loaded (/etc/containers/systemd/rhaiis.container; generated) Active: active (running) since Wed 2025-11-12 12:19:01 UTC; 1min 22s ago Docs: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_ai/ Process: 2391 ExecStartPre=/usr/libexec/rhaiis/check-lib.sh (code=exited, status=0/SUCCESS)Monitor the service logs to verify the model is loaded and vLLM server is running:
[cloud-user@localhost ~]$ sudo podman logs -f rhaiis
Example output
(APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete.
Test the inference server API:
[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqExample output
{ "id": "cmpl-81f99f3c28d34f99a4c2d154d6bac822", "object": "text_completion", "created": 1762952825, "model": "RedHatAI/granite-3.3-8b-instruct", "choices": [ { "index": 0, "text": "\n\nThe capital of France is Paris.", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null, "prompt_logprobs": null, "prompt_token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 7, "total_tokens": 18, "completion_tokens": 11, "prompt_tokens_details": null }, "kv_transfer_params": null }
Chapter 6. Downloading a model from Hugging Face before running Red Hat AI Inference Server
You can download a model from Hugging Face Hub before starting the Red Hat AI Inference Server service when running the model in offline mode. This approach is useful when you want to download models to the local file system before starting Red Hat AI Inference Server service or when running in environments with restricted internet access.
Prerequisites
- You are logged in as a user with sudo access.
- You have a Hugging Face Hub token. You can obtain a token from Content from huggingface.co is not included.Hugging Face settings.
- You have enabled the Red Hat AI Inference Server systemd Quadlet service.
Procedure
- Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
[cloud-user@localhost ~]$ systemctl stop rhaiis
Open a command prompt inside the Red Hat AI Inference Server container:
[cloud-user@localhost ~]$ sudo podman run -it --rm \ -e HF_TOKEN=<YOUR_HUGGING_FACE_HUB_TOKEN> \ -v /var/lib/rhaiis/cache:/opt/app-root/src/.cache:Z \ -v /var/lib/rhaiis/models:/opt/app-root/src/models:Z \ --entrypoint /bin/bash \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1
NoteYou use the
sudocommand because the download writes to directories owned by the root group.Inside the container, set
HF_HUB_OFFLINEto 0. Run the following command:(app-root) /opt/app-root$ export HF_HUB_OFFLINE=0
Download the model to the default directory. For example:
(app-root) /opt/app-root$ hf download RedHatAI/granite-3.3-8b-instruct \ --local-dir /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \ --token $HF_TOKEN
NoteThe
rhaiis/vllm-cuda-rhel9,rhaiis/model-opt-cuda-rhel9, andrhaiis/vllm-rocm-rhel9containers have the same version of the Hugging Face CLI available.Exit the container:
exit
Edit the Red Hat AI Inference Server configuration file to use the downloaded model in offline mode:
[cloud-user@localhost ~]$ sudo vi /etc/containers/systemd/rhaiis.container.d/install.conf
Update the configuration to enable offline mode and use the local model path:
[Container] # Set to 1 to run in offline mode and disable model downloading at runtime Environment=HF_HUB_OFFLINE=1 # Token is not required when running in offline mode with a local model # Environment=HUGGING_FACE_HUB_TOKEN= # Configure vLLM to use the locally downloaded model Exec=--model /opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct \ --tensor-parallel-size 1 \ --served-model-name RedHatAI/granite-3.3-8b-instruct \ --max-model-len 4096 PublishPort=8000:8000 ShmSize=4G [Install] WantedBy=multi-user.targetNoteWhen you set the model location, you must set the location to the folder that is mapped inside the Red Hat AI Inference Server container,
/opt/app-root/src/models/.Reload the systemd configuration:
[cloud-user@localhost ~]$ sudo systemctl daemon-reload
Start the Red Hat AI Inference Server service:
[cloud-user@localhost ~]$ sudo systemctl start rhaiis
Verification
Monitor the service logs to verify the vLLM server is using the local model:
[cloud-user@localhost ~]$ sudo podman logs -f rhaiis
Example output
(APIServer pid=1) INFO 11-12 14:05:33 [utils.py:233] non-default args: {'model': '/opt/app-root/src/models/red-hat-ai-granite-3.3-8b-instruct', 'max_model_len': 4096, 'served_model_name': ['RedHatAI/granite-3.3-8b-instruct']}Test the inference server API:
[cloud-user@localhost ~]$ curl -X POST -H "Content-Type: application/json" -d '{ "prompt": "What is the capital of France?", "max_tokens": 50 }' http://localhost:8000/v1/completions | jqExample output
{ "id": "cmpl-f3e12cc62bee438c86af676332f8fe55", "object": "text_completion", "created": 1762956836, "model": "RedHatAI/granite-3.3-8b-instruct", "choices": [ { "index": 0, "text": "\n\nThe capital of France is Paris.", "logprobs": null, "finish_reason": "stop", "stop_reason": null, "token_ids": null, "prompt_logprobs": null, "prompt_token_ids": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 7, "total_tokens": 18, "completion_tokens": 11, "prompt_tokens_details": null }, "kv_transfer_params": null }
Chapter 7. Compressing language models with Red Hat AI Model Optimization Toolkit
Quantize and compress large language models with Content from github.com is not included.llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.
Prerequisites
- You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the Content from github.com is not included.llama3_example.py compression recipe. To use this model, you must to request access from the Content from huggingface.co is not included.meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.
Procedure
- Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
[cloud-user@localhost ~]$ systemctl stop rhaiis
Create a working directory:
[cloud-user@localhost ~]$ mkdir -p model-opt
Change permissions on the project folder and enter the folder:
[cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt
Add the compression recipe Python script. For example, create the following
example.pyfile that compresses theTinyLlama/TinyLlama-1.1B-Chat-v1.0model in quantizedFP8format:from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation import os MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] ) # Create log directory in a writable location LOG_DIR = "./sparse_logs" os.makedirs(LOG_DIR, exist_ok=True) # Apply quantization oneshot(model=model, recipe=recipe) # Confirm quantized model looks OK print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to( model.device ) output = model.generate(input_ids, max_new_tokens=20) print(tokenizer.decode(output[0])) print("==========================================") # Save to disk in compressed-tensors format SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic" model.save_pretrained(SAVE_DIR) tokenizer.save_pretrained(SAVE_DIR)Export your Hugging Face token.
$ export HF=<YOUR_HUGGING_FACE_TOKEN>
Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:
[cloud-user@localhost ~]$ sudo podman run -it \ -v ./model-opt:/opt/app-root/model-opt:z \ -v /var/lib/rhaiis/models:/opt/app-root/models:z \ --device nvidia.com/gpu=all \ --workdir /opt/app-root/model-opt \ -e HF_HOME=/opt/app-root/models \ -e HF_TOKEN=$HF \ --entrypoint python \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1 \ python example.py
Verification
Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the
./model-optfolder.Example output
2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 154it [00:02, 59.18it/s]