Choose a validated model for reliable serving
Red Hat AI validated models
Abstract
Chapter 1. About Red Hat AI validated models
Red Hat AI validated models have been tested and verified to work correctly across supported hardware and product configurations. These models are available as Hugging Face downloads, as OCI artifact images, and as modelcar container images. Platform-specific validated models are also available for IBM Spyre on IBM Power and IBM Z systems.
If you are using AI Inference Server as part of a RHEL AI deployment, use OCI artifact images.
If you are using AI Inference Server as part of a OpenShift AI deployment, use ModelCar images.
Red Hat uses Content from github.com is not included.GuideLLM for performance benchmarking and Content from github.com is not included.Language Model Evaluation Harness for accuracy evaluations.
Explore the Red Hat AI validated models collections on Content from huggingface.co is not included.Hugging Face.
AMD GPUs support FP8 (W8A8) and GGUF quantization variant models only. For more information, see Content from docs.vllm.ai is not included.Supported hardware.
Chapter 2. Red Hat AI validated models - February 2026
The following models, available from Content from huggingface.co is not included.RedHat AI on Hugging Face, are validated for use with Red Hat AI Inference Server.
Table 2.1. Red Hat AI validated models - February 2026 collection
| Model | Quantized variants | Hugging Face model cards | Validated on |
|---|---|---|---|
| granite-4.0-h-small | FP8 |
| |
| granite-4.0-h-tiny | FP8 |
| |
| Ministral-3-14B-Instruct-2512 | None |
| |
| Phi-4-reasoning | FP8 |
| |
| Qwen3-Next-80B-A3B-Instruct | INT4 |
| |
| Qwen3-VL-235B-A22B-Instruct-NVFP4 | None |
|
Chapter 3. Red Hat AI validated models - January 2026
The following models, available from Content from huggingface.co is not included.RedHat AI on Hugging Face, are validated for use with Red Hat AI Inference Server.
Table 3.1. Red Hat AI validated models - January 2026 collection
| Model | Quantized variants | Hugging Face model cards | Validated on |
|---|---|---|---|
| Apertus-8B-Instruct-2509 | FP8 |
| |
| Mistral-Large-3-675B-Instruct-2512 | None |
| |
| Mistral-Large-3-675B-Instruct-2512-NVFP4 | None |
| |
| NVIDIA-Nemotron-3-Nano-30B-A3B | FP8 |
|
Chapter 4. NVFP4 Models
The following models, available from Content from huggingface.co is not included.RedHat AI on Hugging Face, are validated for use with Red Hat AI Inference Server.
Table 4.1. NVFP4 Models collection
| Model | Quantized variants | Hugging Face model cards | Validated on |
|---|---|---|---|
| Mistral-Large-3-675B-Instruct-2512-NVFP4 | None |
| |
| Qwen3-VL-235B-A22B-Instruct-NVFP4 | None |
|
Chapter 5. Red Hat AI validated models - October 2025 collection
The following models, available from Content from huggingface.co is not included.RedHat AI on Hugging Face, are validated for use with Red Hat AI Inference Server.
Table 5.1. Red Hat AI validated models - October 2025 collection
| Model | Quantized variants | Hugging Face model card | Validated on |
|---|---|---|---|
| gpt-oss-120b | None |
| |
| gpt-oss-20b | None |
| |
| NVIDIA-Nemotron-Nano-9B-v2 | INT4, FP8 |
| |
| Qwen3-Coder-480B-A35B-Instruct | FP8 |
| |
| Voxtral-Mini-3B-2507 | FP8 |
| |
| whisper-large-v3-turbo | INT4 |
|
Chapter 6. Validated models on Hugging Face - September 2025 collection
The following models, available from Content from huggingface.co is not included.RedHat AI on Hugging Face, are validated for use with Red Hat AI Inference Server.
Table 6.1. Red Hat AI validated models - September 2025 collection
| Model | Quantized variants | Hugging Face model card | Validated on |
|---|---|---|---|
| DeepSeek-R1-0528 | INT4 |
| |
| gemma-3n-E4B-it | FP8 |
| |
| Kimi-K2-Instruct | INT4 |
| |
| Qwen3-8B | FP8 |
|
Chapter 7. Validated models on Hugging Face - May 2025 collection
The following models, available from Content from huggingface.co is not included.RedHat AI on Hugging Face, are validated for use with Red Hat AI Inference Server.
Table 7.1. Red Hat AI validated models - May 2025 collection
| Model | Quantized variants | Hugging Face model card | Validated on |
|---|---|---|---|
| gemma-2-9b-it | FP8 |
| |
| granite-3.1-8b-base | INT4 |
| |
| granite-3.1-8b-instruct | INT4, INT8, FP8 |
| |
| Llama-3.1-8B-Instruct | None |
| |
| Llama-3.1-Nemotron-70B-Instruct-HF | FP8 |
| |
| Llama-3.3-70B-Instruct | INT4, INT8, FP8 |
| |
| Llama-4-Maverick-17B-128E-Instruct | FP8 |
| |
| Llama-4-Scout-17B-16E-Instruct | INT4, FP8 |
| |
| Meta-Llama-3.1-8B-Instruct | INT4, INT8, FP8 |
| |
| Mistral-Small-24B-Instruct-2501 | INT4, INT8, FP8 |
| |
| Mistral-Small-3.1-24B-Instruct-2503 | INT4, INT8, FP8 |
| |
| Mixtral-8x7B-Instruct-v0.1 | None |
| |
| phi-4 | INT4, INT8, FP8 |
| |
| Qwen2.5-7B-Instruct | INT4, INT8, FP8 |
|
Chapter 8. Validated OCI artifact model container images
The following table lists validated OCI artifact model container images available from the Red Hat container registry, including baseline and quantized variants for each supported model.
Table 8.1. Validated OCI artifact model container images
| Model | Quantized variants | ModelCar images |
|---|---|---|
| llama-4-scout-17b-16e-instruct | INT4, FP8 |
|
| llama-4-maverick-17b-128e-instruct | FP8 |
|
| mistral-small-3-1-24b-instruct-2503 | INT4, INT8, FP8 |
|
| llama-3-3-70b-instruct | INT4, INT8, FP8 |
|
| llama-3-1-8b-instruct | INT4, INT8, FP8 |
|
| granite-3-1-8b-instruct | INT4, INT8, FP8 |
|
| phi-4 | INT4, INT8, FP8 |
|
| qwen2-5-7b-instruct | INT4, INT8, FP8 |
|
| mistral-small-24b-instruct-2501 | INT4, INT8, FP8 |
|
| mixtral-8x7b-instruct-v0-1 | None |
|
| granite-3-1-8b-base | INT4 (baseline currently unavailable) |
|
| granite-3.1-8b-starter-v2 | None |
|
| llama-3-1-nemotron-70b-instruct-hf | FP8 |
|
| gemma-2-9b-it | FP8 |
|
| deepseek-r1-0528 | INT4 (baseline currently unavailable) |
|
| qwen3-8b | FP8 (baseline currently unavailable) |
|
| kimi-k2-instruct | INT4 (baseline currently unavailable) |
|
| gemma-3n-e4b-it | FP8 (baseline currently unavailable) |
|
| gpt-oss-120b | None |
|
| gpt-oss-20b | None |
|
| qwen3-coder-480b-a35b-instruct | FP8 (baseline currently unavailable) |
|
| whisper-large-v3-turbo | INT4 (baseline currently unavailable) |
|
| voxtral-mini-3b-2507 | FP8 (baseline currently unavailable) |
|
| nvidia-nemotron-nano-9b-v2 | FP8 (baseline currently unavailable) |
|
Chapter 9. Validated Red Hat AI ModelCar container images
Table 9.1. Validated Red Hat AI ModelCar container images
| Model | Quantized variants | ModelCar images |
|---|---|---|
| llama-4-scout-17b-16e-instruct | INT4, FP8 |
|
| llama-4-maverick-17b-128e-instruct | FP8 |
|
| mistral-small-3-1-24b-instruct-2503 | INT4, INT8, FP8 |
|
| llama-3-3-70b-instruct | INT4, INT8, FP8 |
|
| llama-3-1-8b-instruct | INT4, INT8, FP8 |
|
| granite-3-1-8b-instruct | INT4, INT8, FP8 |
|
| phi-4 | INT4, INT8, FP8 |
|
| qwen2-5-7b-instruct | INT4, INT8, FP8 |
|
| mistral-small-24b-instruct-2501 | INT4, INT8, FP8 |
|
| mixtral-8x7b-instruct-v0-1 | None |
|
| granite-3-1-8b-base | INT4 (baseline currently unavailable) |
|
| granite-3-1-8b-starter-v2 | None |
|
| llama-3-1-nemotron-70b-instruct-hf | FP8 |
|
| gemma-2-9b-it | FP8 |
|
| deepseek-r1-0528 | INT4 (baseline currently unavailable) |
|
| qwen3-8b | FP8 (baseline currently unavailable) |
|
| kimi-k2-instruct | INT4 (baseline currently unavailable) |
|
| gemma-3n-e4b-it | FP8 |
|
| gpt-oss-120b | None |
|
| gpt-oss-20b | None |
|
| qwen3-coder-480b-a35b-instruct | FP8 (baseline currently unavailable) |
|
| whisper-large-v3-turbo | INT4 (baseline currently unavailable) |
|
| voxtral-mini-3b-2507 | FP8 (baseline currently unavailable) |
|
| nvidia-nemotron-nano-9b-v2 | FP8 (baseline currently unavailable) |
|
| phi-4-reasoning | FP8 (baseline currently unavailable) |
|
| qwen3-vl-235b-a22b-instruct-nvfp4 | None |
|
| qwen3-next-80b-a3b-instruct | INT4 (baseline currently unavailable) |
|
| granite-4-0-h-tiny | FP8 |
|
| granite-4-0-h-small | FP8 |
|
| mistral-large-3-675b-instruct-2512 | None |
|
| mistral-large-3-675b-instruct-2512-nvfp4 | None |
|
| apertus-8b-instruct-2509 | FP8 (baseline currently unavailable) |
|
| nvidia-nemotron-3-nano-30b-a3b | FP8 (baseline currently unavailable) |
|
| ministral-3-14b-instruct-2512 | None |
|
Chapter 10. Validated models for x86_64 CPU inference serving
The following large language models have been validated for use with Red Hat AI Inference Server on x86_64 CPUs with AVX2 instruction set support. CPU inference is optimized for smaller models that can run efficiently without GPU acceleration.
x86_64 CPU inference is best suited for smaller models, typically under 3 billion parameters. Performance depends on your CPU specifications, available system RAM, and model size. For larger models or production workloads requiring high throughput, consider using GPU acceleration.
{feature-name} is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Table 10.1. Validated models for inferencing with x86_64 CPU
| Model | Hugging Face model card | Number of parameters |
|---|---|---|
| TinyLlama-1.1B-Chat-v1.0 | Content from huggingface.co is not included.TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1.1B |
| Llama-3.2-1B-Instruct | Content from huggingface.co is not included.meta-llama/Llama-3.2-1B-Instruct | 1B |
| granite-3.2-2b-instruct | Content from huggingface.co is not included.ibm-granite/granite-3.2-2b-instruct | 2B |
| TinyLlama-1.1B-Chat-v1.0-pruned2.4 | Content from huggingface.co is not included.RedHatAI/TinyLlama-1.1B-Chat-v1.0-pruned2.4 | 1.1B (pruned) |
| TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds | Content from huggingface.co is not included.RedHatAI/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds | 1.1B (pruned + quantized) |
| opt-125m | Content from huggingface.co is not included.facebook/opt-125m | 125M |
| Qwen2-0.5B-Instruct-AWQ | Content from huggingface.co is not included.Qwen/Qwen2-0.5B-Instruct-AWQ | 0.5B |
Quantization formats that require GPU-specific kernels, such as Marlin format, are not supported for CPU inference. Use AWQ or GPTQ quantization formats that are compatible with CPU execution.
The following table provides general guidance for approximate system RAM requirements based on model size:
Table 10.2. Memory requirements for inference serving with x86_64 CPU
| Model size | Minimum RAM | Recommended RAM |
|---|---|---|
| 125M - 500M | 8GB | 16GB |
| 500M - 1B | 16GB | 32GB |
| 1B - 3B | 32GB | 64GB |
Actual memory usage depends on the model architecture, context length, and batch size. Increase the VLLM_CPU_KVCACHE_SPACE environment variable to allocate more memory for the key-value cache when using longer context lengths.
Chapter 11. Validated models for use with IBM Power and IBM Spyre AI accelerators
The following large language models are supported for IBM Power systems with IBM Spyre AI accelerators.
IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.
Table 11.1. IBM Granite models for use with IBM Spyre AI accelerators
| Model | Hugging Face model card |
|---|---|
| granite-3.3-8b-instruct | Content from huggingface.co is not included.ibm-granite/granite-3.3-8b-instruct |
| granite-embedding-30m-english | Content from huggingface.co is not included.ibm-granite/granite-embedding-30m-english |
| granite-embedding-107m-multilingual | Content from huggingface.co is not included.ibm-granite/granite-embedding-107m-multilingual |
| granite-embedding-125m-english | Content from huggingface.co is not included.ibm-granite/granite-embedding-125m-english |
| granite-embedding-278m-multilingual | Content from huggingface.co is not included.ibm-granite/granite-embedding-278m-multilingual |
Table 11.2. Reranker models for use with IBM Spyre AI accelerators
| Model | Hugging Face model card |
|---|---|
| bge-reranker-v2-m3 | Content from huggingface.co is not included.BAAI/bge-reranker-v2-m3 |
Pre-built IBM Granite models run with the specific Python packages that are included in the Red Hat AI Inference Server Spyre container image. The models are tied to fixed configurations for Spyre card count, batch size, and input/output context sizes.
Updating or replacing Python packages in the Red Hat AI Inference Server Spyre container image is not supported.
Chapter 12. Validated models for use with IBM Z and IBM Spyre AI accelerators
The following large language models are supported for IBM Z systems with IBM Spyre AI accelerators.
IBM Spyre AI accelerator cards support FP16 format model weights only. For compatible models, the Red Hat AI Inference Server inference engine automatically converts weights to FP16 at startup. No additional configuration is needed.
Table 12.1. Decoder models for use with IBM Spyre AI accelerators
| Model | Hugging Face model card |
|---|---|
| granite-3.3-8b-instruct | Content from huggingface.co is not included.ibm-granite/granite-3.3-8b-instruct |
Pre-built IBM Granite models run with the specific Python packages that are included in the Red Hat AI Inference Server Spyre container image. The models are tied to fixed configurations for Spyre card count, batch size, and input/output context sizes.
Updating or replacing Python packages in the Red Hat AI Inference Server Spyre container image is not supported.
Chapter 13. Validated models for geospatial inference with TerraTorch
The following IBM and NASA Prithvi geospatial foundation models are validated for use with AI Inference Server and TerraTorch.
Prithvi-EO-2.0 models use the Vision Transformer (ViT) architecture and require TerraTorch as the model implementation backend. These models accept GeoTIFF imagery as input and return segmentation predictions.
Table 13.1. Prithvi geospatial models for use with TerraTorch
| Model | Use case | Hugging Face model card | Validated on |
|---|---|---|---|
| Prithvi-EO-2.0-300M-TL-Sen1Floods11 | Flood detection and mapping | Content from huggingface.co is not included.Prithvi-EO-2.0-300M-TL-Sen1Floods11 | RHAIIS 3.3 |
| Prithvi-EO-2.0-300M-BurnScars | Burn scar detection | Content from huggingface.co is not included.Prithvi-EO-2.0-300M-BurnScars | RHAIIS 3.3 |
Explore the IBM and NASA geospatial models collection on Content from huggingface.co is not included.Hugging Face.
Prithvi geospatial models are validated for use with NVIDIA CUDA AI accelerators only.
These models require specific vLLM server arguments to function correctly. You must include --skip-tokenizer-init, --enforce-eager, and --enable-mm-embeds when starting the inference server.
For the complete list of required server arguments, see TerraTorch configuration options for geospatial model serving and Content from terrastackai.github.io is not included.Serving TerraTorch Models with vLLM.