Build AI/Agentic Applications with OGX

Red Hat OpenShift AI Self-Managed 3.5

Working with OGX in Red Hat OpenShift AI Self-Managed

Abstract

As a cluster administrator, you can use the OGX Operator in Red Hat OpenShift AI.

Chapter 1. Llama Stack to OGX migration

Starting in OpenShift AI version 3.5EA1, Llama Stack is being fully renamed to OGX. This creates breaking changes in any applications created with the Llama Stack Operator.

The following charts show the naming changes of the components, environment variables, and field changes.

Table 1.1. Name Mapping

ComponentPrevious nameNew name

API Group

llamastack.io

ogx.io

API Version

v1alpha1

v1beta1

Kind

LlamaStackDistribution

OGXServer

Plural

llamastackdistributions

ogxservers

Short Name

llsd

ogxserver

Container Name

llama-stack

ogx

App Label

app: llama-stack

app: ogx

Managed-by

llama-stack-operator

ogx-operator

Watch Label

llamastack.io/watch: "true"

ogx.io/watch: "true"

Mount Path

/.llama

/.ogx

Leader Election ID

81d5736e.llamastack.io

54e06e98.ogx.io

Table 1.2. Environment Variables

Previous nameNew nameAdditional details

LLS_PORT

OGX_PORT

Container port for the server

LLS_WORKERS

OGX_WORKERS

Number of uvicorn worker processes

LLAMA_STACK_CONFIG

OGX_CONFIG

Path to the server config file

Table 1.3. Status Field Changes

Old PathNew Path

.status.version.llamaStackServerVersion

.status.version.serverVersion

.status.routeURL

.status.externalURL

The following YAML examples display the changes in specifications. For example: OGXServer CRs, network configurations, and workload configurations.

Workload Configuration

Previous workload configuration (flat on spec):

spec:
  replicas: 2
  server:
    distribution:
      name: rh-dev
    containerSpec:
      env:
        - name: MY_VAR
          value: "hello"
    storage:
      size: "20Gi"

New workload configuration (grouped under spec.workload):

spec:
  distribution:
    name: rh-dev
  workload:
    replicas: 2
    storage:
      size: "20Gi"
    overrides:
      env:
        - name: MY_VAR
          value: "hello"

Network Configurations

Previous network configurations (spec.network):

spec:
  network:
    exposeRoute: true
    allowedFrom:
      namespaces: ["my-app"]
      labels: ["team=frontend"]

New network configurations(spec.network):

spec:
  network:
    externalAccess:
      enabled: true
    policy:
      enabled: true
      ingress:
        - from:
            - namespaceSelector:
                matchLabels:
                  kubernetes.io/metadata.name: my-app
            - namespaceSelector:
                matchLabels:
                  team: frontend
          ports:
            - protocol: TCP
              port: 8321

1.1. Migrating from Llama Stack to OGX

In order to migrate to the newly named ogx-operator, you must remove the Llama Stack Operator and create new OGXServer custom resources (CRs).

Prerequisites

  • You have the Llama Stack Operator installed on your OpenShift AI cluster.
  • You have custom LlamaStackDistribution applications.
  • You have cluster administrator permissions.
  • You have installed the OpenShift CLI (oc).

Procedure

  1. Remove the Llama Stack Operator from your environment. You can remove the Llama Stack Operator by setting the component spec:

    $ dsc.spec.components.lls = "Removed"
  2. Install the new OGX operator by setting the component spec:

    $ dsc.spec.components.ogx = "Managed"
  3. Create the OGXServer custom resource (CR).

    apiVersion: ogx.io/v1beta1
    kind: OGXServer
    metadata:
      name: my-server
    spec:
      distribution:
        name: rh-dev
      workload:
        replicas: 1
        storage:
          size: "20Gi"
        overrides:
          env:
            - name: OLLAMA_INFERENCE_MODEL
              value: "llama3.2:1b"
            - name: OLLAMA_URL
              value: "http://ollama-server-service.ollama-dist.svc.cluster.local:11434"
  4. Apply the OGXServer CR to the cluster:

    $ oc apply -f ogxserver.yaml

Verification

  1. Verify the pod deployment with the following command:

    # Check the new CRD is registered
    $ oc get crd ogxservers.ogx.io
    
    # List OGXServer resources
    $ oc get ogxserver
    
    # Check conditions for adoption status
    $ oc get ogxserver my-server -o jsonpath='{.status.conditions}'
    
    # Verify the server is ready
    $ oc get ogxserver my-server -o jsonpath='{.status.phase}'
  2. You can then clean up the legacy resources when the new OGXServer is verified.

    1. Remove the LlamaStackDistribution CR resources:

      $ oc delete llamastackdistribution <old-llsd-name> -n <namespace>
  3. (Optional) Adopting existing PVC.

    1. To preserve existing data by adopting the PVC from the old LlamaStackDistribution, set the annotations parameter similar to the following:

      metadata:
        annotations:
          ogx.io/adopt-storage: "<old-llsd-name>"

      The operator strips the old ownerRef from the PVC and labels it for discovery. The adopted PVC intentionally has no ownerReference to the OGXServer.

  4. (Optional) Adopting existing Service and Ingress

    1. To preserve ClusterIP and external endpoints, set the annotations similar to the following:

      metadata:
        annotations:
          ogx.io/adopt-storage: "<old-llsd-name>"
          ogx.io/adopt-networking: "<old-llsd-name>"

The operator adopts the orphaned Service + Ingress, replaces Service selectors with new pod labels: app: ogx, app.kubernetes.io/instance: <name>, and sets ownerReferences.

Chapter 2. Overview of OGX

OGX is a unified AI runtime environment designed to simplify the deployment and management of generative AI workloads on OpenShift AI. In OpenShift, the OGX Operator manages the deployment lifecycle of these components, ensuring scalability, consistency, and integration with OpenShift AI projects. OGX integrates model inference, embedding generation, vector storage, and retrieval services into a single stack that is optimized for retrieval-augmented generation (RAG) and agent-based AI workflows.

OGX concepts

  • OGX Operator Installs and manages OGX server instances in OpenShift AI, handling lifecycle operations such as deployment, scaling, and updates.
  • The run.yaml file Defines which APIs are enabled and how backend providers are configured for a OGX server. Red Hat ships a default run.yaml that supports common deployment scenarios. You can provide a custom run.yaml to enable advanced workflows or integrate additional providers.
  • OGXServer custom resource Declares the runtime configuration for a OGX server, including model providers, embedding configuration, vector storage, and persistence settings.

OpenShift AI ships with a OGX Distribution that runs the OGX server in a containerized environment. For the OGX Operator version included in this release of OpenShift AI, see Supported Configurations for 3.x.

Important

OGX integration is currently available in Red Hat OpenShift AI 3.5 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production.

These features provide early access to upcoming product capabilities, enabling customers to test functionality and provide feedback during development.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

OGX includes the following core components:

  • Integration with OpenShift AI Uses the OGXServer custom resource to simplify configuration and deployment of AI workloads.
  • Inference model connections Acts as a proxy between OGX APIs and model inference servers, such as vLLM deployments.
  • Embedding generation Generates vector embeddings used for retrieval. In OpenShift AI 3.2, remote embedding models are the recommended and default option for production deployments. Inline embedding models remain available for development and testing scenarios.
  • Vector storage Stores and indexes embeddings by using supported vector databases, such as Milvus or PostgreSQL with the pgvector extension.
  • Metadata persistence Stores vector store metadata, file references, and configuration state. In OpenShift AI 3.2, PostgreSQL is the default backend for production-grade deployments.
  • Retrieval workflows Manages ingestion, chunking, embedding, and similarity search to support RAG workflows.
  • Agentic workflows Enables agent-based interactions through supported APIs, such as OpenAI-compatible Responses and Chat Completions.

For information about deploying OGX in OpenShift AI, see Deploying a RAG stack in a project.

Note

The OGX Operator is not currently supported on IBM Z platform.

OGX is supported on IBM Power (ppc64le) with limited functionality:

  • The GenAI playground is currently unavailable on the IBM Power architecture.
  • Although milvus-lite and PostgreSQL, with the pgvector extension, are listed as supported vector stores, they are not currently available on the IBM Power ppc64le architecture.

2.1. OGX APIs

You can use the following APIs from OGX for AI actions.

2.1.1. Supported OGX APIs in OpenShift AI

2.1.1.1. Datasets_IO API

  • Endpoint: /v1alpha/datasetio.
  • Providers: All dataset_io backends deployed through OpenShift AI.
  • Support level: Technology Preview.

The Dataset_IO API manages the input and output of datasets and their content.

2.1.1.2. Inference API

  • Endpoint: /v1alpha/inference.
  • Providers: All inference backends deployed through OpenShift AI.
  • Support level: Developer Preview.
Warning

The majority of the Inference API is deprecated. The Inference providers use the Completions and Chat Completions APIs now.

The Inference API enables conversational, message-based interactions with models served by OGX in OpenShift AI.

2.1.1.3. Tool Runtime API

  • Endpoint: /v1/tool-runtime.
  • Providers: All tool runtime backends deployed through OpenShift AI.
  • Support level: Developer Preview.

The Tool Runtime API allows a model to dynamically call a tool at runtime.

2.1.1.4. Vector_IO API

  • Endpoint: /v1/vector-io.
  • Providers: All vector_io backends deployed through OpenShift AI.
  • Support level: Developer Preview.

The Vector_IO API allows you to manage and query vector embeddings: numeric representations of data.

2.2. OpenAI-compatible APIs in OGX

OpenShift AI includes a OGX component that exposes OpenAI-compatible APIs. These APIs enable you to reuse existing OpenAI SDKs, tools, and workflows directly within your OpenShift environment, without changing your client code. This compatibility layer supports retrieval-augmented generation (RAG), inference, and embedding workloads by using OpenAI-compatible endpoints, schemas, and authentication patterns.

This compatibility layer has the following capabilities:

  • Standardized endpoints: REST API paths align with OpenAI specifications.
  • Schema parity: Request and response fields follow OpenAI data structures.
Important

When connecting OpenAI SDKs or third-party tools to OpenShift AI, you must update the client configuration to use your deployment’s OGX route as the base_url.

When you use OpenAI-compatible SDKs, the base_url must include the /v1 path suffix so that requests are routed to the OpenAI-compatible API surface exposed by OGX.

When you use OpenAI SDKs or send raw HTTP requests to OGX, always include the /v1 path suffix in the base URL.

For example: http://ogx-service:8321/v1

Using the service endpoint without /v1 results in request failures.

These endpoints are exposed under the OpenAI compatibility layer and are distinct from the native OGX APIs.

2.2.1. Supported OpenAI-compatible APIs in OpenShift AI

Before running the following examples, ensure you have:

  • The OpenAI Python SDK installed: pip install -q openai rich
  • A configured client pointing to your OGX endpoint
  • Model IDs from your deployment (see Models API section)
from openai import OpenAI
import rich

# We'll be using a ogx server deployed in {productname-short}.
# Once all pods associated to the OGXServer are running,
# create the base_url using the ogx service hostname (with /v1 at the end when using openai sdk)
base_url = "http://ogx-distribution-service.my-project.svc.cluster.local:8321/v1"

client = OpenAI(
    api_key="your-ogx-key",
    base_url=base_url
)

For more information, see Deploying a OGX server.

2.2.1.1. Models API

  • Endpoint: /v1/models.
  • Providers: All model-serving back ends configured within OpenShift AI.
  • Support level: Technology Preview.

The Models API lists and retrieves available model resources from the OGX deployment running on OpenShift AI. By using the Models API, you can enumerate models, view their capabilities, and verify deployment status through a standardized OpenAI-compatible interface.

Example code in Python:

# List models available in the ogx server
models = client.models.list()
rich.print(models)

# Select the first LLM and first embedding model
model_id = next(m for m in models if m.custom_metadata["model_type"] == "llm").id
embedding_model_id = (
    em := next(m for m in models if m.custom_metadata["model_type"] == "embedding")
).id
embedding_dimension = em.custom_metadata["embedding_dimension"]

2.2.1.2. Chat Completions API

  • Endpoint: /v1/chat/completions.
  • Providers: All inference back ends deployed through OpenShift AI.
  • Support level: Technology Preview.

The Chat Completions API enables conversational, message-based interactions with models served by OGX in OpenShift AI.

Example code in Python:

# Test chat completion functionality with a simple question
response = client.chat.completions.create(
    model=model_id,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    temperature=0,
)
# Optional verification check
assert len(response.choices) > 0, "No response after basic inference on ogx server"
content = response.choices[0].message.content
rich.print(content)

2.2.1.3. Completions API

  • Endpoint: /v1/completions.
  • Providers: All inference back ends managed by OpenShift AI.
  • Support level: Technology Preview.

The Completions API supports single-turn text generation and prompt completion.

Example code in Python:

# Test completion functionality with a simple question
response = client.completions.create(
    model=model_id,
    prompt="Answer with one word only: What is the capital of France?",
    max_tokens=64,
    temperature=0.1
)
# Optional verification check
assert len(response.choices) > 0, "No response after basic inference on ogx server"
content = response.choices[0].text
rich.print(content)

2.2.1.4. Embeddings API

  • Endpoint: /v1/embeddings.
  • Providers: All embedding models enabled in OpenShift AI.

The Embeddings API generates numerical embeddings for text or documents that can be used in downstream semantic search or RAG applications.

Example code in Python:

# Create text embeddings
response = client.embeddings.create(
    input="Your text string goes here",
    model=embedding_model_id
)
embedding = response.data[0].embedding
rich.print(embedding[:5] + ["..."] + embedding[-5:])

2.2.1.5. Files API

  • Endpoint: /v1/files.
  • Providers: File system-based file storage provider for managing files and documents stored locally in your cluster.
  • Support level: Technology Preview.

The Files API manages file uploads for use in embedding and retrieval workflows.

Example code in Python:

import requests
from rich import print
from rich.rule import Rule
import time

# -----------------------------
# Download the PDF from url
# -----------------------------
print(Rule("[bold cyan]Downloading PDF[/bold cyan]"))

# We'll use IBM 2025-Q4 report to test RAG, as models don't have that info
pdf_url = "https://www.ibm.com/downloads/documents/us-en/1550f7eea8c0ded6"
filename = "ibm-Q4-2025-4q25-press-release.pdf"
title = "IBM-4Q25-Earnings-Press-Release"

print("📥 Fetching PDF from URL...")
response = requests.get(pdf_url)
response.raise_for_status()
print("✅ PDF fetched successfully")

print(f"💾 Saving PDF as [bold]{filename}[/bold]...")
with open(filename, "wb") as f:
    f.write(response.content)
print(f"✅ Downloaded and saved: [green]{filename}[/green]")

# -----------------------------
# Upload the PDF
# -----------------------------
print(Rule("[bold cyan]Uploading File[/bold cyan]"))

print("☁️ Uploading file to Files API...")
with open(filename, "rb") as f:
    file_info = client.files.create(
        file=(filename, f),
        purpose="assistants"
    )

print("✅ File uploaded successfully")
print(file_info)

# -----------------------------
# Create vector store
# -----------------------------
print(Rule("[bold cyan]Creating Vector Store[/bold cyan]"))

provider_id = "milvus-remote"

print("🧠 Creating vector store with Milvus provider...")
vector_store = client.vector_stores.create(
    name="test_vector_store",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": provider_id,
    },
)

print("✅ Vector store created")
print(vector_store)

# -----------------------------
# Add file to vector store
# -----------------------------
print(Rule("[bold cyan]Indexing File[/bold cyan]"))

print("📎 Adding uploaded file to vector store...")
vector_store_file = client.vector_stores.files.create(
    vector_store_id=vector_store.id,
    file_id=file_info.id,
    chunking_strategy={
        "type": "static",
        "static": {
            "max_chunk_size_tokens": 700,
            "chunk_overlap_tokens": 100,
        }
    },
    attributes={
        "title": title,
    },
)

print("✅ File added to vector store")
print(vector_store_file)

# -----------------------------
# Verify file is completed
# -----------------------------
print(Rule("[bold cyan]Waiting until file status is complete[/bold cyan]"))


# Wait for file processing to complete
print("Waiting for file processing to complete...")
max_wait_time = 300  # 5 minutes
start_time = time.time()

while time.time() - start_time < max_wait_time:
    files = client.vector_stores.files.list(vector_store_id=vector_store.id)
    if files.data:
        file_status = files.data[0].status
        print(f"File status: {file_status}")
        if file_status == "completed":
            print("✅ File processing completed!")
            break
        elif file_status == "failed":
            print("✗ File processing failed!")
            break
    time.sleep(5)
else:
    print("⚠ Timeout waiting for file processing")

# Verify file is completed
files = client.vector_stores.files.list(vector_store_id=vector_store.id)
if files.data:
    print(f"\nFinal file status: {files.data[0].status}")
    print(f"File details: {files.data[0]}")
else:
    print("No files found in vector store")

print(Rule("[bold green]All tasks completed successfully ✔[/bold green]"))

2.2.1.6. Vector Stores API

  • Endpoint: /v1/vector_stores.
  • Providers: Remote vector store providers configured in OpenShift AI.
  • Support level: Technology Preview.

The Vector Stores API manages the creation, configuration, and lifecycle of vector store resources in OGX. Through this API, you can create new vector stores, list existing ones, delete unused stores, and query their metadata, all using OpenAI-compatible request and response formats.

2.2.1.7. Vector Store Files API

  • Endpoint: /v1/vector_stores/{vector_store_id}/files.
  • Providers: Local inline provider configured for file storage and retrieval.
  • Support level: Developer Preview.

The Vector Store Files API implements the OpenAI Vector Store Files interface and manages the association between document files and vector stores used for RAG workflows.

2.2.1.8. Responses API

  • Endpoint: /v1/responses.
  • Providers: All agents, inference, and vector providers configured in OpenShift AI.
  • Support level: Developer Preview.

The Responses API generates model outputs by combining inference, file search, and tool-calling capabilities through a single OpenAI-compatible endpoint. It is particularly useful for retrieval-augmented generation (RAG) workflows that rely on the file_search tool to retrieve context from vector stores.

Example code in Python:

from rich import print
from rich.table import Table

system_instructions = """You are a financial document analysis assistant specialized in quarterly earnings reports, annual filings, press releases, and earnings call transcripts.
You are designed to answer questions in a concise and professional manner.
Answer questions strictly using only the provided documents.
Base every answer strictly on the retrieved document content and cite the relevant section or excerpt ID.
Do not use outside knowledge.
Do not guess, infer missing data, or fabricate numbers.
If the answer is not found in the retrieved content, reply: "I couldn't find relevant information in the available files or my own knowledge."
Be concise, precise, and factual."""

examples = [
     {
        "input_query": "What do you know about IBM earnings in Q4, 2025?  Summarize in one sentence",
        "expected_answer": "IBM reported strong fourth-quarter results with revenue rising 12% to $19.7 billion, driven by double-digit growth in its Software and Infrastructure segments and a generative AI book of business that has now surpassed $12.5 billion"
    },
    {
        "input_query": "What was the total value of IBM's generative AI book of business as reported in the fourth quarter of 2025?",
        "expected_answer": "IBM reported that its generative AI book of business now stands at more than $12.5 billion."
    },
    {
        "input_query": "What was IBM's reported free cash flow for the full year of 2025?",
        "expected_answer": (
            "IBM reported a full-year free cash flow of $14.7 billion, which was an increase of $2.0 billion year-over-year"
        )
    },
    {
        "input_query": "How did the Software segment perform in terms of revenue during the fourth quarter of 2025?",
        "expected_answer": (
            "The Software segment generated $9.0 billion in revenue, representing an increase of 14 percent (or 11 percent at constant currency)"
        )
    },
]

# Use the Responses API to create a results table comparing not using vs using
# the vector_store
table = Table(
    title="Answer Comparison (With vs Without Vector Store)",
    show_lines=True,
)

table.add_column("Question", style="cyan", no_wrap=False)
table.add_column("Expected Answer", style="magenta", no_wrap=False)
table.add_column("Answer (No Vector Store)", style="yellow", no_wrap=False)
table.add_column("Answer (With Vector Store)", style="green", no_wrap=False)

for example in examples:
    question = example["input_query"]
    expected_answer = example["expected_answer"]

    # Ask question without vector_store
    response_no_vs = client.responses.create(
        model=model_id,
        input=question,
        instructions=system_instructions,
    )
    answer_no_vs = response_no_vs.output_text.strip()

    # Ask question with vector_store
    response_vs = client.responses.create(
        model=model_id,
        input=question,
        instructions=system_instructions,
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store.id],
            }
        ],
    )
    answer_vs = response_vs.output_text.strip()

    table.add_row(
        question,
        expected_answer,
        answer_no_vs,
        answer_vs,
    )

# The table will take a while to be printed, as multiple queries to the responses API will be done
print(table)
Note

The Responses API is an experimental feature that is still under active development in OpenShift AI. While the API is already functional and suitable for evaluation, some endpoints and parameters remain under implementation and might change in future releases. This API is provided for testing and feedback purposes only and is not recommended for production use.

2.2.1.9. Conversations API

  • Endpoint: /v1/conversations.
  • Providers: All agents and inference providers configured in OpenShift AI.
  • Support level: Technology Preview.

The Conversations API enables multi-turn, context-aware chats by managing server-side conversation state. Instead of manually passing previous_response_id between Responses API calls, you can create a conversation that automatically accumulates message history across multiple turns. This simplifies building AI applications where each turn in the conversation can reference context from all previous turns.

The Conversations API provides the following operations:

  • Create a conversation: POST /v1/conversations - Creates a new conversation container with optional metadata.
  • Retrieve a conversation: GET /v1/conversations/\{id} - Retrieves a conversation by ID.
  • Update a conversation: POST /v1/conversations/\{id} - Updates a conversation’s metadata.
  • Delete a conversation: DELETE /v1/conversations/\{id} - Removes a conversation and its history.
  • Create conversation items: POST /v1/conversations/\{id}/items - Adds items to a conversation.
  • List conversation items: GET /v1/conversations/\{id}/items - Retrieves all messages stored in a conversation.
  • Retrieve a conversation item: GET /v1/conversations/\{id}/items/\{item_id} - Retrieves a specific item.
  • Delete a conversation item: DELETE /v1/conversations/\{id}/items/\{item_id} - Removes an item from a conversation.

To use a conversation with the Responses API, pass the conversation parameter instead of previous_response_id when calling /v1/responses.

Example code in Python:

model_id = "your-model-id"

# Step 1: Create a conversation
conversation = client.conversations.create(
    metadata={"topic": "pet-care", "user": "demo-user"}
)
conversation_id = conversation.id

# Step 2: Send messages using the Responses API with conversation_id
# Turn 1
response1 = client.responses.create(
    model=model_id,
    input="I have a rabbit. What is its living quarters called?",
    conversation=conversation_id,
    store=True,  # Persist each response as a conversation item
    instructions="You are a helpful assistant. Keep responses brief.",
)
print(response1.output_text)

# Turn 2: The response can use context from Turn 1
response2 = client.responses.create(
    model=model_id,
    input="I also have a dog. What are its living quarters called?",
    conversation=conversation_id,
    store=True,
)
print(response2.output_text)

# Turn 3: The response can use context from previous turns
response3 = client.responses.create(
    model=model_id,
    input="List the living quarters I need for all my pets.",
    conversation=conversation_id,
    store=True,
)
print(response3.output_text)

# Step 3: List all messages in the conversation
items = client.conversations.items.list(conversation_id, order="asc")
for item in items.data:
    print(f"{item.role}: {item.content}")

# Step 4: Clean up
client.conversations.delete(conversation_id)
Note

The Conversations API is a Technology Preview feature in OpenShift AI. While functional and suitable for evaluation, some endpoints and parameters might change in future releases. This API is not recommended for production use.

2.2.2. OpenAI compatibility for RAG APIs in OGX

OpenShift AI supports OpenAI-compatible request and response schemas for OGX retrieval-augmented generation (RAG) workflows. This compatibility allows you to use OpenAI clients, tools, and schemas with OGX for managing files, vector stores, and executing RAG queries through the Responses API.

OpenAI compatibility enables the following capabilities:

  • You can use OpenAI SDKs and tools with OGX by pointing the client to the OGX OpenAI-compatible API path.
  • You can manage files and vector stores by using OpenAI-compatible endpoints and invoke RAG workflows by using the Responses API with the file_search tool.

When configuring clients, the required base_url depends on the SDK that you use:

  • OpenAI SDKs When you use an OpenAI-compatible SDK (for example, the OpenAI Python client), you must include the /v1 path suffix in the base URL. For example:

    `http://ogx-service:8321/v1`
  • OGX SDK (ogx_client) When you use the native OGX SDK, set the base URL to the OGX service endpoint without the /v1 suffix. The SDK automatically appends the correct API paths. For example:

    `http://ogx-service:8321`
Important

When you use OpenAI-compatible SDKs or send raw HTTP requests to OGX, always include the /v1 path suffix in the base URL.

Using the service endpoint without /v1 results in request failures.

2.3. OpenAI-compatible file citation annotations

OGX supports OpenAI-compatible file citation annotations in Responses API outputs when using the file_search tool. These annotations enable applications to trace generated responses back to source documents without requiring changes to existing OpenAI client code.

2.3.1. OpenAI-compatible file citation annotations in OGX

OpenShift AI provides OpenAI-compatible file citation annotations in Responses API outputs when using retrieval-augmented generation (RAG) with the file_search tool. These annotations enable applications to trace generated responses back to the source files used during retrieval without requiring changes to existing OpenAI client code. When you use the Responses API with the file_search tool, OGX returns citation metadata that references the source file used to generate the response. Annotations are enabled by default.

Citation annotations have the following characteristics:

  • They follow the same response structure defined by OpenAI.
  • They appear in the annotations field of output_text response content.
  • They identify the source file by ID and filename.
  • They provide document-level attribution.

This feature improves transparency for RAG workflows while maintaining schema compatibility with OpenAI request and response formats.

In OpenShift AI, the following annotation capabilities are supported:

  • Annotations are returned only through the Responses API.
  • Annotations are returned only when using the file_search tool.
  • The file_citation annotation type is supported.
  • Attribution is provided at the document level.

2.3.2. Viewing file citation annotations in Responses API output

When you query ingested content by using the file_search tool with the Responses API, OGX returns OpenAI-compatible file_citation annotations. These annotations identify the source files used during retrieval.

Prerequisites

  • You have deployed a OGX server.
  • You have configured at least one inference model.
  • You have created a vector store and ingested content into it.
  • You can successfully execute a RAG query by using the file_search tool, as described in Querying ingested content in a Llama model.
  • You have access to a client environment, such as a Jupyter notebook or an OpenAI SDK client, that is correctly configured to send authenticated requests to the OGX server.
Note

This procedure requires that content has already been ingested into a vector store. If no content is available, RAG queries return empty or non-contextual responses.

Procedure

  1. In a Jupyter notebook cell or other configured client environment, run a RAG query by using the file_search tool.

    response = client.responses.create(
        model=model_id,
        input=query,
        instructions=system_instructions,
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store_id],
            }
        ],
    )
  2. Inspect the full response object rather than only the output_text property.

    response.output
  3. Access the annotations array.

    annotations = response.output[0].content[0].annotations
    print(annotations)
  4. Review the file_citation annotation fields.

    Example output:

    [
      {
        "type": "file_citation",
        "file_id": "file-57610eaac6364459bfefae60377837b7",
        "filename": "redbankfinancial_about.pdf",
        "index": 139
      }
    ]

Each file_citation annotation includes the following fields:

  • file_id: The identifier of the retrieved file.
  • filename: The name of the source file.
  • index: The index of the cited file in the list of files.

Multiple annotations can reference the same index position.

Optional: Using the OpenAI-compatible HTTP endpoint

If you use raw HTTP requests or an OpenAI SDK, send requests to the following endpoint:

/v1/responses

Ensure that your base URL includes the /v1 path suffix, as described in OpenAI compatibility for RAG APIs in OGX.

Note

The accuracy and consistency of citation annotations depend on the capabilities of the underlying language model. Smaller or less capable models might produce less precise attributions, even when retrieval is functioning correctly. If citation results are incomplete or inconsistent, verify the model configuration and consider using a larger or more capable model.

Optional: Using the OpenAI-compatible endpoint

When you use an OpenAI SDK, configure the client base_url to include the /v1 path suffix. The SDK automatically appends the appropriate endpoint path, such as /responses.

For example:

http://ogx-service:8321/v1

When you send raw HTTP requests, include both the /v1 path suffix and the /responses endpoint in the full request URL.

For example:

http://ogx-service:8321/v1/responses

Ensure that /v1 is included only once in the base URL. Do not append /v1 multiple times.

For more information, see OpenAI compatibility for RAG APIs in OGX.

Note

The accuracy and consistency of citation annotations depend on the capabilities of the underlying language model. Smaller or less capable models might produce less precise attributions, even when retrieval is functioning correctly. If citation results are incomplete or inconsistent, verify the model configuration and consider using a larger or more capable model.

Verification

  • The response includes an annotations array under output[].content[].
  • Each annotation has "type": "file_citation".
  • The file_id and filename correspond to files stored in the specified vector store.

2.3.3. File citation annotation reference

This reference describes the file_citation annotation type returned by OGX through the OpenAI-compatible Responses API.

2.3.3.1. Annotation location

Annotations are returned in the annotations field of output_text content items within the output[].content[] structure of the Responses API response.

"output": [
  {
    "content": [
      {
        "type": "output_text",
        "text": "Example generated response.",
        "annotations": [ ... ]
      }
    ]
  }
]

2.3.3.2. Supported annotation type

In OpenShift AI, OGX returns the file_citation annotation type when using the file_search tool.

URL citation annotations

The url_citation type is defined in the OpenAI schema but is not produced by OGX in OpenShift AI 3.3.

2.3.3.3. File citation fields

The file_citation annotation includes the following fields:

FieldTypeDescription

type

string

Always file_citation

file_id

string

Identifier of the source file used during retrieval

filename

string

Name of the source file

index

integer

Index of the cited file in the list of files.

2.3.3.4. Annotation behavior

  • Attribution is provided at the document level.
  • Multiple annotations can reference the same index position.
  • Chunk-level and token-level attribution are not supported.
  • Annotations follow the OpenAI response schema without modification.

2.4. OGX API provider support

You can use OGX to enable various Provider APIs and providers in OpenShift AI. The following table lists the supported providers included in OpenShift AI, enablement environment variables, disconnected environment support, and its current support status.

Warning

The support status of the OGX API providers has shifted between Technology Preview and Developer Preview across OpenShift AI versions.

Provider APIProvidersHow to EnableDisconnected supportSupport status

Agents

inline::meta-reference

Note

The Responses API is accessible from the Agents provider API.

Enabled by default

Yes

Developer Preview

Dataset_IO

inline::localfs

Enabled by default

Yes

Technology Preview

remote::huggingface

Enabled by default

No

Technology Preview

Files

inline::localfs

Enabled by default

No

Technology Preview

remote::s3

Set the ENABLE_S3 environment variable to "true"

Yes

Developer Preview

Inference

remote::vllm

Set the VLLM_URL environment variable

Yes

Technology Preview

inline::sentence-transformers

Set the ENABLE_SENTENCE_TRANSFORMERS environment variable to "true"

Yes

Technology Preview

remote::azure

Set the AZURE_API_KEY environment variable

No

Technology Preview

remote::bedrock

Set the AWS_ACCESS_KEY_ID environment variable

No

Technology Preview

remote::openai

Set the OPENAI_API_KEY environment variable

No

Technology Preview

remote::vertexai

Set the VERTEX_AI_PROJECT environment variable

No

Technology Preview

remote::watsonx

Set the WATSONX_API_KEY environment variable

No

Technology Preview

Tool_Runtime

remote::model-context-protocol

Enabled by default

No

Developer Preview

inline::rag-runtime

Enabled by default

No

Developer Preview

remote::brave-search

Enabled by default

No

Developer Preview

remote::tavily-search

Enabled by default

No

Developer Preview

Vector_IO

inline::faiss

Set the ENABLE_FAISS environment variable

No

Technology Preview

inline::milvus

Set the ENABLE_INLINE_MILVUS environment variable to "true"

Yes

Technology Preview

remote::milvus

Set the MILVUS_ENDPOINT environment variable

Yes

Technology Preview

remote::pgvector

Set the ENABLE_PGVECTOR environment variable

Yes

Technology Preview

remote::qdrant

Set the ENABLE_QDRANT environment variable

Yes

Technology Preview

Chapter 3. Activating the OGX Operator

You can activate the OGX Operator on your OpenShift cluster by setting its managementState to Managed in the OpenShift AI Operator DataScienceCluster custom resource (CR). This setting enables Llama-based model serving without reinstalling or directly editing Operator subscriptions. You can edit the CR in the OpenShift web console or by using the OpenShift CLI (oc).

Note

As an alternative to following the steps in this procedure, you can activate the OGX Operator from the OpenShift CLI (oc) by running the following command:

$ oc patch datasciencecluster <name> --type=merge -p {"spec":{"components":{"ogx":{"managementState":"Managed"}}}}

Replace <name> with your DataScienceCluster name, for example, default-dsc.

Prerequisites

Note

For IBM Power, ppc64le architectures, CPU-only deployments are fully supported.

Procedure

  1. Log in to the OpenShift web console as a cluster administrator.
  2. In the Administrator perspective, click EcosystemInstalled Operators.
  3. Click the Red Hat OpenShift AI Operator to open its details.
  4. Click the Data Science Cluster tab.
  5. On the DataScienceClusters page, click the default-dsc object.
  6. Click the YAML tab.

    An embedded YAML editor opens, displaying the configuration for the DataScienceCluster custom resource.

  7. In the YAML editor, locate the spec.components section. If the ogx field does not exist, add it. Then, set the managementState field to Managed:

    spec:
      components:
        ogx:
          managementState: Managed
  8. Click Save to apply your changes.

Verification

After you activate the OGX Operator, verify that it is running in your cluster:

  1. In the OpenShift web console, click WorkloadsPods.
  2. From the Project list, select the redhat-ods-applications namespace.
  3. Confirm that a pod with the label name=ogx-k8s-operator is displayed and has a status of Running.

Chapter 4. Deploying a OGX server

OGX allows you to create and deploy a server that enables various APIs for accessing AI services in your OpenShift AI cluster. You can create a OGXServer custom resource for your desired use cases. You are responsible for provisioning and managing the PostgreSQL instance. The PostgreSQL database can be deployed in-cluster or hosted externally, as long as it is reachable from the cluster network.

The included procedure provides an example OGXServer CR that deploys a OGX server that enables the following setup:

  • A connection to a vLLM inference service with a llama32-3b model.
  • A connection to a remote vector database.
  • Allocated persistent storage.
  • Orchestration endpoints.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have activated the OGX Operator in your cluster.
  • You have access to a PostgreSQL version 14 or later instance that is reachable from the OpenShift cluster network.
  • You have PostgreSQL credentials for that instance that allow OGX to create the database and tables.
  • You know the PostgreSQL hostname and database port to use for the POSTGRES_HOST and POSTGRES_PORT environment variables.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. In the OpenShift web console, select AdministratorQuick Create ( quick create icon ) → Import YAML, and create a CR similar to the following example ogx-custom-server.yaml file:

    Example ogx-custom-server.yaml

    apiVersion: ogx.io/v1beta1
    kind: OGXServer
    metadata:
      name: ogx-custom-server
      namespace: <project-name> # Replace with your OpenShift project
    spec:
      distribution:
        name: rh-dev
      workload:
        replicas: 1
        overrides:
          env:
            - name: VLLM_URL
              value: 'https://llama32-3b.ogx.svc.cluster.local/v1'
            - name: INFERENCE_MODEL
              value: llama32-3b
            - name: VLLM_TLS_VERIFY
              value: 'false'
            - name: POSTGRES_HOST
              value: <postgres-host>
            - name: POSTGRES_PORT
              value: '<postgres-port>' # Default PostgreSQL port is 5432
            - name: POSTGRES_DB
              value: ogx
            - name: POSTGRES_USER
              value: ogx
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: postgres-secret 1
          name: ogx
          port: 8321
        distribution:
          name: 'rh-dev'
        storage:
          size: 20Gi
          mountPath: <custom-mount-path> ## Defaults to /opt/app-root/src/.ogx/distributions/rh/

    1
    Create the secret in the same namespace as the OGXServer resource. Avoid placing passwords directly on the command line, as they can be stored in shell history. Instead, create a file that contains only the database password and use that file to create the secret, or create the secret by using the OpenShift web console.

    For example:

    $ oc create secret generic postgres-secret --from-file=password=pg-password.txt -n <project-name>
    $ rm -f pg-password.txt

    For more information about creating and managing Secrets, see Providing sensitive data to pods by using secrets.

    Ensure that the file pg-password.txt contains only the database password and is deleted after the secret is created.

    OGX automatically creates the metadata database specified by the POSTGRES_DB environment variable if it does not already exist, provided that the PostgreSQL user has sufficient privileges.

Verification

  1. Check that the custom resource was created with the following command:

    $ oc get ogxserver -n ogx
  2. Check the running pods with the following command:

    $ oc get pods -n ogx | grep ogx-custom-server
  3. Check the logs with the following command:

    $ oc logs -n ogx -l app=ogx

    Example output

    INFO: Started server process
    INFO: Waiting for application startup.
    INFO: Application startup complete.
    INFO: Uvicorn running on http://['::', '0.0.0.0']:8321

Chapter 5. OGX application examples

Use the following examples to deploy and configure OGX applications on OpenShift AI. These examples include deploying a RAG stack, evaluating RAG systems, and configuring authentication and availability options.

The following documentation includes example workflows:

  • Deploying a RAG stack in a data science project
  • Evaluating RAG systems with OGX
  • Using supported vector stores with OGX
  • Using external object storage for files
  • Configuring OGX with OAuth authentication

5.1. Deploying a RAG stack in a project

Important

This feature is currently available in Red Hat OpenShift AI 3.5 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

As an OpenShift cluster administrator, you can deploy a Retrieval-Augmented Generation (RAG) stack in OpenShift AI. This stack provides the infrastructure, including LLM inference, vector storage, and retrieval services that data scientists and AI engineers use to build conversational workflows in their projects.

To deploy the RAG stack in a project, complete the following tasks:

  • Activate the OGX Operator in OpenShift AI.
  • Enable GPU support on the OpenShift cluster. This task includes installing the required NVIDIA Operators.
  • Deploy an inference model, for example, the llama-3.2-3b-instruct model. This task includes creating a storage connection and configuring GPU allocation.
  • Create a OGXServer instance to enable RAG functionality. This action deploys OGX alongside a Milvus vector store and connects both components to the inference model.
  • Ingest domain data into the configured vector store by running Docling in an AI pipeline or Jupyter notebook. This process keeps the embeddings synchronized with the source data.
  • Expose and secure the model endpoints.

5.1.1. Overview of RAG

Retrieval-augmented generation (RAG) in OpenShift AI enhances large language models (LLMs) by integrating domain-specific data sources directly into the model’s context. Domain-specific data sources can be structured data, such as relational database tables, or unstructured data, such as PDF documents.

RAG indexes content and builds an embedding store that data scientists and AI engineers can query. When data scientists or AI engineers pose a question to a RAG chatbot, the RAG pipeline retrieves the most relevant pieces of data, passes them to the LLM as context, and generates a response that reflects both the prompt and the retrieved content.

By implementing RAG, data scientists and AI engineers can obtain tailored, accurate, and verifiable answers to complex queries based on their own datasets within a project.

5.1.1.1. Audience for RAG

The target audience for RAG is practitioners who build data-grounded conversational AI applications using OpenShift AI infrastructure.

For Data Scientists
Data scientists can use RAG to prototype and validate models that answer natural-language queries against data sources without managing low-level embedding pipelines or vector stores. They can focus on creating prompts and evaluating model outputs instead of building retrieval infrastructure.
For MLOps Engineers
MLOps engineers typically deploy and operate RAG pipelines in production. Within OpenShift AI, they manage LLM endpoints, monitor performance, and ensure that both retrieval and generation scale reliably. RAG decouples vector store maintenance from the serving layer, enabling MLOps engineers to apply CI/CD workflows to data ingestion and model deployment alike.
For Data Engineers
Data engineers build workflows to load data into storage that OpenShift AI indexes. They keep embeddings in sync with source systems, such as S3 buckets or relational tables to ensure that chatbot responses are accurate.
For AI Engineers
AI engineers architect RAG chatbots by defining prompt templates, retrieval methods, and fallback logic. They configure agents and add domain-specific tools, such as OpenShift job triggers, enabling rapid iteration.

5.1.2. Overview of vector databases

Vector databases are a core component of retrieval-augmented generation (RAG) in OpenShift AI. They store and index vector embeddings that represent the semantic meaning of text or other data. When integrated with OGX, vector databases enable applications to retrieve relevant context and combine it with large language model (LLM) inference.

Vector databases provide the following capabilities:

  • Store vector embeddings generated by embedding models.
  • Support efficient similarity search to retrieve semantically related content.
  • Enable RAG workflows by supplying the LLM with contextually relevant data.

In OpenShift AI, vector databases are configured and managed through the OGX Operator as part of a OGXServer. PostgreSQL is the default and recommended metadata store for OGX, supporting production-ready persistence, concurrency, and scalability.

The following vector database options are supported in OpenShift AI:

  • Remote Milvus Remote Milvus runs as a standalone vector database service, either within the cluster or as an external managed deployment. This option is suitable for large-scale or production-grade RAG workloads that require high availability, horizontal scalability, and isolation from the OGX server. In OpenShift environments, Milvus typically requires an accompanying etcd service for coordination. For more information, see Providing redundancy with etcd.
  • Remote PostgreSQL with pgvector PostgreSQL with the pgvector extension provides a production-ready vector database option that integrates vector similarity search directly into PostgreSQL. This option is well suited for environments that already operate PostgreSQL and require durable storage, transactional consistency, and centralized management. pgvector enables OGX to store embeddings and perform similarity search without deploying a separate vector database service.

Consider the following guidance when choosing a vector database for your RAG workloads:

  • Use Remote Milvus when you require large-scale vector indexing and high-throughput similarity search.
  • Use PostgreSQL with pgvector when you want production-ready persistence and integration with existing PostgreSQL-based data platforms.

SQLite-based storage is no longer recommended for production deployments. PostgreSQL-based backends provide improved reliability, concurrency, and scalability as OGX moves toward general availability.

5.1.2.1. Overview of Milvus vector databases

Milvus is an open source vector database designed for high-performance similarity search across large volumes of embedding data. In OpenShift AI, Milvus is supported as a vector store provider for OGX and enables retrieval-augmented generation (RAG) workloads that require efficient vector indexing, scalable search, and durable storage.

Production-grade OGX deployments default to PostgreSQL for metadata persistence. When Milvus is used as the vector store, PostgreSQL is typically used for OGX metadata, while Milvus manages vector indexes and similarity search.

Milvus vector databases provide the following capabilities in OpenShift AI:

  • High-performance similarity search using Approximate Nearest Neighbor (ANN) algorithms
  • Efficient indexing and query optimization for dense embeddings
  • Persistent storage of vector data
  • Integration with OGX through an OpenAI-compatible Vector Stores API

In a typical RAG workflow in OpenShift AI, the following responsibilities are separated:

  • Embedding generation Embeddings are generated by the configured embedding provider. Remote embedding models are the recommended and default option for production deployments.
  • Vector storage and retrieval Milvus stores embedding vectors and performs similarity search operations.
  • Metadata persistence OGX stores vector store metadata, file references, and configuration state using PostgreSQL in production deployments.
  • OGX server Coordinates ingestion, retrieval, and model inference through a unified API surface.

In OpenShift AI, Milvus can be used in the following operational modes:

  • Remote Milvus Runs as a standalone service within your OpenShift project or as an external managed Milvus deployment. Remote Milvus is recommended for production-grade RAG workloads.

A remote Milvus deployment typically includes the following components:

  • A Milvus service that exposes a gRPC endpoint (port 19530) for client traffic
  • An etcd service that Milvus uses for metadata coordination, collection state, and index management
  • Persistent storage for durable vector data

Milvus requires a dedicated etcd instance for metadata coordination, even when running in standalone mode. Do not use the OpenShift control plane etcd for this purpose. For more information about etcd, see Providing redundancy with etcd.

Important

You must deploy a dedicated etcd service for Milvus or connect Milvus to an external etcd instance. Do not share the OpenShift control plane etcd with application workloads.

Use Remote Milvus when you require scalable vector search, high-performance retrieval, and integration with production-grade OGX deployments in OpenShift AI.

For instructions on deploying Milvus as a remote vector database, see Deploying a remote Milvus vector database.

5.1.2.2. Overview of Qdrant vector databases

Qdrant is an open source vector database optimized for high-performance similarity search and advanced filtering. In OpenShift AI, Qdrant is supported as a remote vector store provider for OGX and can be used in retrieval-augmented generation (RAG) workloads that require efficient vector indexing and durable storage.

When used with OGX in OpenShift AI, Qdrant provides:

  • High-performance similarity search using Hierarchical Navigable Small World (HNSW) indexing
  • Filtering based on stored metadata during vector search
  • Persistent storage of vector data
  • Integration through the OpenAI-compatible Vector Stores API

In a RAG workflow:

  • Embeddings are generated by the configured embedding provider.
  • Qdrant stores embedding vectors and performs similarity search.
  • OGX manages ingestion, retrieval, and model inference through a unified API.

In OpenShift AI, you must deploy Qdrant as a remote service, either within your OpenShift project or as an externally managed deployment.

Note

Inline Qdrant is not supported. To use Qdrant with OGX in OpenShift AI, deploy Qdrant as a remote service.

A typical remote deployment includes:

  • A Qdrant service exposing HTTP (port 6333) and gRPC (port 6334) endpoints
  • Persistent storage for vector data
  • Optional API key authentication

For deployment and configuration instructions, see Using Qdrant in OGX.

5.1.2.3. Overview of pgvector vector databases

pgvector is an open source PostgreSQL extension that enables vector similarity search on embedding data stored in relational tables. In OpenShift AI, PostgreSQL with the pgvector extension is supported as a remote vector database provider for the OGX Operator. pgvector supports retrieval augmented generation workflows that require persistent vector storage while integrating with existing PostgreSQL environments.

pgvector vector databases provide the following capabilities in OpenShift AI:

  • Storage of vector embeddings in PostgreSQL tables.
  • Similarity search across embeddings by using pgvector distance metrics.
  • Persistent storage of vectors alongside structured relational data.
  • Integration with existing PostgreSQL security and operational tooling.

In a typical retrieval augmented generation workflow in OpenShift AI, your application uses the following components:

  • Inference provider Generates embeddings and model responses.
  • Vector store provider Stores embeddings and performs similarity search. When you use pgvector, PostgreSQL provides this capability as a remote vector store.
  • File storage provider Stores the source files that are ingested into vector stores.
  • OGX server Provides a unified API surface, including an OpenAI compatible Vector Stores API.

When you ingest content, OGX splits source material into chunks, generates embeddings, and stores them in PostgreSQL through the pgvector extension. When you query a vector store, OGX performs similarity search and returns the most relevant chunks for use in prompts.

In OpenShift AI, pgvector is used in the following operational mode:

  • Remote PostgreSQL with pgvector, which runs as a standalone PostgreSQL database service accessed by the OGX server. This mode is suitable for development and production workloads that require persistent storage and integration with existing PostgreSQL infrastructure.

When you deploy PostgreSQL with the pgvector extension, you typically manage the following components:

  • Secrets for PostgreSQL connection credentials.
  • Persistent storage for durable database data.
  • A PostgreSQL service that exposes a network endpoint.

PostgreSQL with pgvector does not require an external coordination service. Vector data, indexes, and metadata are stored directly in PostgreSQL tables and managed through standard database mechanisms.

Use PostgreSQL with pgvector when you require persistent vector storage and want to integrate vector search into existing PostgreSQL based data platforms within OpenShift AI. Deploying a PostgreSQL instance with pgvector.

5.1.3. Deploying a Llama model with KServe

To use OGX and retrieval-augmented generation (RAG) workloads in OpenShift AI, you must deploy a Llama model with a vLLM model server and configure KServe in KServe RawDeployment mode.

Important

When deploying models using KServe on IBM Power (ppc64le), ensure that you use only supported parameters for the model configuration.

Half (FP16) precision is not currently supported on this architecture. Attempting to use FP16 may result in a NotImplementedError: "rotary_embedding_impl" not implemented for 'Half' error.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have activated the OGX Operator.
  • You have installed KServe.
  • You have enabled the model serving platform. For more information about enabling the model serving platform, see Enabling the model serving platform.
  • You can access the model serving platform in the dashboard configuration. For more information about setting dashboard configuration options, see Customizing the dashboard.
  • You have enabled GPU support in OpenShift AI, including installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

  • You have created a project.
  • The vLLM serving runtime is installed and available in your environment.
  • You have created a storage connection for your model that contains a URI - v1 connection type. This storage connection must define the location of your Llama 3.2 model artifacts. For example, oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct. For more information about creating storage connections, see This content is not included.Adding a connection to your project.
Procedure

These steps are only supported in OpenShift AI versions 2.19 and later.

  1. In the OpenShift AI dashboard, navigate to the project details page and click the Deployments tab.
  2. In the Model serving platform tile, click Select model.
  3. Click the Deploy model button.

    The Deploy model dialog opens.

  4. Configure the deployment properties for your model:

    1. In the Model deployment name field, enter a unique name for your deployment.
    2. In the Serving runtime field, select vLLM NVIDIA GPU serving runtime for KServe from the drop-down list.
    3. In the Deployment mode field, select KServe RawDeployment from the drop-down list.
    4. Set Number of model server replicas to deploy to 1.
    5. In the Model server size field, select Custom from the drop-down list.

      • Set CPUs requested to 1 core.
      • Set Memory requested to 10 GiB.
      • Set CPU limit to 2 core.
      • Set Memory limit to 14 GiB.
      • Set Accelerator to NVIDIA GPUs.
      • Set Accelerator count to 1.
    6. From the Connection type, select a relevant data connection from the drop-down list.
  5. In the Additional serving runtime arguments field, specify the following recommended arguments:

    --dtype=half
    --max-model-len=20000
    --gpu-memory-utilization=0.95
    --enable-chunked-prefill
    --enable-auto-tool-choice
    --tool-call-parser=llama3_json
    --chat-template=/app/data/template/tool_chat_template_llama3.2_json.jinja
    1. Click Deploy.

      Note

      Model deployment can take several minutes, especially for the first model that is deployed on the cluster. Initial deployment may take more than 10 minutes while the relevant images download.

Verification

  1. Verify that the kserve-controller-manager and odh-model-controller pods are running:

    1. Open a new terminal window.
    2. Log in to your OpenShift cluster from the CLI:
    3. In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
    4. After you have logged in, click Display token.
    5. Copy the Log in with this token command and paste it in the OpenShift CLI (oc).

      $ oc login --token=<token> --server=<openshift_cluster_url>
    6. Enter the following command to verify that the kserve-controller-manager and odh-model-controller pods are running:

      $ oc get pods -n redhat-ods-applications | grep -E 'kserve-controller-manager|odh-model-controller'
    7. Confirm that you see output similar to the following example:

      kserve-controller-manager-7c865c9c9f-xyz12   1/1     Running   0          4m21s
      odh-model-controller-7b7d5fd9cc-wxy34        1/1     Running   0          3m55s
    8. If you do not see either of the kserve-controller-manager and odh-model-controller pods, there could be a problem with your deployment. In addition, if the pods appear in the list, but their Status is not set to Running, check the pod logs for errors:

      $ oc logs <pod-name> -n redhat-ods-applications
    9. Check the status of the inference service:

      $ oc get inferenceservice -n ogx
      $ oc get pods -n <project name> | grep llama
      • The deployment automatically creates the following resources:

        • A ServingRuntime resource.
        • An InferenceService resource, a Deployment, a pod, and a service pointing to the pod.
      • Verify that the server is running. For example:

        $ oc logs llama-32-3b-instruct-predictor-77f6574f76-8nl4r  -n <project name>

        Check for output similar to the following example log:

        INFO     2025-05-15 11:23:52,750 __main__:498 server: Listening on ['::', '0.0.0.0']:8321
        INFO:     Started server process [1]
        INFO:     Waiting for application startup.
        INFO     2025-05-15 11:23:52,765 __main__:151 server: Starting up
        INFO:     Application startup complete.
        INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
      • The deployed model displays in the Deployments tab on the project details page for the project it was deployed under.
  2. If you see a ConvertTritonGPUToLLVM error in the pod logs when querying the /v1/chat/completions API, and the vLLM server restarts or returns a 500 Internal Server error, apply the following workaround:

    Before deploying the model, remove the --enable-chunked-prefill argument from the Additional serving runtime arguments field in the deployment dialog.

    The error is displayed similar to the following:

    /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: error: Failures have been detected while processing an MLIR pass pipeline
    /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: note: Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
    INFO:     10.129.2.8:0 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

5.1.4. Testing your vLLM model endpoints

To verify that your deployed Llama 3.2 model is accessible externally, ensure that your vLLM model server is exposed as a network endpoint. You can then test access to the model from outside both the OpenShift cluster and the OpenShift AI interface.

Important

If you selected Make deployed models available through an external route during deployment, your vLLM model endpoint is already accessible outside the cluster. You do not need to manually expose the model server. Manually exposing vLLM model endpoints, for example, by using oc expose, creates an unsecured route unless you configure authentication. Avoid exposing endpoints without security controls to prevent unauthorized access.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • You have logged in to Red Hat OpenShift AI.
  • You have activated the OGX Operator in OpenShift AI.
  • You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. Open a new terminal window.

    1. Log in to your OpenShift cluster from the CLI:
    2. In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
    3. After you have logged in, click Display token.
    4. Copy the Log in with this token command and paste it in the OpenShift CLI (oc).

      $ oc login --token=<token> --server=<openshift_cluster_url>
  2. If you enabled Require token authentication during model deployment, retrieve your token:

    $ export MODEL_TOKEN=$(oc get secret default-name-llama-32-3b-instruct-sa -n <project name> --template={{ .data.token }} | base64 -d)
  3. Obtain your model endpoint URL:

    • If you enabled Make deployed models available through an external route during model deployment, click Endpoint details on the Deployments page in the OpenShift AI dashboard to obtain your model endpoint URL.
    • In addition, if you did not enable Require token authentication during model deployment, you can also enter the following command to retrieve the endpoint URL:

      $ export MODEL_ENDPOINT="https://$(oc get route llama-32-3b-instruct -n <project name> --template={{ .spec.host }})"
  4. Test the endpoint with a sample chat completion request:

    • If you did not enable Require token authentication during model deployment, enter a chat completion request. For example:

      $ curl -X POST $MODEL_ENDPOINT/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{
       "model": "llama-32-3b-instruct",
       "messages": [
         {
           "role": "user",
           "content": "Hello"
         }
       ]
      }'
    • If you enabled Require token authentication during model deployment, include a token in your request. For example:

      curl -s -k $MODEL_ENDPOINT/v1/chat/completions \
      --header "Authorization: Bearer $MODEL_TOKEN" \
      --header 'Content-Type: application/json' \
      -d '{
        "model": "llama-32-3b-instruct",
        "messages": [
          {
            "role": "user",
            "content": "can you tell me a funny joke?"
          }
        ]
      }' | jq .
      Note

      The -k flag disables SSL verification and should only be used in test environments or with self-signed certificates.

Verification

Confirm that you received a JSON response containing a chat completion. For example:

{
  "id": "chatcmpl-05d24b91b08a4b78b0e084d4cc91dd7e",
  "object": "chat.completion",
  "created": 1747279170,
  "model": "llama-32-3b-instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "reasoning_content": null,
      "content": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?",
      "tool_calls": []
    },
    "logprobs": null,
    "finish_reason": "stop",
    "stop_reason": null
  }],
  "usage": {
    "prompt_tokens": 37,
    "total_tokens": 62,
    "completion_tokens": 25,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

If you do not receive a response similar to the example, verify that the endpoint URL and token are correct, and ensure your model deployment is running.

5.1.5. Deploying a remote Milvus vector database

To use Milvus as a remote vector database provider for OGX in OpenShift AI, you must deploy Milvus and its required etcd service in your OpenShift project. This procedure shows how to deploy Milvus in standalone mode without the Milvus Operator.

Note

The following example configuration is intended for testing or evaluation environments. For production-grade deployments, see Content from milvus.io is not included.https://milvus.io/docs in the Milvus documentation.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You are logged in to Red Hat OpenShift AI.
  • You have a StorageClass available that can provision persistent volumes.
  • You created a root password to secure your Milvus service.
  • You have deployed an inference model with vLLM, for example, the llama-3.2-3b-instruct model, and you have selected Make deployed models available through an external route and Require token authentication during model deployment.
  • You have the correct inference model identifier, for example, llama-3-2-3b.
  • You have the model endpoint URL, ending with /v1, such as https://llama-32-3b-instruct-predictor:8443/v1.
  • You have the API token required to access the model endpoint.
  • You have installed the OpenShift command line interface (oc) as described in Installing the OpenShift CLI.

Procedure

  1. In the OpenShift console, click the Quick Create ( quick create icon ) icon and then click the Import YAML option.
  2. Verify that your project is the selected project.
  3. In the Import YAML editor, paste the following manifest and click Create:

    apiVersion: v1
    kind: Secret
    metadata:
      name: milvus-secret
    type: Opaque
    stringData:
      root-password: "MyStr0ngP@ssw0rd"
    ---
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: milvus-pvc
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      volumeMode: Filesystem
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: etcd-deployment
      labels:
        app: etcd
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: etcd
      strategy:
        type: Recreate
      template:
        metadata:
          labels:
            app: etcd
        spec:
          containers:
            - name: etcd
              image: quay.io/coreos/etcd:v3.5.5
              command:
                - etcd
                - --advertise-client-urls=http://127.0.0.1:2379
                - --listen-client-urls=http://0.0.0.0:2379
                - --data-dir=/etcd
              ports:
                - containerPort: 2379
              volumeMounts:
                - name: etcd-data
                  mountPath: /etcd
              env:
                - name: ETCD_AUTO_COMPACTION_MODE
                  value: revision
                - name: ETCD_AUTO_COMPACTION_RETENTION
                  value: "1000"
                - name: ETCD_QUOTA_BACKEND_BYTES
                  value: "4294967296"
                - name: ETCD_SNAPSHOT_COUNT
                  value: "50000"
          volumes:
            - name: etcd-data
              emptyDir: {}
          restartPolicy: Always
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: etcd-service
    spec:
      ports:
        - port: 2379
          targetPort: 2379
      selector:
        app: etcd
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: milvus-standalone
      name: milvus-standalone
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: milvus-standalone
      strategy:
        type: Recreate
      template:
        metadata:
          labels:
            app: milvus-standalone
        spec:
          containers:
            - name: milvus-standalone
              image: milvusdb/milvus:v2.6.0
              args: ["milvus", "run", "standalone"]
              env:
                - name: DEPLOY_MODE
                  value: standalone
                - name: ETCD_ENDPOINTS
                  value: etcd-service:2379
                - name: COMMON_STORAGETYPE
                  value: local
                - name: MILVUS_ROOT_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: milvus-secret
                      key: root-password
              livenessProbe:
                exec:
                  command: ["curl", "-f", "http://localhost:9091/healthz"]
                initialDelaySeconds: 90
                periodSeconds: 30
                timeoutSeconds: 20
                failureThreshold: 5
              ports:
                - containerPort: 19530
                  protocol: TCP
                - containerPort: 9091
                  protocol: TCP
              volumeMounts:
                - name: milvus-data
                  mountPath: /var/lib/milvus
          restartPolicy: Always
          volumes:
            - name: milvus-data
              persistentVolumeClaim:
                claimName: milvus-pvc
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: milvus-service
    spec:
      selector:
        app: milvus-standalone
      ports:
        - name: grpc
          port: 19530
          targetPort: 19530
        - name: http
          port: 9091
          targetPort: 9091
    Note
    • Use the gRPC port (19530) for the MILVUS_ENDPOINT setting in OGX.
    • The HTTP port (9091) is reserved for health checks.
    • If you deploy Milvus in a different namespace, use the fully qualified service name in your OGX configuration. For example: http://milvus-service.<namespace>.svc.cluster.local:19530

Verification

  1. In the OpenShift web console, click WorkloadsDeployments.
  2. Verify that both etcd-deployment and milvus-standalone show a status of 1 of 1 pods available.
  3. Click Pods in the navigation panel and confirm that pods for both deployments are Running.
  4. Click the milvus-standalone pod name, then select the Logs tab.
  5. Verify that Milvus reports a healthy startup with output similar to:

    Milvus Standalone is ready to serve ...
    Listening on 0.0.0.0:19530 (gRPC)
  6. Click NetworkingServices and confirm that the milvus-service and etcd-service resources exist and are exposed on ports 19530 and 2379, respectively.
  7. (Optional) Click Podsmilvus-standaloneTerminal and run the following health check:

    curl http://localhost:9091/healthz

    A response of {"status": "healthy"} confirms that Milvus is running correctly.

5.1.6. Deploying a OGXServer instance

You can deploy OGX with retrieval-augmented generation (RAG) by pairing it with a vLLM-served Llama 3.2 model. This module provides the following deployment examples of the OGXServer custom resource (CR):

  • Example A: Remote Milvus (external service)
  • Example B: Remote PostgreSQL with pgvector (external service, remote embeddings)

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You are logged in to Red Hat OpenShift AI.
  • You have activated the OGX Operator in OpenShift AI.
  • You have deployed an inference model with vLLM (for example, llama-3.2-3b-instruct) and selected Make deployed models available through an external route and Require token authentication during model deployment. In addition, in Add custom runtime arguments, you have added --enable-auto-tool-choice.
  • You have the correct inference model identifier, for example, llama-3-2-3b.
  • You have the model endpoint URL ending with /v1, for example, https://llama-32-3b-instruct-predictor:8443/v1.
  • You have the API token required to access the model endpoint.
  • You have installed the PostgreSQL Operator version 14 or later and configured a PostgreSQL database for OGX metadata storage. For more information, see the documentation for "Deploying a OGX server".
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. Open a new terminal window and log in to your OpenShift cluster from the CLI:

    In the upper-right corner of the OpenShift web console, click your user name and select Copy login command. After you have logged in, click Display token. Copy the Log in with this token command and paste it in the OpenShift CLI (oc).

    $ oc login --token=<token> --server=<openshift_cluster_url>
  2. Create a secret that contains the inference model and the remote embeddings environment variables:

    # Remote LLM
    export INFERENCE_MODEL="llama-3-2-3b"
    export VLLM_URL="https://llama-32-3b-instruct-predictor:8443/v1"
    export VLLM_TLS_VERIFY="false"   # Use "true" in production
    export VLLM_API_TOKEN="<token identifier>"
    export VLLM_MAX_TOKENS=16384
    
    # Remote embedding configuration
    export EMBEDDING_MODEL="nomic-embed-text-v1-5"
    export EMBEDDING_PROVIDER_MODEL_ID="nomic-embed-text-v1-5"
    export VLLM_EMBEDDING_URL="<embedding-endpoint>/v1"
    export VLLM_EMBEDDING_API_TOKEN="<embedding-token>"
    export VLLM_EMBEDDING_MAX_TOKENS=8192
    export VLLM_EMBEDDING_TLS_VERIFY="true"
    
    oc create secret generic ogx-secret -n <project-name> \
      --from-literal=INFERENCE_MODEL="$INFERENCE_MODEL" \
      --from-literal=VLLM_URL="$VLLM_URL" \
      --from-literal=VLLM_TLS_VERIFY="$VLLM_TLS_VERIFY" \
      --from-literal=VLLM_API_TOKEN="$VLLM_API_TOKEN" \
      --from-literal=VLLM_MAX_TOKENS="$VLLM_MAX_TOKENS" \
      --from-literal=EMBEDDING_MODEL="$EMBEDDING_MODEL" \
      --from-literal=EMBEDDING_PROVIDER_MODEL_ID="$EMBEDDING_PROVIDER_MODEL_ID" \
      --from-literal=VLLM_EMBEDDING_URL="$VLLM_EMBEDDING_URL" \
      --from-literal=VLLM_EMBEDDING_TLS_VERIFY="$VLLM_EMBEDDING_TLS_VERIFY" \
      --from-literal=VLLM_EMBEDDING_API_TOKEN="$VLLM_EMBEDDING_API_TOKEN" \
      --from-literal=VLLM_EMBEDDING_MAX_TOKENS="$VLLM_EMBEDDING_MAX_TOKENS"
  3. Choose one of the following deployment examples:
Important

In order to use OGX in a disconnected environment as well as enabling embeddings for deployments, you need to use the remote::vllm provider to set up a vLLM instance that uses a embedding model. For example, the ibm-granite/granite-embedding-125m-english model.

5.1.6.1. Example A: OGXServer with Remote Milvus

Use this example for production-grade or large datasets with an external Milvus service.

  1. Create the Milvus connection secret:

    # Required: gRPC endpoint on port 19530
    export MILVUS_ENDPOINT="tcp://milvus-service:19530"
    export MILVUS_TOKEN="<milvus-root-or-user-token>"
    export MILVUS_CONSISTENCY_LEVEL="Bounded"   # Optional; choose per your deployment
    
    oc create secret generic milvus-secret \
      --from-literal=MILVUS_ENDPOINT="$MILVUS_ENDPOINT" \
      --from-literal=MILVUS_TOKEN="$MILVUS_TOKEN" \
      --from-literal=MILVUS_CONSISTENCY_LEVEL="$MILVUS_CONSISTENCY_LEVEL"
    Important

    Use the gRPC port 19530 for MILVUS_ENDPOINT. Ports such as 9091 are typically used for health checks and are not valid for client traffic.

  2. In the OpenShift web console, select AdministratorQuick Create ( quick create icon ) → Import YAML, and create a CR similar to the following:

    apiVersion: ogx.io/v1beta1
    kind: OGXServer
    metadata:
      name: ogx-upgrade-test
    spec:
      distribution:
        name: rh-dev
      workload:
        replicas: 1
        overrides:
          env:
            - name: VLLM_URL
              valueFrom:
                secretKeyRef:
                  key: VLLM_URL
                  name: ogx-secret
            - name: VLLM_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  key: VLLM_TLS_VERIFY
                  name: ogx-secret
            - name: VLLM_API_TOKEN
              valueFrom:
                secretKeyRef:
                  key: VLLM_API_TOKEN
                  name: ogx-secret
            - name: EMBEDDING_MODEL
              valueFrom:
                secretKeyRef:
                  key: EMBEDDING_MODEL
                  name: ogx-secret
            - name: VLLM_EMBEDDING_URL
              valueFrom:
                secretKeyRef:
                  key: VLLM_EMBEDDING_URL
                  name: ogx-secret
            - name: VLLM_EMBEDDING_API_TOKEN
              valueFrom:
                secretKeyRef:
                  key: VLLM_EMBEDDING_API_TOKEN
                  name: ogx-secret
            - name: VLLM_EMBEDDING_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  key: VLLM_EMBEDDING_TLS_VERIFY
                  name: ogx-secret
            - name: EMBEDDING_PROVIDER_MODEL_ID
              value: "<embedding-provider-model-id>"
            - name: POSTGRES_HOST
              value: "<postgres-host>"
            - name: POSTGRES_PORT
              value: "5432"
            - name: POSTGRES_DB
              value: "ogx_metadata"
            - name: POSTGRES_USER
              value: "ogx"
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: password
                  name: <secret-name>
            - name: ENABLE_PGVECTOR
              value: "true"
            - name: PGVECTOR_HOST
              value: <postgres-host>
            - name: PGVECTOR_PORT
              value: "5432"
            - name: PGVECTOR_DB
              value: "pgvector"
            - name: PGVECTOR_USER
              valueFrom:
                secretKeyRef:
                  key: pgvector-user
                  name: <secret-name>
            - name: PGVECTOR_PASSWORD
              valueFrom:
                secretKeyRef:
                  key: pgvector-password
                  name: <secret-name>
        storage:
          size: 5Gi
    Note

    The rh-dev value is an internal image reference. When you create the OGXServer custom resource, the OpenShift AI Operator automatically resolves rh-dev to the container image in the appropriate registry. This internal image reference allows the underlying image to update without requiring changes to your custom resource.

5.1.6.2. Example B: OGXServer with Remote PostgreSQL with pgvector

Use this example when you want to use a PostgreSQL database with the pgvector extension as the vector store backend. This configuration enables the pgvector provider and reads connection values from a secret. This example uses remote embeddings.

  1. Create the pgvector connection secret:

    export PGVECTOR_HOST="<pgvector-hostname>"
    export PGVECTOR_PORT="5432"
    export PGVECTOR_DB="<pgvector-database>"
    export PGVECTOR_USER="<pgvector-username>"
    export PGVECTOR_PASSWORD="<pgvector-password>"
    
    oc create secret generic pgvector-connection -n <project-name> \
      --from-literal=PGVECTOR_HOST="$PGVECTOR_HOST" \
      --from-literal=PGVECTOR_PORT="$PGVECTOR_PORT" \
      --from-literal=PGVECTOR_DB="$PGVECTOR_DB" \
      --from-literal=PGVECTOR_USER="$PGVECTOR_USER" \
      --from-literal=PGVECTOR_PASSWORD="$PGVECTOR_PASSWORD"
  2. In the OpenShift web console, select AdministratorQuick Create ( quick create icon ) → Import YAML, and create a custom resource similar to the following:

    apiVersion: ogx.io/v1beta1
    kind: OGXserver
    metadata:
      name: lsd-llama-pgvector-remote
    spec:
      replicas: 1
      server:
        containerSpec:
          resources:
            requests:
              cpu: "250m"
              memory: "500Mi"
            limits:
              cpu: 4
              memory: "12Gi"
          env:
            # PostgreSQL metadata store (required in {productname-short} 3.2)
            - name: POSTGRES_HOST
              value: <postgres-host>
            - name: POSTGRES_PORT
              value: "5432"
            - name: POSTGRES_DB
              value: <postgres-database>
            - name: POSTGRES_USER
              value: <postgres-username>
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <postgres-secret-name>
                  key: <postgres-password-key>
    
            # Remote LLM configuration
            - name: INFERENCE_MODEL
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: INFERENCE_MODEL
            - name: VLLM_URL
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_URL
            - name: VLLM_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_TLS_VERIFY
            - name: VLLM_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_API_TOKEN
            - name: VLLM_MAX_TOKENS
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_MAX_TOKENS
    
            # Remote embedding configuration
            - name: EMBEDDING_MODEL
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: EMBEDDING_MODEL
            - name: EMBEDDING_PROVIDER_MODEL_ID
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: EMBEDDING_PROVIDER_MODEL_ID
            - name: VLLM_EMBEDDING_URL
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_EMBEDDING_URL
            - name: VLLM_EMBEDDING_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_EMBEDDING_TLS_VERIFY
            - name: VLLM_EMBEDDING_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_EMBEDDING_API_TOKEN
            - name: VLLM_EMBEDDING_MAX_TOKENS
              valueFrom:
                secretKeyRef:
                  name: ogx-secret
                  key: VLLM_EMBEDDING_MAX_TOKENS
    
            # Enable and configure pgvector provider
            - name: ENABLE_PGVECTOR
              value: "true"
            - name: PGVECTOR_HOST
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_HOST
            - name: PGVECTOR_PORT
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_PORT
            - name: PGVECTOR_DB
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_DB
            - name: PGVECTOR_USER
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_USER
            - name: PGVECTOR_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_PASSWORD
    
            - name: FMS_ORCHESTRATOR_URL
              value: "http://localhost"
          name: ogx
          port: 8321
        distribution:
          name: rh-dev
    Note

    The rh-dev value is an internal image reference. When you create the OGXServer custom resource, the OpenShift AI Operator automatically resolves rh-dev to the container image in the appropriate registry. This internal image reference allows the underlying image to update without requiring changes to your custom resource.

  3. Click Create.

Verification

  • In the left-hand navigation, click WorkloadsPods and verify that the OGX pod is running in the correct namespace.
  • To verify that the OGX Server is running, click the pod name and select the Logs tab. Look for output similar to the following:

    INFO     2025-05-15 11:23:52,750 __main__:498 server: Listening on ['::', '0.0.0.0']:8321
    INFO:     Started server process [1]
    INFO:     Waiting for application startup.
    INFO     2025-05-15 11:23:52,765 __main__:151 server: Starting up
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
Tip

If you switch between vector store configurations, delete the existing pod to ensure the new environment variables and backing store are picked up cleanly.

5.1.7. Ingesting content into a Llama model

You can quickly customize and prototype retrievable content by uploading a document and adding it to a vector store from inside a Jupyter notebook. This approach avoids building a separate ingestion pipeline. By using the OGX SDK, you can ingest documents into a vector store and enable retrieval-augmented generation (RAG) workflows.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have deployed a Llama 3.2 model with a vLLM model server.
  • You have created a OGXServer instance.
  • You have configured a PostgreSQL database for OGX metadata storage.
  • You have configured an embedding model:

    • Recommended: You have configured a remote embedding model by using environment variables in the OGXServer.
  • You have created a workbench within a project.
  • You have opened a Jupyter notebook and it is running in your workbench environment.
  • You have installed ogx_client version 1.0.0 or later in your workbench environment.
  • You have installed requests in your workbench environment. This is required for downloading example documents.
  • If you use a remote vector store or remote embedding model, your environment has network access to those services through OpenShift.

Procedure

  1. In a new notebook cell, install the client:

    %pip install ogx_client
  2. Install the requests library if it is not already available:

    %pip install requests
  3. Import OGXClient and create a client instance:

    from ogx_client import OGXClient
    client = OGXClient(base_url="<ogx-base-url>")
  4. List the available models:

    models = client.models.list()
  5. Verify that the list includes:

    • At least one LLM model.
    • At least one embedding model.

      [Model(identifier='llama-32-3b-instruct', model_type='llm', provider_id='vllm-inference'),
       Model(identifier='nomic-embed-text-v1-5', model_type='embedding', metadata={'embedding_dimension': 768})]
  6. Select one LLM and one embedding model:

    model_id = next(m.identifier for m in models if m.model_type == "llm")
    
    embedding_model = next(m for m in models if m.model_type == "embedding")
    embedding_model_id = embedding_model.identifier
    embedding_dimension = int(embedding_model.metadata["embedding_dimension"])
  7. (Optional) Create a vector store. Skip this step if you already have one.

    Note

    Provider IDs can differ between interfaces. In the Python SDK, you typically use the provider name directly (for example, provider_id: "pgvector"). In some CLI tools and examples, remote providers might use a prefixed identifier (for example, --vector-db-provider-id remote-pgvector). Use the provider ID format that matches the interface you are using.

Example 5.1. Option 1: Remote Milvus (recommended for production)

vector_store = client.vector_stores.create(
    name="my_remote_milvus",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus-remote",
    },
)
vector_store_id = vector_store.id
Note

Ensure your OGXServer is configured with MILVUS_ENDPOINT and MILVUS_TOKEN.

Example 5.2. Option 2: Remote PostgreSQL with pgvector

vector_store = client.vector_stores.create(
    name="my_pgvector_store",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "pgvector",
    },
)
vector_store_id = vector_store.id
Note

Ensure that the pgvector provider is enabled in your OGXServer and that the PostgreSQL instance has the pgvector extension installed.

  1. If you already have a vector store, set its identifier:

    # vector_store_id = "<existing-vector-store-id>"
  2. Download a PDF, upload it to OGX, and add it to your vector store:

    import requests
    
    pdf_url = "https://www.federalreserve.gov/aboutthefed/files/quarterly-report-20250822.pdf"
    filename = "quarterly-report-20250822.pdf"
    
    response = requests.get(pdf_url)
    response.raise_for_status()
    
    with open(filename, "wb") as f:
        f.write(response.content)
    
    with open(filename, "rb") as f:
        file_info = client.files.create(
            file=(filename, f),
            purpose="assistants",
        )
    
    vector_store_file = client.vector_stores.files.create(
        vector_store_id=vector_store_id,
        file_id=file_info.id,
        chunking_strategy={
            "type": "static",
            "static": {
                "max_chunk_size_tokens": 800,
                "chunk_overlap_tokens": 400,
            },
        },
    )
    
    print(vector_store_file)

Verification

  • The call to client.vector_stores.files.create() succeeds and returns metadata for the ingested file.
  • The vector store contains indexed chunks associated with the uploaded document.
  • Subsequent RAG queries can retrieve content from the vector store.

5.1.8. Querying ingested content in a Llama model

You can use the OGX SDK in your Jupyter notebook to query ingested content by running retrieval-augmented generation (RAG) queries on content stored in your vector store. You can perform one-off lookups without setting up a separate retrieval service.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
  • If you are using GPU acceleration, you have at least one NVIDIA GPU available.
  • You have activated the OGX Operator in OpenShift AI.
  • You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
  • You have created a OGXServer instance with:

    • PostgreSQL configured as the metadata store.
    • An embedding model configured, preferably as a remote embedding provider.
  • You have created a workbench within a project and opened a running Jupyter notebook.
  • You have installed ogx_client version 1.0.0 or later in your workbench environment.
  • You have already ingested content into a vector store.
Note

This procedure requires that content has already been ingested into a vector store. If no content is available, RAG queries return empty or non-contextual responses.

Procedure

  1. In a new notebook cell, install the client:

    %pip install -q ogx_client
  2. Import OgxClient:

    from ogx_client import OgxClient
  3. Create a client instance:

    # Use the OGX service or route URL that is reachable from the workbench.
    # Do not append /v1 when using ogx_client.
    client = OGXClient(base_url="<ogx-base-url>")
  4. List available models:

    models = client.models.list()
  5. Select an LLM. If you plan to register a new vector store, also capture an embedding model:

    model_id = next(m.identifier for m in models if m.model_type == "llm")
    
    embedding = next((m for m in models if m.model_type == "embedding"), None)
    if embedding:
        embedding_model_id = embedding.identifier
        embedding_dimension = int(embedding.metadata.get("embedding_dimension", 768))
  6. If you do not already have a vector store ID, register a vector store (choose one):

Example 5.3. Option 1: Remote Milvus (recommended for production)

vector_store = client.vector_stores.create(
    name="my_remote_milvus",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus-remote",
    },
)
vector_store_id = vector_store.id
Note

Ensure your OGXServer sets MILVUS_ENDPOINT (gRPC port 19530) and MILVUS_TOKEN.

Example 5.4. Option 2: Remote PostgreSQL with pgvector

vector_store = client.vector_stores.create(
    name="my_pgvector_store",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "pgvector",
    },
)
vector_store_id = vector_store.id
Note

Ensure the pgvector provider is enabled in your OGXServer and that the PostgreSQL instance has the pgvector extension installed. This option is suitable for production-grade RAG workloads that require durability and concurrency.

  1. If you already have a vector store, set its identifier:

    # vector_store_id = "<existing-vector-store-id>"
  2. Query without using a vector store:

    system_instructions = """You are a precise and reliable AI assistant.
    Use retrieved context when it is available.
    If nothing relevant is found, say so clearly."""
    
    query = "How do you do great work?"
    
    response = client.responses.create(
        model=model_id,
        input=query,
        instructions=system_instructions,
    )
    
    print(response.output_text)
  3. Query by using the Responses API with file search:

    response = client.responses.create(
        model=model_id,
        input=query,
        instructions=system_instructions,
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store_id],
            }
        ],
    )
    
    print(response.output_text)
Note

When you include the file_search tool with vector_store_ids, OGX retrieves relevant chunks from the specified vector store and provides them to the model as context for the response.

Verification

  • The notebook returns a response without vector stores and a context-aware response when vector stores are enabled.
  • No errors appear, confirming successful retrieval and model execution.

5.1.9. Preparing documents with Docling for OGX retrieval

You can transform your source documents with a Docling-enabled pipeline and ingest the output into a OGX vector store by using the OGX SDK. This modular approach separates document preparation from ingestion while still enabling an end-to-end, retrieval-augmented generation (RAG) workflow.

The pipeline registers a vector store and downloads the source PDFs, then splits them for parallel processing and converts each batch to Markdown with Docling. It generates embeddings from the Markdown and stores them in the vector store, making the documents searchable through OGX.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
  • You have logged in to the OpenShift web console.
  • You have a project and access to pipelines in the OpenShift AI dashboard.
  • You have created and configured a pipeline server within the project that contains your workbench.
  • You have activated the OGX Operator in OpenShift AI.
  • You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
  • You have configured a OGX deployment by creating a OGXServer instance to enable RAG functionality.
  • You have created a workbench within a project.
  • You have opened a Jupyter notebook and it is running in your workbench environment.
  • You have installed the ogx-client version 0.3.1 or later in your workbench environment.
  • You have installed local object storage buckets and created connections, as described in Adding a connection to your project.
  • You have compiled to YAML a pipeline that includes a Docling transform, either one of the RAG demo samples or your own custom pipeline.
  • Your project quota allows between 500 millicores (0.5 CPU) and 4 CPU cores for the pipeline run.
  • Your project quota allows from 2 GiB up to 6 GiB of RAM for the pipeline run.
  • If you are using GPU acceleration, you have at least one NVIDIA GPU available.

Procedure

  1. In a new notebook cell, install the client:

    %pip install -q ogx-client
  2. In a new notebook cell, import OgxClient:

    from ogx_client import OgxClient
  3. In a new notebook cell, assign your deployment endpoint to the base_url parameter to create a OgxClient instance:

    client = OgxClient(base_url="http://<ogx-service>:8321")
    Note

    OgxClient requires the service root without the /v1 path suffix. For example, use http://ogx-service:8321.

    The /v1 suffix is required only when you use OpenAI-compatible SDKs or send raw HTTP requests to the OpenAI-compatible API surface.

  4. List the available models:

    models = client.models.list()
  5. Select the first LLM and the first embedding model:

    model_id = next(m.identifier for m in models if m.model_type == "llm")
    embedding_model = next(m for m in models if m.model_type == "embedding")
    embedding_model_id = embedding_model.identifier
    embedding_dimension = int(embedding_model.metadata.get("embedding_dimension", 768))
  6. Register a vector store (choose one option). Skip this step if your pipeline registers the store automatically.

Example 5.5. Remote Milvus

vector_store_name = "my_remote_db"
vector_store = client.vector_stores.create(
    name=vector_store_name,
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus-remote",  # remote Milvus provider
    },
)
vector_store_id = vector_store.id
print(f"Registered remote Milvus DB: {vector_store_id}")
Note

Ensure your OGXServer includes MILVUS_ENDPOINT and MILVUS_TOKEN (gRPC :19530).

Important

If you are using the sample Docling pipeline from the RAG demo repository, the pipeline registers the vector store automatically and you can skip the previous step. If you are using your own pipeline, you must register the vector store yourself.

  1. In the OpenShift web console, import the YAML file containing your Docling pipeline into your project, as described in Importing a pipeline.
  2. Create a pipeline run to execute your Docling pipeline, as described in Executing a pipeline run. The pipeline run inserts your PDF documents into the vector store. If you run the Docling pipeline from the Content from github.com is not included.RAG demo samples repository, you can optionally customize the following parameters before starting the pipeline run:

    • base_url: The base URL to fetch PDF files from.
    • pdf_filenames: A comma-separated list of PDF filenames to download and convert.
    • num_workers: The number of parallel workers.
    • vector_store_id: The vector store identifier.
    • service_url: The Milvus service URL.
    • embed_model_id: The embedding model to use.
    • max_tokens: The maximum tokens for each chunk.
    • use_gpu: Enable or disable GPU acceleration.

Verification

  1. In your Jupyter notebook, query the LLM with a question that relates to the ingested content:

    system_instructions = """You are a precise and reliable AI assistant.
    Use retrieved context when it is available.
    If nothing relevant is found in the available files, say so clearly."""
    
    prompt = "What can you tell me about the birth of word processing?"
    
    # Query using the Responses API with file search
    response = client.responses.create(
        model=model_id,
        input=prompt,
        instructions=system_instructions,
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store_id],
            }
        ],
    )
    
    print("Answer (with vector stores):")
    print(response.output_text)
  2. Query chunks from the vector store:

    query_result = client.vector_io.query(
        vector_store_id=vector_store_id,
        query="word processing",
    )
    print(query_result)
    • The pipeline run completes successfully in your project.
    • Document embeddings are stored in the vector store and are available for retrieval.
    • No errors or warnings appear in the pipeline logs or your notebook output.

5.1.10. About OGX search types

OGX supports keyword, vector, and hybrid search modes for retrieving context in retrieval-augmented generation (RAG) workloads. Each mode offers different tradeoffs in precision, recall, semantic depth, and computational cost.

5.1.10.1. Supported search modes

5.2. Evaluating RAG systems with OGX

You can use the evaluation providers that OGX exposes to measure and improve the quality of your Retrieval-Augmented Generation (RAG) workloads in OpenShift AI. This section introduces RAG evaluation providers, describes how to use Ragas with OGX, shows how to benchmark embedding models with BEIR, and helps you choose the right provider for your use case.

5.2.1. Understanding RAG evaluation providers

OGX supports pluggable evaluation providers that measure the quality and performance of Retrieval-Augmented Generation (RAG) pipelines. Evaluation providers assess how accurately, faithfully, and relevantly the generated responses align with the retrieved context and the original user query. Each provider implements its own metrics and evaluation methodology. You can enable a specific provider through the configuration of the OGXServer custom resource.

OpenShift AI supports the following evaluation providers:

  • Ragas: A lightweight, Python-based framework that evaluates factuality, contextual grounding, and response relevance.
  • AutoRAG: Automatically optimize RAG configurations for your documents. For more information, see AutoRAG overview.
  • TrustyAI: A Red Hat framework that evaluates explainability, fairness, and reliability of model outputs.

Evaluation providers operate independently of model serving and retrieval components. You can run evaluations asynchronously and aggregate results for quality tracking over time.

5.2.2. Using Ragas with OGX

You can use the Ragas (Retrieval-Augmented Generation Assessment) evaluation provider with OGX to measure the quality of your Retrieval-Augmented Generation (RAG) workflows in OpenShift AI. Ragas integrates with the OGX evaluation API to compute metrics such as faithfulness, answer relevancy, and context precision for your RAG workloads.

OGX exposes evaluation providers as part of its API surface. When you configure Ragas as a provider, the OGX server sends RAG inputs and outputs to Ragas and records the resulting metrics for later analysis.

Ragas evaluation with OGX in OpenShift AI supports the following deployment modes:

  • Inline provider for development and small-scale experiments.
  • Remote provider for production-scale evaluations that run as OpenShift AI AI pipelines.

You choose the mode that best fits your workflow:

  • Use the inline provider when you want fast, low-overhead evaluation while you iterate on prompts, retrieval configuration, or model choices.
  • Use the remote provider when you need to evaluate large datasets, integrate with CI/CD pipelines, or run repeated benchmarks at scale.

Evaluating RAG systems with RAGAS.

5.3. Using PostgreSQL in OGX

PostgreSQL is a dependency for OGX deployments in OpenShift AI, where it serves as the mandatory metadata storage backend for supported vector storage configurations. Additionally, you can configure PostgreSQL as a remote vector database provider by enabling the pgvector extension.

In OpenShift AI, PostgreSQL serves the following roles in OGX deployments:

  • Required metadata storage for OGX APIs and orchestration services.
  • An optional remote vector database when the pgvector provider is enabled.

Depending on your deployment requirements, these roles can be fulfilled by the same PostgreSQL instance or separate instances. For example, you might use a single instance for development and testing environments, and separate instances for production deployments that require independent scaling or isolation.

Important

The procedures provide basic configuration suitable for development and testing. Production deployments require additional planning, including the following considerations:

  • High availability and replication
  • Backup and disaster recovery
  • Security hardening and encryption
  • Performance tuning and monitoring

5.3.1. Understanding PostgreSQL in OGX

5.3.1.1. Understanding OGX metadata storage

In OpenShift AI, OGX requires PostgreSQL as a metadata storage backend to persist state and configuration data across multiple components. Metadata storage provides durable persistence for vector stores, file management, agent state, conversation history, and other OGX services.

PostgreSQL is required as a metadata storage backend for all OpenShift AI deployments.

5.3.1.1.1. Role of metadata storage in OGX

OGX components require persistent storage beyond in-memory data structures. Without metadata storage, component state would be lost on pod restarts or application failures.

OGX uses metadata storage to persist:

  • Vector store metadata, such as collection identifiers and document mappings.
  • File metadata, including file locations, identifiers, and attributes.
  • Agent state and conversation history.
  • Dataset configurations and batch processing state.
  • Model registry information and prompt templates.

This persistent storage allows OGX to maintain operational state across pod restarts, rescheduling, and application updates.

5.3.1.1.2. PostgreSQL metadata storage backends

OGX uses PostgreSQL to store multiple categories of metadata, including vector store metadata, file records, agent state, conversation history, and configuration data. These data types have different storage characteristics but are managed automatically within a single PostgreSQL instance.

Important

PostgreSQL version 14 or later is required for all OGX deployments, including development, testing, and production environments.

If validation errors occur, confirm that the deployed OGX image version matches the configuration schema referenced by your run.yaml.

OGX does not provision or manage the PostgreSQL instance used for metadata storage. You must deploy and manage the PostgreSQL database and supply its connection details when deploying OGX.

5.3.2. Deploying and Configuring PostgreSQL

5.3.2.1. Deploying a PostgreSQL instance with pgvector

You can connect OGX in OpenShift AI to an existing PostgreSQL instance that has the pgvector extension enabled. For development or evaluation, you can also deploy a PostgreSQL instance with the pgvector extension directly in your OpenShift project by creating Kubernetes resources through the OpenShift web console. This procedure focuses on deploying PostgreSQL with the pgvector extension for use as a remote vector store. It does not cover preparing a PostgreSQL database for use as OGX metadata storage.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have permissions to create resources in a project in your OpenShift cluster.
  • You have PostgreSQL connection details available, including the database name, user name, and password.
  • If you plan to deploy PostgreSQL in-cluster, you have a StorageClass that can provision persistent volumes.
  • If you are using an existing PostgreSQL instance, the pgvector extension is installed and enabled on the target database.

Procedure

  1. Log in to the OpenShift web console.
  2. Select the project where you want to deploy the PostgreSQL instance.
  3. Click the Quick Create ( quick create icon ) icon, and then click Import YAML.
  4. Verify that the correct project is selected.
  5. Copy the following YAML, replace the placeholder values, paste it into the YAML editor, and then click Create.

    Important

    This example deploys a standalone PostgreSQL service with the pgvector extension enabled.

    OGX does not automatically use this database. To use this PostgreSQL instance as a vector store, you must explicitly configure the pgvector provider in a OGXServer.

    This example is intended for development or evaluation purposes. For production deployments, review and adapt the configuration to meet your organization’s security, availability, backup, and lifecycle requirements.

    Example PostgreSQL deployment with pgvector (development or evaluation)

    apiVersion: v1
    kind: Secret
    metadata:
      name: <pgvector-postgresql-credentials-secret>
    type: Opaque
    stringData:
      POSTGRES_DB: "<database-name>"
      POSTGRES_USER: "<database-username>"
      POSTGRES_PASSWORD: "<database-password>"
    
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: <pgvector-postgresql-pvc>
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: <storage-size>
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: <pgvector-postgresql-deployment>
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: <pgvector-postgresql-app-label>
      template:
        metadata:
          labels:
            app: <pgvector-postgresql-app-label>
        spec:
          containers:
          - name: postgres
            image: pgvector/pgvector:pg16
            ports:
            - name: postgres
              containerPort: 5432
            env:
            - name: POSTGRES_DB
              valueFrom:
                secretKeyRef:
                  name: <pgvector-postgresql-credentials-secret>
                  key: POSTGRES_DB
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: <pgvector-postgresql-credentials-secret>
                  key: POSTGRES_USER
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <pgvector-postgresql-credentials-secret>
                  key: POSTGRES_PASSWORD
            volumeMounts:
            - name: pgdata
              mountPath: /var/lib/postgresql/data
    
            # Replace TCP socket probes with exec probes that validate SQL readiness.
            readinessProbe:
              exec:
                command:
                - /bin/sh
                - -c
                - pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB"
              initialDelaySeconds: 10
              periodSeconds: 10
              timeoutSeconds: 5
              failureThreshold: 6
            livenessProbe:
              exec:
                command:
                - /bin/sh
                - -c
                - pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB"
              initialDelaySeconds: 30
              periodSeconds: 20
              timeoutSeconds: 5
              failureThreshold: 6
    
            # Create the pgvector extension after PostgreSQL is actually accepting SQL.
            lifecycle:
              postStart:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - |
                    set -e
                    echo "Waiting for PostgreSQL to be ready before enabling pgvector..."
                    until PGPASSWORD="$POSTGRES_PASSWORD" psql -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT 1" >/dev/null 2>&1; do
                      sleep 2
                    done
                    PGPASSWORD="$POSTGRES_PASSWORD" psql -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "CREATE EXTENSION IF NOT EXISTS vector;"
    
          volumes:
          - name: pgdata
            persistentVolumeClaim:
              claimName: <pgvector-postgresql-pvc>
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: <pgvector-postgresql-service>
    spec:
      selector:
        app: <pgvector-postgresql-app-label>
      ports:
      - name: postgres
        port: 5432
        targetPort: 5432
      type: ClusterIP

  6. Click Create.

Verification

  1. Navigate to NetworkingServices.
  2. Confirm that the PostgreSQL Service is listed and exposes port 5432.
  3. Navigate to WorkloadsPods.
  4. Confirm that the PostgreSQL pod is running.
Note

This procedure verifies only that PostgreSQL with pgvector is deployed and reachable within the project. It does not verify integration with OGX.

5.3.2.2. Configuring the pgvector remote provider in OGX

To use PostgreSQL with the pgvector extension as a remote vector store, configure pgvector in your existing OGXServer and provide PostgreSQL connection details as environment variables. Ensure that your OGXServer already includes the PostgreSQL metadata storage configuration. This setup enables retrieval augmented generation (RAG) workflows in OpenShift AI by using PostgreSQL-based vector storage.

Prerequisites

  • You have installed and enabled the OGX Operator in OpenShift AI.
  • You have a PostgreSQL database with the pgvector extension enabled. OGX uses PostgreSQL for two purposes: metadata storage and the optional pgvector remote vector store. You can use a single PostgreSQL instance for both roles or deploy separate instances.
  • You have the PostgreSQL connection details, including the host name, port number, database name, user name, and password.
  • You have permissions to create Secrets and edit custom resources in your project.

Procedure

  1. In the OpenShift web console, switch to the Administrator perspective.
  2. Create a Secret that stores the PostgreSQL connection details.

    1. Ensure that the correct project is selected.
    2. Click WorkloadsSecrets.
    3. Click CreateFrom YAML.
    4. Paste the following YAML, update the placeholder values, and then click Create.

      Example Secret for pgvector connection details

      apiVersion: v1
      kind: Secret
      metadata:
        name: pgvector-connection
      type: Opaque
      stringData:
        PGVECTOR_HOST: "<pgvector-hostname>"
        PGVECTOR_PORT: "<pgvector-port>"
        PGVECTOR_DB: "<database-name>"
        PGVECTOR_USER: "<database-username>"
        PGVECTOR_PASSWORD: "<database-password>"

      Important

      The pgvector provider is not enabled automatically.

      You must explicitly enable pgvector and supply its connection details through environment variables in your OGXServer.

      In OpenShift AI, the pgvector provider is enabled when the ENABLE_PGVECTOR environment variable is set.

  3. Update your OGXServer custom resource to enable pgvector and reference the Secret.

    1. Select the OGX Operator.
    2. Click the OGXServer tab.
    3. Select your OGXServer resource.
    4. Click YAML.
    5. Update the resource to include the following fields, and then click Save.

      Before you enable pgvector, deploy a OGX server and configure the PostgreSQL metadata store.

For more information, see Deploying a OGX server.

Then update your existing OGXServer to add the pgvector configuration shown in the following example. The example shows only the additional environment variables required to enable the pgvector provider.

Example OGXServer configuration for pgvector

apiVersion: ogx.io/v1beta1
kind: OGXServer
metadata:
  name: ogx
spec:
  distribution:
    name: rh-dev
  workload:
    overrides:
      env:
        - name: ENABLE_PGVECTOR
          value: "true"
        - name: PGVECTOR_HOST
          valueFrom:
            secretKeyRef:
              name: pgvector-connection
              key: PGVECTOR_HOST
        - name: PGVECTOR_PORT
          valueFrom:
            secretKeyRef:
              name: pgvector-connection
              key: PGVECTOR_PORT
        - name: PGVECTOR_DB
          valueFrom:
            secretKeyRef:
              name: pgvector-connection
              key: PGVECTOR_DB
        - name: PGVECTOR_USER
          valueFrom:
            secretKeyRef:
              name: pgvector-connection
              key: PGVECTOR_USER
        - name: PGVECTOR_PASSWORD
          valueFrom:
            secretKeyRef:
              name: pgvector-connection
              key: PGVECTOR_PASSWORD

Verification

  1. Click WorkloadsPods.
  2. Confirm that the OGX pod restarts and reaches the Running state.
  3. Open the pod logs and confirm that the server starts successfully and initializes the pgvector provider without errors.

5.4. Using Qdrant in OGX

Qdrant is a supported remote vector store provider for OGX in OpenShift AI. You can deploy Qdrant in your OpenShift project or connect to an existing Qdrant instance, and configure OGX to use Qdrant for retrieval-augmented generation (RAG) workloads.

To use Qdrant with OGX, complete the following tasks:

  • Review how Qdrant integrates with OGX.
  • Deploy a Qdrant instance or connect to an existing deployment.
  • Configure your OGXServer to use Qdrant as the vector store provider.
  • Perform vector operations through the OpenAI-compatible Vector Stores API.

5.4.1. Overview of Qdrant vector databases

Qdrant is an open source vector database optimized for high-performance similarity search and advanced filtering. In OpenShift AI, Qdrant is supported as a remote vector store provider for OGX and can be used in retrieval-augmented generation (RAG) workloads that require efficient vector indexing and durable storage.

When used with OGX in OpenShift AI, Qdrant provides:

  • High-performance similarity search using Hierarchical Navigable Small World (HNSW) indexing
  • Filtering based on stored metadata during vector search
  • Persistent storage of vector data
  • Integration through the OpenAI-compatible Vector Stores API

In a RAG workflow:

  • Embeddings are generated by the configured embedding provider.
  • Qdrant stores embedding vectors and performs similarity search.
  • OGX manages ingestion, retrieval, and model inference through a unified API.

In OpenShift AI, you must deploy Qdrant as a remote service, either within your OpenShift project or as an externally managed deployment.

Note

Inline Qdrant is not supported. To use Qdrant with OGX in OpenShift AI, deploy Qdrant as a remote service.

A typical remote deployment includes:

  • A Qdrant service exposing HTTP (port 6333) and gRPC (port 6334) endpoints
  • Persistent storage for vector data
  • Optional API key authentication

For deployment and configuration instructions, see Using Qdrant in OGX.

5.4.2. Deploying a Qdrant vector database

You can connect OGX in OpenShift AI to an existing Qdrant instance or deploy a Qdrant vector database in your OpenShift project. For development or evaluation purposes, you can deploy Qdrant by creating Kubernetes resources in the OpenShift web console.

Prerequisites

  • You have installed OpenShift 4.19 or later.
  • You have permission to create resources in a project.
  • A StorageClass is available that can provision a PersistentVolume for the PersistentVolumeClaim used by this deployment.

    Note

    This example uses a single PersistentVolumeClaim. If your cluster uses dynamic provisioning, the StorageClass provisions the required PersistentVolume automatically.

  • Optional: You have an API key for Qdrant authentication. If your Qdrant instance does not require authentication, remove the Secret and the QDRANT__SERVICE__API_KEY environment variable from the deployment example.

Procedure

  1. Log in to the OpenShift web console.
  2. From the Project list, select the project where you want to deploy Qdrant.
  3. Click Import YAML.
  4. Paste the following YAML:

    Important

    This example deploys a standalone Qdrant service for development or evaluation. For production deployments, review and adapt the configuration to meet your organization’s security, availability, backup, and lifecycle requirements.

    apiVersion: v1
    kind: Secret
    metadata:
      name: <qdrant_credentials_secret>
    type: Opaque
    stringData:
      QDRANT_API_KEY: "<api_key>"
    
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: <qdrant_pvc>
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: <storage_size>
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: <qdrant_deployment>
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: <qdrant_app_label>
      template:
        metadata:
          labels:
            app: <qdrant_app_label>
        spec:
          containers:
          - name: qdrant
            image: qdrant/qdrant:v1.12.0
            ports:
            - name: http
              containerPort: 6333
            - name: grpc
              containerPort: 6334
            env:
            - name: QDRANT__SERVICE__API_KEY
              valueFrom:
                secretKeyRef:
                  name: <qdrant_credentials_secret>
                  key: QDRANT_API_KEY
            volumeMounts:
            - name: qdrant-storage
              mountPath: /qdrant/storage
            - name: qdrant-storage
              mountPath: /qdrant/snapshots
              subPath: snapshots
            readinessProbe:
              httpGet:
                path: /readyz
                port: 6333
              initialDelaySeconds: 5
              periodSeconds: 10
            livenessProbe:
              httpGet:
                path: /healthz
                port: 6333
              initialDelaySeconds: 10
              periodSeconds: 20
          volumes:
          - name: qdrant-storage
            persistentVolumeClaim:
              claimName: <qdrant_pvc>
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: <qdrant_service>
    spec:
      selector:
        app: <qdrant_app_label>
      ports:
      - name: http
        port: 6333
        targetPort: 6333
      - name: grpc
        port: 6334
        targetPort: 6334
      type: ClusterIP
    Note

    If your Qdrant instance does not require authentication, remove the Secret and the QDRANT__SERVICE__API_KEY environment variable from the Deployment configuration.

  5. Replace the placeholder values as follows:

    • <qdrant_credentials_secret>: A name for the Secret that stores the Qdrant API key, for example qdrant-credentials.
    • <api_key>: An API key for authenticating with Qdrant. If authentication is not required, remove the Secret and the QDRANT__SERVICE__API_KEY environment variable from the Deployment.
    • <qdrant_pvc>: A name for the PersistentVolumeClaim, for example qdrant-pvc.
    • <storage_size>: The storage capacity to request, for example 10Gi.
    • <qdrant_deployment>: A name for the Deployment, for example qdrant.
    • <qdrant_app_label>: A label for the application, for example qdrant.
    • <qdrant_service>: A name for the Service, for example qdrant-service.
  6. Click Create.

Verification

  • The Qdrant Service is present in the project and exposes ports 6333 (HTTP) and 6334 (gRPC). You can confirm this on the NetworkingServices page in the OpenShift web console.
  • The Qdrant pod reaches the Running state. You can confirm this on the WorkloadsPods page in the OpenShift web console.
Note

This verification confirms only that Qdrant is deployed and reachable within the project. To use this Qdrant instance with OGX, configure the Qdrant provider in a OGXServer.

5.4.3. Configuring the Qdrant remote provider in OGX

To use Qdrant as a remote vector store, configure your OGXServer resource with the connection details for your Qdrant service. This configuration enables OGX to store and retrieve embedding vectors using Qdrant in OpenShift AI.

Prerequisites

  • You have installed and enabled the OGX Operator in OpenShift AI.
  • You have a running Qdrant instance that is accessible from your OpenShift cluster.
  • You have the Qdrant connection details, including the service URL and, if required, an API key.
  • You have permission to create Secrets and modify custom resources in your project.

Procedure

  1. In the OpenShift web console, switch to the Administrator perspective.
  2. Create a Secret that stores the Qdrant connection details used by OGX. This Secret must contain the URL of the Qdrant service and, if required, the API key.

    Note

    If you deployed Qdrant by using the procedure in Deploying a Qdrant vector database, create this Secret separately for the OGX configuration. The Secret created during the Qdrant deployment does not contain the QDRANT_URL value required by the OGX provider.

    1. From the Project list, select the project where the OGXServer resource is deployed.
    2. Click WorkloadsSecrets.
    3. Click CreateFrom YAML.
    4. Paste the following YAML:

      apiVersion: v1
      kind: Secret
      metadata:
        name: qdrant-connection
      type: Opaque
      stringData:
        QDRANT_URL: "<qdrant_url>"
        QDRANT_API_KEY: "<api_key>"
    5. Replace the placeholder values as follows:

      • <qdrant_url>: The full URL to the Qdrant service, for example http://qdrant-service:6333. For in-cluster deployments, use the Service name and port. For external deployments, use the external URL.
      • <api_key>: The API key for authenticating with Qdrant. If authentication is not enabled for your Qdrant instance, remove the QDRANT_API_KEY entry from both the Secret and the env section in the OGXServer configuration.
    6. Click Create.
  3. Update your OGXServer custom resource to reference the Secret and supply the required environment variables.

    1. Click OperatorsInstalled Operators.
    2. Select the OGX Operator.
    3. Click the OGXServer tab.
    4. Select your OGXServer resource.
    5. Click YAML.
    6. Update the resource to include the following fields.

      Note

      The environment variable names and configuration fields used by the Qdrant provider can vary depending on the OGX version included with OpenShift AI. Before applying this configuration, verify that the variables and fields match the supported versions listed in Supported Configurations for 3.x.

      apiVersion: ogx.io/v1beta1
      kind: OGXServer
      metadata:
        name: ogx
      spec:
        server:
          containerSpec:
            env:
              - name: ENABLE_QDRANT
                value: "true"
              - name: QDRANT_URL
                valueFrom:
                  secretKeyRef:
                    name: qdrant-connection
                    key: QDRANT_URL
              - name: QDRANT_API_KEY
                valueFrom:
                  secretKeyRef:
                    name: qdrant-connection
                    key: QDRANT_API_KEY
    7. Click Save.

Verification

  • The OGX pod reaches the Running state. You can confirm this on the WorkloadsPods page in the OpenShift web console.
  • The pod logs show that the Qdrant provider initializes successfully and does not report connection errors.
  • Vector operations executed through the OGX API complete successfully, confirming that OGX can communicate with Qdrant.

    For information about performing vector operations, see:

    Performing vector operations with Qdrant.

5.4.4. Performing vector operations with Qdrant

After configuring Qdrant as the vector store provider in OGX, you can perform vector operations by using the OpenAI-compatible Vector Stores API exposed by OGX. These operations include creating vector stores, adding documents, performing similarity search, and deleting vector stores. You interact with the OGX API rather than connecting directly to Qdrant. OGX manages collection creation, embedding generation, and query execution on your behalf.

Prerequisites

  • You have installed and enabled the OGX Operator in OpenShift AI.
  • You have configured Qdrant as the vector store provider in your OGXServer.
  • You have an embedding model available through a configured inference provider.
  • You have network access to the OGX API endpoint.
  • You have installed the jq command-line utility.

    For installation instructions, see Content from jqlang.org is not included.jq.

  • You have the curl command-line tool installed.

Procedure

  1. Determine how you will access the OGX API.

    You can access the API from within the cluster or from outside the cluster.

    • In-cluster access: Run the curl commands from a pod in the same project, or from a workstation that has network access to the OGX Service.
    • External access: Expose the OGX Service by creating a Route, and then use the Route URL from your local workstation.

      For this procedure, set OGX_URL to the service or route root URL without the /v1 suffix. The example commands append /v1 as part of the endpoint path.

      For more information about API compatibility and base URL requirements, see

      OpenAI compatibility for RAG APIs in OGX.

      Example base URL for in-cluster access

      OGX_URL="http://ogx-service:8321"

      Example base URL for external access through a Route

      OGX_URL="https://ogx-route.example.com"

  2. Create a vector store and capture its ID.

    CREATE_RESPONSE=$(curl -s -X POST "${OGX_URL}/v1/vector_stores" \
      -H "Content-Type: application/json" \
      -d '{
        "name": "my-rag-store",
        "embedding_model": "vllm/ibm-granite/granite-embedding-125m-english",
        "embedding_dimension": 768,
        "provider_id": "qdrant-remote"
      }')
    
    VECTOR_STORE_ID=$(echo "$CREATE_RESPONSE" | jq -r '.id')
    echo "Vector store ID: ${VECTOR_STORE_ID}"

    Ensure that the VECTOR_STORE_ID variable contains a valid value before continuing.

5.4.4.1. Add files to a vector store

Upload files to the vector store for ingestion. OGX automatically splits the content into chunks, generates embeddings, and stores them in Qdrant.

Example using curl

FILE_RESPONSE=$(curl -s -X POST "${OGX_URL}/v1/vector_stores/${VECTOR_STORE_ID}/files" \
  -F "file=@/path/to/document.pdf" \
  -F "purpose=assistants")

FILE_ID=$(echo "$FILE_RESPONSE" | jq -r '.id')
echo "File ID: ${FILE_ID}"

5.4.4.2. Query a vector store

Perform similarity search to retrieve relevant content from the vector store. The search query is converted into an embedding and compared with stored vectors in Qdrant.

Example using curl

curl -X POST "${OGX_URL}/v1/vector_stores/${VECTOR_STORE_ID}/search" \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is retrieval-augmented generation?",
    "max_results": 5
  }'

5.4.4.3. Delete a vector store

Delete a vector store when it is no longer required. This removes the vector store and its associated data from Qdrant.

Example using curl

curl -X DELETE "${OGX_URL}/v1/vector_stores/${VECTOR_STORE_ID}"

Verification

  • Creating a vector store returns a valid vector store ID.
  • File uploads complete successfully and are accepted by the API.
  • Search queries return results from the ingested content.

5.5. Using external S3-compatible storage for the Files API

You can configure OpenShift AI to use an external S3-compatible object storage service as the backend for the OGX OpenAI-compatible /v1/files endpoint. This configuration enables file upload, storage, and retrieval for retrieval-augmented generation (RAG) and document-based workflows by using existing enterprise object storage infrastructure.

5.5.1. External S3-compatible provider for the /v1/files endpoint

The OGX Files API supports two providers in OpenShift AI: the default inline::localfs provider, which stores files on the local file system of the OGX pod, and the remote::s3 provider, which stores file content in an external S3-compatible object storage service. Use the remote::s3 provider when you require scalable, durable storage for file content that is independent of the OGX pod lifecycle, or when you must integrate with enterprise-managed storage platforms.

Support level: Developer Preview.

The remote::s3 provider stores file content in an S3 bucket. File metadata, such as the file ID, filename, purpose, size, and timestamps, is stored in the PostgreSQL metadata store that OGX requires in OpenShift AI. The Files API metadata is managed automatically within the same PostgreSQL instance that holds metadata for the other OGX APIs. No additional database configuration is required.

The provider works with any object storage system that exposes an S3-compatible API, including the following examples:

  • Amazon S3
  • MinIO
  • Ceph Object Gateway (RGW)
  • Oracle Cloud Infrastructure (OCI) Object Storage, through the S3 Compatibility API

Since OGX interacts with the storage system through the S3 API, any storage technology is compatible as long as it implements the S3 API.

Benefits

Using external S3-compatible storage for the /v1/files endpoint provides the following advantages:

  • Reuse existing, approved object storage services and governance controls for file content.
  • Scale file content storage independently of OGX compute resources.
  • Persist file content across OGX pod restarts and rescheduling events.
  • Centralize file content across multiple AI applications and clusters.
  • Meet compliance requirements by integrating with enterprise-managed storage platforms.

Compatibility and constraints

The remote::s3 provider has the following compatibility characteristics and constraints:

  • The provider is compatible with the OpenAI /v1/files API.
  • The inline::localfs provider remains supported and requires no configuration changes for existing deployments.
  • Migration from the inline::localfs provider to the remote::s3 provider is not performed automatically. Files stored by one provider are not available through the other.
  • The external storage system must implement a compatible S3 API. Provider-specific features that deviate from the S3 specification are not supported.
  • The provider uses the same AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables as the remote::bedrock inference provider. If you use both providers and require different credentials for each, use IAM roles for service accounts or a non-AWS S3-compatible backend.
  • Some S3-compatible backends require additional client-side environment variables, such as checksum-handling settings. Consult the documentation for your S3-compatible backend for the configuration that it requires.

For information about the limitations of the remote::s3 provider, see Limitations of external S3-compatible files providers.

5.5.2. Creating secrets for the external S3-compatible files provider

To authenticate OGX to an external S3-compatible object storage service by using access keys, create a Kubernetes secret that contains the credentials.

If you intend to use IAM roles for service accounts (IRSA) or another short-lived credential mechanism, you can skip this procedure and configure the role binding for your OGX service account instead.

Prerequisites

  • You have access to the project where your OGX resources are deployed.
  • You have the access key ID and secret access key for your S3-compatible storage backend, or you have configured an IAM role for the OGX service account.

Procedure

  1. Log in to your OpenShift cluster from the CLI:

    $ oc login --token=<token> --server=<openshift_cluster_url>
  2. Create a secret that contains the S3 credentials:

    $ oc create secret generic s3-files-credentials \
      --from-literal=AWS_ACCESS_KEY_ID=<access_key_id> \
      --from-literal=AWS_SECRET_ACCESS_KEY=<secret_access_key> \
      -n <project>

    In the previous command, replace <access_key_id> with the access key ID for your S3-compatible storage backend. Replace <secret_access_key> with the secret access key for your S3-compatible storage backend. Replace <project> with the name of the project where the OGXServer resource is deployed.

Verification

  • The S3 credentials secret exists in the project. You can confirm this by running the following command:

    $ oc get secret s3-files-credentials -n <project>

5.5.3. Configuring the external S3-compatible provider for the /v1/files endpoint

To use external S3-compatible object storage as the backend for the OGX /v1/files endpoint, update your OGXServer custom resource (CR) to enable the remote::s3 provider and supply the required configuration through environment variables.

Prerequisites

  • You have deployed a OGX server and configured a PostgreSQL database for OGX metadata storage. For more information, see Deploying a OGX server. Metadata for the remote::s3 provider is stored in this same PostgreSQL instance automatically.
  • You have created a secret that contains your S3 credentials. For more information, see Creating secrets for the external S3-compatible files provider.
  • If your S3 endpoint uses a TLS certificate signed by a private certificate authority (CA), you have configured the OGX server to trust that CA. For more information, see Configuring a CA bundle for OGX.
  • You have the S3 configuration details including: the external S3 endpoint URL, the bucket name, and the region value.
  • The S3 bucket exists. Bucket creation is the responsibility of a storage administrator. For development environments where automatic bucket creation is acceptable, see the optional S3_AUTO_CREATE_BUCKET field in the configuration example.
  • You have permission to edit custom resources in your project.

Procedure

  1. Log in to the OpenShift AI web console as a cluster administrator.
  2. From the Project list, select the project that contains your OGXServer CR.
  3. Update your OGXServer CR to enable the remote::s3 files provider and reference the secret that you created.

    1. Click HomeSearch.
    2. From the Resources list, search for OGXServer and select it. The cluster also exposes a OGXOperator resource, which is an internal OpenShift AI resource that is managed by the Red Hat OpenShift AI Operator. Do not select OGXOperator.
    3. From the list of OGXServer instances, click the name of the instance that you want to update.
    4. Click the YAML tab.
    5. Update the resource to include the following fields, and then click Save:

      apiVersion: ogx.io/v1beta1
      kind: OGXServer
      metadata:
        name: my-ogx
        namespace: my-ogx-namespace
      spec:
        server:
          containerSpec:
            env:
              - name: ENABLE_S3 1
                value: "true"
              - name: S3_BUCKET_NAME 2
                value: "<bucket_name>"
              - name: AWS_DEFAULT_REGION 3
                value: "<region>"
              - name: S3_ENDPOINT_URL 4
                value: "<s3_endpoint_url>"
              - name: S3_AUTO_CREATE_BUCKET 5
                value: "false"
              - name: AWS_ACCESS_KEY_ID 6
                valueFrom:
                  secretKeyRef:
                    name: s3-files-credentials
                    key: AWS_ACCESS_KEY_ID
              - name: AWS_SECRET_ACCESS_KEY
                valueFrom:
                  secretKeyRef:
                    name: s3-files-credentials
                    key: AWS_SECRET_ACCESS_KEY
          name: ogx
          port: 8321
        distribution:
          name: rh-dev
      1
      Enables the remote::s3 files provider.
      2
      Specifies the name of the S3 bucket where files are stored. S3 bucket names must be globally unique.
      3
      Specifies the region for the S3 bucket, for example, us-east-1. For non-AWS backends, set this value to match the region configuration of your backend, if it requires one. The default of us-east-1 is appropriate for many S3-compatible backends, including MinIO.
      4
      Optional: Specifies the S3 endpoint URL for S3-compatible backends other than AWS S3. Omit this field when using AWS S3. Set this field for MinIO, Ceph Object Gateway, OCI Object Storage, or other S3-compatible backends.
      5
      Optional: When set to "true", allows the provider to create the bucket if it does not exist. Requires the s3:CreateBucket IAM permission. Leave unset or set to "false" in production environments.
      6
      Specifies the S3 credentials, sourced from the secret that you created. Omit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY if the OGX service account uses an IAM role.
Note

For production deployments, Red Hat recommends authenticating to AWS S3 by using IAM roles for service accounts (IRSA) rather than static access keys. When you use an IAM role, omit the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, and bind the IAM role to the service account used by the OGX pod.

Note

The remote::s3 provider reads the same AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables as the remote::bedrock inference provider. If you have configured both providers and require different credentials for each, use IAM roles for service accounts or a non-AWS S3-compatible backend.

Verification

  • The OGX pod restarts and reaches the Running state.
  • The pod logs show the resolved OGX configuration containing a Files provider entry with provider_id: s3 and provider_type: remote::s3. You can also confirm provider registration by sending a GET request to the /v1/providers endpoint of the OGX API and verifying that an entry with "api": "files" and "provider_id": "s3" appears in the response.
  • If the configured bucket does not exist and S3_AUTO_CREATE_BUCKET is set to false, the OGX pod enters CrashLoopBackOff. The pod logs include the following error: RuntimeError: S3 bucket '<bucket_name>' does not exist. Either create the bucket manually or set 'auto_create_bucket: true' in your configuration.
  • File operations against the /v1/files endpoint complete successfully, confirming that OGX can communicate with the S3-compatible backend. For more information, see Using the /v1/files endpoint with external S3-compatible storage.

5.5.4. Using the /v1/files endpoint with external S3-compatible storage

After you configure the remote::s3 provider, you can manage files by using the OpenAI-compatible /v1/files endpoint. File content is stored in your S3 bucket, and file metadata is stored in the OGX PostgreSQL metadata store.

Prerequisites

  • You have configured the remote::s3 provider for the /v1/files endpoint. For more information, see Configuring an external S3-compatible provider for the /v1/files endpoint.
  • The OGX pod is running and the remote::s3 provider has initialized successfully.
  • You have the curl command-line tool installed in the environment from which you run the procedure.
  • You have the service or route URL for the OGX API endpoint, without the /v1 suffix.
  • You have the API token required to access the OGX endpoint, if your deployment requires authentication.

Procedure

  1. Open a terminal session in an environment that can reach the OGX API. The OGX service is reachable in the following ways:

    In-cluster access

    Run the commands from a pod in the same project as the OGX service, or from a workbench that has network access to the OGX service. The OGX Operator names the service <distribution-name>-service. Use that name with port 8321, for example:

    OGX_URL="http://ogx-service:8321"
    External access

    Expose the OGX service by creating a route, and then run the commands from your local workstation. Use the route URL, for example:

    OGX_URL="https://ogx-route.example.com"

    For more information about the correct base URL format and how to find the URL for your deployment, see OpenAI compatibility for RAG APIs in OGX.

  2. Set the OGX_URL environment variable to the URL of your OGX service or route, without the /v1 suffix. The example commands in this procedure append /v1 as part of the endpoint path.
  3. If your deployment requires authentication, set the OGX_TOKEN environment variable to your API token:

    $ export OGX_TOKEN="<api_token>"
  4. Upload a file. The purpose form field is required:

    $ curl -X POST \
      "${OGX_URL}/v1/files" \
      -H "Authorization: Bearer ${OGX_TOKEN}" \
      -F purpose="assistants" \
      -F file="@<path_to_file>"

    In the previous command, replace <path_to_file> with the local path to the file that you want to upload.

    The output is similar to the following example:

    {
      "object": "file",
      "id": "file-53b3fd75ca2c421c9a292ac63ff924ce",
      "bytes": 16,
      "created_at": 1778608944,
      "expires_at": null,
      "filename": "test.txt",
      "purpose": "assistants"
    }

    Note the value of the id field. You use this identifier in subsequent operations to refer to the file. The expires_at field is null for files that do not expire, which is the default behavior for files uploaded with purpose="assistants".

  5. List files:

    $ curl -X GET \
      "${OGX_URL}/v1/files" \
      -H "Authorization: Bearer ${OGX_TOKEN}"

    The output is similar to the following example:

    {
      "data": [
        {
          "object": "file",
          "id": "file-53b3fd75ca2c421c9a292ac63ff924ce",
          "bytes": 16,
          "created_at": 1778608944,
          "expires_at": null,
          "filename": "test.txt",
          "purpose": "assistants"
        }
      ],
      "has_more": false,
      "first_id": "file-53b3fd75ca2c421c9a292ac63ff924ce",
      "last_id": "file-53b3fd75ca2c421c9a292ac63ff924ce",
      "object": "list"
    }

    When no files are present, the data array is empty and the first_id and last_id fields contain empty strings.

  6. Retrieve file metadata:

    $ curl -X GET \
      "${OGX_URL}/v1/files/<file_id>" \
      -H "Authorization: Bearer ${OGX_TOKEN}"

    In the previous command, replace <file_id> with the identifier of the file, which is returned by the upload or list operation.

    The output is similar to the following example:

    {
      "object": "file",
      "id": "file-53b3fd75ca2c421c9a292ac63ff924ce",
      "bytes": 16,
      "created_at": 1778608944,
      "expires_at": null,
      "filename": "test.txt",
      "purpose": "assistants"
    }
  7. Retrieve file content:

    $ curl -X GET \
      "${OGX_URL}/v1/files/<file_id>/content" \
      -H "Authorization: Bearer ${OGX_TOKEN}" \
      -o <output_path>

    In the previous command, replace <output_path> with the local path where the downloaded file content is saved.

    The endpoint returns the raw bytes of the file. The command does not produce console output when you use the -o option. Confirm that the file was downloaded successfully by checking that the output file exists and has the expected size:

    $ ls -l <output_path>

    The reported file size matches the bytes value returned by the metadata operation.

  8. Delete a file:

    $ curl -X DELETE \
      "${OGX_URL}/v1/files/<file_id>" \
      -H "Authorization: Bearer ${OGX_TOKEN}"

    The output is similar to the following example:

    {
      "id": "file-53b3fd75ca2c421c9a292ac63ff924ce",
      "object": "file",
      "deleted": true
    }

    The deleted field is set to true, confirming that OGX has removed both the file metadata and the underlying S3 object.

Verification

  • Uploaded files appear in the response from the list operation, with an entry in the data array whose id field matches the id returned by the upload operation, and whose filename, bytes, and purpose fields match the values that you supplied at upload time.
  • Delete operations return a response with "deleted": true, and a subsequent list operation does not include the deleted file in the data array. The corresponding object is also removed from the underlying S3 bucket.
  • The OGX pod logs record each operation as an HTTP access entry. For example, a successful upload appears in the logs as "POST /v1/files HTTP/1.1" 200.

Additional resources

  • To make uploaded files available for retrieval-augmented generation (RAG) workflows, you must associate the file with a vector store by using the /v1/vector_stores/{vector_store_id}/files endpoint. For more information, see Ingesting content into a Llama model.

5.5.5. About the IAM policy for external S3-compatible files providers

The remote::s3 provider requires a minimum set of permissions on the S3 bucket that it uses. Grant only the permissions required for /v1/files operations, and avoid using credentials that provide access to multiple buckets or accounts. The following examples show least-privilege IAM policies that you can use as a starting point for AWS S3 deployments. Adapt the policies to your S3-compatible backend’s access control mechanism as required.

Required permissions

The remote::s3 provider requires the following permissions on the bucket:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket_name>",
        "arn:aws:s3:::<bucket_name>/*"
      ]
    }
  ]
}

In the previous policy, replace <bucket_name> with the name of the S3 bucket that the remote::s3 provider uses.

Additional permission for automatic bucket creation

If S3_AUTO_CREATE_BUCKET is set to true, the provider also requires the s3:CreateBucket permission. Red Hat recommends pre-creating the bucket administratively rather than granting this additional permission to the workload.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket",
        "s3:CreateBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<bucket_name>",
        "arn:aws:s3:::<bucket_name>/*"
      ]
    }
  ]
}

Additional security guidance

When you configure the remote::s3 provider, apply the following recommended security practices:

  • Store access credentials in Kubernetes secrets, and restrict access to the project where the OGXServer resource is deployed.
  • Use TLS to secure communication with the S3 endpoint. For S3 endpoints that use private CAs, configure the OGX server to trust the CA by using the operator’s TLS configuration. For more information, see Configuring a CA bundle for OGX.
  • For production deployments, use IAM roles for service accounts (IRSA) instead of static access keys. IAM roles provide short-lived credentials and remove the need to store long-lived secrets in the cluster.
  • Rotate access keys regularly when static credentials are used.
  • Apply a bucket policy that restricts access to the specific principal that runs the OGX workload.
  • OGX enforces access policies at the file metadata layer. File visibility and access through the /v1/files API is governed by the access policies that you configure on the OGX server, in addition to the IAM permissions and bucket policies that you configure on the S3 backend.
  • The remote::s3 provider does not enable server-side encryption on uploaded objects. If you require encryption at rest, configure default server-side encryption on the S3 bucket.

5.5.6. Limitations of the external S3-compatible files provider

The following limitations apply to the remote::s3 provider for the OGX /v1/files endpoint in OpenShift AI.

File expiration
By default, uploaded files do not expire. To set a per-file expiration, specify the expires_after field at upload time. Files that are uploaded with the batch purpose expire 30 days after upload.
Server-side encryption
The provider does not enable server-side encryption on uploaded objects. If you require encryption at rest, configure default server-side encryption on the S3 bucket at the storage layer.
AWS session tokens
AWS session tokens are not supported. The provider accepts only long-lived access keys or IAM roles for service accounts (IRSA).
S3 key prefixes
The provider does not support organizing files under an S3 key prefix. All file objects are stored at the root of the bucket. To isolate files for different workloads, use separate buckets.
Multipart uploads
The provider does not support S3 multipart uploads. All files are uploaded as single objects. Very large files might fail to upload, take longer to upload than they would with multipart upload, or require additional pod memory.
Streaming downloads
The provider loads file content into memory before returning it to the client. Very large file downloads can require significant pod memory.
S3 addressing style
The provider uses the AWS SDK’s default S3 addressing style. Some S3-compatible backends, such as on-premises Ceph deployments without virtual-host DNS configuration, require path-style addressing. For these backends, configure path-style addressing on the backend or in the network layer rather than on the provider.
Automatic bucket creation
By default, the provider expects the S3 bucket to exist before the provider starts. If the bucket does not exist and S3_AUTO_CREATE_BUCKET is not set to true, the OGX server logs an error that names the missing bucket. Red Hat recommends pre-creating the bucket administratively rather than enabling automatic bucket creation.
Shared AWS credentials
The remote::s3 provider reads the same AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables as the remote::bedrock inference provider. If you use both providers and require different credentials for each, use IAM roles for service accounts or a non-AWS S3-compatible backend.

5.6. Configuring OGX with OAuth authentication

You can configure OGX to use role-based access control (RBAC) for model access with OAuth authentication on OpenShift AI. The following example shows how to configure OGX so that all authenticated users can access a vLLM model, while only specific users can access an OpenAI model. This example uses Keycloak to issue and validate tokens.

This procedure assumes that the Keycloak server is available at https://my-keycloak-server.com.

Important

When you access OGX APIs, the required base URL depends on the client that you use.

  • For OpenAI-compatible clients or raw HTTP requests, include the /v1 path suffix in the base URL.

    For example, http://ogx-service:8321/v1

  • For the OGXClient SDK, do not include the /v1 path suffix in the base URL.

    For example, http://ogx-service:8321

If you use an incorrect base URL, requests fail.

Prerequisites

  • You have installed OpenShift 4.19 or later.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster-admin privileges for your OpenShift cluster.
  • You have a Keycloak instance configured with the following settings:

    • Realm: ogx-demo
    • Client: ogx with direct access grants enabled
    • Role: inference_max
    • A protocol mapper that adds realm roles to the access token under the ogx_roles claim
    • Two test users:

      • user1 with no assigned roles
      • user2 assigned the inference_max role
  • You have saved the Keycloak client secret for token requests.
  • Your Keycloak server is reachable at https://my-keycloak-server.com.
  • You have installed the OpenShift CLI (oc) as described in the documentation for your cluster:

Procedure

  1. To configure OGX to use role-based access control (RBAC) for model access, view and verify the OAuth provider token structure.

    1. Generate a Keycloak test token by running the following command:

      $ curl -d client_id=ogx -d client_secret=YOUR_CLIENT_SECRET -d username=user1 -d password=user-password -d grant_type=password https://my-keycloak-server.com/realms/ogx-demo/protocol/openid-connect/token | jq -r .access_token > test.token
    2. View the token claims by running the following command:

      $ cat test.token | cut -d . -f 2 | base64 -d 2>/dev/null | jq .

    Example token structure from Keycloak

    {
      "iss": "https://my-keycloak-server.com/realms/ogx-demo",
      "aud": "account",
      "sub": "761cdc99-80e5-4506-9b9e-26a67a8566f7",
      "preferred_username": "user1",
      "ogx_roles": [
        "inference_max"
      ]
    }

  2. Update your existing run.yaml file to add the OAuth parameters.

    Example OAuth parameters in the run.yaml file

    server:
      port: 8321
      auth:
        provider_config:
          type: "oauth2_token"
          jwks:
            uri: "https://my-keycloak-server.com/realms/ogx-demo/protocol/openid-connect/certs" 1
            key_recheck_period: 3600
          issuer: "https://my-keycloak-server.com/realms/ogx-demo" 2
          audience: "account"
          verify_tls: true
          claims_mapping:
            ogx_roles: "roles" 3
        access_policy:
          - permit: 4
              actions: [read]
              resource: model::vllm-inference/llama-3-2-3b
            description: Allow all authenticated users to access the Llama 3.2 model
          - permit: 5
              actions: [read]
              resource: model::openai/gpt-4o-mini
            when: user with inference_max in roles
            description: Allow only users with the inference_max role to access OpenAI models

    1 2
    Specify your Keycloak host and realm in the URL.
    3
    Maps the ogx_roles claim from the token to the roles field.
    4
    Allows all authenticated users to access vLLM models.
    5
    Restricts OpenAI models to users with the inference_max role.
  3. Create a ConfigMap that uses the updated run.yaml configuration by running the following command:

    $ oc create configmap ogx-custom-config --from-file=run.yaml=run.yaml -n redhat-ods-operator
  4. Create a ogx-server.yaml file with the following content:

    apiVersion: ogx.io/v1beta1
    kind: OGXServer
    metadata:
      name: ogx-server
      namespace: redhat-ods-operator
    spec:
      distribution:
        name: rh-dev
      workload:
        replicas: 1
        overrides:
          env:
            # vLLM provider configuration
            - name: VLLM_URL
              value: "https://your-vllm-service:8000/v1"
            - name: VLLM_API_TOKEN
              value: "your-vllm-token"
            - name: VLLM_TLS_VERIFY
              value: "false"
            # OpenAI provider configuration
            - name: OPENAI_API_KEY
              value: "your-openai-api-key"
            - name: OPENAI_BASE_URL
              value: "https://api.openai.com/v1"
      userConfig:
        configMapName: ogx-custom-config
        configMapNamespace: redhat-ods-operator
  5. Apply the distribution by running the following command:

    $ oc apply -f ogx-server.yaml
  6. Wait for the distribution to be ready by running the following command:

    $ oc wait --for=jsonpath='{.status.phase}'=Ready ogxserver/ogx-server -n redhat-ods-operator --timeout=300s
  7. Generate OAuth tokens for each user account to authenticate API requests.

    1. To request a basic access token and save it to a user1.token file, run the following command:

      $ curl -d client_id=ogx \
        -d client_secret=YOUR_CLIENT_SECRET \
        -d username=user1 \
        -d password=user1-password \
        -d grant_type=password \
        https://my-keycloak-server.com/realms/ogx-demo/protocol/openid-connect/token \
        | jq -r .access_token > user1.token
    2. To request a token for the privileged user and save it to a user2.token file, run the following command:

      $ curl -d client_id=ogx \
        -d client_secret=YOUR_CLIENT_SECRET \
        -d username=user2 \
        -d password=user2-password \
        -d grant_type=password \
        https://my-keycloak-server.com/realms/ogx-demo/protocol/openid-connect/token \
        | jq -r .access_token > user2.token
    3. Verify the token claims by running the following command:

      $ cat user2.token | cut -d . -f 2 | base64 -d 2>/dev/null | jq .

Verification

  1. Set the OGX service URL:

    $ export OGX_HOST="http://<ogx-host>:8321"
  2. Verify basic access for user1, who has no privileged roles.

    Load the token:

    $ USER1_TOKEN=$(cat user1.token)

    Confirm that user1 can access the vLLM-served model:

    $ curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${OGX_HOST}/v1/openai/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER1_TOKEN}" \
      -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 200.

    Confirm that user1 is denied access to the restricted OpenAI model:

    $ curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${OGX_HOST}/v1/openai/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER1_TOKEN}" \
      -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 403.

  3. Verify privileged access for user2, who is assigned the inference_max role.

    Load the token:

    $ USER2_TOKEN=$(cat user2.token)

    Confirm that user2 can access both models:

    $ curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${OGX_HOST}/v1/openai/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER2_TOKEN}" \
      -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
    $ curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${OGX_HOST}/v1/openai/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER2_TOKEN}" \
      -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 200 for both requests.

  4. Verify that requests without a Bearer token are denied.

    $ curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${OGX_HOST}/v1/openai/chat/completions" \
      -H "Content-Type: application/json" \
      -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 401.

5.7. Configuring attribute-based access control (ABAC) on your OGX server

OGX supports OAuth 2.0/OIDC authentication with attribute-based access control (ABAC) for multi-tenant isolation. ABAC provides multi-tenant isolation by configuring access policies based on specific attributes, assigned to a user and the requested resource. When enabled, users can only access resources they own based on the attributes, and system resources are readable by all authenticated users.

The following procedure describes how to enable attribute-based access control (ABAC) policies in your OGX distribution.

Prerequisites

  • You have installed OpenShift 4.19 or later.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have access to an OAuth 2.0/OIDC identity provider, for example, a Keycloak provider.

Procedure

  1. The AUTH_* parameters also need to be set in the OGXServer custom resource. For example:

    Example OGXServer CR

    spec:
      replicas: 1
      server:
        containerSpec:
          env:
    ...
          - name: AUTH_ISSUER
            value: https://keycloak-redhat-ods-applications.apps.rosa.<user-cluster>.gm8d.p3.openshiftapps.com/realms/ogx-demo
          - name: AUTH_JWKS_URI
            value: http://keycloak:8080/realms/ogx-demo/protocol/openid-connect/certs

    The server.auth section of the config.yaml file includes the authentication environment variables, these specifications uses OAuth2 token validation:

    server:
      auth:
        provider_config:
          type: ${env.AUTH_ISSUER:+oauth2_token}
          audience: ${env.AUTH_AUDIENCE:=ogx}
          issuer: ${env.AUTH_ISSUER:=}
          jwks:
            uri: ${env.AUTH_JWKS_URI:=}
            key_recheck_period: ${env.AUTH_JWKS_RECHECK_PERIOD:=3600}
          verify_tls: ${env.AUTH_VERIFY_TLS:=true}

    Table 5.1. Environment variables reference

    VariableDescriptionDefault

    AUTH_ISSUER

    OpenID connect (OIDC) issuer URL. If unset, authentication is disabled

    None

    AUTH_AUDIENCE

    Expected token audience

    ogx

    AUTH_JWKS_URI

    JSON Web key set (JWKS) endpoint for token validation

    None

    AUTH_JWKS_RECHECK_PERIOD

    How often, in seconds, to refresh JWKS keys

    3600

    AUTH_VERIFY_TLS

    Verify TLS when fetching JWKS

    true

  2. The client user must include a valid JWT bearer token in requests, for example:

    $ curl -H "Authorization: Bearer <token>" \
      https://ogx.example.com/v1/models
  3. The OGX distribution ships with a default access policy:

    access_policy:
      - permit:
          actions: [read]
        when: resource is unowned
        description: "All users can read system resources"
      - permit:
          actions: [create]
        description: "Authenticated users can create resources"
      - permit:
          actions: [read, update, delete]
        when: user is owner
        description: "Owners can manage their own resources"

    You can change these policies and create custom permissions for resource allocation.

    The default policy describes the following behaviors for users:

    • System resources are readable by all - Resources without an owner. Models, shields, benchmarks registered in configuration are readable by any authenticated user.
    • Any authenticated user can create resources - Users can create their own vector databases, files, datasets, conversations, etc.
    • Users can only manage their own resources - Read, update, and delete operations on owned resources are restricted to the resource owner.

    This access policy applies to user-created resources including: Vector databases, Files, Datasets, Conversations, Responses, Agents. While system resources registered in the config.yaml file do not have an owner and are accessible by all user types.

5.8. About using self-signed certificates with OGX

You can configure a OGXServer custom resource (CR) to trust certificates that are issued by self-signed or private Certificate Authorities (CAs). This configuration enables the OGX server to establish secure TLS connections to external inference, embedding, or vector store providers.

To configure a custom CA bundle, you reference a config map that contains the CA certificates from the spec.server.tlsConfig.caBundle field of the CR. The OGX Operator validates the certificates, mounts a concatenated bundle into the OGX server pod, and sets the SSL_CERT_FILE environment variable so that TLS clients in the server trust the bundle automatically.

Important

When you configure or change the CA bundle for a OGXServer CR, the OGX Operator restarts the OGX server pod so that the new certificates take effect. Plan for a brief service interruption when you apply or update the CA bundle on a OGXServer CR that is serving production traffic.

For the procedure and the OGX Operator processing details, see Configuring a CA bundle for OGX in Installing and uninstalling Red Hat OpenShift AI.

5.9. Enabling high availability and autoscaling for OGX

You can configure OGX servers to remain available if a pod restarts, an application crashes, or node maintenance occurs. You can also enable autoscaling to adjust server capacity automatically based on resource usage. This procedure shows how to configure high availability and autoscaling for OGX server pods by using the OGXServer custom resource.

Prerequisites

  • You have installed OpenShift 4.19 or later.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have activated the OGX Operator in OpenShift AI. For more information, see Activating the OGX Operator.
  • You have installed the OpenShift CLI (oc) as described in the documentation for your cluster:

Procedure

  1. To enable high availability for your OGX server, add the following parameters to your OGXServer CR:

    spec:
      replicas: 2 1
      server:
        podDisruptionBudget:
          maxUnavailable: 1 2
        topologySpreadConstraints: 3
          - maxSkew: 1 4
            topologyKey: topology.kubernetes.io/zone 5
            whenUnsatisfiable: ScheduleAnyway 6
            labelSelector:
              matchLabels:
                app.kubernetes.io/instance: ogxserver-sample 7
    1
    Runs two OGX pods for high availability.
    2
    Specifies voluntary disruption tolerance. This configuration keeps at least one server pod available during voluntary disruptions.
    3
    Specifies how matching pods are spread across the cluster topology.
    4
    Instructs the scheduler to minimize replica imbalance across zones. With two replicas, the scheduler attempts to place one pod per zone.
    5
    Uses the node zone label as the failure domain for pod spreading.
    6
    Allows scheduling to proceed even if spread constraints cannot be fully satisfied.
    7
    Ensures that only pods from the same application instance are considered when calculating spread.
  2. To enable autoscaling for your OGX server, add the following parameters to your OGXServer CR:

    spec:
      server:
        autoscaling: 1
          minReplicas: 1 2
          maxReplicas: 5 3
          targetCPUUtilizationPercentage: 75 4
          targetMemoryUtilizationPercentage: 70 5
    1
    Configures a HorizontalPodAutoscaler (HPA) for the server pods.
    2
    Specifies the minimum number of replicas maintained by the HPA.
    3
    Specifies the maximum number of replicas maintained by the HPA.
    4
    Enables CPU-based scaling.
    5
    Enables memory-based scaling.

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.