Managing OpenShift AI

Red Hat OpenShift AI Self-Managed 2.16

Cluster administrator tasks for managing OpenShift AI

Abstract

As an OpenShift cluster administrator, manage OpenShift AI users and groups, the dashboard interface and applications, deployment resources, accelerators, distributed workloads, and data backup.

Preface

As an OpenShift cluster administrator, you can manage the following Red Hat OpenShift AI resources:

Users and groups
The dashboard interface, including the visibility of navigation menu options
Applications that show in the dashboard
Custom deployment resources that are related to the Red Hat OpenShift AI Operator, for example, CPU and memory limits and requests
Accelerators
Distributed workloads
Data backup

Chapter 1. Managing users and groups

Users with cluster administrator access to OpenShift can add, modify, and remove user permissions for Red Hat OpenShift AI.

1.1. Overview of user types and permissions

Table 1 describes the Red Hat OpenShift AI user types.

Table 1.1. User types

User Type	Permissions
Users	Machine learning operations (MLOps) engineers and data scientists can access and use individual components of Red Hat OpenShift AI, such as workbenches and data science pipelines. See also Accessing the OpenShift AI dashboard.
Administrators	In addition to the actions permitted to users, administrators can perform these actions: Configure Red Hat OpenShift AI settings. Access and manage notebook servers. Access and manage data science pipeline applications for any data science project.

By default, all OpenShift users have access to Red Hat OpenShift AI. In addition, users in the OpenShift administrator group (cluster admins), automatically have administrator access in OpenShift AI.

Optionally, if you want to restrict access to your OpenShift AI deployment to specific users or groups, you can create user groups for users and administrators.

If you decide to restrict access, and you already have groups defined in your configured identity provider, you can add these groups to your OpenShift AI deployment. If you decide to use groups without adding these groups from an identity provider, you must create the groups in OpenShift and then add users to them.

There are some operations relevant to OpenShift AI that require the cluster-admin role. Those operations include:

Adding users to the OpenShift AI user and administrator groups, if you are using groups.
Removing users from the OpenShift AI user and administrator groups, if you are using groups.
Managing custom environment and storage configuration for users in OpenShift, such as Jupyter notebook resources, ConfigMaps, and persistent volume claims (PVCs).

Important

Although users of OpenShift AI and its components are authenticated through OpenShift, session management is separate from authentication. This means that logging out of OpenShift or OpenShift AI does not affect a logged in Jupyter session running on those platforms. This means that when a user’s permissions change, that user must log out of all current sessions in order for the changes to take effect.

Additional resources

1.2. Viewing OpenShift AI users

If you have defined OpenShift AI user groups, you can view the users that belong to these groups.

Prerequisites

The Red Hat OpenShift AI user group, administrator group, or both exist.
You have the cluster-admin role in OpenShift.
You have configured a supported identity provider for OpenShift.

Procedure

In the OpenShift web console, click User Management → Groups.
Click the name of the group containing the users that you want to view.
- For administrative users, click the name of your administrator group. for example, rhoai-admins.
- For normal users, click the name of your user group, for example, rhoai-users.
  The Group details page for the group appears.

Verification

In the Users section for the relevant group, you can view the users who have permission to access Red Hat OpenShift AI.

1.3. Adding users to OpenShift AI user groups

By default, all OpenShift users have access to Red Hat OpenShift AI.

Optionally, you can restrict user access to your OpenShift AI instance by defining user groups. You must grant users permission to access Red Hat OpenShift AI by adding user accounts to the Red Hat OpenShift AI user group, administrator group, or both. You can either use the default group name, or specify a group name that already exists in your identity provider.

The user group provides the user with access to product components in the Red Hat OpenShift AI dashboard, such as data science pipelines, and associated services, such as Jupyter. By default, users in the user group have access to data science pipeline applications within data science projects that they created.

The administrator group provides the user with access to developer and administrator functions in the Red Hat OpenShift AI dashboard, such as data science pipelines, and associated services, such as Jupyter. Users in the administrator group can configure data science pipeline applications in the OpenShift AI dashboard for any data science project.

If you restrict access by using user groups, users that are not in the OpenShift AI user group or administrator group cannot view the dashboard and use associated services, such as Jupyter. They are also unable to access the Cluster settings page.

Important

If you are using LDAP as your identity provider, you need to configure LDAP syncing to OpenShift. For more information, see Syncing LDAP groups.

Follow the steps in this section to add users to your OpenShift AI administrator and user groups.

Note: You can add users in OpenShift AI but you must manage the user lists in the OpenShift web console.

Prerequisites

You have configured a supported identity provider for OpenShift.
You are assigned the cluster-admin role in OpenShift.
You have defined an administrator group and user group for OpenShift AI.

Procedure

In the OpenShift web console, click User Management → Groups.
Click the name of the group you want to add users to.
- For administrative users, click the administrator group, for example, rhoai-admins.
- For normal users, click the user group, for example, rhoai-users.
  The Group details page for that group appears.
Click Actions → Add Users.
The Add Users dialog appears.
In the Users field, enter the relevant user name to add to the group.
Click Save.

Verification

Click the Details tab for each group and confirm that the Users section contains the user names that you added.

1.4. Selecting OpenShift AI administrator and user groups

By default, all users authenticated in OpenShift can access OpenShift AI.

Also by default, users with cluster-admin permissions are OpenShift AI administrators. A cluster admin is a superuser that can perform any action in any project in the OpenShift cluster. When bound to a user with a local binding, they have full control over quota and every action on every resource in the project.

After a cluster admin user defines additional administrator and user groups in OpenShift, you can add those groups to OpenShift AI by selecting them in the OpenShift AI dashboard.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
The groups that you want to select as administrator and user groups for OpenShift AI already exist in OpenShift. For more information, see Managing users and groups.

Procedure

From the OpenShift AI dashboard, click Settings → User management.
Select your OpenShift AI administrator groups: Under Data science administrator groups, click the text box and select an OpenShift group. Repeat this process to define multiple administrator groups.
Select your OpenShift AI user groups: Under Data science user groups, click the text box and select an OpenShift group. Repeat this process to define multiple user groups.
Important
The system:authenticated setting allows all users authenticated in OpenShift to access OpenShift AI.
Click Save changes.

Verification

Administrator users can successfully log in to OpenShift AI and have access to the Settings navigation menu.
Non-administrator users can successfully log in to OpenShift AI. They can also access and use individual components, such as projects and workbenches.

1.5. Deleting users

1.5.1. About deleting users and their resources

If you have administrator access to OpenShift, you can revoke a user’s access to Jupyter and delete the user’s resources from Red Hat OpenShift AI.

Deleting a user and the user’s resources involves the following tasks:

Before you delete a user from OpenShift AI, it is good practice to back up the data on your persistent volume claims (PVCs).
Stop notebook servers owned by the user.
Revoke user access to Jupyter.
Remove the user from the allowed group in your OpenShift identity provider.
After you delete a user, delete their associated configuration files from OpenShift.

1.5.2. Stopping notebook servers owned by other users

OpenShift AI administrators can stop notebook servers that are owned by other users to reduce resource consumption on the cluster, or as part of removing a user and their resources from the cluster.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
You have launched the Jupyter application, as described in Starting a Jupyter notebook server.
The notebook server that you want to stop is running.

Procedure

On the page that opens when you launch Jupyter, click the Administration tab.
Stop one or more servers.
- If you want to stop one or more specific servers, perform the following actions:
  1. In the Users section, locate the user that the notebook server belongs to.
  2. To stop the notebook server, perform one of the following actions:
    Click the action menu (⋮) beside the relevant user and select Stop server.
    Click View server beside the relevant user and then click Stop notebook server.
    The Stop server dialog box appears.
  3. Click Stop server.
- If you want to stop all servers, perform the following actions:
  1. Click the Stop all servers button.
  2. Click OK to confirm stopping all servers.

Verification

The Stop server link beside each server changes to a Start server link when the notebook server has stopped.

1.5.3. Revoking user access to Jupyter

You can revoke a user’s access to Jupyter by removing the user from the OpenShift AI user groups that define access to OpenShift AI. When you remove a user from the user groups, the user is prevented from accessing the OpenShift AI dashboard and from using associated services that consume resources in your cluster.

Important

Follow these steps only if you have implemented OpenShift AI user groups to restrict access to OpenShift AI. To completely remove a user from OpenShift AI, you must remove them from the allowed group in your OpenShift identity provider.

Prerequisites

You have stopped any notebook servers owned by the user you want to delete.
You are assigned the cluster-admin role in OpenShift.
You are using OpenShift AI user groups, and the user is part of the user group, administrator group, or both.

Procedure

In the OpenShift web console, click User Management → Groups.
Click the name of the group that you want to remove the user from.
- For administrative users, click the name of your administrator group, for example, rhoai-admins.
- For non-administrator users, click the name of your user group, for example, rhoai-users.
The Group details page for the group appears.
In the Users section on the Details tab, locate the user that you want to remove.
Click the action menu (⋮) beside the user that you want to remove and click Remove user.

Verification

Check the Users section on the Details tab and confirm that the user that you removed is not visible.
In the rhods-notebooks project, check under Workloads → Pods and ensure that there is no notebook server pod for this user. If you see a pod named jupyter-nb-<username>-* for the user that you have removed, delete that pod to ensure that the deleted user is not consuming resources on the cluster.
In the OpenShift AI dashboard, check the list of data science projects. Delete any projects that belong to the user.

1.5.4. Backing up storage data

It is a best practice to back up the data on your persistent volume claims (PVCs) regularly.

Backing up your data is particularly important before you delete a user and before you uninstall OpenShift AI, as all PVCs are deleted when OpenShift AI is uninstalled.

See the documentation for your cluster platform for more information about backing up your PVCs.

Additional resources

Understanding persistent storage

1.5.5. Cleaning up after deleting users

After you remove a user’s access to Red Hat OpenShift AI or Jupyter, you must also delete the configuration files for the user from OpenShift. Red Hat recommends that you back up the user’s data before removing their configuration files.

Prerequisites

(Optional) If you want to completely remove the user’s access to OpenShift AI, you have removed their credentials from your identity provider.
You have revoked the user’s access to Jupyter.
You have backed up the user’s storage data.
You have logged in to the OpenShift web console as a user with the cluster-admin role.

Procedure

Delete the user’s persistent volume claim (PVC).
1. Click Storage → PersistentVolumeClaims.
2. If it is not already selected, select the rhods-notebooks project from the project list.
3. Locate the jupyter-nb-<username> PVC.
  Replace <username> with the relevant user name.
4. Click the action menu (⋮) and select Delete PersistentVolumeClaim from the list.
  The Delete PersistentVolumeClaim dialog appears.
5. Inspect the dialog and confirm that you are deleting the correct PVC.
6. Click Delete.
Delete the user’s ConfigMap.
1. Click Workloads → ConfigMaps.
2. If it is not already selected, select the rhods-notebooks project from the project list.
3. Locate the jupyterhub-singleuser-profile-<username> ConfigMap.
  Replace <username> with the relevant user name.
4. Click the action menu (⋮) and select Delete ConfigMap from the list.
  The Delete ConfigMap dialog appears.
5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.
6. Click Delete.

Verification

The user cannot access Jupyter any more, and sees an "Access permission needed" message if they try.
The user’s single-user profile, persistent volume claim (PVC), and ConfigMap are not visible in OpenShift.

Chapter 2. Creating custom workbench images

Red Hat OpenShift AI includes a selection of default workbench images that a data scientist can select when they create or edit a workbench.

In addition, you can import a custom workbench image, for example, if you want to add libraries that data scientists often use, or if your data scientists require a specific version of a library that is different from the version provided in a default image. Custom workbench images are also useful if your data scientists require operating system packages or applications because they cannot install them directly in their running environment (data scientist users do not have root access, which is needed for those operations).

A custom workbench image is simply a container image. You build one as you would build any standard container image, by using a Containerfile (or Dockerfile). You start from an existing image (the FROM instruction), and then add your required elements.

You have the following options for creating a custom workbench image:

Start from one of the default images, as described in Creating a custom image from a default OpenShift AI image.
Create your own image by following the guidelines for making it compatible with OpenShift AI, as described in Creating a custom image from your own image.

Important

Red Hat supports adding custom workbench images to your deployment of OpenShift AI, ensuring that they are available for selection when creating a workbench. However, Red Hat does not support the contents of your custom workbench image. That is, if your custom workbench image is available for selection during workbench creation, but does not create a usable workbench, Red Hat does not provide support to fix your custom workbench image.

Additional resources

For a list of the OpenShift AI default workbench images and their preinstalled packages, see Supported Configurations.

For more information about creating images, see the following resources:

2.1. Creating a custom image from a default OpenShift AI image

After Red Hat OpenShift AI is installed on a cluster, you can find the default workbench images in the OpenShift console, under Builds → ImageStreams for the redhat-ods-applications project.

You can create a custom image by adding OS packages or applications to a default OpenShift AI image.

Prerequisites

You know which default image you want to use as the base for your custom image.
See Supported Configurations for a list of the OpenShift AI default workbench images and their preinstalled packages.
You have cluster-admin access to the OpenShift console for the cluster where OpenShift AI is installed.

Procedure

Obtain the location of the default image that you want to use as the base for your custom image.
1. In the OpenShift console, select Builds → ImageStreams.
2. Select the redhat-ods-applications project.
3. From the list of installed imagestreams, click the name of the image that you want to use as the base for your custom image. For example, click pytorch.
4. On the ImageStream details page, click YAML.
5. In the spec:tags section, find the tag for the version of the image that you want to use.
  The location of the original image is shown in the tag’s from:name section, for example:
  name: 'quay.io/modh/odh-pytorch-notebook@sha256:b68e0192abf7d…'
6. Copy this location for use in your custom image.
Create a standard Containerfile or Dockerfile.
For the FROM instruction, specify the base image location that you copied in Step 1, for example:
FROM quay.io/modh/odh-pytorch-notebook@sha256:b68e0…
Optional: Install OS images:
1. Switch to USER 0 (USER 0 is required to install OS packages).
2. Install the packages.
3. Switch back to USER 1001.
  The following example creates a custom workbench image that adds Java to the default PyTorch image:
```
 FROM quay.io/modh/odh-pytorch-notebook@sha256:b68e0…

 USER 0

 RUN INSTALL_PKGS="java-11-openjdk java-11-openjdk-devel" && \
    dnf install -y --setopt=tsflags=nodocs $INSTALL_PKGS && \
    dnf -y clean all --enablerepo=*

 USER 1001
```
Optional: Add Python packages:
1. Specify USER 1001.
2. Copy the requirements.txt file.
3. Install the packages.
  The following example installs packages from the requirements.txt file in the default PyTorch image:
```
 FROM quay.io/modh/odh-pytorch-notebook@sha256:b68e0…

 USER 1001

 COPY requirements.txt ./requirements.txt

 RUN pip install -r requirements.txt
```
Build the image file. For example, you can use podman build locally where the image file is located and then push the image to a registry that is accessible to OpenShift AI:
```
$ podman build -t my-registry/my-custom-image:0.0.1 .
$ podman push my-registry/my-custom-image:0.0.1
```
Alternatively, you can leverage OpenShift’s image build capabilities by using BuildConfig.

2.2. Creating a custom image from your own image

You can build your own custom image. However, you must make sure that your image is compatible with OpenShift and OpenShift AI.

Additional resources

General Container image guidelines section in the OpenShift Container Platform Images documentation.
Red Hat Universal Base Image: This content is not included.https://catalog.redhat.com/software/base-images
Red Hat Ecosystem Catalog: This content is not included.https://catalog.redhat.com/

2.2.1. Basic guidelines for creating your own workbench image

The following basic guidelines provide information to consider when you build your own custom workbench image.

Designing your image to run with USER 1001

In OpenShift, your container will run with a random UID and a GID of 0. Make sure that your image is compatible with these user and group requirements, especially if you need write access to directories. Best practice is to design your image to run with USER 1001.

Avoid placing artifacts in $HOME

The persistent volume attached to the workbench will be mounted on /opt/app-root/src. This location is also the location of $HOME. Therefore, do not put any files or other resources directly in $HOME because they won’t be visible after the workbench is deployed (and the persistent volume is mounted).

Specifying the API endpoint

OpenShift readiness and liveness probes will query the /api endpoint. For a Jupyter IDE, this is the default endpoint. For other IDEs, you must implement the /api endpoint.

2.2.2. Advanced guidelines for creating your own workbench image

The following guidelines provide information to consider when you build your own custom workbench image.

Minimizing image size

A workbench image uses a "layered" file system. Every time you use a COPY or a RUN command in your workbench image file, a new layer is created. Artifacts are not deleted. When you remove an artifact, for example, a file, it is "masked" in the next layer. Therefore, consider the following guidelines when you create your workbench image file.

Avoid using the dnf update command.
- If you start from an image that is constantly updated, such as ubi9/python-39 from the Red Hat Catalog, you might not need to use the dnf update command. This command fetches new metadata, updates files that might not have impact, and increases the workbench image size.
- Point to a newer version of your base image rather than performing a dnf update on an older version.
Group RUN commands. Chain your commands by adding && \ at the end of each line.
If you must compile code (such as a library or an application) to include in your custom image, implement multi-stage builds so that you avoid including the build artifacts in your final image. That is, compile the library or application in an intermediate image and then copy the result to your final image, leaving behind build artifacts that you do not want included.

Setting access to files and directories

Set the ownership of files and folders to 1001:0 (user "default", group "0"), for example:
```
COPY --chown=1001:0 os-packages.txt ./
```
On OpenShift, every container is in a standard namespace (unless you modify security). The container runs with a user that has a random user ID (uid) and with a group ID (gid) of 0. Therefore, all folders that you want to write to - and all the files you want to (temporarily) modify - in your image must be accessible by the user that has the random user ID (uid). Alternatively, you can set access to any user, as shown in the following example:
```
COPY --chmod=775 os-packages.txt ./
```
Build your image with /opt/app-root/src as the default location for the data that you want persisted, for example:
```
WORKDIR /opt/app-root/src
```
When a user launches a workbench from the OpenShift AI Applications → Enabled page, the "personal" volume of the user is mounted at /opt/app-root/src. Because this location is not configurable, when you build your custom image, you must specify this default location for persisted data.
Fix permissions to support PIP (the package manager for Python packages) in OpenShift environments. Add the following command to your custom image (if needed, change python3.9 to the Python version that you are using):
```
chmod -R g+w /opt/app-root/lib/python3.9/site-packages && \
   fix-permissions /opt/app-root -P
```
A service within your workbench image must answer at ${NB_PREFIX}/api, otherwise the OpenShift liveness/readiness probes fail and delete the pod for the workbench image.
The NB_PREFIX environment variable specifies the URL path where the container is expected to be listening.
The following is an example of an Nginx configuration:
```
location = ${NB_PREFIX}/api {
	return 302  /healthz;
	access_log  off;
}
```

For idle culling to work, the ${NB_PREFIX}/api/kernels URL must return a specifically-formatted JSON payload, as shown in the following example:

The following is an example of an Nginx configuration:

location = ${NB_PREFIX}/api/kernels {
	return 302 $custom_scheme://$http_host/api/kernels/;
	access_log  off;
}

location ${NB_PREFIX}/api/kernels/ {
	return 302 $custom_scheme://$http_host/api/kernels/;
	access_log  off;
}

location /api/kernels/ {
  index access.cgi;
  fastcgi_index access.cgi;
  gzip  off;
  access_log	off;
 }

The returned JSON payload should be:

{"id":"rstudio","name":"rstudio","last_activity":(time in ISO8601 format),"execution_state":"busy","connections": 1}

Enabling CodeReady Builder (CRB) and Extra Packages for Enterprise Linux (EPEL)

CRB and EPEL are repositories that provide packages which are absent from a standard Red Hat Enterprise Linux (RHEL) or Universal Base Image (UBI) installation. They are useful and required for installing some software, for example, RStudio.

On UBI9 images, CRB is enabled by default. To enable EPEL on UBI9-based images, run the following command:

 RUN yum install -y https://download.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

To enable CRB and EPEL on Centos Stream 9-based images, run the following command:

 RUN yum install -y yum-utils && \
    yum-config-manager --enable crb && \
    yum install -y https://download.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

Adding Elyra compatibility

Support for data science pipelines V2 (provided with the odh-elyra package) is available in Red Hat OpenShift AI version 2.9 and later. Previous versions of OpenShift AI support data science pipelines V1 (provided with the elyra package).

If you want your custom image to support data science pipelines V2, you must address the following requirements:

Include the odh-elyra package for having support with Data Science pipeline V2 (not the elyra package), for example:
```
 USER 1001

 RUN pip install odh-elyra
```
If you want to include the data science pipeline configuration automatically, as a runtime configuration, add an annotation when you import a custom workbench image.

2.3. Enabling custom images in OpenShift AI

All OpenShift AI administrators can import custom workbench images, by default, by selecting the Settings → Notebook images navigation option in the OpenShift AI dashboard.

If the Settings → Notebook images option is not available, check the following settings, depending on which navigation element does not appear in the dashboard:

The Settings menu does not appear in the OpenShift AI navigation bar
The visibility of the OpenShift AI dashboard Settings menu is determined by your user permissions. By default, the Settings menu is available to OpenShift AI administration users (users that are members of the rhoai-admins group). Users with the OpenShift cluster-admin role are automatically added to the rhoai-admins group and are granted administrator access in OpenShift AI.
For more information about user permissions, see Managing users and groups.
The Notebook images menu option does not appear under the Settings menu
The visibility of the Notebook images menu option is controlled in the dashboard configuration, by the value of the dashboardConfig: disableBYONImageStream option. It is set to false (the Notebook images menu option is visible) by default.
You need cluster-admin permissions to edit the dashboard configuration.
For more information about setting dashboard configuration options, see Customizing the dashboard.

2.4. Importing a custom workbench image

In addition to workbench images provided and supported by Red Hat and independent software vendors (ISVs), you can import custom workbench images that cater to your project’s specific requirements.

You must import it so that your OpenShift AI users (data scientists) can access it when they create a project workbench.

Prerequisites

You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
Your custom image exists in an image registry that is accessible to OpenShift AI.
The Settings → Notebook images dashboard navigation menu option is enabled, as described in Enabling custom workbench images in OpenShift AI.
If you want to associate an accelerator with the custom image that you want to import, you know the accelerator’s identifier - the unique string that identifies the hardware accelerator.

Procedure

From the OpenShift AI dashboard, click Settings → Notebook images.
The Notebook images page appears. Previously imported images are displayed. To enable or disable a previously imported image, on the row containing the relevant image, click the toggle in the Enable column.
Optional: If you want to associate an accelerator and you have not already created an accelerator profile, click Create profile on the row containing the image and complete the relevant fields. If the image does not contain an accelerator identifier, you must manually configure one before creating an associated accelerator profile.
Click Import new image. Alternatively, if no previously imported images were found, click Import image.
The Import Notebook images dialog appears.
In the Image location field, enter the URL of the repository containing the image. For example: quay.io/my-repo/my-image:tag, quay.io/my-repo/my-image@sha256:xxxxxxxxxxxxx, or docker.io/my-repo/my-image:tag.
In the Name field, enter an appropriate name for the image.
Optional: In the Description field, enter a description for the image.
Optional: From the Accelerator identifier list, select an identifier to set its accelerator as recommended with the image. If the image contains only one accelerator identifier, the identifier name displays by default.
Optional: Add software to the image. After the import has completed, the software is added to the image’s meta-data and displayed on the Jupyter server creation page.
1. Click the Software tab.
2. Click the Add software button.
3. Click Edit ( ).
4. Enter the Software name.
5. Enter the software Version.
6. Click Confirm ( ) to confirm your entry.
7. To add additional software, click Add software, complete the relevant fields, and confirm your entry.
Optional: Add packages to the notebook images. After the import has completed, the packages are added to the image’s meta-data and displayed on the Jupyter server creation page.
1. Click the Packages tab.
2. Click the Add package button.
3. Click Edit ( ).
4. Enter the Package name. For example, if you want to include data science pipeline V2 automatically, as a runtime configuration, type odh-elyra.
5. Enter the package Version. For example, type 3.16.7.
6. Click Confirm ( ) to confirm your entry.
7. To add an additional package, click Add package, complete the relevant fields, and confirm your entry.
Click Import.

Verification

The image that you imported is displayed in the table on the Notebook images page.
Your custom image is available for selection when a user creates a workbench.

Additional resources

Chapter 3. Customizing the dashboard

The OpenShift AI dashboard provides features that are designed to work for most scenarios. These features are configured in the OdhDashboardConfig custom resource (CR) file.

To see a description of the options in the OpenShift AI dashboard configuration file, see Dashboard configuration options.

As an administrator, you can customize the interface of the dashboard, for example to show or hide some of the dashboard navigation menu options. To change the default settings of the dashboard, edit the OdhDashboardConfig custom resource (CR) file as described in Editing the dashboard configuration file.

3.1. Editing the dashboard configuration file

As an administrator, you can customize the interface of the dashboard by editing the dashboard configuration file.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Home → API Explorer.
In the search bar, enter OdhDashboardConfig to filter by kind.
Click the OdhDashboardConfig custom resource (CR) to open the resource details page.
Select the redhat-ods-applications project from the Project list.
Click the Instances tab.
Click the odh-dashboard-config instance to open the details page.
Click the YAML tab.
Edit the values of the options that you want to change.
Click Save to apply your changes and then click Reload to synchronize your changes to the cluster.

Verification

3.2. Dashboard configuration options

The OpenShift AI dashboard includes a set of core features enabled by default that are designed to work for most scenarios. Administrators can configure the OpenShift AI dashboard from the OdhDashboardConfig custom resource (CR) in OpenShift.

Table 3.1. Dashboard feature configuration options

Feature	Default	Description
`dashboardConfig: disableAcceleratorProfiles`	`false`	Shows the Settings → Accelerator profiles option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableBYONImageStream`	`false`	Shows the Settings → Notebook images option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableClusterManager`	`false`	Shows the Settings → Cluster settings option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableCustomServingRuntimes`	`false`	Shows the Settings → Serving runtimes option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableDistributedWorkloads`	`false`	Shows the Distributed Workload Metrics option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableHome`	`false`	Shows the Home option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableInfo`	`false`	On the Applications → Explore page, when a user clicks on an application tile, an information panel opens with more details about the application. To disable the information panel for all applications on the Applications → Explore page , set the value to `true`.
`dashboardConfig: disableISVBadges`	`false`	Shows the label on a tile that indicates whether the application is “Red Hat managed”, “Partner managed”, or “Self-managed”. To hide these labels, set the value to `true`.
`dashboardConfig: disableKServe`	`false`	Enables the ability to select KServe as a model-serving platform. To disable this ability, set the value to `true`.
`dashboardConfig: disableKServeAuth`	`false`	Enables the ability to use authentication with KServe. To disable this ability, set the value to `true`.
`dashboardConfig: disableKServeMetrics`	`false`	Enables the ability to view KServe metrics. To disable this ability, set the value to `true`.
`dashboardConfig: disableModelMesh`	`false`	Enables the ability to select ModelMesh as a model-serving platform. To disable this ability, set the value to `true`.
`dashboardConfig: disableModelRegistry`	`false`	Shows the Model Registry option and the Settings → Model registry settings option in the dashboard navigation menu. To hide these menu options, set the value to `true`.
`dashboardConfig: disableModelServing`	`false`	Shows the Model Serving option in the dashboard navigation menu and in the list of components for the data science projects. To hide Model Serving from the dashboard navigation menu and from the list of components for data science projects, set the value to `true`.
`dashboardConfig: disableNIMModelServing`	`true`	Disables the ability to select NVIDIA NIM as a model-serving platform. To enable this ability, set the value to `false`.
`dashboardConfig: disablePerformanceMetrics`	`false`	Shows the Endpoint Performance tab on the Model Serving page. To hide this tab, set the value to `true`.
`dashboardConfig: disablePipelines`	`false`	Shows the Data Science Pipelines option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableProjects`	`false`	Shows the Data Science Projects option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableProjectSharing`	`false`	Allows users to share access to their data science projects with other users. To prevent users from sharing data science projects, set the value to `true`.
`dashboardConfig: disableServingRuntimeParams`	`false`	Shows the Configuration parameters section in the Deploy model dialog and the Edit model dialog when using the single-model serving platform. To hide this section, set the value to `true`.
`dashboardConfig: disableStorageClasses`	`false`	Shows the Settings → Storage classes option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: disableSupport`	`false`	Shows the Support menu option when a user clicks the Help icon in the dashboard toolbar. To hide this menu option, set the value to `true`.
`dashboardConfig: disableTracking`	`false`	Allows Red Hat to collect data about OpenShift AI usage in your cluster. To disable data collection, set the value to `true`. You can also set this option in the OpenShift AI dashboard interface from the Settings → Cluster settings navigation menu.
`dashboardConfig: disableTrustyBiasMetrics`	`false`	Shows the Model Bias tab on the Model Serving page. To hide this tab, set the value to `true`.
`dashboardConfig: disableUserManagement`	`false`	Shows the Settings → User management option in the dashboard navigation menu. To hide this menu option, set the value to `true`.
`dashboardConfig: enablement`	`true`	Enables OpenShift AI administrators to add applications to the OpenShift AI dashboard Applications → Enabled page. To disable this ability, set the value to `false`.
`notebookController: enabled`	`true`	Controls the Notebook Controller options, such as whether it is enabled in the dashboard and which parts are visible.
`notebookSizes`		Allows you to customize names and resources for notebooks. The Kubernetes-style sizes are shown in the drop-down menu that appears when launching a workbench with the Notebook Controller. Note: These sizes must follow conventions. For example, requests must be smaller than limits.
`ModelServerSizes`		Allows you to customize names and resources for model servers.
`groupsConfig`		Controls access to dashboard features, such as the Notebook server control panel for allowed users and the cluster settings user interface for OpenShift AI administrators.
`templateOrder`		Specifies the order of custom Serving Runtime templates. When the user creates a new template, it is added to this list.

Chapter 4. Managing applications that show in the dashboard

4.1. Adding an application to the dashboard

If you have installed an application in your OpenShift cluster, you can add a tile for that application to the OpenShift AI dashboard (the Applications → Enabled page) to make it accessible for OpenShift AI users.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
The dashboard configuration enablement option is set to true (the default). Note that a cluster administrator can disable this ability as described in Preventing users from adding applications to the dashboard.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Home → API Explorer.
On the API Explorer page, search for the OdhApplication kind.
Click the OdhApplication kind to open the resource details page.
On the OdhApplication details page, select the redhat-ods-applications project from the Project list.
Click the Instances tab.
Click Create OdhApplication.

On the Create OdhApplication page, copy the following code and paste it into the YAML editor.

apiVersion: dashboard.opendatahub.io/v1
kind: OdhApplication
metadata:
  name: examplename
  namespace: redhat-ods-applications
  labels:
    app: odh-dashboard
    app.kubernetes.io/part-of: odh-dashboard
spec:
  enable:
    validationConfigMap: examplename-enable
  img: >-
    <svg width="24" height="25" viewBox="0 0 24 25" fill="none" xmlns="http://www.w3.org/2000/svg">
    <path d="path data" fill="#ee0000"/>
    </svg>
  getStartedLink: 'https://example.org/docs/quickstart.html'
  route: exampleroutename
  routeNamespace: examplenamespace
  displayName: Example Name
  kfdefApplications: []
  support: third party support
  csvName: ''
  provider: example
  docsLink: 'https://example.org/docs/index.html'
  quickStart: ''
  getStartedMarkDown: >-
    # Example

    Enter text for the information panel.

  description: >-
    Enter summary text for the tile.
  category: Self-managed | Partner managed | {org-name} managed

Modify the parameters in the code for your application.
Tip
To see example YAML files, click Home → API Explorer, select OdhApplication, click the Instances tab, select an instance, and then click the YAML tab.
Click Create. The application details page appears.
Log in to OpenShift AI.
In the left menu, click Applications → Explore.
Locate the new tile for your application and click it.
In the information pane for the application, click Enable.

Verification

In the left menu of the OpenShift AI dashboard, click Applications → Enabled and verify that your application is available.

4.2. Preventing users from adding applications to the dashboard

By default, OpenShift AI administrators can add applications to the OpenShift AI dashboard Application → Enabled page.

As a cluster administrator, you can disable the ability for OpenShift AI administrators to add applications to the dashboard.

Note: The Jupyter tile is enabled by default. To disable it, see Hiding the default Jupyter application.

Prerequisite

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
Open the dashboard configuration file:
1. In the Administrator perspective, click Home → API Explorer.
2. In the search bar, enter OdhDashboardConfig to filter by kind.
3. Click the OdhDashboardConfig custom resource (CR) to open the resource details page.
4. Select the redhat-ods-applications project from the Project list.
5. Click the Instances tab.
6. Click the odh-dashboard-config instance to open the details page.
7. Click the YAML tab.
In the spec:dashboardConfig section, set the value of enablement to false to disable the ability for dashboard users to add applications to the dashboard.
Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.

Verification

Open the OpenShift AI dashboard Application → Enabled page.

4.3. Disabling applications connected to OpenShift AI

You can disable applications and components so that they do not appear on the OpenShift AI dashboard when you no longer want to use them, for example, when data scientists no longer use an application or when the application license expires.

Disabling unused applications allows your data scientists to manually remove these application tiles from their OpenShift AI dashboard so that they can focus on the applications that they are most likely to use. See Removing disabled applications from the dashboard for more information about manually removing application tiles.

Prerequisites

You have logged in to the OpenShift web console.
You are part of the cluster-admins user group in OpenShift.
You have installed or configured the service on your OpenShift cluster.
The application or component that you want to disable is enabled and appears on the Enabled page.

Procedure

In the OpenShift web console, switch to the Administrator perspective.
Switch to the redhat-ods-applications project.
Click Operators → Installed Operators.
Click on the Operator that you want to uninstall. You can enter a keyword into the Filter by name field to help you find the Operator faster.
Delete any Operator resources or instances by using the tabs in the Operator interface.
During installation, some Operators require the administrator to create resources or start process instances using tabs in the Operator interface. These must be deleted before the Operator can uninstall correctly.
On the Operator Details page, click the Actions drop-down menu and select Uninstall Operator.
An Uninstall Operator? dialog box is displayed.
Select Uninstall to uninstall the Operator, Operator deployments, and pods. After this is complete, the Operator stops running and no longer receives updates.

Important

Removing an Operator does not remove any custom resource definitions or managed resources for the Operator. Custom resource definitions and managed resources still exist and must be cleaned up manually. Any applications deployed by your Operator and any configured off-cluster resources continue to run and must be cleaned up manually.

Verification

The Operator is uninstalled from its target clusters.
The Operator no longer appears on the Installed Operators page.
The disabled application is no longer available for your data scientists to use, and is marked as Disabled on the Enabled page of the OpenShift AI dashboard. This action may take a few minutes to occur following the removal of the Operator.

4.4. Showing or hiding information about enabled applications

If you have installed another application in your OpenShift cluster, you can add a tile for that application to the OpenShift AI dashboard (the Applications → Enabled page) to make it accessible for OpenShift AI users.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Home → API Explorer.
On the API Explorer page, search for the OdhApplication kind.
Click the OdhApplication kind to open the resource details page.
On the OdhApplication details page, select the redhat-ods-applications project from the Project list.
Click the Instances tab.
Click Create OdhApplication.

On the Create OdhApplication page, copy the following code and paste it into the YAML editor.

apiVersion: dashboard.opendatahub.io/v1
kind: OdhApplication
metadata:
  name: examplename
  namespace: redhat-ods-applications
  labels:
    app: odh-dashboard
    app.kubernetes.io/part-of: odh-dashboard
spec:
  enable:
    validationConfigMap: examplename-enable
  img: >-
    <svg width="24" height="25" viewBox="0 0 24 25" fill="none" xmlns="http://www.w3.org/2000/svg">
    <path d="path data" fill="#ee0000"/>
    </svg>
  getStartedLink: 'https://example.org/docs/quickstart.html'
  route: exampleroutename
  routeNamespace: examplenamespace
  displayName: Example Name
  kfdefApplications: []
  support: third party support
  csvName: ''
  provider: example
  docsLink: 'https://example.org/docs/index.html'
  quickStart: ''
  getStartedMarkDown: >-
    # Example

    Enter text for the information panel.

  description: >-
    Enter summary text for the tile.
  category: Self-managed | Partner managed | Red Hat managed

Modify the parameters in the code for your application.
Tip
To see example YAML files, click Home → API Explorer, select OdhApplication, click the Instances tab, select an instance, and then click the YAML tab.
Click Create. The application details page appears.
Log in to OpenShift AI.
In the left menu, click Applications → Explore.
Locate the new tile for your application and click it.
In the information pane for the application, click Enable.

Verification

In the left menu of the OpenShift AI dashboard, click Applications → Enabled and verify that your application is available.

4.5. Hiding the default Jupyter application

The OpenShift AI dashboard includes Jupyter as an enabled application by default.

To hide the Jupyter tile from the list of Enabled applications, edit the dashboard configuration file.

Prerequisite

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
Open the dashboard configuration file:
1. In the Administrator perspective, click Home → API Explorer.
2. In the search bar, enter OdhDashboardConfig to filter by kind.
3. Click the OdhDashboardConfig custom resource (CR) to open the resource details page.
4. Select the redhat-ods-applications project from the Project list.
5. Click the Instances tab.
6. Click the odh-dashboard-config instance to open the details page.
7. Click the YAML tab.
In the spec:notebookController section, set the value of enabled to false to hide the Jupyter tile from the list of Enabled applications.
Click Save to apply your changes and then click Reload to make sure that your changes are synced to the cluster.

Verification

In the OpenShift AI dashboard, select Applications> Enabled. You should not see the Jupyter tile.

Chapter 5. Allocating additional resources to OpenShift AI users

As a cluster administrator, you can allocate additional resources to a cluster to support compute-intensive data science work. This support includes increasing the number of nodes in the cluster and changing the cluster’s allocated machine pool.

For more information about allocating additional resources to an OpenShift cluster, see Manually scaling a compute machine set.

Chapter 6. Customizing component deployment resources

6.1. Overview of component resource customization

You can customize deployment resources that are related to the Red Hat OpenShift AI Operator, for example, CPU and memory limits and requests. For resource customizations to persist without being overwritten by the Operator, the opendatahub.io/managed: true annotation must not be present in the YAML file for the component deployment. This annotation is absent by default.

The following table shows the deployment names for each component in the redhat-ods-applications namespace.

Important

Components denoted with (Technology Preview) in this table are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using Technology Preview features in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Component	Deployment names
CodeFlare	codeflare-operator-manager
KServe	kserve-controller-manager odh-model-controller
Ray	kuberay-operator
Kueue	kueue-controller-manager
Workbenches	notebook-controller-deployment odh-notebook-controller-manager
Dashboard	rhods-dashboard
Model serving	modelmesh-controller odh-model-controller
Model registry (Technology Preview)	model-registry-operator-controller-manager
Data science pipelines	data-science-pipelines-operator-controller-manager
Training Operator	kubeflow-training-operator

6.2. Customizing component resources

You can customize component deployment resources by updating the .spec.template.spec.containers.resources section of the YAML file for the component deployment.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Workloads > Deployments.
From the Project drop-down list, select redhat-ods-applications.
In the Name column, click the name of the deployment for the component that you want to customize resources for.
Note
For more information about the deployment names for each component, see Overview of component resource customization.
On the Deployment details page that appears, click the YAML tab.
Find the .spec.template.spec.containers.resources section.

Update the value of the resource that you want to customize. For example, to update the memory limit to 500Mi, make the following change:

containers:
        - resources:
            limits:
                cpu: '2'
                memory: 500Mi
            requests:
                cpu: '1'
                memory: 1Gi

Click Save.
Click Reload.

Verification

6.3. Disabling component resource customization

You can disable customization of component deployment resources, and restore default values, by adding the opendatahub.io/managed: true annotation to the YAML file for the component deployment.

Important

Manually removing or setting the opendatahub.io/managed: true annotation to false after manually adding it to the YAML file for a component deployment might cause unexpected cluster issues.

To remove the annotation from a deployment, use the steps described in Re-enabling component resource customization.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Workloads > Deployments.
From the Project drop-down list, select redhat-ods-applications.
In the Name column, click the name of the deployment for the component to which you want to add the annotation.
Note
For more information about the deployment names for each component, see Overview of component resource customization.
On the Deployment details page that appears, click the YAML tab.
Find the metadata.annotations: section.
Add the opendatahub.io/managed: true annotation.
```
metadata:
  annotations:
     opendatahub.io/managed: true
```
Click Save.
Click Reload.

Verification

The opendatahub.io/managed: true annotation appears in the YAML file for the component deployment.

6.4. Re-enabling component resource customization

You can re-enable customization of component deployment resources after manually disabling it.

Important

Manually removing or setting the opendatahub.io/managed: annotation to false after adding it to the YAML file for a component deployment might cause unexpected cluster issues.

To remove the annotation from a deployment, use the following steps to delete the deployment. The controller pod for the deployment will automatically redeploy with the default settings.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.

Procedure

Log in to the OpenShift console as a cluster administrator.
In the Administrator perspective, click Workloads > Deployments.
From the Project drop-down list, select redhat-ods-applications.
In the Name column, click the name of the deployment for the component for which you want to remove the annotation.
Click the Options menu .
Click Delete Deployment.

Verification

The controller pod for the deployment automatically redeploys with the default settings.

Chapter 7. Enabling accelerators

7.1. Enabling NVIDIA GPUs

Before you can use NVIDIA GPUs in OpenShift AI, you must install the NVIDIA GPU Operator.

Important

If you are using OpenShift AI in a disconnected self-managed environment, see Enabling NVIDIA GPUs instead.

Prerequisites

You have logged in to your OpenShift cluster.
You have the cluster-admin role in your OpenShift cluster.
You have installed an NVIDIA GPU and confirmed that it is detected in your environment.

Procedure

To enable GPU support on an OpenShift cluster, follow the instructions here: Content from docs.nvidia.com is not included.NVIDIA GPU Operator on Red Hat OpenShift Container Platform in the NVIDIA documentation.
Important
After you install the Node Feature Discovery (NFD) Operator, you must create an instance of NodeFeatureDiscovery. In addition, after you install the NVIDIA GPU Operator, you must create a ClusterPolicy and populate it with default values.
Delete the migration-gpu-status ConfigMap.
1. In the OpenShift web console, switch to the Administrator perspective.
2. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate ConfigMap.
3. Search for the migration-gpu-status ConfigMap.
4. Click the action menu (⋮) and select Delete ConfigMap from the list.
  The Delete ConfigMap dialog appears.
5. Inspect the dialog and confirm that you are deleting the correct ConfigMap.
6. Click Delete.
Restart the dashboard replicaset.
1. In the OpenShift web console, switch to the Administrator perspective.
2. Click Workloads → Deployments.
3. Set the Project to All Projects or redhat-ods-applications to ensure you can see the appropriate deployment.
4. Search for the rhods-dashboard deployment.
5. Click the action menu (⋮) and select Restart Rollout from the list.
6. Wait until the Status column indicates that all pods in the rollout have fully restarted.

Verification

The reset migration-gpu-status instance is present on the Instances tab on the AcceleratorProfile custom resource definition (CRD) details page.
From the Administrator perspective, go to the Operators → Installed Operators page. Confirm that the following Operators appear:
- NVIDIA GPU
- Node Feature Discovery (NFD)
- Kernel Module Management (KMM)

The GPU is correctly detected a few minutes after full installation of the Node Feature Discovery (NFD) and NVIDIA GPU Operators. The OpenShift command line interface (CLI) displays the appropriate output for the GPU worker node. For example:

# Expected output when the GPU is detected properly
oc describe node <node name>
...
Capacity:
  cpu:                4
  ephemeral-storage:  313981932Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16076568Ki
  nvidia.com/gpu:     1
  pods:               250
Allocatable:
  cpu:                3920m
  ephemeral-storage:  288292006229
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             12828440Ki
  nvidia.com/gpu:     1
  pods:               250

Note

In OpenShift AI 2.16, Red Hat supports the use of accelerators within the same cluster only. Red Hat does not support remote direct memory access (RDMA) between accelerators, or the use of accelerators across a network, for example, by using technology such as NVIDIA GPUDirect or NVLink.

After installing the NVIDIA GPU Operator, create an accelerator profile as described in Working with accelerator profiles.

7.2. Intel Gaudi AI Accelerator integration

To accelerate your high-performance deep learning models, you can integrate Intel Gaudi AI accelerators into OpenShift AI. This integration enables your data scientists to use Gaudi libraries and software associated with Intel Gaudi AI accelerators through custom-configured workbench instances.

Intel Gaudi AI accelerators offer optimized performance for deep learning workloads, with the latest Gaudi 3 devices providing significant improvements in training speed and energy efficiency. These accelerators are suitable for enterprises running machine learning and AI applications on OpenShift AI.

Before you can enable Intel Gaudi AI accelerators in OpenShift AI, you must complete the following steps:

Install the latest version of the Intel Gaudi AI Accelerator Operator from OperatorHub.
Create and configure a custom workbench image for Intel Gaudi AI accelerators. A prebuilt workbench image for Gaudi accelerators is not included in OpenShift AI.
Manually define and configure an accelerator profile for each Intel Gaudi AI device in your environment.

OpenShift AI supports Intel Gaudi devices up to Intel Gaudi 3. The Intel Gaudi 3 accelerators, in particular, offer the following benefits:

Improved training throughput: Reduce the time required to train large models by using advanced tensor processing cores and increased memory bandwidth.
Energy efficiency: Lower power consumption while maintaining high performance, reducing operational costs for large-scale deployments.
Scalable architecture: Scale across multiple nodes for distributed training configurations.

Your OpenShift platform must support EC2 DL1 instances to use Intel Gaudi AI accelerators in an Amazon EC2 DL1 instance. You can use Intel Gaudi AI accelerators in workbench instances or model serving after you enable the accelerators, create a custom workbench image, and configure the accelerator profile.

To identify the Intel Gaudi AI accelerators present in your deployment, use the lspci utility. For more information, see Content from linux.die.net is not included.lspci(8) - Linux man page.

Important

The presence of Intel Gaudi AI accelerators in your deployment, as indicated by the lspci utility, does not guarantee that the devices are ready to use. You must ensure that all installation and configuration steps are completed successfully.

Additional resources

7.2.1. Enabling Intel Gaudi AI accelerators

Before you can use Intel Gaudi AI accelerators in OpenShift AI, you must install the required dependencies, deploy the Intel Gaudi AI Accelerator Operator, and configure the environment.

Prerequisites

You have logged in to OpenShift.
You have the cluster-admin role in OpenShift.
You have installed your Intel Gaudi accelerator and confirmed that it is detected in your environment.
Your OpenShift environment supports EC2 DL1 instances if you are running on Amazon Web Services (AWS).
You have installed the OpenShift command-line interface (CLI).

Procedure

Install the latest version of the Intel Gaudi AI Accelerator Operator, as described in Content from docs.habana.ai is not included.Intel Gaudi AI Operator OpenShift installation.
By default, OpenShift sets a per-pod PID limit of 4096. If your workload requires more processing power, such as when you use multiple Gaudi accelerators or when using vLLM with Ray, you must manually increase the per-pod PID limit to avoid Resource temporarily unavailable errors. These errors occur due to PID exhaustion. Red Hat recommends setting this limit to 32768, although values over 20000 are sufficient.
1. Run the following command to label the node:
```
oc label node <node_name> custom-kubelet=set-pod-pid-limit-kubelet
```
2. Optional: To prevent workload distribution on the affected node, you can mark the node as unschedulable and then drain it in preparation for maintenance. For more information, see Understanding how to evacuate pods on nodes.
3. Create a custom-kubelet-pidslimit.yaml KubeletConfig resource file:
```
oc create -f custom-kubelet-pidslimit.yaml
```
4. Populate the file with the following YAML code. Set the PodPidsLimit value to 32768:
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-kubelet-pidslimit
spec:
  kubeletConfig:
    PodPidsLimit: 32768
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: set-pod-pid-limit-kubelet
```
5. Apply the configuration:
```
oc apply -f custom-kubelet-pidslimit.yaml
```
  This operation causes the node to reboot. For more information, see Understanding node rebooting.
6. Optional: If you previously marked the node as unschedulable, you can allow scheduling again after the node reboots.
Create a custom workbench image for Intel Gaudi AI accelerators, as described in Creating custom workbench images.
After installing the Intel Gaudi AI Accelerator Operator, create an accelerator profile, as described in Working with accelerator profiles.

Verification

From the Administrator perspective, go to the Operators → Installed Operators page. Confirm that the following Operators appear:

Intel Gaudi AI Accelerator
Node Feature Discovery (NFD)
Kernel Module Management (KMM)

Chapter 8. Managing distributed workloads

8.1. Overview of Kueue resources

Cluster administrators can configure Kueue objects (such as resource flavors, cluster queues, and local queues) to manage distributed workload resources across multiple nodes in an OpenShift cluster.

Note

In OpenShift AI 2.16, Red Hat does not support shared cohorts.

8.1.1. Resource flavor

The Kueue ResourceFlavor object describes the resource variations that are available in a cluster.

Resources in a cluster can be homogenous or heterogeneous:

Homogeneous resources are identical across the cluster: same node type, CPUs, memory, accelerators, and so on.
Heterogeneous resources have variations across the cluster.

If a cluster has homogeneous resources, or if it is not necessary to manage separate quotas for different flavors of a resource, a cluster administrator can create an empty ResourceFlavor object named default-flavor, without any labels or taints, as follows:

Empty Kueue resource flavor for homegeneous resources

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: default-flavor

If a cluster has heterogeneous resources, cluster administrators can define a different resource flavor for each variation in the resources available. Example variations include different CPUs, different memory, or different accelerators. If a cluster has multiple types of accelerator, cluster administrators can set up a resource flavor for each accelerator type. Cluster administrators can then associate the resource flavors with cluster nodes by using labels, taints, and tolerations, as shown in the following example.

Example Kueue resource flavor for heterogeneous resources

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "spot"
spec:
  nodeLabels:
    instance-type: spot
  nodeTaints:
  - effect: NoSchedule
    key: spot
    value: "true"
  tolerations:
  - key: "spot-taint"
    operator: "Exists"
    effect: "NoSchedule"

Make sure that each resource flavor has the correct label selectors and taint tolerations so that workloads run on the expected nodes.

See the example configurations provided in Example Kueue resource configurations.

For more information about configuring resource flavors, see Content from kueue.sigs.k8s.io is not included.Resource Flavor in the Kueue documentation.

8.1.2. Cluster queue

The Kueue ClusterQueue object manages a pool of cluster resources such as pods, CPUs, memory, and accelerators. A cluster can have multiple cluster queues, and each cluster queue can reference multiple resource flavors.

Cluster administrators can configure a cluster queue to define the resource flavors that the queue manages, and assign a quota for each resource in each resource flavor.

The following example configures a cluster queue to assign a quota of 9 CPUs, 36 GiB memory, 5 pods, and 5 NVIDIA GPUs.

Example cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "pods", "nvidia.com/gpu"]
    flavors:
    - name: "default-flavor"
      resources:
      - name: "cpu"
        nominalQuota: 9
      - name: "memory"
        nominalQuota: 36Gi
      - name: "pods"
        nominalQuota: 5
      - name: "nvidia.com/gpu"
        nominalQuota: '5'

A cluster administrator should notify the consumers of a cluster queue about the quota limits for that cluster queue. The cluster queue starts a distributed workload only if the total required resources are within these quota limits. If the sum of the requests for a resource in a distributed workload is greater than the specified quota for that resource in the cluster queue, the cluster queue does not admit the distributed workload.

See the example configurations provided in Example Kueue resource configurations.

For more information about configuring cluster queues, see Content from kueue.sigs.k8s.io is not included.Cluster Queue in the Kueue documentation.

8.1.3. Local queue

The Kueue LocalQueue object groups closely related distributed workloads in a project. Cluster administrators can configure local queues to specify the project name and the associated cluster queue. Each local queue then grants access to the resources that its specified cluster queue manages. A cluster administrator can optionally define one local queue in a project as the default local queue for that project.

When configuring a distributed workload, the user specifies the local queue name. If a cluster administrator configured a default local queue, the user can omit the local queue specification from the distributed workload code.

Kueue allocates the resources for a distributed workload from the cluster queue that is associated with the local queue, if the total requested resources are within the quota limits specified in that cluster queue.

The following example configures a local queue called team-a-queue for the team-a project, and specifies cluster-queue as the associated cluster queue.

Example local queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: team-a
  name: team-a-queue
  annotations:
    kueue.x-k8s.io/default-queue: "true"
spec:
  clusterQueue: cluster-queue

In this example, the kueue.x-k8s.io/default-queue: "true" annotation defines this local queue as the default local queue for the team-a project. If a user submits a distributed workload in the team-a project and that distributed workload does not specify a local queue in the cluster configuration, Kueue automatically routes the distributed workload to the team-a-queue local queue. The distributed workload can then access the resources that the cluster-queue cluster queue manages.

For more information about configuring local queues, see Content from kueue.sigs.k8s.io is not included.Local Queue in the Kueue documentation.

8.2. Example Kueue resource configurations

These examples show how to configure Kueue resource flavors and cluster queues.

Note

In OpenShift AI 2.16, Red Hat does not support shared cohorts.

8.2.1. NVIDIA GPUs without shared cohort

8.2.1.1. NVIDIA RTX A400 GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "A400-node"
spec:
  nodeLabels:
    instance-type: nvidia-a400-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

8.2.1.2. NVIDIA RTX A1000 GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "A1000-node"
spec:
  nodeLabels:
    instance-type: nvidia-a1000-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

8.2.1.3. NVIDIA RTX A400 GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "A400-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    - name: "A400-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

8.2.1.4. NVIDIA RTX A1000 GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "A1000-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "A1000-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

8.2.2. NVIDIA GPUs and AMD GPUs without shared cohort

8.2.2.1. AMD GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "amd-node"
spec:
  nodeLabels:
    instance-type: amd-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

8.2.2.2. NVIDIA GPU resource flavor

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "nvidia-node"
spec:
  nodeLabels:
    instance-type: nvidia-node
  tolerations:
  - key: "HasGPU"
    operator: "Exists"
    effect: "NoSchedule"

8.2.2.3. AMD GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-amd-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "amd.com/gpu"]
    - name: "amd-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "amd.com/gpu"

8.2.2.4. NVIDIA GPU cluster queue

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "team-a-nvidia-queue"
spec:
  namespaceSelector: {} # match all.
  resourceGroups:
  - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
    flavors:
    - name: "nvidia-node"
      resources:
      - name: "cpu"
        nominalQuota: 16
      - name: "memory"
        nominalQuota: 64Gi
      - name: "nvidia.com/gpu"
        nominalQuota: 2

8.2.3. Additional resources

Content from kueue.sigs.k8s.io is not included.Resource Flavor in the Kueue documentation
Content from kueue.sigs.k8s.io is not included.Cluster Queue in the Kueue documentation

8.3. Configuring quota management for distributed workloads

Configure quotas for distributed workloads on a cluster, so that you can share resources between several data science projects.

Prerequisites

You have logged in to OpenShift with the cluster-admin role.
You have downloaded and installed the OpenShift command-line interface (CLI). See Installing the OpenShift CLI.
You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).
You have created a data science project that contains a workbench, and the workbench is running a default notebook image that contains the CodeFlare SDK, for example, the Standard Data Science notebook. For information about how to create a project, see Creating a data science project.
You have sufficient resources. In addition to the base OpenShift AI resources, you need 1.6 vCPU and 2 GiB memory to deploy the distributed workloads infrastructure.
The resources are physically available in the cluster.
Note
In OpenShift AI 2.16, Red Hat supports only a single cluster queue per cluster (that is, homogenous clusters), and only empty resource flavors. For more information about Kueue resources, see Overview of Kueue resources.
If you want to use graphics processing units (GPUs), you have enabled GPU support in OpenShift AI. If you use NVIDIA GPUs, see Enabling NVIDIA GPUs. If you use AMD GPUs, see AMD GPU integration.
Note
In OpenShift AI 2.16, Red Hat supports only NVIDIA and AMD GPU accelerators for distributed workloads.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
Create an empty Kueue resource flavor, as follows:
1. Create a file called default_flavor.yaml and populate it with the following content:
  Empty Kueue resource flavor
  apiVersion: kueue.x-k8s.io/v1beta1 kind: ResourceFlavor metadata: name: default-flavor
2. Apply the configuration to create the default-flavor object:
```
$ oc apply -f default_flavor.yaml
```
Create a cluster queue to manage the empty Kueue resource flavor, as follows:
1. Create a file called cluster_queue.yaml and populate it with the following content:
  Example cluster queue
  apiVersion: kueue.x-k8s.io/v1beta1 kind: ClusterQueue metadata: name: "cluster-queue" spec: namespaceSelector: {} # match all. resourceGroups: - coveredResources: ["cpu", "memory", "nvidia.com/gpu"] # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu" flavors: - name: "default-flavor" resources: - name: "cpu" nominalQuota: 9 - name: "memory" nominalQuota: 36Gi - name: "nvidia.com/gpu" # If you use AMD GPUs, substitute "nvidia.com/gpu" with "amd.com/gpu" nominalQuota: 5
2. Replace the example quota values (9 CPUs, 36 GiB memory, and 5 NVIDIA GPUs) with the appropriate values for your cluster queue. The cluster queue will start a distributed workload only if the total required resources are within these quota limits.
  You must specify a quota for each resource that the user can request, even if the requested value is 0, by updating the spec.resourceGroups section as follows:
  - Include the resource name in the coveredResources list.
  - Specify the resource name and nominalQuota in the flavors.resources section, even if the nominalQuota value is 0.
3. Apply the configuration to create the cluster-queue object:
```
$ oc apply -f cluster_queue.yaml
```
Create a local queue that points to your cluster queue, as follows:
1. Create a file called local_queue.yaml and populate it with the following content:
  Example local queue
  apiVersion: kueue.x-k8s.io/v1beta1 kind: LocalQueue metadata: namespace: test name: local-queue-test annotations: kueue.x-k8s.io/default-queue: 'true' spec: clusterQueue: cluster-queue
  The kueue.x-k8s.io/default-queue: 'true' annotation defines this queue as the default queue. Distributed workloads are submitted to this queue if no local_queue value is specified in the ClusterConfiguration section of the data science pipeline or Jupyter notebook or Microsoft Visual Studio Code file.
2. Update the namespace value to specify the same namespace as in the ClusterConfiguration section that creates the Ray cluster.
3. Optional: Update the name value accordingly.
4. Apply the configuration to create the local-queue object:
```
$ oc apply -f local_queue.yaml
```
  The cluster queue allocates the resources to run distributed workloads in the local queue.

Verification

Check the status of the local queue in a project, as follows:

$ oc get -n <project-name> localqueues

Additional resources

Content from kueue.sigs.k8s.io is not included.Kueue documentation

8.4. Configuring the CodeFlare Operator

If you want to change the default configuration of the CodeFlare Operator for distributed workloads in OpenShift AI, you can edit the associated config map.

Prerequisites

You have logged in to OpenShift with the cluster-admin role.
You have installed the required distributed workloads components as described in Installing the distributed workloads components (for disconnected environments, see Installing the distributed workloads components).

Procedure

In the OpenShift console, click Workloads → ConfigMaps.
From the Project list, select redhat-ods-applications.
Search for the codeflare-operator-config config map, and click the config map name to open the ConfigMap details page.
Click the YAML tab to show the config map specifications.
In the data:config.yaml:kuberay section, you can edit the following entries:
ingressDomain
This configuration option is null (ingressDomain: "") by default. Do not change this option unless the Ingress Controller is not running on OpenShift. OpenShift AI uses this value to generate the dashboard and client routes for every Ray Cluster, as shown in the following examples:
Example dashboard and client routes

ray-dashboard-<clustername>-<namespace>.<your.ingress.domain> ray-client-<clustername>-<namespace>.<your.ingress.domain>
mTLSEnabled
This configuration option is enabled (mTLSEnabled: true) by default. When this option is enabled, the Ray Cluster pods create certificates that are used for mutual Transport Layer Security (mTLS), a form of mutual authentication, between Ray Cluster nodes. When this option is enabled, Ray clients cannot connect to the Ray head node unless they download the generated certificates from the ca-secret-_<cluster_name>_ secret, generate the necessary certificates for mTLS communication, and then set the required Ray environment variables. Users must then re-initialize the Ray clients to apply the changes. The CodeFlare SDK provides the following functions to simplify the authentication process for Ray clients:
Example Ray client authentication code

from codeflare_sdk import generate_cert generate_cert.generate_tls_cert(cluster.config.name, cluster.config.namespace) generate_cert.export_env(cluster.config.name, cluster.config.namespace) ray.init(cluster.cluster_uri())
rayDashboardOauthEnabled
This configuration option is enabled (rayDashboardOAuthEnabled: true) by default. When this option is enabled, OpenShift AI places an OpenShift OAuth proxy in front of the Ray Cluster head node. Users must then authenticate by using their OpenShift cluster login credentials when accessing the Ray Dashboard through the browser. If users want to access the Ray Dashboard in another way (for example, by using the Ray JobSubmissionClient class), they must set an authorization header as part of their request, as shown in the following example:
Example authorization header

{Authorization: "Bearer <your-openshift-token>"}
To save your changes, click Save.
To apply your changes, delete the pod:
1. Click Workloads → Pods.
2. Find the codeflare-operator-manager-<pod-id> pod.
3. Click the options menu (⋮) for that pod, and then click Delete Pod. The pod restarts with your changes applied.

Verification

Check the status of the codeflare-operator-manager pod, as follows:

In the OpenShift console, click Workloads → Deployments.
Search for the codeflare-operator-manager deployment, and then click the deployment name to open the deployment details page.
Click the Pods tab. When the status of the codeflare-operator-manager-<pod-id> pod is Running, the pod is ready to use. To see more information about the pod, click the pod name to open the pod details page, and then click the Logs tab.

8.5. Troubleshooting common problems with distributed workloads for administrators

If your users are experiencing errors in Red Hat OpenShift AI relating to distributed workloads, read this section to understand what could be causing the problem, and how to resolve the problem.

If the problem is not documented here or in the release notes, contact Red Hat Support.

8.5.1. A user’s Ray cluster is in a suspended state

Problem

The resource quota specified in the cluster queue configuration might be insufficient, or the resource flavor might not yet be created.

Diagnosis

The user’s Ray cluster head pod or worker pods remain in a suspended state. Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the suspended state, as shown in the following example:

status:
 conditions:
   - lastTransitionTime: '2024-05-29T13:05:09Z'
     message: 'couldn''t assign flavors to pod set small-group-jobtest12: insufficient quota for nvidia.com/gpu in flavor default-flavor in ClusterQueue'

Resolution

Check whether the resource flavor is created, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home → Search, and from the Resources list, select ResourceFlavor.
3. If necessary, create the resource flavor.
Check the cluster queue configuration in the user’s code, to ensure that the resources that they requested are within the limits defined for the project.
If necessary, increase the resource quota.

For information about configuring resource flavors and quotas, see Configuring quota management for distributed workloads.

8.5.2. A user’s Ray cluster is in a failed state

Problem

The user might have insufficient resources.

Diagnosis

The user’s Ray cluster head pod or worker pods are not running. When a Ray cluster is created, it initially enters a failed state. This failed state usually resolves after the reconciliation process completes and the Ray cluster pods are running.

Resolution

If the failed state persists, complete the following steps:

In the OpenShift console, select the user’s project from the Project list.
Click Workloads → Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to identify the cause of the problem.
Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for the failed state.

8.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.ray.openshift.ai\": failed to call webhook: Post \"https://codeflare-operator-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"codeflare-operator-webhook-service\""}]},"code":500}

Diagnosis

The CodeFlare Operator pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads → Pods.
Verify that the CodeFlare Operator pod is running. If necessary, restart the CodeFlare Operator pod.
Review the logs for the CodeFlare Operator pod to verify that the webhook server is serving, as shown in the following example:
```
INFO	controller-runtime.webhook	  Serving webhook server	{"host": "", "port": 9443}
```

8.5.4. A user receives a failed to call webhook error message for Kueue

Problem

After the user runs the cluster.up() command, the following error is shown:

ApiException: (500)
Reason: Internal Server Error
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\"","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"mraycluster.kb.io\": failed to call webhook: Post \"https://kueue-webhook-service.redhat-ods-applications.svc:443/mutate-ray-io-v1-raycluster?timeout=10s\": no endpoints available for service \"kueue-webhook-service\""}]},"code":500}

Diagnosis

The Kueue pod might not be running.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads → Pods.
Verify that the Kueue pod is running. If necessary, restart the Kueue pod.

Review the logs for the Kueue pod to verify that the webhook server is serving, as shown in the following example:

{"level":"info","ts":"2024-06-24T14:36:24.255137871Z","logger":"controller-runtime.webhook","caller":"webhook/server.go:242","msg":"Serving webhook server","host":"","port":9443}

8.5.5. A user’s Ray cluster does not start

Problem

After the user runs the cluster.up() command, when they run either the cluster.details() command or the cluster.status() command, the Ray cluster status remains as Starting instead of changing to Ready. No pods are created.

Diagnosis

Check the status of the Workloads resource that is created with the RayCluster resource. The status.conditions.message field provides the reason for remaining in the Starting state. Similarly, check the status.conditions.message field for the RayCluster resource.

Resolution

In the OpenShift console, select the user’s project from the Project list.
Click Workloads → Pods.
Verify that the KubeRay pod is running. If necessary, restart the KubeRay pod.
Review the logs for the KubeRay pod to identify errors.

8.5.6. A user receives a Default Local Queue … not found error message

Problem

After the user runs the cluster.up() command, the following error is shown:

Default Local Queue with kueue.x-k8s.io/default-queue: true annotation not found please create a default Local Queue or provide the local_queue name in Cluster Configuration.

Diagnosis

No default local queue is defined, and a local queue is not specified in the cluster configuration.

Resolution

Check whether a local queue exists in the user’s project, as follows:
1. In the OpenShift console, select the user’s project from the Project list.
2. Click Home → Search, and from the Resources list, select LocalQueue.
3. If no local queues are found, create a local queue.
4. Provide the user with the details of the local queues in their project, and advise them to add a local queue to their cluster configuration.
Define a default local queue.
For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

8.5.7. A user receives a local_queue provided does not exist error message

Problem

After the user runs the cluster.up() command, the following error is shown:

local_queue provided does not exist or is not in this namespace. Please provide the correct local_queue name in Cluster Configuration.

Diagnosis

An incorrect value is specified for the local queue in the cluster configuration, or an incorrect default local queue is defined. The specified local queue either does not exist, or exists in a different namespace.

Resolution

In the OpenShift console, select the user’s project from the Project list.
1. Click Search, and from the Resources list, select LocalQueue.
2. Resolve the problem in one of the following ways:
  - If no local queues are found, create a local queue.
  - If one or more local queues are found, provide the user with the details of the local queues in their project. Advise the user to ensure that they spelled the local queue name correctly in their cluster configuration, and that the namespace value in the cluster configuration matches their project name. If the user does not specify a namespace value in the cluster configuration, the Ray cluster is created in the current project.
3. Define a default local queue.
  For information about creating a local queue and defining a default local queue, see Configuring quota management for distributed workloads.

8.5.8. A user cannot create a Ray cluster or submit jobs

Problem

After the user runs the cluster.up() command, an error similar to the following text is shown:

RuntimeError: Failed to get RayCluster CustomResourceDefinition: (403)
Reason: Forbidden
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rayclusters.ray.io is forbidden: User \"system:serviceaccount:regularuser-project:regularuser-workbench\" cannot list resource \"rayclusters\" in API group \"ray.io\" in the namespace \"regularuser-project\"","reason":"Forbidden","details":{"group":"ray.io","kind":"rayclusters"},"code":403}

Diagnosis

The correct OpenShift login credentials are not specified in the TokenAuthentication section of the user’s notebook code.

Resolution

Advise the user to identify and specify the correct OpenShift login credentials as follows:
1. In the OpenShift console header, click your username and click Copy login command.
2. In the new tab that opens, log in as the user whose credentials you want to use.
3. Click Display Token.
4. From the Log in with this token section, copy the token and server values.
5. Specify the copied token and server values in your notebook code as follows:
```
auth = TokenAuthentication(
    token = "<token>",
    server = "<server>",
    skip_tls=False
)
auth.login()
```
Verify that the user has the correct permissions and is part of the rhoai-users group.

8.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled

Problem

Kueue waits for a period of time before marking a workload as ready, to enable all of the workload pods to become provisioned and running. By default, Kueue waits for 5 minutes. If the pod image is very large and is still being pulled after the 5-minute waiting period elapses, Kueue fails the workload and terminates the related pods.

Diagnosis

In the OpenShift console, select the user’s project from the Project list.
Click Workloads → Pods.
Click the user’s pod name to open the pod details page.
Click the Events tab, and review the pod events to check whether the image pull completed successfully.

Resolution

If the pod takes more than 5 minutes to pull the image, resolve the problem in one of the following ways:

Add an OnFailure restart policy for resources that are managed by Kueue.
In the redhat-ods-applications namespace, edit the kueue-manager-config ConfigMap to set a custom timeout for the waitForPodsReady property. For more information about this configuration option, see Content from kueue.sigs.k8s.io is not included.Enabling waitForPodsReady in the Kueue documentation.

Chapter 9. Backing up data

9.1. Backing up storage data

It is a best practice to back up the data on your persistent volume claims (PVCs) regularly.

Backing up your data is particularly important before you delete a user and before you uninstall OpenShift AI, as all PVCs are deleted when OpenShift AI is uninstalled.

See the documentation for your cluster platform for more information about backing up your PVCs.

Additional resources

Understanding persistent storage

Chapter 10. Viewing logs and audit records

As a cluster administrator, you can use the OpenShift AI Operator logger to monitor and troubleshoot issues. You can also use OpenShift audit records to review a history of changes made to the OpenShift AI Operator configuration.

10.1. Configuring the OpenShift AI Operator logger

You can change the log level for OpenShift AI Operator components by setting the .spec.devFlags.logmode flag for the DSC Initialization/DSCI custom resource during runtime. If you do not set a logmode value, the logger uses the INFO log level by default.

The log level that you set with .spec.devFlags.logmode applies to all components, not just those in a Managed state.

The following table shows the available log levels:

Log level	Stacktrace level	Verbosity	Output	Timestamp type
`devel` or `development`	WARN	INFO	Console	Epoch timestamps
`""` (or no `logmode` value set)	ERROR	INFO	JSON	Human-readable timestamps
`prod` or `production`	ERROR	INFO	JSON	Human-readable timestamps

Logs that are set to devel or development generate in a plain text console format. Logs that are set to prod, production, or which do not have a level set generate in a JSON format.

Prerequisites

You have admin access to the DSCInitialization resources in the OpenShift cluster.
You installed the OpenShift command line interface (oc) as described in Installing the OpenShift CLI.

Procedure

Log in to the OpenShift as a cluster administrator.
Click Operators → Installed Operators and then click the Red Hat OpenShift AI Operator.
Click the DSC Initialization tab.
Click the default-dsci object.
Click the YAML tab.

In the spec section, update the .spec.devFlags.logmode flag with the log level that you want to set.

apiVersion: dscinitialization.opendatahub.io/v1
kind: DSCInitialization
metadata:
  name: default-dsci
spec:
  devFlags:
    logmode: development

Click Save.

You can also configure the log level from the OpenShift CLI by using the following command with the logmode value set to the log level that you want.

oc patch dsci default-dsci -p '{"spec":{"devFlags":{"logmode":"development"}}}' --type=merge

Verification

If you set the component log level to devel or development, logs generate more frequently and include logs at WARN level and above.
If you set the component log level to prod or production, or do not set a log level, logs generate less frequently and include logs at ERROR level or above.

10.1.1. Viewing the OpenShift AI Operator log

Run the following command:

oc get pods -l name=rhods-operator -o name -n redhat-ods-operator |  xargs -I {} oc logs -f {} -n redhat-ods-operator

The operator pod log opens.

You can also view the operator pod log in the OpenShift Console, under Workloads > Deployments > Pods > redhat-ods-operator > Logs.

10.2. Viewing audit records

Cluster administrators can use OpenShift auditing to see changes made to the OpenShift AI Operator configuration by reviewing modifications to the DataScienceCluster (DSC) and DSCInitialization (DSCI) custom resources. Audit logging is enabled by default in standard OpenShift cluster configurations. For more information, see Viewing audit logs in the OpenShift documentation.

Note

In Red Hat OpenShift Service on Amazon Web Services with hosted control planes (ROSA HCP), audit logging is disabled by default because the Elasticsearch log store does not provide secure storage for audit logs. To send the audit logs to Amazon CloudWatch, see Forwarding logs to Amazon CloudWatch.

The following example shows how to use the OpenShift audit logs to see the history of changes made (by users) to the DSC and DSCI custom resources.

Prerequisites

You have cluster administrator privileges for your OpenShift cluster.
You installed the OpenShift command line interface (oc) as described in Installing the OpenShift CLI.

Procedure

In a terminal window, if you are not already logged in to your OpenShift cluster as a cluster administrator, log in to the OpenShift CLI as shown in the following example:
```
$ oc login <openshift_cluster_url> -u <admin_username> -p <password>
```
To access the full content of the changed custom resources, set the OpenShift audit log policy to WriteRequestBodies or a more comprehensive profile. For more information, see About audit log policy profiles.

Fetch the audit log files that are available for the relevant control plane nodes. For example:

oc adm node-logs --role=master --path=kube-apiserver/ \
  | awk '{ print $1 }' | sort -u \
  | while read node ; do
      oc adm node-logs $node --path=kube-apiserver/audit.log < /dev/null
    done \
  | grep opendatahub > /tmp/kube-apiserver-audit-opendatahub.log

Search the files for the DSC and DSCI custom resources. For example:

jq 'select((.objectRef.apiGroup == "dscinitialization.opendatahub.io"
                or .objectRef.apiGroup == "datasciencecluster.opendatahub.io")
              and .user.username != "system:serviceaccount:redhat-ods-operator:redhat-ods-operator-controller-manager"
              and .verb != "get" and .verb != "watch" and .verb != "list")' < /tmp/kube-apiserver-audit-opendatahub.log

Verification

The commands return relevant log entries.

Tip

To configure the log retention time, see the following resources:

OpenShift 4.14 to 4.16: Configuring log retention time in Elasticsearch or Enabling stream-based retention with Loki
OpenShift 4.17: Enabling stream-based retention with Loki

Additional resources

Legal Notice

The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at Content from creativecommons.org is not included.http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

Java® is a registered trademark of Oracle and/or its affiliates.

XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.

MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.

Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.

The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

All other trademarks are the property of their respective owners.

Managing OpenShift AI

Cluster administrator tasks for managing OpenShift AI

Preface

Chapter 1. Managing users and groups

1.1. Overview of user types and permissions

1.2. Viewing OpenShift AI users

1.3. Adding users to OpenShift AI user groups

1.4. Selecting OpenShift AI administrator and user groups

1.5. Deleting users

1.5.1. About deleting users and their resources

1.5.2. Stopping notebook servers owned by other users

1.5.3. Revoking user access to Jupyter

1.5.4. Backing up storage data

1.5.5. Cleaning up after deleting users

Chapter 2. Creating custom workbench images

2.1. Creating a custom image from a default OpenShift AI image

2.2. Creating a custom image from your own image

2.2.1. Basic guidelines for creating your own workbench image

2.2.2. Advanced guidelines for creating your own workbench image

2.3. Enabling custom images in OpenShift AI

2.4. Importing a custom workbench image

Chapter 3. Customizing the dashboard

3.1. Editing the dashboard configuration file

3.2. Dashboard configuration options

Chapter 4. Managing applications that show in the dashboard

4.1. Adding an application to the dashboard

4.2. Preventing users from adding applications to the dashboard

4.3. Disabling applications connected to OpenShift AI

4.4. Showing or hiding information about enabled applications

4.5. Hiding the default Jupyter application

Chapter 5. Allocating additional resources to OpenShift AI users

Chapter 6. Customizing component deployment resources

6.1. Overview of component resource customization

6.2. Customizing component resources

6.3. Disabling component resource customization

6.4. Re-enabling component resource customization

Chapter 7. Enabling accelerators

7.1. Enabling NVIDIA GPUs

7.2. Intel Gaudi AI Accelerator integration

7.2.1. Enabling Intel Gaudi AI accelerators

Chapter 8. Managing distributed workloads

8.1. Overview of Kueue resources

8.1.1. Resource flavor

8.1.2. Cluster queue

8.1.3. Local queue

8.2. Example Kueue resource configurations

8.2.1. NVIDIA GPUs without shared cohort

8.2.1.1. NVIDIA RTX A400 GPU resource flavor

8.2.1.2. NVIDIA RTX A1000 GPU resource flavor

8.2.1.3. NVIDIA RTX A400 GPU cluster queue

8.2.1.4. NVIDIA RTX A1000 GPU cluster queue

8.2.2. NVIDIA GPUs and AMD GPUs without shared cohort

8.2.2.1. AMD GPU resource flavor

8.2.2.2. NVIDIA GPU resource flavor

8.2.2.3. AMD GPU cluster queue

8.2.2.4. NVIDIA GPU cluster queue

8.2.3. Additional resources

8.3. Configuring quota management for distributed workloads

8.4. Configuring the CodeFlare Operator

8.5. Troubleshooting common problems with distributed workloads for administrators

8.5.1. A user’s Ray cluster is in a suspended state

8.5.2. A user’s Ray cluster is in a failed state

8.5.3. A user receives a failed to call webhook error message for the CodeFlare Operator

8.5.4. A user receives a failed to call webhook error message for Kueue

8.5.5. A user’s Ray cluster does not start

8.5.6. A user receives a Default Local Queue …​ not found error message

8.5.7. A user receives a local_queue provided does not exist error message

8.5.8. A user cannot create a Ray cluster or submit jobs

8.5.9. The user’s pod provisioned by Kueue is terminated before the user’s image is pulled

Chapter 9. Backing up data

9.1. Backing up storage data

Chapter 10. Viewing logs and audit records

10.1. Configuring the OpenShift AI Operator logger

10.1.1. Viewing the OpenShift AI Operator log

10.2. Viewing audit records

Legal Notice

8.5.6. A user receives a Default Local Queue … not found error message