Working with MLflow
Work with MLflow from Red Hat OpenShift AI Self-Managed
Abstract
Chapter 1. About MLflow
Red Hat OpenShift AI deploys a single shared MLflow instance through the MLflow operator. With this deployment, you can use MLflow workspaces and Kubernetes-backed authorization. For more information, see Content from mlflow.org is not included.MLflow Workspaces.
A project is a namespace that has its own set of objects, policies, constraints, and service accounts. Each Red Hat OpenShift Container Platform project maps to an MLflow workspace in a one-to-one relationship. MLflow provides logical isolation of experiments, runs, registered models, prompts, datasets, and traces per workspace.
OpenShift AI manages the workspace lifecycle outside of MLflow. The MLflow API does not create, update, or delete workspaces. Instead, you manage workspaces through the OpenShift AI dashboard or the Red Hat OpenShift command-line interface (CLI). When you manage projects through these tools, the corresponding MLflow workspace becomes available automatically.
Every MLflow API request relies on Kubernetes role-based access control (RBAC) for authorization. The MLflow server submits a SelfSubjectAccessReview against pseudo-resources in the mlflow.kubeflow.org API group, using the caller’s bearer token and the target project namespace. Users who already have the Red Hat OpenShift admin, edit, or view roles on a project automatically receive the corresponding MLflow permissions.
Chapter 2. Install and configure MLflow
To track machine learning experiments in OpenShift AI, use the Red Hat OpenShift AI Operator to deploy a cluster-scoped mlflow instance.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- You have installed the OpenShift Container Platform
-
You have permission to patch the
DataScienceClusterresource and applyMLflowcustom resources. - For production-oriented deployments, you have created the required secrets for database credentials and S3-compatible storage.
Procedure
The MLflow resource is cluster scoped. Therefore, you can install only one instance of mlflow in the cluster. The Red Hat OpenShift AI Operator creates the MLflow resource in the redhat-ods-applications namespace.
- Log in to the OpenShift cluster by using the CLI.
Enable the MLflow Operator component by patching the
DataScienceClusterobject:$ oc patch datasciencecluster default-dsc \ --type=merge \ -p {"spec":{"components":{"mlflowoperator":{"managementState":"Managed"}}}}Create an
MLflowcustom resource (CR). Choose the configuration that matches your environment:- Minimal development or test deployment
Uses SQLite for the backend store and a persistent volume claim (PVC) for file-based artifact storage.
apiVersion: mlflow.opendatahub.io/v1 kind: MLflow metadata: name: mlflow namespace: redhat-ods-applications spec: storage: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi backendStoreUri: "sqlite:////mlflow/mlflow.db" artifactsDestination: "file:///mlflow/artifacts" serveArtifacts: true- Production-oriented deployment
Uses PostgreSQL for metadata and S3-compatible object storage for artifacts.
apiVersion: mlflow.opendatahub.io/v1 kind: MLflow metadata: name: mlflow namespace: redhat-ods-applications spec: replicas: 2 backendStoreUriFrom: name: mlflow-db-credentials key: backend-store-uri artifactsDestination: "s3://my-mlflow-bucket/artifacts" serveArtifacts: true envFrom: - secretRef: name: aws_access_key_id=<your_aws_access_key> aws_secret_access_key=<your_aws_secret_access_key> aws_session_token=<your_aws_session_token>
Table 2.1. MLflow Deployment and Artifact Configuration
Parameter or Feature Configuration Details replicasActs as an important scaling knob specifically for production-oriented deployments.
registryStoreUriDefaults to the value of
backendStoreUriif not specified. In practice, the same PostgreSQL database is typically utilized for both.serveArtifactsWhen enabled, clients log and retrieve artifacts through the MLflow server REST API via the
mlflow-artifacts:/proxy URI scheme. IfdefaultArtifactRootis not specified while this is enabled, it automatically defaults tomlflow-artifacts:/.defaultArtifactRootDefines the directory for storing artifacts for each new experiment. Do not set this to a direct-storage URI (e.g.,
s3://) whenserveArtifactsis enabled; doing so causes clients to bypass the proxy and attempt direct storage access.Artifact Storage Path
If no per-project override is configured, artifacts are stored under
<default_artifact_root>/workspaces/<project_name>. Note that file-based artifact storage requiresserveArtifacts: true.- In the OpenShift AI dashboard navigation bar, click the Applications menu.
- Confirm that the MLflow UI is displayed in the list.
- Open the UI, select your workspace, and verify that you can compare runs and visualize metrics.
Alternatively, click Start Demo, which loads demo data to compare runs and visualize metrics.
Additional resources
-
To view all available configuration options for the MLflow custom resource, run the
oc explain mlflow.speccommand.
-
To view all available configuration options for the MLflow custom resource, run the
2.1. MLflow and MLflowConfig configuration parameters
The MLflow deployment on Red Hat OpenShift AI relies on two custom resources: MLflow and MLflowConfig. These resources configure metadata storage, artifact destinations, and project-specific overrides.
The MLflow resource is cluster-scoped and must be named mlflow. This resource defines the global state of the MLflow deployment, including the database backend, artifact storage, replica count, and TLS configuration.
Table 2.2. Key fields in the MLflow resource
| Field | Description |
|---|---|
|
| The number of MLflow pods. The default value is 1. High-availability deployments require remote storage because PVC-backed storage does not support multiple concurrent writers. |
|
|
The tracking metadata database URI. Supported values include |
|
| A secret reference for the backend database URI. This field is used when the URI contains credentials. |
|
| Specifies the compute resources for the MLflow container. |
The MLflowConfig resource is namespace-scoped and must be named mlflow. Project owners can use this resource to override default artifact storage settings for their specific projects.
Table 2.3. Key fields in the MLflowConfig resource
| Field | Description |
|---|---|
|
|
The name of a secret in the project that contains S3-compatible connection credentials and bucket information. The value must be |
|
|
An optional relative path appended to the bucket root from the secret. For example, if the bucket is |
-
To view all available configuration fields for a resource, use the
oc explain <resource>.speccommand.
2.2. Aggregate cluster roles
MLflow uses aggregate cluster roles to control access to MLflow resources in Red Hat OpenShift AI. These roles determine which MLflow operations a user can perform based on their existing role bindings.
Standard bindings for view, edit, and admin roles do not grant identical access to all MLflow resources. While mlflow resources are cluster-scoped, mlflowconfig resources are namespace-scoped. Effective access depends on whether permissions are granted at the cluster level or the namespace level.
The mlflow-view ClusterRole aggregates into the standard OpenShift AI view, edit, and admin roles. User access depends on the scope of the binding set at the cluster level.
This role grants read-only access to the following MLflow resources:
-
get,list, andwatchpermissions for themlflowandmlflowconfigsCustom Resource Definitions (CRDs) -
getandlistpermissions for MLflow pseudo-resources, such asdatasets,experiments, andregisteredmodels -
getpermissions for MLflow status
The mlflow-edit ClusterRole aggregates into the standard OpenShift AI edit and admin roles. User access depends on the scope of the binding set at the cluster level.
This role grants write access to the following MLflow resources:
-
create,read,update, anddeletepermissions for themlflowCRD -
create,delete,patch, andupdatepermissions for themlflowconfigsCRD -
create,update, anddeletepermissions for MLflow pseudo-resources -
updatepermissions forfinalizers
The mlflow-integration ClusterRole grants service accounts data-plane access to MLflow without requiring full edit or delete privileges. When you bind this role with a RoleBinding, it provides namespace-scoped access. This ClusterRole does not grant delete or access to the control-plane resources.
This role is intended for integration-focused tasks and grants the following access:
-
get,list,create, andupdatepermissions for theexperiments,datasets, andregisteredmodelsauthorization-plugin pseudo-resources - Read access to MLflow resources and statuses
- Permissions to interact with the MLflow tracking API and model registry
2.3. RBAC model for MLflow API usage
The MLflow server authorizes every API request by using a Kubernetes SelfSubjectAccessReview. The server uses the caller’s bearer token to determine if the token has permission to perform a specific verb on an MLflow pseudo-resource within the target project namespace.
The authorization checks use the mlflow.kubeflow.org API group. These pseudo-resources are used solely for role-based access control (RBAC) policy evaluation. Pseudo-resources are not actual Kubernetes resources or Custom Resource Definitions (CRDs), they do not exist as objects on the cluster. You cannot create, list, or inspect pseudo-resources by using the Kubernetes API or command-line tools such as kubectl.
Access is granted through standard Kubernetes Role, ClusterRole, RoleBinding and ClusterRoleBinding objects.
The workspace name serves as the Kubernetes namespace for access checks. The MLflow API access is project-scoped.
The following table outlines primary pseudo-resources:
Table 2.4. MLflow pseudo-resources
| Pseudo-resource | Controls access to |
|---|---|
|
| Experiments, runs, traces, artifacts, logged models, scorers, and related tracking operations. |
|
| Registered models, model versions, and prompts. |
|
| Evaluation datasets and related data set operations. |
+ .Role assignments for MLflow API usage
The Red Hat OpenShift AI admin, edit, and view roles include the permissions required for MLflow API authorization checks through aggregate ClusterRoles. These roles allow users to interact with MLflow components within their assigned namespaces.
Workloads or agents that require MLflow API access without using broad human-facing roles can use built-in integration roles. For example, a dedicated mlflow-integration ClusterRole is available for service accounts that need to perform automated tracking or model registry tasks.
+ .Resource-name granularity The MLflow Kubernetes authorization plugin allows you to assign permissions with resourceName granularity. For example, you can restrict an agent so that it sends traces to only a specific experiment by defining the following attributes in the RBAC policy:
-
resourceName: The name of the specific experiment -
resource:experiments -
apiGroup:mlflow.kubeflow.org -
verbs:get,list,update
Chapter 3. Install and authenticate the MLflow SDK
Install the MLflow SDK and configure authentication for your OpenShift cluster to track machine learning experiments in Red Hat OpenShift AI.
Prerequisites
- You have access to a OpenShift cluster.
-
You have installed the OpenShift CLI (
oc) and are able to access MLflow in your OpenShift AI project or MLflow workspace. - For the automated authentication method, you have a service account with the appropriate permissions or are running locally.
Procedure
- Install the MLflow SDK:
pip install "mlflow[kubernetes]>=3.11"
Authenticate the SDK by using one of the following methods:
To use automated authentication, enable the
kubernetes-namespacedclient-side authentication plugin.NoteThis plugin reads credentials from the mounted service account token when running in a pod, or from the active
kubeconfigcontext when running on a workstation. The MLflow workspace is automatically configured by reading either the service account’s namespace or activekubeconfignamespace.Enter the following commands:
export MLFLOW_TRACKING_URI="https://<dashboard-url>/mlflow" export MLFLOW_TRACKING_AUTH=kubernetes-namespaced
To authenticate manually, export your tracking token and project workspace as environment variables. Enter the following commands:
export MLFLOW_TRACKING_URI="https://<dashboard-url>/mlflow" export MLFLOW_TRACKING_TOKEN="$(oc whoami --show-token)" export MLFLOW_WORKSPACE="<project-name>"
ImportantThis configuration example is not recommended for production environments.
If your OpenShift cluster does not use trusted TLS certificates, enter the following command to disable TLS verification:
export MLFLOW_TRACKING_INSECURE_TLS=true
Verification
Use the following command to verify connectivity python -c "import mlflow; print(mlflow.list_workspaces())" This command lists the workspaces you can access on the OpenShift cluster.
3.1. Configuring the MLflow SDK for a local workstation
When you run the MLflow SDK on a local workstation, the authentication plugin uses the active kubeconfig context. The plugin uses the namespace from the kubeconfig context as the workspace and resolves the authentication token from your kubeconfig credentials, including exec-based authentication providers used by oc login command.
+ .Prerequisites
- You have installed the MLflow SDK.
- You have an active kubeconfig context.
Procedure
+ . Set the following environment variables to configure the tracking URI and authentication method:
+
export MLFLOW_TRACKING_URI="https://<dashboard-url>/mlflow" export MLFLOW_TRACKING_AUTH=kubernetes-namespaced
Optional: If you prefer to set the token and workspace manually, export the following variables:
export MLFLOW_TRACKING_URI="https://<dashboard-url>/mlflow" export MLFLOW_TRACKING_TOKEN="$(oc whoami --show-token)" export MLFLOW_WORKSPACE="<project-name>"
3.2. Configuring MLflow SDK environment variables for pods
+ When you run a pod, the Kubernetes authentication plugin uses the mounted service account token and namespace to automatically set the workspace.
You do not have to call mlflow.set_workspace() when you enable the authentication plugin. The plugin derives the workspace from the pod’s service account namespace. You can override the workspace explicitly if you need to target a different project, provided that your service account has the necessary RBAC permissions in that project.
Prerequisites
You have installed the MLflow SDK.
Procedure
Set the following environment variables in your pod configuration:
export MLFLOW_TRACKING_URI="https://<dashboard-url>/mlflow" export MLFLOW_TRACKING_AUTH=kubernetes-namespaced
Verification
Run a Python script using the following code to confirm that the MLflow SDK successfully connects to the tracking server and logs data:
import mlflow mlflow.set_experiment("demo-experiment") with mlflow.start_run(): mlflow.log_param("framework", "pytorch") mlflow.log_metric("accuracy", 0.95)
3.3. Upstream MLflow SDK reference
Additional resources
For more information about the upstream MLflow SDK, see the following resources:
- Content from mlflow.org is not included.MLflow tracking
- Content from mlflow.org is not included.MLflow Tracking API reference
- Content from mlflow.org is not included.MLflow Model Registry
- Content from mlflow.org is not included.Getting started with MLflow workspaces
- Content from mlflow.org is not included.MLflow workspace providers
- Content from mlflow.org is not included.MLflow 3 migration guide
- Content from mlflow.org is not included.MLflow Python API reference
3.4. MLflow SDK troubleshooting reference
If you encounter errors when working with experiments or artifacts in the MLflow SDK, use the following information to resolve common issues and error messages.
If your issue is not described here, contact Red Hat support.
Common issues
- 403 or permission denied
- Problem
- The active project is missing the required role-based access control (RBAC) permissions.
- Resolution
- Verify that your user or service account has the necessary role binding in the active project.
- Workspace not found
Problem
The SDK cannot locate the workspace because the project name is incorrect, the namespace is filtered, or no workspace was selected.Resolution
Verify that your project name is correct and that the namespace is not restricted. Ensure that you have selected a workspace in your MLflow environment settings by using one of the following methods: * TheMLFLOW_TRACKING_AUTH=kubernetes-namespacedenvironment variable. * TheMLFLOW_WORKSPACE=<workspace_name>environment variable. * Themlflow.set_workspace("team-a")Python function.- Artifact override is not applied
- Problem
-
The
MLflowConfigresource is missing, has the wrong name, or exists in the wrong project. Alternatively, the associated secret does not exist. - Resolution
-
Ensure that an
MLflowConfigresource exists and is namedmlflow. Verify that themlflow-artifact-connectionsecret is present in the namespace.
- Kubernetes-namespaced authentication plugin cannot resolve credentials
- Problem
-
The authentication plugin is missing a service account token (when running in-cluster) or an active
kubeconfigcontext (when running locally). - Resolution
-
If running in-cluster, ensure the service account has a valid token. If running locally, verify that your
kubeconfigcontext is active and points to the correct cluster and project by runningoc project.
- Artifact writes go to the default storage location
- Problem
-
The
MLflowConfigresource does not exist in the active project, or theartifactRootSecretis invalid. - Resolution
-
Create the
MLflowConfigresource in your active project and verify that theartifactRootSecretcontains the correct connection credentials.
- JSON decode error: Expecting value
- Problem
- The Red Hat OpenShift AI authentication token is missing or invalid. Consequently, OpenShift AI receives an HTML response from MLflow instead of the expected JSON, causing OpenShift AI to prompt for a login.
- Resolution
-
Log in to the OpenShift cluster by running the
oc logincommand. Ensure that your environment uses theMLFLOW_TRACKING_AUTH=kubernetes-namespacedenvironment variable to authenticate requests.
3.5. MLflow version compatibility
The following information describes the compatible versions of MLflow and Red Hat OpenShift AI 3.4 GA.
Table 3.1. MLflow version compatibility and configuration
| Item | Description |
|---|---|
| Deployed MLflow server version |
|
| Required MLflow SDK version | 3.11 or later |
| Authentication plugin name |
|
| Environment variable |
|
The following command installs the compatible version of the MLflow SDK:
+
pip install "mlflow[kubernetes]>=3.11"
+
MLflow SDK version 3.11 and later includes the kubernetes-namespaced authentication plugin by default.
3.6. MLflow storage and database compatibility
The following table lists the MLflow storage and database configurations. The configuration settings vary according to your environment, for example production, development, or testing.
Table 3.2. Supported storage and database options
| Storage area | Supported options |
|---|---|
| Artifact storage | S3 compatible object storage for production. File system for development and testing. |
| Database | PostgreSQL for production. SQLite for development and testing. |
| Artifact repository plugins | S3 and file. |
3.7. Tracking experiments with MLflow SDK
Use the MLflow software development kit (SDK) to log and track machine learning experiments. With the SDK, you can record parameters, metrics, and artifacts to a centralized tracking server for later analysis.
Prerequisites
- You have installed the MLflow SDK version 3.11 or later.
- You have access to a Red Hat OpenShift AI data science project that has MLflow permissions configured.
You have configured a tracking URI and authentication. For more information, see Content from mlflow.org is not included.MLflow Kubernetes Authentication. .Procedure
In your notebook, log experiments, parameters, metrics, and artifacts by using the MLflow SDK:
When the kubernetes-namespaced authentication plugin is configured, the tracking URI and workspace are resolved automatically.
import random import time import mlflow mlflow.set_experiment("demo-experiment") with mlflow.start_run(run_name="demo-run"): mlflow.log_param("model_type", "baseline") mlflow.log_param("feature_count", 3) for step in range(5): mlflow.log_metric("accuracy", 0.8 + random.random() * 0.2, step=step) mlflow.log_metric("loss", 0.5 - random.random() * 0.2, step=step) time.sleep(0.2)Optional: To target a project other than the one associated with your credentials, set the workspace explicitly:
mlflow.set_workspace("<project-name>")
Troubleshooting
If you encounter errors when using the MLflow SDK to manage experiments or artifacts, see Troubleshooting MLflow SDK.
3.8. Configure project-specific S3 artifact storage
By default, all projects use the artifact storage configured in the MLflow resource. You can override the artifact storage for a specific project by creating an MLflowConfig resource and a connection in that project. After you create these resources, MLflow resolves the artifact root from the override for any new experiments and runs that you create in that project. However, MLflow does not serve artifacts when you configure a per-project override. The client accesses the S3 bucket directly, so the client must have valid S3 credentials.
The MLflowConfig resource is namespace-scoped and must be named mlflow. It points to an S3 compatible object storage connection that holds the credentials and bucket information for the project.
Prerequisites
- You have installed MLflow.
- You have an S3-compatible object storage bucket with credentials.
Procedure
In your project, add a connection of type S3 compatible object storage. Set the resource name of the connection to
mlflow-artifact-connection. Provide the following connection details:- Access key: The S3 access key ID.
- Secret key: The S3 secret access key.
- Endpoint: The S3 compatible endpoint URL.
- Region: The S3 region. Optional if your storage provider does not require it.
- Bucket: The S3 bucket name.
If you are using the connections API instead of the dashboard, create a
Secretwith the annotationopendatahub.io/connection-type-protocol: "s3"and name itmlflow-artifact-connection. Setting anAWS_DEFAULT_REGIONis optional. The required keys are:-
AWS_ACCESS_KEY_ID -
AWS_SECRET_ACCESS_KEY -
AWS_S3_BUCKET -
AWS_S3_ENDPOINT
Create an
MLflowConfigresource namedmlflowin the same project:apiVersion: mlflow.kubeflow.org/v1 kind: MLflowConfig metadata: name: mlflow spec: artifactRootSecret: mlflow-artifact-connection artifactRootPath: mlflow-artifacts
artifactRootSecretmust bemlflow-artifact-connection. The Custom Resource Definition (CRD) enforces this validation.NoteartifactRootPathis an optional relative path that the system appends to the bucket root from the secret. For example, if the bucket isds-team-bucketandartifactRootPathismlflow-artifacts, the resolved artifact root becomess3://ds-team-bucket/mlflow-artifacts. The path must be relative, must not use backslashes, and must not contain path traversal such as...
Chapter 4. Track experiments with MLflow in workbenches
You can track machine learning experiments in workbench notebooks with the MLflow SDK. OpenShift AI provides automatic MLflow SDK configuration for workbenches, which removes the need to manually set tracking URIs, configure authentication, or manage RBAC permissions.
To use MLflow experiment tracking, a cluster administrator must first enable the MLflow Operator component, and then a data scientist can enable and use the integration in individual workbenches.
4.1. MLflow workbench integration
You can enable automatic MLflow SDK configuration in your workbenches by annotating notebook resources with the opendatahub.io/mlflow-instance annotation. When this annotation is present, OpenShift AI automatically injects MLflow environment variables and provisions the required RBAC permissions, so that you can track experiments without manual SDK configuration.
The MLflow workbench integration removes the need to manually set tracking URIs, configure authentication tokens, or create Kubernetes RBAC resources. When an administrator enables the MLflow Operator component in the DataScienceCluster and creates an MLflow custom resource (CR), the platform handles the remaining configuration at the workbench level.
4.1.1. How the integration works
The integration uses two mechanisms in the notebook controller to configure MLflow access:
- Environment variable injection
When a workbench notebook resource has the
opendatahub.io/mlflow-instanceannotation, the notebook controller’s mutating webhook injects three environment variables into the notebook container before the pod starts:-
MLFLOW_TRACKING_URI: The URL of the MLflow tracking server, constructed from the GatewayAPI hostname and the MLflow instance name. -
MLFLOW_K8S_INTEGRATION: Set totrueto enable Kubernetes service account token authentication with the MLflow server. -
MLFLOW_TRACKING_AUTH: Set tokubernetes-namespacedto configure namespace-scoped authentication.
These variables allow the MLflow Python SDK to connect to the tracking server without additional configuration in your notebook code.
-
- RBAC provisioning
-
The notebook controller’s reconciler creates a namespace-scoped
RoleBindingresource named__<notebook_name>__-mlflowthat grants the workbench service account the permissions defined in themlflow-operator-mlflow-integrationClusterRoleresource. TheRoleBindingresource uses a controller owner reference to the notebook resource, so it is automatically deleted when the notebook is deleted.
4.1.2. Configuration methods
You can enable MLflow integration for workbenches in two ways:
- Dashboard-managed
-
If you use the OpenShift AI dashboard with the Dashboard, MLflow, and Workbenches components set to
Managedin theDataScienceCluster, the dashboard automatically adds theopendatahub.io/mlflow-instanceannotation when you create or update a workbench. No manual configuration is required. - Manual
-
If you manage workbenches through GitOps, Helm charts, or direct YAML manifests without the OpenShift AI dashboard, you must manually add the
opendatahub.io/mlflow-instanceannotation to the notebook resource.
4.1.3. Annotation lifecycle
The opendatahub.io/mlflow-instance annotation controls the MLflow integration for each workbench:
- Adding the annotation
-
You can add the annotation to a stopped notebook or include it when you create a new notebook. The annotation value must be the name of the MLflow instance CR. Environment variables are injected on the next pod start, and the
RoleBindingresource is created by the reconciler. - Removing the annotation
-
You must stop the workbench before removing the annotation. The
RoleBindingresource is deleted immediately when the annotation is removed, but environment variables in the running pod persist until restart. This mismatch can cause authentication failures if the workbench continues to send requests to the MLflow tracking server after theRoleBindingresource is removed. A validating webhook checks that thekubeflow-resource-stoppedannotation is present on the notebook resource before it allows removal of the MLflow annotation. The webhook validates the annotation state, not the actual pod state. If you try to remove the annotation without first stopping the workbench through the dashboard or by setting the stop annotation, the API request is rejected with a webhook error.
4.1.4. Non-blocking behavior
MLflow integration failures do not block workbench admission. If the tracking URI cannot be determined, for example because the GatewayAPI hostname is not yet available, the webhook skips the MLFLOW_TRACKING_URI injection but still allows the notebook to start. In this case, MLFLOW_K8S_INTEGRATION and MLFLOW_TRACKING_AUTH are still injected, which creates a degraded state in which the MLflow SDK is configured for Kubernetes authentication but has no tracking server to connect to. If MLflow operations such as mlflow.set_experiment() fail with a ConnectionError, verify that the MLFLOW_TRACKING_URI environment variable is set in your workbench pod.
Similarly, if the mlflow-operator-mlflow-integration ClusterRole resource does not yet exist, the reconciler requeues the RoleBinding resource creation with a warning event every 30 seconds until the ClusterRole resource becomes available.
4.1.5. Limitations
Artifact-serving configuration with S3 connection types is not supported. You can log parameters, metrics, and tags, but mlflow.log_artifact() functionality that relies on an S3-backed artifact store requires additional configuration that is outside the scope of the automatic integration.
4.2. Enable the MLflow Operator component
You can enable the MLflow Operator as a managed component in the DataScienceCluster object so that the MLflow tracking server and workbench integration are available on the platform. When the MLflow Operator component is enabled, workbench notebooks can use automatic MLflow SDK configuration.
Prerequisites
- The Red Hat OpenShift AI Operator is installed on your OpenShift cluster.
-
A
DataScienceClusterobject exists. - You have cluster administrator privileges.
-
You have installed the OpenShift CLI (
oc).
Procedure
Log in to your OpenShift cluster as a cluster administrator:
$ oc login --token=__<token>__ --server=__<openshift_cluster_url>__
Identify the name of the
DataScienceClusterobject:$ oc get datasciencecluster
Edit the
DataScienceClusterobject to enable the MLflow Operator component:$ oc edit datasciencecluster __<dsc_name>__
In the
spec.componentssection, add or update themlflowoperatorfield and setmanagementStatetoManaged:spec: components: # ... mlflowoperator: managementState: Managed # ...NoteThe default
managementStatefor themlflowoperatorcomponent isRemoved. You must explicitly set it toManagedto enable MLflow on the platform.Save and close the editor.
When you set
mlflowoperatortoManaged, the Red Hat OpenShift AI Operator automatically configures the notebook controller withMLFLOW_ENABLED=trueand sets the appropriateGATEWAY_URLfor the cluster.- Create an MLflow CR to deploy a tracking server instance. The MLflow CR structure and namespace requirements depend on the MLflow Operator version. For more information, consult the MLflow Operator documentation for the CR specification.
Verification
Confirm that the MLflow Operator pod is running:
$ oc get pods -n redhat-ods-applications -l app=mlflow-operator --field-selector status.phase=Running
The output shows one or more MLflow Operator pods in
Runningstatus. If the command returns no results, verify that themlflowoperatorcomponentmanagementStateis set toManagedand wait for the Operator pods to start.Confirm that the
mlflowoperatorcomponent status istruein theDataScienceClusterobject:$ oc get datasciencecluster __<dsc_name>__ -o jsonpath='{.status.installedComponents.mlflowoperator}'The expected output is
true.
4.2.1. MLflow dashboard feature flag deprecation
Starting with Red Hat OpenShift AI 3.4, the mlflow field in the OdhDashboardConfig custom resource (CR) is deprecated. MLflow availability in the dashboard is now determined by the mlflowoperator component state in the DataScienceCluster object, and the dashboard feature flag is no longer required.
4.2.1.1. Deprecated field
The mlflow field in spec.dashboardConfig of the OdhDashboardConfig CR is deprecated and no longer controls MLflow visibility in the dashboard. You do not need to set this field to enable MLflow. The field has no effect on MLflow functionality.
4.2.1.2. Current behavior
MLflow features in the OpenShift AI dashboard are automatically available when the mlflowoperator component is set to Managed in the DataScienceCluster object. No additional dashboard configuration is required.
4.2.1.3. mlflowPipelines field
The mlflowPipelines field in the OdhDashboardConfig CR is not deprecated and remains active. This field controls whether pipeline run tables display the MLflow experiment column. Do not confuse the deprecated mlflow field with the active mlflowPipelines field.
Table 4.1. Dashboard feature flag status
| Field | Status | Description |
|---|---|---|
|
| Deprecated |
In earlier releases, this field controlled MLflow visibility in the dashboard. No longer required; MLflow is enabled automatically by the |
|
| Active | Controls whether the MLflow experiment column is displayed in pipeline run tables. |
4.3. Enable MLflow integration for a workbench
You can enable automatic MLflow SDK configuration for a workbench by adding the opendatahub.io/mlflow-instance annotation to the notebook resource. When the workbench starts, the platform injects MLflow environment variables and provisions the required RBAC resources.
When you create or update a workbench, the OpenShift AI dashboard automatically adds the MLflow annotation. For this to occur, the following components in the DataScienceCluster object must be set to Managed:
- Dashboard
- MLflow
- Workbenches
The following procedure is for environments where you manage workbenches through GitOps, Helm charts, or direct YAML manifests without the OpenShift AI dashboard.
Prerequisites
-
A cluster administrator has enabled the MLflow Operator component in the
DataScienceClusterobject. For more information, see Enable the MLflow Operator component. - An MLflow CR has been created in the same namespace as the workbench.
- The workbench is stopped.
-
You have installed the OpenShift CLI (
oc).
Procedure
Log in to your OpenShift cluster:
$ oc login --token=__<token>__ --server=__<openshift_cluster_url>__
Add the
opendatahub.io/mlflow-instanceannotation to the notebook resource.In the following command, replace
__<notebook_name>__with the name of your workbench notebook and__<mlflow_instance_name>__with the name of your MLflow instance CR:$ oc annotate notebook -n __<namespace>__ __<notebook_name>__ opendatahub.io/mlflow-instance=__<mlflow_instance_name>__
If the MLflow instance name is
mlflow, the tracking URI path is/mlflow. If the instance name is different, the path is/mlflow-__<mlflow_instance_name>__.Alternatively, you can add the annotation directly in the notebook resource YAML:
apiVersion: kubeflow.org/v1 kind: Notebook metadata: name: __<notebook_name>__ namespace: __<namespace>__ annotations: opendatahub.io/mlflow-instance: "__<mlflow_instance_name>__" spec: # ...- Start or restart the workbench so that the environment variables are injected into the new pod.
Verification
Verify that the MLflow environment variables are present in the running workbench pod.
First, find the pod name for your notebook:
$ oc get pods -n __<namespace>__ -l notebook-name=__<notebook_name>__
Then verify that the MLflow environment variables are present, where
__<notebook_pod_name>__is the pod name from the previous command and__<notebook_name>__is the notebook resource name:$ oc exec -n __<namespace>__ __<notebook_pod_name>__ -c __<notebook_name>__ -- env | grep MLFLOW
The expected output shows the following environment variables:
MLFLOW_TRACKING_URI=https://<gateway_hostname>/mlflow MLFLOW_K8S_INTEGRATION=true MLFLOW_TRACKING_AUTH=kubernetes-namespaced
The
<gateway_hostname>value is cluster-specific and matches the GatewayAPI public endpoint for your cluster.Verify that the MLflow
RoleBindingresource exists in the namespace:$ oc get rolebinding -n __<namespace>__ __<notebook_name>__-mlflow
The output shows a
RoleBindingresource referencing themlflow-operator-mlflow-integrationClusterRoleresource.
4.4. Use the MLflow SDK in a workbench notebook
After MLflow integration is enabled for your workbench, you can use the MLflow Python SDK to create experiments, log parameters and metrics, and track runs. The required environment variables are already configured, so the SDK connects to the tracking server automatically.
Prerequisites
-
A cluster administrator has enabled the MLflow Operator component in the
DataScienceClustercustom resource (CR). For more information, see Enable the MLflow Operator component. - An MLflow CR has been created in the namespace.
-
The workbench notebook resource has the
opendatahub.io/mlflow-instanceannotation. For more information, see Enable MLflow integration for a workbench. - The workbench is started and the MLflow environment variables are injected.
The
mlflowPython package is installed in the workbench. Starting with OpenShift AI 3.4, all workbench images except the minimal image include the MLflow SDK. For older images or custom images, install the package by runningpip install mlflowin a notebook cell.NoteIf you need only basic tracking functionality, you can install the
mlflow-skinnypackage instead for a smaller footprint. Themlflow-skinnypackage contains the core tracking and logging features but does not include model serving integrations or the full MLflow CLI.
Procedure
- Open a notebook in the workbench.
Verify that the MLflow environment variables are available:
import os print(os.environ.get("MLFLOW_TRACKING_URI")) print(os.environ.get("MLFLOW_K8S_INTEGRATION")) print(os.environ.get("MLFLOW_TRACKING_AUTH"))All three variables display their configured values.
Import the MLflow SDK and set an experiment:
import mlflow mlflow.set_experiment("my-experiment")The
set_experiment()call returns anExperimentobject and prints the experiment ID. If you see aConnectionError, verify that theMLFLOW_TRACKING_URIenvironment variable is set correctly by re-running the check in step 2.Start a run and log parameters and metrics:
with mlflow.start_run(): mlflow.log_param("learning_rate", 0.01) mlflow.log_param("epochs", 10) mlflow.log_metric("accuracy", 0.95) mlflow.log_metric("loss", 0.05)You can log any number of parameters, metrics, and tags within a run. If
start_run()raises aConnectionErroror an authentication or authorization error, verify that the MLflow tracking server is running and that the workbenchRoleBindingresource exists.- View the experiment results in the MLflow UI, which is accessible from the OpenShift AI dashboard.
Verification
- Confirm that the experiment and run are displayed in the MLflow UI. Navigate to the MLflow tracking server URL in your browser or through the OpenShift AI dashboard and verify that the logged parameters and metrics are visible.
Artifact-serving configuration with S3 connection types is not supported. You can log parameters, metrics, and tags, but mlflow.log_artifact() functionality that relies on an S3-backed artifact store requires additional configuration. For more information, see MLflow workbench integration.
4.5. Disable MLflow integration for a workbench
You can disable MLflow integration for a workbench by removing the opendatahub.io/mlflow-instance annotation from the notebook resource. You must stop the workbench before removing the annotation.
When you remove the annotation, the platform deletes the associated RoleBinding resource immediately, but environment variables in the running pod persist until the pod restarts. If the workbench is running when you remove the annotation, the MLflow SDK continues to send requests to the tracking server without a valid RoleBinding resource in place, which causes authentication failures.
Prerequisites
-
Your workbench has the
opendatahub.io/mlflow-instanceannotation. -
The workbench is stopped. The validating webhook requires the
kubeflow-resource-stoppedannotation to be present on the notebook resource before it allows removal of the MLflow annotation. -
You have installed the OpenShift CLI (
oc).
Procedure
Log in to your OpenShift cluster:
$ oc login --token=__<token>__ --server=__<openshift_cluster_url>__
Verify that the workbench pod has stopped:
$ oc get pods -n __<namespace>__ -l notebook-name=__<notebook_name>__
The command produces no output when the pod has stopped.
Remove the
opendatahub.io/mlflow-instanceannotation from the notebook resource.In the following command, replace
__<notebook_name>__with the name of your workbench notebook:$ oc annotate notebook -n __<namespace>__ __<notebook_name>__ opendatahub.io/mlflow-instance-
The trailing dash (
-) removes the annotation.
Verification
Start the workbench and verify that the MLflow environment variables are no longer present.
First, find the pod name for your notebook:
$ oc get pods -n __<namespace>__ -l notebook-name=__<notebook_name>__
Then verify that the MLflow environment variables are no longer present, where
__<notebook_pod_name>__is the pod name from the previous command and__<notebook_name>__is the notebook resource name:$ oc exec -n __<namespace>__ __<notebook_pod_name>__ -c __<notebook_name>__ -- env | grep MLFLOW
The command produces no output, confirming that the MLflow variables have been removed.
Verify that the MLflow
RoleBindingresource has been removed:$ oc get rolebinding -n __<namespace>__ __<notebook_name>__-mlflow
The expected output is:
Error from server (NotFound): rolebindings.rbac.authorization.k8s.io "__<notebook_name>__-mlflow" not found
Troubleshooting
If the annotation removal is rejected with a webhook error, see Resolve MLflow annotation removal rejection.
4.6. Resolve MLflow annotation removal rejection
If you try to remove the opendatahub.io/mlflow-instance annotation from a running workbench, the API request is rejected by a validating webhook. You can resolve this issue by stopping the workbench before removing the annotation.
The validating webhook checks that the kubeflow-resource-stopped annotation is present on the notebook resource before it allows removal of the MLflow annotation. The webhook validates the annotation state, not the actual pod state. If the pod was stopped without setting the stop annotation, for example by deleting the pod directly, the webhook still rejects the annotation removal.
Prerequisites
-
The workbench has the
opendatahub.io/mlflow-instanceannotation. - You received a webhook error when trying to remove the annotation.
-
You have installed the OpenShift CLI (
oc).
Procedure
Stop the workbench. You can stop the workbench from the OpenShift AI dashboard or by applying the stop annotation:
$ oc annotate notebook -n __<namespace>__ __<notebook_name>__ kubeflow-resource-stopped=true
The value of the
kubeflow-resource-stoppedannotation is not significant for this purpose. The validating webhook checks only for the presence of the annotation key.Wait for the notebook pod to stop fully:
$ oc get pods -n __<namespace>__ -l notebook-name=__<notebook_name>__
The command produces no output when the pod has stopped.
Remove the MLflow annotation:
$ oc annotate notebook -n __<namespace>__ __<notebook_name>__ opendatahub.io/mlflow-instance-
Verification
Confirm that the annotation has been removed:
$ oc get notebook -n __<namespace>__ __<notebook_name>__ -o jsonpath='{.metadata.annotations.opendatahub\.io/mlflow-instance}'The command produces no output, confirming the annotation is removed.
4.7. MLflow workbench environment variables and annotations
When MLflow integration is enabled for a workbench, the notebook controller injects environment variables and creates RBAC resources automatically. You can use this reference to understand the annotation, environment variables, and RoleBinding resource that the platform manages on your behalf.
4.7.1. Annotation
Table 4.2. MLflow instance annotation
| Annotation | Description |
|---|---|
|
| Specifies the name of the MLflow instance CR to connect to. When this annotation is present and non-empty on a notebook resource, the notebook controller enables MLflow integration for the workbench. The annotation value determines the path segment in the tracking URI. |
Table 4.3. Annotation value to tracking URI path mapping
| Annotation value | Tracking URI path |
|---|---|
|
|
|
|
|
|
4.7.2. Environment variables
The following environment variables are injected into the notebook container when the opendatahub.io/mlflow-instance annotation is present.
Table 4.4. MLflow environment variables
| Variable | Value | Description |
|---|---|---|
|
|
|
Specifies the URL of the MLflow tracking server. The hostname is derived from the GatewayAPI public endpoint. If the instance name is |
|
|
| Specifies that the MLflow client uses Kubernetes service account token authentication. |
|
|
| Specifies the authentication method for the MLflow client. This value configures namespace-scoped authentication using the workbench service account. |
4.7.3. RoleBinding
The notebook controller creates a namespace-scoped RoleBinding resource for each annotated workbench.
Table 4.5. MLflow RoleBinding details
| Property | Value |
|---|---|
| Name |
|
| Namespace | Same as the notebook resource namespace |
| Subject |
|
| RoleRef |
|
| Controller owner reference |
Set to the notebook resource, so the |
| Labels |
|
4.7.4. ClusterRole availability
The RoleBinding resource references the mlflow-operator-mlflow-integration ClusterRole resource, which is created by the MLflow Operator. If this ClusterRole resource does not yet exist when the reconciler attempts to create the RoleBinding resource, the reconciler requeues the request with a warning event every 30 seconds until the ClusterRole resource becomes available. These events are generated only when a notebook resource has the opendatahub.io/mlflow-instance annotation, but the ClusterRole resource does not yet exist. If no annotated notebooks exist, no events are generated. You can observe these events by running the following command, where __<namespace>__ is the namespace containing your workbench notebook:
$ oc get events -n __<namespace>__ --field-selector reason=MLflowClusterRolePending
4.7.5. Controller environment variables
The Red Hat OpenShift AI Operator automatically configures the following environment variables on the notebook controller when the mlflowoperator component is set to Managed in the DataScienceCluster object. You do not need to set these variables manually.
Table 4.6. Controller environment variables
| Variable | Description |
|---|---|
|
|
Set to |
|
|
Set based on the cluster’s Gateway configuration. The notebook controller uses this value to construct the |
Chapter 5. Track and compare MLflow experiments
When you train or tune models, you need to understand what combination of parameters, data, and code produces the best results. You can use the embedded MLflow experiment tracking page in the OpenShift AI dashboard to view, compare, and analyze your experiment runs without leaving the dashboard.
The experiment tracking page appears under Develop & train > Experiments (MLflow) after you enable the MLflow Operator component in the DataScienceCluster resource and create an MLflow custom resource. The page displays the same experiment and run data as the standalone MLflow UI, integrated directly into the dashboard with native light and dark mode support.
Experiments organize your runs into logical groups. Each run captures the parameters, metrics, and artifacts from a single training or evaluation execution. You can log runs to MLflow from a workbench by using the MLflow SDK, which is pre-installed in the Data Science workbench image.
From the experiment tracking page, you can perform the following tasks:
- Browse experiments and runs
- View all experiments in your project and drill into individual experiments to see their associated runs. Each run displays its logged parameters, metrics, and system-level resource usage.
- Compare runs
- Select multiple runs from one or more experiments and compare them side by side. The comparison view provides visualizations such as parallel coordinates plots, scatter plots, box plots, and contour plots with configurable axes. You can toggle Show differences only to focus on the parameters and metrics that vary across runs.
- Create experiments
Create new experiments to organize runs for a specific model, dataset, or tuning strategy. Experiment names must be unique within the project.
Within an experiment, a workflow type toggle controls which tabs are visible. Select GenAI to see Overview, Traces, and Sessions tabs, or select Model training to see Runs, Models, and Traces tabs.
You log metrics and create runs through the MLflow SDK in a workbench, not through the dashboard UI. The dashboard provides a read-only and comparison view of the run data that the SDK records.
Additional resources
5.1. View MLflow experiments and runs
After you log training or evaluation runs by using the MLflow SDK, you can view experiment results and run details in the OpenShift AI dashboard. The Experiments page displays all experiments in your project and lets you drill into individual runs.
Prerequisites
-
The MLflow Operator component is enabled in the
DataScienceClusterresource. -
An
MLflowcustom resource has been created. - You have logged one or more runs to an MLflow experiment from a workbench.
Procedure
- From the OpenShift AI dashboard, select the project that contains the MLflow experiments from the Project list.
Click Develop & train → Experiments (MLflow).
The Experiments list displays all MLflow experiments in your project.
Click an experiment name to view its associated runs.
The Runs table displays each run with its parameters, metrics, and status.
Click a run name to view its details.
The details page displays the following information:
- Parameters
- The hyperparameters and configuration values that you logged for the run.
- Metrics
- The performance metrics that you logged during the run, displayed as interactive charts that you can zoom and pan.
- System metrics
- Resource usage data such as CPU and memory consumption during the run.
Optional: To create a new experiment, click Create.
- Enter a unique name for the experiment.
- Click Create.
Verification
- The Experiments list displays your experiments and the Runs table shows the expected number of runs.
- The details page displays the parameters and metrics that you logged by using the MLflow SDK.
5.2. Compare MLflow runs
When you have several runs across one or more experiments, you can compare them side by side to identify which parameters and metrics differ. This helps you select the best-performing configuration.
Prerequisites
-
The MLflow Operator component is enabled in the
DataScienceClusterresource. -
An
MLflowcustom resource has been created. - You have logged two or more runs to MLflow experiments from a workbench.
Procedure
- From the OpenShift AI dashboard, select the project that contains the MLflow experiments from the Project list.
- Click Develop & train → Experiments (MLflow).
- Click an experiment name to view its runs.
- Select the checkbox next to each run that you want to compare.
Click Compare.
The comparison view displays the selected runs with their parameters and metrics in a tabular layout.
- Optional: To focus on values that vary between runs, select Show differences only.
Optional: To visualize the comparison, select a chart type from the visualization panel.
The following chart types are available:
- Parallel coordinates plot
- Displays all selected runs as lines across parallel axes, one for each metric or parameter. Use this view to identify patterns and tradeoffs across many dimensions at once.
- Scatter plot
- Plots two metrics or parameters against each other across the selected runs. Use the X axis and Y axis dropdowns to select the axes.
- Box plot
- Displays the distribution of a metric across the selected runs.
- Contour plot
- Displays the density of data points across two parameters or metrics as a contour map. Use the X axis, Y axis, and Z axis dropdowns to configure the axes.
Optional: To compare runs across different experiments, return to the Experiments list.
- Select runs from another experiment.
- Add the selected runs to the comparison.
Verification
- The comparison view displays all selected runs with their parameters and metrics.
- Chart visualizations reflect the selected axes and chart type.