Install packages from PyPI Mirror fails on Data Science Pipelines in disconnected installation

Solution Unverified - Updated

Environment

  • Red Hat OpenShift Data Science
    • Version: < 2.5

Issue

When running a pipeline in Data Science Pipelines that needs to install Python packages in a disconnected environment, it fails looking up on external URLs.

Resolution

There are 3 possible methods to work around this issue.

  1. Specify the URL of the PyPi package in the pipeline code.
preprocess_data_step = components.create_component_from_func(
   func=preprocess_data,
   base_image="registry.redhat.io/ubi8/python-39",
 packages_to_install=['https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com/packages/urllib3-2.0.7.tar.gz'])
  1. Inject an Environment Variable in the pipeline component to specify the PyPi mirror.
ingest_data_step = components.create_component_from_func(
   func=ingest_data,
   base_image="registry.redhat.io/ubi8/python-39",
   packages_to_install=['pandas'])
…
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
   idx_url = V1EnvVar(name='PIP_INDEX_URL', value='https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com')


   ingest_and_process_task = ingest_data_step().add_env_variable(idx_url)

Note: For PyPi mirrors with self-signed certificates, it is required to inject an extra Environment Variable to disable trust validation in the PyPi mirror. The full example for both cases is shown below.

ingest_data_step = components.create_component_from_func(
   func=ingest_data,
   base_image="registry.redhat.io/ubi8/python-39",
   packages_to_install=['pandas'],)
preprocess_data_step = components.create_component_from_func(
   func=preprocess_data,
   base_image="registry.redhat.io/ubi8/python-39",
   packages_to_install=['https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com/packages/urllib3-2.0.7.tar.gz'])
…
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
   idx_url = V1EnvVar(name='PIP_INDEX_URL', value='https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com')
   trt_host = V1EnvVar(name='PIP_TRUSTED_HOST', value='pypi-notebook.apps.hnalla.dev.datahub.redhat.com')

   ingest_and_process_task = ingest_data_step().add_env_variable(idx_url).add_env_variable(trt_host)
   preprocess_data_task = preprocess_data_step(ingest_and_process_task.output).add_env_variable(trt_host)

For PyPi mirrors with signed certificates, just replace PIP_TRUSTED_HOST with PIP_CERT environment variable containing the certificate file path.

  • It requires a secret containing this certificate file. The example below shows a secret with the certificate.
kind: Secret
apiVersion: v1
metadata:
  name: pypi-mirror-secret
data:
  ca.crt: >-
    ### INSERT CERTIFICATE CONTENTS HERE ###
type: kubernetes.io/tls
  • The secret must be mounted in the pipeline component. The example below shows a code snippet to mount a secret in the pipeline component.
    idx_url = V1EnvVar(name='PIP_INDEX_URL',
        value_from=V1EnvVarSource(
            secret_key_ref=V1SecretKeySelector(
                key="index-url", name="pip-conf")))
    pypi_cert = V1EnvVar(name='PIP_CERT', value='/certs/ca.crt')
    vol = V1Volume(name='pypi-certs',
        secret=V1SecretVolumeSource(secret_name='pypi-mirror-secret'))
    vol_mount = V1VolumeMount(name='pypi-certs', mount_path='/certs')

    ingest_and_process_task = ingest_data_step().add_env_variable(idx_url).add_env_variable(pypi_cert).add_volume_mount(vol_mount).add_volume(vol)
  1. Build a runtime image to use in the pipeline code. This solution basically involves building a new image from UBI Python image, adding the pip.conf file in /etc/ path, and push to a disconnected Image Registry. The example pip.conf file should look like this:
[global]
index-url = https://artifactory-artifactory.apps.rmartine.dev.datahub.redhat.com/artifactory/api/pypi/disconnected-pypi/simple
trusted-host = artifactory-artifactory.apps.rmartine.dev.datahub.redhat.com

And the Dockerfile should look like this.

FROM registry.redhat.io/ubi8/python-39


USER root


ADD pip.conf /etc/


USER 1001

Special case: PyPi mirrors with Authentication

For PyPi mirrors that require authentication, the best option is to create a key/value secret and reference in the Environment Variable configuration in the pipeline code. The example below is the secret definition.

kind: Secret
apiVersion: v1
metadata:
 name: pip-conf
data:
 index-url: https://admin:password@pypi-notebook.apps.hnalla.dev.datahub.redhat.com
 trusted-host: pypi-notebook.apps.hnalla.dev.datahub.redhat.com
type: Opaque

And the pipeline code should look like this.

from kubernetes.client.models import V1EnvVar,V1EnvVarSource,V1SecretKeySelector
…
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
   idx_url = V1EnvVar(name='PIP_INDEX_URL',
       value_from=V1EnvVarSource(
           secret_key_ref=V1SecretKeySelector(
               key="index-url", name="pip-conf")))
   trt_host = V1EnvVar(name='PIP_TRUSTED_HOST',
       value_from=V1EnvVarSource(
           secret_key_ref=V1SecretKeySelector(
               key="trusted-host", name="pip-conf")))

Root Cause

Python pip command needs a special configuration for using private PyPi mirrors, which is not yet parametrized in the Data Science Pipelines installation.

Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.