Install packages from PyPI Mirror fails on Data Science Pipelines in disconnected installation
Environment
- Red Hat OpenShift Data Science
- Version: < 2.5
Issue
When running a pipeline in Data Science Pipelines that needs to install Python packages in a disconnected environment, it fails looking up on external URLs.
Resolution
There are 3 possible methods to work around this issue.
- Specify the URL of the PyPi package in the pipeline code.
preprocess_data_step = components.create_component_from_func(
func=preprocess_data,
base_image="registry.redhat.io/ubi8/python-39",
packages_to_install=['https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com/packages/urllib3-2.0.7.tar.gz'])
- Inject an Environment Variable in the pipeline component to specify the PyPi mirror.
ingest_data_step = components.create_component_from_func(
func=ingest_data,
base_image="registry.redhat.io/ubi8/python-39",
packages_to_install=['pandas'])
…
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
idx_url = V1EnvVar(name='PIP_INDEX_URL', value='https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com')
ingest_and_process_task = ingest_data_step().add_env_variable(idx_url)
Note: For PyPi mirrors with self-signed certificates, it is required to inject an extra Environment Variable to disable trust validation in the PyPi mirror. The full example for both cases is shown below.
ingest_data_step = components.create_component_from_func(
func=ingest_data,
base_image="registry.redhat.io/ubi8/python-39",
packages_to_install=['pandas'],)
preprocess_data_step = components.create_component_from_func(
func=preprocess_data,
base_image="registry.redhat.io/ubi8/python-39",
packages_to_install=['https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com/packages/urllib3-2.0.7.tar.gz'])
…
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
idx_url = V1EnvVar(name='PIP_INDEX_URL', value='https://pypi-notebook.apps.hnalla.dev.datahub.redhat.com')
trt_host = V1EnvVar(name='PIP_TRUSTED_HOST', value='pypi-notebook.apps.hnalla.dev.datahub.redhat.com')
ingest_and_process_task = ingest_data_step().add_env_variable(idx_url).add_env_variable(trt_host)
preprocess_data_task = preprocess_data_step(ingest_and_process_task.output).add_env_variable(trt_host)
For PyPi mirrors with signed certificates, just replace PIP_TRUSTED_HOST with PIP_CERT environment variable containing the certificate file path.
- It requires a secret containing this certificate file. The example below shows a secret with the certificate.
kind: Secret
apiVersion: v1
metadata:
name: pypi-mirror-secret
data:
ca.crt: >-
### INSERT CERTIFICATE CONTENTS HERE ###
type: kubernetes.io/tls
- The secret must be mounted in the pipeline component. The example below shows a code snippet to mount a secret in the pipeline component.
idx_url = V1EnvVar(name='PIP_INDEX_URL',
value_from=V1EnvVarSource(
secret_key_ref=V1SecretKeySelector(
key="index-url", name="pip-conf")))
pypi_cert = V1EnvVar(name='PIP_CERT', value='/certs/ca.crt')
vol = V1Volume(name='pypi-certs',
secret=V1SecretVolumeSource(secret_name='pypi-mirror-secret'))
vol_mount = V1VolumeMount(name='pypi-certs', mount_path='/certs')
ingest_and_process_task = ingest_data_step().add_env_variable(idx_url).add_env_variable(pypi_cert).add_volume_mount(vol_mount).add_volume(vol)
- Build a runtime image to use in the pipeline code. This solution basically involves building a new image from UBI Python image, adding the pip.conf file in /etc/ path, and push to a disconnected Image Registry. The example pip.conf file should look like this:
[global]
index-url = https://artifactory-artifactory.apps.rmartine.dev.datahub.redhat.com/artifactory/api/pypi/disconnected-pypi/simple
trusted-host = artifactory-artifactory.apps.rmartine.dev.datahub.redhat.com
And the Dockerfile should look like this.
FROM registry.redhat.io/ubi8/python-39
USER root
ADD pip.conf /etc/
USER 1001
Special case: PyPi mirrors with Authentication
For PyPi mirrors that require authentication, the best option is to create a key/value secret and reference in the Environment Variable configuration in the pipeline code. The example below is the secret definition.
kind: Secret
apiVersion: v1
metadata:
name: pip-conf
data:
index-url: https://admin:password@pypi-notebook.apps.hnalla.dev.datahub.redhat.com
trusted-host: pypi-notebook.apps.hnalla.dev.datahub.redhat.com
type: Opaque
And the pipeline code should look like this.
from kubernetes.client.models import V1EnvVar,V1EnvVarSource,V1SecretKeySelector
…
@dsl.pipeline(name="my-pipeline")
def my_pipeline():
idx_url = V1EnvVar(name='PIP_INDEX_URL',
value_from=V1EnvVarSource(
secret_key_ref=V1SecretKeySelector(
key="index-url", name="pip-conf")))
trt_host = V1EnvVar(name='PIP_TRUSTED_HOST',
value_from=V1EnvVarSource(
secret_key_ref=V1SecretKeySelector(
key="trusted-host", name="pip-conf")))
Root Cause
Python pip command needs a special configuration for using private PyPi mirrors, which is not yet parametrized in the Data Science Pipelines installation.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.