Troubleshooting Guide: SAP Edge Integration Cell on OpenShift

Updated

Table of Contents


Issue Documentation Template

When adding new issues to this guide, use this template:

### Issue: [Brief Description]

**Symptom**: What the user sees (error messages, unexpected behavior)
**Environment**: OpenShift version, Service Mesh version, specific conditions
**Root Cause**: Why this happens (if known)
**Solution**:
1. Step-by-step fix
2. Commands to run
3. Expected outputs

**Prevention**: How to avoid this in the future
**Validation**: How to confirm the fix worked
**Related Issues**: Links to similar problems
**Date Added**: YYYY-MM-DD
**Reporter**: [Name/Team]

ELM Deployment Issues

Issue: Image Replication Fails with Manifest Format Error - Quay Registry Compatibility

Symptom:

Action [Create Image Replications (...)] failed: Failed to create image replication - status [ERROR]
error message: cannot copy image <image>: Uploading manifest failed, attempted the following formats:
- application/vnd.oci.image.manifest.v1+json (manifest invalid)
- application/vnd.docker.distribution.manifest.v2+json (Unknown media type during manifest conversion: "application/vnd.docker.image.rootfs.diff.tar.gzip")
- application/vnd.docker.distribution.manifest.v1+prettyjws (Unknown media type during manifest conversion: "application/vnd.docker.image.rootfs.diff.tar.gzip")
- application/vnd.oci.image.index.v1+json (Unsupported conversion type)
- application/vnd.docker.distribution.manifest.list.v2+json (Unsupported conversion type)

Environment:

  • EIC version 8.36.2
  • OpenShift 4.20
  • Local Quay container registry (version < 3.12)
  • Source registry:

Root Cause: Quay registry versions below 3.12 cannot handle certain manifest formats used by the source SAP container images, causing manifest conversion failures during image replication.

Solution:

  1. Check your current Quay version:

       # If using Quay Operator
       oc get quayregistry -n quay-enterprise -o yaml | grep "desiredVersion\|currentVersion"
    
       # Or check Quay pod logs for version info
       oc logs -n quay-enterprise deployment/quay-registry-quay-app | grep -i version
    
  2. Upgrade Quay to version 3.12 or higher:

    For Quay Operator-managed deployments:

    # Update the QuayRegistry resource to specify newer version
    oc patch quayregistry example-registry -n quay-enterprise --type merge -p '{"spec":{"configBundleSecret":"quay-config-bundle","components":[{"kind":"quay","managed":true}],"desiredVersion":"3.12.0"}}'
    

    For standalone Quay deployments:

    • Follow Red Hat Quay upgrade documentation
    • Ensure backup of registry data before upgrade
    • Plan for potential downtime during upgrade
  3. Verify the upgrade:

       # Check that all Quay pods are running with new version
       oc get pods -n quay-enterprise
    
       # Verify Quay UI shows correct version
       # Access Quay web interface and check version in footer
    
  4. Retry the ELM deployment:

    • Once Quay is upgraded to 3.12+, retry the failed deployment step
    • The image replication should now succeed

Alternative Workarounds (if immediate upgrade not possible):

  1. Manual image copy with skopeo:

    # Install skopeo if not available
    # Use skopeo to copy images with format conversion
    skopeo copy --format v2s2 \
      docker://dockersrv.cdn.repositories.cloud.sap/com.sap.it.img/edge-ssb-operator:1.10.0 \
      docker://your-quay-registry.com/namespace/edge-ssb-operator:1.10.0
    

Prevention:

  • Always verify container registry compatibility before ELM deployment
  • Maintain Quay registry at version 3.12 or higher
  • Test image replication in non-production environment first

Validation:

# Verify successful image replication
oc get imagereplication -n edgelm

# Check that images are available in target registry
# Through Quay UI or API: GET /api/v1/repository/{namespace}/{repository}/tag/

Related Issues:

  • Any OCI/Docker manifest compatibility issues with container registries
  • Registry upgrade planning and procedures

Date Added: 2025-10-29

Resolution Notes: This issue was initially suspected to be a problem with source image build process, but investigation confirmed it was purely a registry compatibility issue. Upgrading Quay to version 3.12+ completely resolved the problem.

Issue: Solace Pod Fails to Start with NetApp NFS Storage Backend

Symptom:

# Solace pod stuck in pending or error state
oc get pods -n edge-icell-services | grep solace
solace-message-broker-xxx   0/1     Error    0          5m

# Pod events show storage-related errors
oc describe pod solace-message-broker-xxx -n edge-icell-services
Events:
  Warning  FailedMount  persistentvolume-controller  MountVolume.SetUp failed for volume "pvc-xxx" : mount failed: exit status 1
  Warning  FailedMount  kubelet  Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data token]: timed out waiting for the condition

Environment:

  • SAP Edge Integration Cell deployment
  • NetApp NFS as storage backend for PVCs
  • OpenShift cluster with NFS-based StorageClass configured as default

Root Cause: The Solace message broker component within Edge Integration Cell does not support NFS storage due to performance and file locking requirements. NFS lacks the I/O characteristics and file system semantics required for Solace's persistent storage operations.

Solution:

  1. Configure alternative storage backend (Recommended):

    Option A: Use iSCSI with NetApp ONTAP:

    # Install NetApp Trident operator with iSCSI backend
    # Example backend configuration for iSCSI
    apiVersion: trident.netapp.io/v1
    kind: TridentBackendConfig
    metadata:
      name: backend-ontap-san
      namespace: trident
    spec:
      version: 1
      storageDriverName: ontap-san
      managementLIF: "192.168.1.100"
      svm: "svm_iscsi"
      username: "admin"
      password: "password"
      protocol: "iscsi"
    

    Option B: Use NVMe with NetApp ONTAP (if supported):

    # Configure NVMe/TCP or NVMe/FC backend
    # Ensure OpenShift nodes have NVMe tools installed
    apiVersion: trident.netapp.io/v1
    kind: TridentBackendConfig
    metadata:
      name: backend-ontap-nvme
      namespace: trident
    spec:
      version: 1
      storageDriverName: ontap-san
      managementLIF: "192.168.1.100"
      svm: "svm_nvme"
      username: "admin"
      password: "password"
      protocol: "nvme"
    
  2. Create appropriate StorageClass:

    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: netapp-block-storage
      annotations:
        storageclass.kubernetes.io/is-default-class: "false"
    provisioner: csi.trident.netapp.io
    parameters:
      backendType: "ontap-san"
      fsType: "ext4"
    allowVolumeExpansion: true
    volumeBindingMode: Immediate
    
  3. Update Edge Integration Cell configuration:

    # Specify the block storage class in EIC deployment configuration
    # This may require updating the EIC configuration in ELM UI
    # or modifying the deployment manifests to use the correct StorageClass
    
  4. Verify storage configuration:

       # Check available storage classes
       oc get storageclass
    
       # Verify the block storage class is available
       oc describe storageclass netapp-block-storage
    
       # Check Trident backend status
       oc get tridentbackendconfig -n trident
    

Alternative Workarounds:

  1. Use local storage (for testing/development only):

    # Configure local storage class for non-production environments
    apiVersion: storage.k8s.io/v1
    kind: StorageClass
    metadata:
      name: local-storage
    provisioner: kubernetes.io/no-provisioner
    volumeBindingMode: WaitForFirstConsumer
    
  2. Use other supported storage providers:

    • Red Hat OpenShift Data Foundation (ODF)
    • VMware vSphere CSI
    • Amazon EBS (for AWS deployments)
    • Azure Disk (for Azure deployments)

Prevention:

  • Always use block storage for Edge Integration Cell deployments
  • Review storage requirements in SAP Edge Integration Cell documentation before deployment
  • Test storage performance with Solace requirements in non-production environment
  • Follow Red Hat's storage recommendations for OpenShift workloads

Validation:

# Verify Solace pods are running successfully
oc get pods -n edge-icell-services | grep solace

# Check PVC is bound to block storage
oc get pvc -n edge-icell-services
oc describe pvc <solace-pvc-name> -n edge-icell-services | grep -A 5 "StorageClass"

# Verify storage backend type
oc get pv <pv-name> -o yaml | grep -A 10 "csi:"

Related Issues:

  • Performance issues with database workloads on NFS storage
  • File locking problems with message broker components
  • Storage class configuration and selection

References:

Date Added: 2025-10-29
Reporter: SAP Edge Integration Team

Resolution Notes: Using NFS storage is NOT recommended for configuring the Message Service Storage Class when deploying SAP Edge Integration Cell. The Solace component within Edge Integration Cell specifically advises against NFS usage due to performance and file system requirements. Block storage backends such as iSCSI or NVMe should be used instead.

Issue: "Permission Denied" on ODF CephFS RWX Shared Volumes

Symptom: SAP EIC pods (edge-api, worker, edc, etc.) sharing a CephFS RWX volume receive "Permission Denied" errors when accessing /mnt/diagnostics/ and /mnt/dumps/. This causes the Diagnostic Task feature to fail.

Environment: OCP 4.x with ODF CephFS, pods using privileged SCC with seLinuxContext: RunAsAny

Root Cause: SELinux MCS label conflicts when multiple pods share a CephFS RWX volume. The CephFS CSI driver relabels the volume with each pod's MCS label during mount, causing access conflicts.

Solution: See Red Hat KB 7137220 for detailed resolution steps including:

  • Creating a StorageClass with kernelMountOptions (new deployments)
  • Patching existing PersistentVolumes (existing deployments)

Date Added: 2026-01-29


General Troubleshooting Procedures

Debug Information Collection

When encountering issues, collect this information:

# Cluster information
oc version
oc get nodes
oc get clusterversion

# Operator status
oc get csv -n openshift-operators

# Service Mesh 3.x status (OSSM3)
# OSSM3 uses Istio custom resources instead of SMCP/SMMR (OSSM2).
oc get istio -n istio-system
oc describe istio/default -n istio-system
oc get pods -n istio-system
oc get pods -n istio-cni 2>/dev/null || true

# Application namespaces
oc get namespaces | grep -E "(edgelm|edge-icell)"
oc get pods --all-namespaces | grep -E "(edgelm|edge-icell)"

# RBAC status
oc get clusterroles | grep edgelm
oc get clusterrolebindings | grep edgelm
oc get serviceaccounts -n edgelm

# Events (last 1 hour)
oc get events --sort-by='.lastTimestamp' | head -20

Log Collection

# Service Mesh operator logs (OSSM3)
# Operator pod/deployment names differ by installation. Find the pod first, then fetch logs:
oc get pods -n openshift-operators | grep -i "servicemesh\\|istio"
oc logs -n openshift-operators <operator-pod-name> --tail=100

# Control plane logs
oc get pods -n istio-system | grep -i istiod
oc logs -n istio-system <istiod-pod-name> --tail=100

# Application logs (if pods exist)
oc logs -n edgelm --selector app=edgelm --tail=50

Validation and Testing Procedures

Complete Setup Validation

Run this comprehensive check after completing the setup:

#!/bin/bash
echo "=== Namespace Validation ==="
oc get namespaces | grep -E "(edgelm|edge-icell|istio-)"

echo "=== Service Mesh Validation ==="
oc get istio -n istio-system
oc get pods -n istio-system --no-headers | wc -l

echo "=== RBAC Validation ==="
oc get clusterroles | grep edgelm | wc -l
oc get serviceaccount edgelm -n edgelm

echo "=== Authentication Validation ==="
oc --kubeconfig=edgelm-kubeconfig auth can-i list pods -n edgelm

echo "=== Overall Health Check ==="
oc get pods --all-namespaces | grep -E "(Error|CrashLoop|Pending)"

Expected outputs:

  • 6 namespaces created
  • Istio control plane resource exists and is "Ready" (or reports healthy status)
  • Multiple istio-system pods running
  • Several edgelm clusterroles found
  • Service account exists
  • Authentication test returns "yes"
  • No pods in error states
Category
Article Type