Troubleshooting Guide for SAP Data Intelligence 3 on OpenShift Container Platform 4

Updated

Contents

1. Troubleshooting installation, upgrade and restore problems

1.1. Unassigned privileged security context

If there are no pods, replica sets, or stateful sets coming up and you can see an event similar to the one below, you need to add privileged security context constraint to its service account.

   # oc get events | grep securityContext
   1m          32m          23        diagnostics-elasticsearch-5b5465ffb.156926cccbf56887                          ReplicaSet                                                                            Warning   FailedCreate             replicaset-controller                  Error creating: pods "diagnostics-elasticsearch-5b5465ffb-" is forbidden: unable to validate against any security context constraint: [spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

Copy the name in the fourth column (the event name - diagnostics-elasticsearch-5b5465ffb.156926cccbf56887) and determine its corresponding service account name.

   # eventname="diagnostics-elasticsearch-5b5465ffb.156926cccbf56887"
   # oc get -o go-template=$'{{with .spec.template.spec.serviceAccountName}}{{.}}{{else}}default{{end}}\n' \
       "$(oc get events "${eventname}" -o jsonpath='{.involvedObject.kind}/{.involvedObject.name}{"\n"}')"
   sdi-elasticsearch

The obtained service account name (sdi-elasticsearch) now needs to be assigned privileged SCC:

   # oc adm policy add-scc-to-user privileged -z sdi-elasticsearch

The pod shall then come up on its own, unless this is the only problem.

1.2. Default storage class set

If pods are failing because PVCs are not being bound, the problem may be that the default storage class has not been set and no storage class was specified to the installer.

   # oc get pods
   NAME                                                  READY     STATUS    RESTARTS   AGE
   hana-0                                                0/1       Pending   0          45m
   vora-consul-0                                         0/1       Pending   0          45m
   vora-consul-1                                         0/1       Pending   0          45m
   vora-consul-2                                         0/1       Pending   0          45m


   # oc describe pvc data-hana-0
   Name:          data-hana-0
   Namespace:     sdi
   StorageClass:
   Status:        Pending
   Volume:
   Labels:        app=vora
                  datahub.sap.com/app=hana
                  vora-component=hana
   Annotations:   <none>
   Finalizers:    [kubernetes.io/pvc-protection]
   Capacity:
   Access Modes:
   Events:
     Type    Reason         Age                  From                         Message
     ----    ------         ----                 ----                         -------
     Normal  FailedBinding  47s (x126 over 30m)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

To fix this, either make sure to set the Default StorageClass (4.12) / (4.10) or provide the storage class name to the installer.

1.3. vsystem-app pods are unable to load

If you have SELinux in enforcing mode, you may see the pods launched by vsystem crash-looping because of the container named vsystem-iptables, as shown below:

   # oc get pods
   NAME                                                          READY     STATUS             RESTARTS   AGE
   auditlog-59b4757cb9-ccgwh                                     1/1       Running            0          40m
   datahub-app-db-gzmtb-67cd6c56b8-9sm2v                         2/3       CrashLoopBackOff   11         34m
   datahub-app-db-tlwkg-5b5b54955b-bb67k                         2/3       CrashLoopBackOff   10         30m
   ...
   internal-comm-secret-gen-nd7d2                                0/1       Completed          0          36m
   license-management-gjh4r-749f4bd745-wdtpr                     2/3       CrashLoopBackOff   11         35m
   shared-k98sh-7b8f4bf547-2j5gr                                 2/3       CrashLoopBackOff   4          2m
   ...
   vora-tx-lock-manager-7c57965d6c-rlhhn                         2/2       Running            3          40m
   voraadapter-lsvhq-94cc5c564-57cx2                             2/3       CrashLoopBackOff   11         32m
   voraadapter-qkzrx-7575dcf977-8x9bt                            2/3       CrashLoopBackOff   11         35m
   vsystem-5898b475dc-s6dnt                                      2/2       Running            0          37m

When you inspect one of those pods, you can see an error message similar to the one below:

   # oc logs voraadapter-lsvhq-94cc5c564-57cx2 -c vsystem-iptables
   2018-12-06 11:45:16.463220|+0000|INFO |Execute: iptables -N VSYSTEM-AGENT-PREROUTING -t nat||vsystem|1|execRule|iptables.go(56)
   2018-12-06 11:45:16.465087|+0000|INFO |Output: iptables: Chain already exists.||vsystem|1|execRule|iptables.go(62)
   Error: exited with status: 1
   Usage:
     vsystem iptables [flags]


   Flags:
     -h, --help               help for iptables
         --no-wait            Exit immediately after applying the rules and don't wait for SIGTERM/SIGINT.
         --rule stringSlice   IPTables rule which should be applied. All rules must be specified as string and without the iptables command.

In the audit log on the node where the pod got scheduled, you should be able to find an AVC denial similar to the following: On RHCOS nodes, you may need to inspect the output of dmesg command instead.

   # grep 'denied.*iptab' /var/log/audit/audit.log
   type=AVC msg=audit(1544115868.568:15632): avc:  denied  { module_request } for  pid=54200 comm="iptables" kmod="ipt_REDIRECT" scontext=system_u:system_r:container_t:s0:c826,c909 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
   ...
   # # on RHCOS
   # dmesg | grep denied

To fix this, the ipt_REDIRECT kernel module needs to be loaded. Please refer to Pre-load needed kernel modules.

1.4. Unable to initialize License Manager

The installation may fail with the following error:

   2019-07-22T15:07:29+0000 [INFO] Initializing system tenant...
   2019-07-22T15:07:29+0000 [INFO] Initializing License Manager in system tenant...2019-07-22T15:07:29+0000 [ERROR] Couldn't start License Manager!
   The response: {"status":500,"code":{"component":"router","value":8},"message":"Internal Server Error: see logs for more info"}Error: http status code 500 Internal Server Error (500)
   2019-07-22T15:07:29+0000 [ERROR] Failed to initialize vSystem, will retry in 30 sec...

In the log of license management pod, you can find an error as shown below:

   # oc logs deploy/license-management-l4rvh
   Found 2 pods, using pod/license-management-l4rvh-74595f8c9b-flgz9
   + iptables -D PREROUTING -t nat -j VSYSTEM-AGENT-PREROUTING
   + true
   + iptables -F VSYSTEM-AGENT-PREROUTING -t nat
   + true
   + iptables -X VSYSTEM-AGENT-PREROUTING -t nat
   + true
   + iptables -N VSYSTEM-AGENT-PREROUTING -t nat
   iptables v1.6.2: can't initialize iptables table `nat': Permission denied
   Perhaps iptables or your kernel needs to be upgraded.

This means, the vsystem-iptables container in the pod lacks permission to manipulate iptables. Please make sure to pre-load kernel modules.

1.5. Unable to start diagnostics prometheus node exporter pods

During an installation or upgrade, it may happen, that the Node Exporter pods keep restarting:

   # oc get pods  | grep node-exporter
   diagnostics-prometheus-node-exporter-5rkm8                        0/1       CrashLoopBackOff   6          8m
   diagnostics-prometheus-node-exporter-hsww5                        0/1       CrashLoopBackOff   6          8m
   diagnostics-prometheus-node-exporter-jxxpn                        0/1       CrashLoopBackOff   6          8m
   diagnostics-prometheus-node-exporter-rbw82                        0/1       CrashLoopBackOff   7          8m
   diagnostics-prometheus-node-exporter-s2jsz                        0/1       CrashLoopBackOff   6          8m

The possible reason is that the limits on resource consumption set on the pods are too low. To address this post-installation, you can patch the DaemonSet, in the SDI's namespace, as shown below:

   # oc patch -p '{"spec": {"template": {"spec": {"containers": [
       { "name": "diagnostics-prometheus-node-exporter",
         "resources": {"limits": {"cpu": "200m", "memory": "100M"}}
       }]}}}}' ds/diagnostics-prometheus-node-exporter

To address this during the installation (using any installation method), add the following parameters:

   -e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.cpu=200m
   -e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.memory=100M

If the graph builds hang in Pending state or fail completely, you may find the following pod not coming up in the sdi namespace because its image cannot be pulled from the registry:

   # oc get pods | grep vflow
   datahub.post-actions.validations.validate-vflow-9s25l             0/1     Completed          0          14h
   vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2                  0/1     ImagePullBackOff   0          21s
   vflow-graph-9958667ba5554dceb67e9ec3aa6a1bbb-com-sap-demo-dljzk   1/1     Running            0          94m
   # oc describe pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 | sed -n '/^Events:/,$p'
   Events:
     Type     Reason     Age                From                    Message
     ----     ------     ----               ----                    -------
     Normal   Scheduled  30s                default-scheduler       Successfully assigned sdi/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 to sdi-moworker3
     Normal   BackOff    20s (x2 over 21s)  kubelet, sdi-moworker3  Back-off pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
     Warning  Failed     20s (x2 over 21s)  kubelet, sdi-moworker3  Error: ImagePullBackOff
     Normal   Pulling    6s (x2 over 21s)   kubelet, sdi-moworker3  Pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
     Warning  Failed     6s (x2 over 21s)   kubelet, sdi-moworker3  Failed to pull image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600": rpc error: code = Unknown desc = Error reading manifest 3.0.23-com.sap.sles.base-20200617-174600 in container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9: unauthorized: authentication required
     Warning  Failed     6s (x2 over 21s)   kubelet, sdi-moworker3  Error: ErrImagePull

To amend this, one needs to link the secret for the modeler's registry to a corresponding service account associated with the failed pod. In this case, the default one.

   # oc get -n "${SDI_NAMESPACE:-sdi}" -o jsonpath='{.spec.serviceAccountName}{"\n"}' \
       pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2
   default
   # oc create secret -n "${SDI_NAMESPACE:-sdi}" docker-registry sdi-registry-pull-secret \
       --docker-server=container-image-registry-sdi-observer.apps.morrisville.ocp.vslen \
       --docker-username=user-n5137x --docker-password=ec8srNF5Pf1vXlPTRLagEjRRr4Vo3nIW
   # oc secrets link -n "${SDI_NAMESPACE:-sdi}" --for=pull default sdi-registry-pull-secret
   # oc delete -n "${SDI_NAMESPACE:-sdi}" pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2

Also, please make sure to restart the Pipeline Modeler and any failing graph builds in the offended tenant.

1.7. Failure of container

If pods fail with a similar error like the one below, the containers are most likely not allowed to run under the desired UID.

   # oc get pods
   NAME                                READY   STATUS             RESTARTS   AGE
   datahub.checks.checkpoint-m82tj     0/1     Completed          0          12m
   vora-textanalysis-6c9789756-pdxzd   0/1     CrashLoopBackOff   6          9m18s
   # oc logs vora-textanalysis-6c9789756-pdxzd
   Traceback (most recent call last):
     File "/dqp/scripts/start_service.py", line 413, in <module>
       sys.exit(Main().run())
     File "/dqp/scripts/start_service.py", line 238, in run
       **global_run_args)
     File "/dqp/python/dqp_services/services/textanalysis.py", line 20, in run
       trace_dir = utils.get_trace_dir(global_trace_dir, self.config)
     File "/dqp/python/dqp_utils.py", line 90, in get_trace_dir
       return get_dir(global_trace_dir, conf.trace_dir)
     File "/dqp/python/dqp_utils.py", line 85, in get_dir
       makedirs(config_value)
     File "/usr/lib64/python2.7/os.py", line 157, in makedirs
       mkdir(name, mode)
   OSError: [Errno 13] Permission denied: 'textanalysis'

To remedy that, be sure to apply all the oc adm policy add-scc-to-* commands from the project setup section. The one that has not been applied in this case is:

   # oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"

1.8. Failure of jobs during installation or upgrade

If the installation jobs are failing with the following error, either the anyuid security context constraint has not been applied or the cluster is too old.

   # oc logs solution-reconcile-vsolution-vsystem-ui-3.0.9-vnnbf
   Error: mkdir /.vsystem: permission denied.
   2020-03-05T15:51:18+0000 [WARN] Could not login to vSystem!
   2020-03-05T15:51:23+0000 [INFO] Retrying...
   Error: mkdir /.vsystem: permission denied.
   2020-03-05T15:51:23+0000 [WARN] Could not login to vSystem!
   2020-03-05T15:51:28+0000 [INFO] Retrying...
   Error: mkdir /.vsystem: permission denied.
   ...
   2020-03-05T15:52:13+0000 [ERROR] Timeout while waiting to login to vSystem...

The reason behind is that vctl binary in the containers determines HOME directory for its user from /etc/passwd. When the container is not run with the desired UID, the value is set incorrectly to /. The binary then lacks permissions to write to the root directory.

To remedy that, make sure that the following conditions are met:

  1. Make sure that you are running OpenShift cluster 4.2.32 or newer.
  2. Make sure that anyuid SCC has been applied to the SDI namespace. To verify this, make sure the SDI namespace is listed in the 3rd column of the output of the following command:
      # oc get -o json scc/anyuid | jq -r '.groups[]'
       system:cluster-admins
       system:serviceaccounts:sdi

When the jobs are rerun, anyuid scc will be assigned to them:

      # oc get pods -n "${SDI_NAMESPACE:-sdi}" -o json | jq -r '.items[] | select((.metadata.ownerReferences // []) |
           any(.kind == "Job")) | "\(.metadata.name)\t\(.metadata.annotations["openshift.io/scc"])"' | column -t
       datahub.voracluster-start-1d3ffe-287c16-d7h7t                    anyuid
       datahub.voracluster-start-b3312c-287c16-j6g7p                    anyuid
       datahub.voracluster-stop-5a6771-6d14f3-nnzkf                     anyuid
       ...
       strategy-reconcile-strat-system-3.0.34-3.0.34-pzn79              anyuid
       tenant-reconcile-default-3.0.34-wjlfs                            anyuid
       tenant-reconcile-system-3.0.34-gf7r4                             anyuid
       vora-config-init-qw9vc                                           anyuid
       vora-dlog-admin-f6rfg                                            anyuid
  1. Additionally, please make sure that all the other oc adm policy add-scc-to-* commands listed in the project setup have been applied to the same $SDI_NAMESPACE.

1.9. vsystem-vrep is unable to export NFS on RHCOS

If vsystem-vrep-0 pod fails with the following error, it means it is unable to start an NFS server on top of overlayfs.

   # oc logs -n ocpsdi1 vsystem-vrep-0 vsystem-vrep
   2020-07-13 15:46:05.054171|+0000|INFO |Starting vSystem version 2002.1.15-0528, buildtime 2020-05-28T18:5856, gitcommit ||vsystem|1|main|server.go(107)
   2020-07-13 15:46:05.054239|+0000|INFO |Starting Kernel NFS Server||vrep|1|Start|server.go(83)
   2020-07-13 15:46:05.108868|+0000|INFO |Serving liveness probe at ":8739"||vsystem|9|func2|server.go(149)
   2020-07-13 15:46:10.303625|+0000|WARN |no backup or restore credentials mounted, not doing backup and restore||vsystem|1|NewRcloneBackupRestore|backup_restore.go(76)
   2020-07-13 15:46:10.311488|+0000|INFO |vRep components are initialised successfully||vsystem|1|main|server.go(249)
   2020-07-13 15:46:10.311617|+0000|ERROR|cannot parse duration from "SOLUTION_LAYER_CLEANUP_DELAY" env variable: time: invalid duration ||vsystem|16|CleanUpSolutionLayersJob|manager.go(351)
   2020-07-13 15:46:10.311719|+0000|INFO |Background task for cleaning up solution layers will be triggered every 12h0m0s||vsystem|16|CleanUpSolutionLayersJob|manager.go(358)
   2020-07-13 15:46:10.312402|+0000|INFO |Recreating volume mounts||vsystem|1|RemountVolumes|volume_service.go(339)
   2020-07-13 15:46:10.319334|+0000|ERROR|error re-loading NFS exports: exit status 1
   exportfs: /exports does not support NFS export||vrep|1|AddExportsEntry|server.go(162)
   2020-07-13 15:46:10.319991|+0000|FATAL|Error creating runtime volume: error exporting directory for runtime data via NFS: export error||vsystem|1|Fail|termination.go(22)

There are two solutions to the problem. Both of them result in an additional volume mounted at /exports, which is the root directory of all exports.

1.10. Kaniko unable to push images to a registry

Symptoms

  • Kaniko is enabled in SDI (mandatory on OpenShift 4).
  • Registry is secured by TLS certificates with a self-signed certificate.
  • Other SDI and OpenShift components can use the registry without issues.
  • The pipeline modeler crashes with a traceback preceded by the following error:
      # oc logs -f -c vflow  "$(oc get pods -o name \
         -l vsystem.datahub.sap.com/template=pipeline-modeler | head -n 1)" | grep 'push permissions'
       error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "container-image-registry-miminar-sdi-observer.apps.sydney.example.com/vora/vflow-node-f87b598586d430f955b09991fc11
       73f716be17b9:3.0.27-com.sap.sles.base-20201001-102714": BLOB_UPLOAD_UNKNOWN: blob upload unknown to registry

Resolution

The root cause has not been identified yet. To work around it, the modeler shall be configured to use an insecure registry accessible via plain HTTP (without TLS) and requiring no authentication. Such a registry can be provisioned with SDI Observer. If the existing registry is provisioned by SDI Observer, one can modify it in such a way that it requires no authentication, as shown below:

  1. Initiate an update of SDI Observer.
  2. Re-configure sdi-observer for no authentication:
      # oc set env -n "${NAMESPACE:-sdi-observer}" SDI_REGISTRY_AUTHENTICATION=none dc/sdi-observer
  1. Wait until the registry gets re-deployed.
  2. Verify that the registry is running and that neither REGISTRY_AUTH_HTPASSWD_REALM nor REGISTRY_AUTH_HTPASSWD_PATH are present in the output of the following command:
      # oc set env -n "${NAMESPACE:-sdi-observer}" --list dc/container-image-registry
       REGISTRY_HTTP_SECRET=mOjuXMvQnyvktGLeqpgs5f7nQNAiNMEE
  1. Note the registry service address, which can be determined as shown below:
      # # <service-name>.<namespace>.cluster.local:<service-port>
       # oc project "${NAMESPACE:-sdi-observer}"
       # printf "$(oc get -o jsonpath='{.metadata.name}.{.metadata.namespace}.svc.%s:{.spec.ports[0].port}' \
               svc container-image-registry)\n" \
           "$(oc get dnses.operator.openshift.io/default -o jsonpath='{.status.clusterDomain}')"
       container-image-registry.sdi-observer.svc.cluster.local:5000
  1. Verify that the service is responsive over plain HTTP from inside the OpenShift cluster and requires no authentication:
      # registry_url=http://container-image-registry.sdi-observer.svc.cluster.local:5000
       # oc rsh -n openshift-authentication "$(oc get pods -n openshift-authentication | \
           awk '/oauth-openshift.*Running/ {print $1; exit}')" curl -I "$registry_url"
       HTTP/1.1 200 OK
       Content-Length: 2
       Content-Type: application/json; charset=utf-8
       Docker-Distribution-Api-Version: reg

Note: The service URL is not reachable from outside of the OpenShift cluster.

  1. For each SDI tenant using the registry:

  2. Login to the tenant as an administrator and open System Management.

  3. View Application Configuration and Secrets.

    Access Application Configuration and Secrets

  4. Set the following properties to the registry address:

    • Modeler: Base registry for pulling images
    • Modeler: Docker registry for Modeler images
  5. Unset the following properties:

    • Modeler: Name of the vSystem secret containing the credentials for Docker registry
    • Modeler: Docker image pull secret for Modeler

    The end result should look like this:

    Modified registry parameters for Modeler

  6. Return to "Applications" in System Management and select Modeler.

  7. Delete all the instances.

  8. Create a new instance with the plus button.

  9. Access the instance to verify that it is working.

1.11. Failure of SLCBridge pod to deploy

If the initialization phase of Software Lifecycle Container Bridge fails with an error like the one below, you are probably running SLCB version 1.1.53 configured to push to a registry requiring basic authentication.

   *************************************************
   * Executing Step WaitForK8s SLCBridgePod Failed *
   *************************************************


     Execution of step WaitForK8s SLCBridgePod failed
     Synchronizing Deployment slcbridgebase failed (pod "slcbridgebase-5bcd7946f4-t6vfr" failed) [1.116647047s]
     .
     Choose "Retry" to retry the step.
     Choose "Rollback" to undo the steps done so far.
     Choose "Cancel" to cancel deployment immediately.


   # oc logs -n sap-slcbridge -c slcbridge -l run=slcbridge --tail=13
   ----------------------------
   Code: 401
   Scheme: basic
   "realm": "basic-realm"
   {"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}
   ----------------------------
   2020-09-29T11:49:33.346Z        INFO    images/registry.go:182  Access check of registry "container-image-registry-sdi-observer.apps.sydney.example.com" returned AuthNeedBasic
   2020-09-29T11:49:33.346Z        INFO    slp/server.go:199       Shutting down server
   2020-09-29T11:49:33.347Z        INFO    hsm/hsm.go:125  Context closed
   2020-09-29T11:49:33.347Z        INFO    hsm/state.go:56 Received Cancel
   2020-09-29T11:49:33.347Z        DEBUG   hsm/hsm.go:118  Leaving event loop
   2020-09-29T11:49:33.347Z        INFO    slp/server.go:208       Server shutdown complete
   2020-09-29T11:49:33.347Z        INFO    slcbridge/master.go:64  could not authenticate at registry SLP_BRIDGE_REPOSITORY container-image-registry-sdi-observer.apps.sydney.example.com
   2020-09-29T11:49:33.348Z        INFO    globals/goroutines.go:63        Shutdown complete (exit status 1).

More information can be found in Content from launchpad.support.sap.com is not included.SAP Note #2589449.

To fix this, please download the latest SLCB version, newer than 1.1.53, according to Content from launchpad.support.sap.com is not included.SAP Note #2589449

1.12. Failure of Kibana pod to start

When Kibana pod is stuck in CrashLoopBackOff status, and the following error shows up in its log, you will need to delete the existing index.

   # oc logs -n "${SDI_NAMESPACE:-sdi}" -c diagnostics-kibana -l datahub.sap.com/app-component=kibana --tail=5
   {"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:ui_metric@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
   {"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:visualizations@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
   {"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:elasticsearch@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from yellow to green - Ready","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
   {"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["info","migrations"],"pid":1,"message":"Creating index .kibana_1."}
   {"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["warning","migrations"],"pid":1,"message":"Another Kibana instance appears to be migrating the index. Waiting for that migration to complete. If no other Kibana instance is attempting migrations, you can get past this message by deleting index .kibana_1 and restarting Kibana."}

Please note the name of the index in the last warning message. In this case, it is .kibana_1. Execute the following command with the proper index name at the end of the curl command to delete the index and then delete the Kibana pod as well.

   # oc exec -n "${SDI_NAMESPACE:-sdi}" -it diagnostics-elasticsearch-0 -c diagnostics-elasticsearch \
       -- curl -X DELETE 'http://localhost:9200/.kibana_1'
   # oc delete pod -n "${SDI_NAMESPACE:-sdi}" -l datahub.sap.com/app-component=kibana

The Kibana pod will be spawned and shall start running in a few minutes as long as its dependent diagnostics pods are running as well.

1.13. Fluentd pods are unable to access container logs

If you see the following errors, fluentd cannot access container logs on the hosts.

  • Error from SLC Bridge:
      2021-01-26T08:28:49.810Z  INFO  cmd/cmd.go:243  1> DataHub/kub-slcbridge/default [Pending]
       2021-01-26T08:28:49.810Z  INFO  cmd/cmd.go:243  1> └── Diagnostic/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:03 +0000 UTC]
       2021-01-26T08:28:49.811Z  INFO  cmd/cmd.go:243  1>     └── DiagnosticDeployment/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:29 +0000 UTC]
       2021-01-26T08:28:49.811Z  INFO  cmd/cmd.go:243  1>
       2021-01-26T08:28:55.989Z  INFO  cmd/cmd.go:243  1> DataHub/kub-slcbridge/default [Pending]
       2021-01-26T08:28:55.989Z  INFO  cmd/cmd.go:243  1> └── Diagnostic/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:03 +0000 UTC]
       2021-01-26T08:28:55.989Z  INFO  cmd/cmd.go:243  1>     └── DiagnosticDeployment/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:29 +0000 UTC]
  • Fluentd pod description:
      # oc describe pod diagnostics-fluentd-bb9j7
       Name:           diagnostics-fluentd-bb9j7
       …
         Warning  FailedMount  6m35s                 kubelet, compute-4  Unable to attach or mount volumes: unmounted volumes=[varlibdockercontainers], unattached volumes=[vartmp kub-slcbridge-fluentd-token-k5c9n settings varlog varlibdockercontainers]: timed out waiting for the condition
         Warning  FailedMount  2m1s (x2 over 4m19s)  kubelet, compute-4  Unable to attach or mount volumes: unmounted volumes=[varlibdockercontainers], unattached volumes=[varlibdockercontainers vartmp kub-slcbridge-fluentd-token-k5c9n settings varlog]: timed out waiting for the condition
         Warning  FailedMount  23s (x12 over 8m37s)  kubelet, compute-4  MountVolume.SetUp failed for volume "varlibdockercontainers" : hostPath type check failed: /var/lib/docker/containers is not a directory
  • Log from one of the pods:
      # oc logs $(oc get pods -o name -l datahub.sap.com/app-component=fluentd | head -n 1) | tail -n 20
         2019-04-15 18:53:24 +0000 [error]: unexpected error error="Permission denied @ rb_sysopen - /var/log/es-containers-sdh25-mortal-garfish.log.pos"
         2019-04-15 18:53:24 +0000 [error]: suppressed same stacktrace
         2019-04-15 18:53:25 +0000 [warn]: '@' is the system reserved prefix. It works in the nested configuration for now but it will be rejected: @timestamp
         2019-04-15 18:53:26 +0000 [error]: unexpected error error_class=Errno::EACCES error="Permission denied @ rb_sysopen - /var/log/es-containers-sdh25-mortal-garfish.log.pos"
         2019-04-15 18:53:26 +0000 [error]: /usr/lib64/ruby/gems/2.5.0/gems/fluentd-0.14.8/lib/fluent/plugin/in_tail.rb:151:in `initialize'
         2019-04-15 18:53:26 +0000 [error]: /usr/lib64/ruby/gems/2.5.0/gems/fluentd-0.14.8/lib/fluent/plugin/in_tail.rb:151:in `open'
       ...

Those errors are fixed automatically by SDI Observer; please make sure it is running and can access the SDI_NAMESPACE.

One can also apply a fix manually with the following commands:

   # oc -n "${SDI_NAMESPACE:-sdi}" patch dh default --type='json' -p='[
       { "op": "replace"
       , "path": "/spec/diagnostic/fluentd/varlibdockercontainers"
       , "value":"/var/log/pods" }]'
   # oc -n "${SDI_NAMESPACE:-sdi}" patch ds/diagnostics-fluentd -p '{"spec":{"template":{"spec":{
       "containers": [{"name":"diagnostics-fluentd", "securityContext":{"privileged": true}}]}}}}'

1.14. Failure of validation during the SDI installation

If the following error message is displayed at the end of SDI installation, it means that the pipeline modeler cannot communicate with the configured registry:

   ************************************
   * Executing Step Validation Failed *
   ************************************




   Execution of step Validation failed
   execution failed: status 1, error: time="2021-10-18T13:49:13Z" level=error msg="Job execution failed for job:
   datahub.post-actions.validations.validate-vflow, job failed with reason BackoffLimitExceeded:Job has reached the specified
   backoff limit"
   time="2021-10-18T13:49:13Z" level=error msg="Running script post-actions.validations.validate-vflow...Failed!"
   time="2021-10-18T13:49:13Z" level=error msg="Error: job failed with reason BackoffLimitExceeded:Job has reached the
   specified backoff limit"
   time="2021-10-18T13:50:00Z" level=error msg="Running post-actions/validations...Failed!"
   time="2021-10-18T13:50:00Z" level=fatal msg="Failed: there are failed scripts: post-actions.validations.validate-vflow"
   .
   Choose "Retry" to retry the step.
   Choose "Abort" to abort the SLC Bridge and return to the "Welcome" dialog.
   Choose "Cancel" to cancel the SLC Bridge immediately.


     Choose action Retry(r)/Abort(a)/<F1> for help: n

Often, it is due to the registry's CA certificate not being imported properly.

Verification

To verify that the certificate is correct, perform the following steps: If any of the steps fail, the certificate must be reconfigured.

  1. From the Management host, get the configured certificate from the SDI namespace:
      # oc get -n "${SDI_NAMESPACE:-sdi}" secret/cmcertificates \
               -o jsonpath='{.data.cert}' | base64 -d >cmcertificates.crt
  1. Verify the connection to the registry and its trustworthiness:
      # curl --cacert cmcertificates.crt -I https://<configured-registry-for-pipeline-modeler>/v2/

Example output for a trusted registry:

      HTTP/1.1 200 OK
       Content-Length: 2
       Content-Type: application/json; charset=utf-8
       Docker-Distribution-Api-Version: registry/2.0
       Date: Mon, 18 Oct 2021 14:55:37 GMT

Resolution

Update the trusted CA certificates.

  1. Ensure the registry is trusted with the correct CA bundle file:
      # curl --cacert correct-ca-bundle.crt -I https://<configured-registry-for-pipeline-modeler>/v2/
  1. (Optional) Update the secret directly or indirectly.
  • Directly as shown below:
          # oc create -n "${SDI_NAMESPACE:-sdi}" secret generic cmcertificates \
                   --from-file=cert=correct-ca-bundle.crt --dry-run=client -o json | \
               oc apply -f -
  • Indirectly using SDI Observer:

    1. Update the `run-observer-template.sh` for `CABUNDLE_PATH=./correct-ca-bundle.crt` and `INJECT_CABUNDLE=true`.
    2. Re-run the `run-observer-template.sh` script.
    
  1. Follow the Content from help.sap.com is not included.Manage Certificates guide (3.3) / Content from help.sap.com is not included.(3.2) / Content from help.sap.com is not included.(3.1) to import the correct-ca-bundle.crt via SDI Connection Management.
  2. Re-run the validation in the Software Lifecycle Bridge.

1.15. Crashing of Vora components

Symptoms

  • After a restoration from backup, Vora pods keep crashing and restarting:
      NAME                                   READY   STATUS      RESTARTS   AGE   IP             NODE            NOMINATED NODE   READINESS GATES
       vora-disk-0                            1/2     Running     3          37m   10.131.1.194   sdi-siworker1   <none>           <none>
       vora-relational-86b67c64b6-gn4pp       1/2     Running     7          37m   10.131.1.190   sdi-siworker1   <none>           <none>
       vora-tx-coordinator-5c6b45bb7b-qqnpn   1/2     Running     7          37m   10.130.2.249   sdi-siworker2   <none>           <none>
  • vora-disk-0 pod does not produce any output:
      # oc logs -c disk -f vora-disk-0
  • Local connection to vora-tx-coordinator cannot be established.

Resolution

The resolution is described in Content from launchpad.support.sap.com is not included.SAP Note 2918288 - SAP Data Intelligence Backup and Restore Note, in the section d030326, DIBUGS-11651, 2021-11-09 "Restoration fails with error in Vora disk engine".

1.16. Failure of image mirroring to Quay

During image mirroring to a local Quay registry, it may happen that an upload of a blob fails with the error message below. This is a This content is not included.known bug on the Quay side and will be addressed in future versions.

writing blob: initiating layer upload to /v2/sdimorrisville/com.sap.datahub.linuxx86_64/datahub-operator-installer-base/blobs/uploads/ in quay.apps.cluster.example.com: unauthorized: access to the requested resource is not authorized

Please retry the image mirroring until all the SAP images are successfully mirrored.

1.17. Failure of SLC Bridge init with Quay

As of SLC Bridge 1.1.71, the pull secrets are not created on the OpenShift side as long as a docker authentication file on the Management host contains the quay registry.

Symptoms

  • SLC Bridge init fails with an error like the one below:
      *************************************************
       * Executing Step WaitForK8s SLCBridgePod Failed *
       *************************************************


       Execution of step WaitForK8s SLCBridgePod failed
       Synchronizing Deployment slcbridgebase failed (pod "slcbridgebase-6f985dcb87-6plql" failed) [517.957512ms]
       .
       Choose "Retry" to retry the step.
       Choose "Abort" to abort the SLC Bridge and return to the "Welcome" dialog.
       Choose "Cancel" to cancel the SLC Bridge immediately.


         Choose action Retry(r)/Abort(a)/<F1> for help: r
  • Pod images in the sap-slcbridge namespace cannot be pulled:
      # oc describe pod | sed -n '/^Events:/,$p'
       Events:
         Type     Reason          Age                From               Message
         ----     ------          ----               ----               -------
         Normal   Scheduled       61s                default-scheduler  Successfully assigned slcb-test/slcbridgebase-6f985dcb87-6plql to leworker1.cluster.example.com
         Normal   AddedInterface  61s                multus             Add eth0 [10.131.0.113/23] from openshift-sdn
         Warning  Failed          43s (x2 over 60s)  kubelet            Failed to pull image "quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/nginx-sidecar:1.1.71": rpc error: code = Unknown desc = Error reading manifest 1.1.71 in quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/nginx-sidecar: unauthorized: access to the requested resource is not authorized
         Warning  Failed          43s (x2 over 60s)  kubelet            Error: ErrImagePull
         Normal   Pulling         43s (x2 over 60s)  kubelet            Pulling image "quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/slcbridgebase:1.1.71"
         Warning  Failed          43s (x2 over 60s)  kubelet            Failed to pull image "quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/slcbridgebase:1.1.71": rpc error: code = Unknown desc = Error reading manifest 1.1.71 in quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/slcbridgebase: unauthorized: access to the requested resource is not authorized
         Warning  Failed          43s (x2 over 60s)  kubelet            Error: ErrImagePull
         Normal   BackOff         31s (x3 over 60s)  kubelet            Back-off pulling image "quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/nginx-sidecar:1.1.71"
         Warning  Failed          31s (x3 over 60s)  kubelet            Error: ImagePullBackOff
         Normal   BackOff         31s (x3 over 60s)  kubelet            Back-off pulling image "quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/slcbridgebase:1.1.71"
         Warning  Failed          31s (x3 over 60s)  kubelet            Error: ImagePullBackOff
         Normal   Pulling         20s (x3 over 60s)  kubelet            Pulling image "quay.apps.cluster.example.com/sdi3/com.sap.sl.cbpod/nginx-sidecar:1.1.71"
  • No pull secret is present in the sap-slcbridge namespace:
      # oc get secret -n slcb-test | grep 'NAME\|pull-secret'
       NAME                        TYPE                                  DATA   AGE

Resolution

Please update SLC Bridge to the latest version (at least 1.1.73).

1.18. Failure of SLC Bridge image to be pulled from Quay

As of SLC Bridge 1.1.72, the bridge container fails to authenticate to Quay during the registry test.

Symptoms

  • SLC Bridge init fails with an error like the one below:
      *************************************************
       * Executing Step WaitForK8s SLCBridgePod Failed *
       *************************************************


       Execution of step WaitForK8s SLCBridgePod failed
       Synchronizing Deployment slcbridgebase failed (pod "slcbridgebase-6f985dcb87-6plql" failed) [517.957512ms]
       .
       Choose "Retry" to retry the step.
       Choose "Abort" to abort the SLC Bridge and return to the "Welcome" dialog.
       Choose "Cancel" to cancel the SLC Bridge immediately.


         Choose action Retry(r)/Abort(a)/<F1> for help: r
  • The container slcbridgebase is crashing.
      # kubectl get pods -n sap-slcbridge
       NAME                             READY   STATUS             RESTARTS   AGE
       slcbridgebase-8488d65d67-tqk7f   0/2     CrashLoopBackOff   25         31m
  • Its log contains the following error message:
      # oc logs -l app=slcbridge -c sidecar -n sap-slcbridge | grep unauthorized | head -n 1
       2022-02-11T15:19:29.792Z        WARN    images/registrycheck.go:57      Copying image memtarball:/canary.tar failed: trying to reuse blob sha256:8e0a91696253bb936c9603caed888f624af04b6eb335265a6e7a66e07bd23b51 at destination: checking whether a blob sha256:8e0a91696253bb936c9603caed888f624af04b6eb335265a6e7a66e07bd23b51 exists in quay.apps.lenbarehat.ocp.vslen/sdi172test/com.sap.sl.cbpod/canary: unauthorized: authentication required

Resolution

Please update SLC Bridge to the latest version (at least 1.1.73).

1.19. Failure of PVC expansion

Pod diagnostics-prometheus-server-0 failed to start after the expansion of its persistent volume claim.

Symptoms

  • Diagnostics-prometheus-server crashes and generates the following error:
      # kubectl logs diagnostics-prometheus-server-0 -c diagnostics-prometheus-server


         ts=2023-05-17T06:32:22.510Z caller=main.go:525 level=info msg="Starting Prometheus" version="(version=2.35.0, branch=non-git, revision=non-git)"
        
         ts=2023-05-17T06:32:22.510Z caller=main.go:530 level=info build_context="(go=go1.17.8, user=root@09fd96d4e0c7, date=20220818-10:05:28)"
        
         ts=2023-05-17T06:32:22.511Z caller=main.go:531 level=info host_details="(Linux 4.18.0-305.62.1.el8_4.x86_64 #1 SMP Thu Aug 11 12:07:27 EDT 2022 x86_64 diagnostics-prometheus-server-0 (none))"
        
         ts=2023-05-17T06:32:22.511Z caller=main.go:532 level=info fd_limits="(soft=1048576, hard=1048576)"
        
         ts=2023-05-17T06:32:22.511Z caller=main.go:533 level=info vm_limits="(soft=unlimited, hard=unlimited)"
        
         unexpected fault address 0x7f48189f9000
        
         fatal error: fault
        
         [signal SIGBUS: bus error code=0x2 addr=0x7f48189f9000 pc=0x46bca2]

Resolution

  1. Verify if the expansion of the PVC was done successfully by running the command:
     # oc describe pvc storage-diagnostics-prometheus-server-0  -n di-ml

Now, check the output for any errors or warnings related to PVC expansion. Here is an example:

   ❯ oc describe pvc storage-diagnostics-prometheus-server-0
   Name:          storage-diagnostics-prometheus-server-0
   ...
   Events:
     Type     Reason                      Age   From                                                 Message
     ----     ------                      ----  ----                                                 -------
     Normal   Resizing                    67m   external-resizer openshift-storage.rbd.csi.ceph.com  External resizer is resizing volume pvc-48e165e9-64e9-4188-b84d-7884f7597fcb
     Warning  ExternalExpanding           67m   volume_expand                                        Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.
     Normal   FileSystemResizeRequired    67m   external-resizer openshift-storage.rbd.csi.ceph.com  Require file system resize of volume on node
     Normal   FileSystemResizeSuccessful  67m   kubelet                                              MountVolume.NodeExpandVolume succeeded for volume "pvc-48e165e9-64e9-4188-b84d-7884f7597fcb" morrisville-s7vxz-worker-fcxk2
  1. Rebind the PVC to the stateful set. You can achieve this by scaling down the stateful set diagnostics-prometheus-server to zero replicas. Once scaled down, the SDI operator will automatically scale it back up, but please note that this process may take some time to complete.
     # oc scale sts diagnostics-prometheus-server --replicas=0

2. Troubleshooting SDI runtime problems

2.1. Time-out of the 504 gateway

When accessing SDI services exposed via OpenShift's Ingress Controller (as routes) and experiencing 504 gateway time-out errors, it is most likely caused by the following factors:

  1. SDI components accessed for the first time on a per tenant and per user basis require a new pod to be started, which takes a considerable amount of time.
  2. The default time-out for server connection configured on the load balancers is usually too small to tolerate containers being pulled, initialized and started.

To amend that, make sure to do the following:

  1. Set the "haproxy.router.openshift.io/timeout" annotation to "2m" on the vsystem route as shown below (assuming the route is named vsystem):
      # oc annotate -n "${SDI_NAMESPACE:-sdi}" route/vsystem haproxy.router.openshift.io/timeout=2m

This results in the following haproxy settings being applied to the ingress router and the route in question:

      # oc rsh -n openshift-ingress $(oc get pods -o name -n openshift-ingress | \
               awk '/\/router-default/ {print;exit}') cat /var/lib/haproxy/conf/haproxy.config | \
           awk 'BEGIN { p=0 }
               /^backend.*:'"${SDI_NAMESPACE:-sdi}:vsystem"'/ { p=1 }
               { if (p) { print; if ($0 ~ /^\s*$/) {exit} } }'
       Defaulting container name to router.
       Use 'oc describe pod/router-default-6655556d4b-7xpsw -n openshift-ingress' to see all of the containers in this pod.
       backend be_secure:sdi:vsystem
         mode http
         option redispatch
         option forwardfor
         balance leastconn
         timeout server  2m
  1. Set the same server timeout (2 minutes) on the external load balancer forwarding traffic to OpenShift's Ingress routers; the following is an example configuration for haproxy:
      frontend                                    https
           bind                                    *:443
           mode                                    tcp
           option                                  tcplog
           timeout     server                      2m
           tcp-request inspect-delay               5s
           tcp-request content accept              if { req_ssl_hello_type 1 }


           use_backend sydney-router-https         if { req_ssl_sni -m end -i apps.sydney.example.com }
           use_backend melbourne-router-https      if { req_ssl_sni -m end -i apps.melbourne.example.com }
           use_backend registry-https              if { req_ssl_sni -m end -i registry.example.com }


       backend         sydney-router-https
           balance     source
           server      compute1                     compute1.sydney.example.com:443     check
           server      compute2                     compute2.sydney.example.com:443     check
           server      compute3                     compute3.sydney.example.com:443     check


       backend         melbourne-router-https
           ....

2.2. HANA backup pod is unable to pull an image from an authenticated registry

If the configured container image registry requires authentication, HANA backup jobs might fail, as shown in the following example:

   # oc get pods | grep backup-hana
   default-chq28a9-backup-hana-sjqph                                 0/2     ImagePullBackOff   0          15h
   default-hfiew1i-backup-hana-zv8g2                                 0/2     ImagePullBackOff   0          38h
   default-m21kt3d-backup-hana-zw7w4                                 0/2     ImagePullBackOff   0          39h
   default-w29xv3w-backup-hana-dzlvn                                 0/2     ImagePullBackOff   0          15h


   # oc describe pod default-hfiew1i-backup-hana-zv8g2 | tail -n 6
     Warning  Failed          12h (x5 over 12h)       kubelet            Error: ImagePullBackOff
     Warning  Failed          12h (x3 over 12h)       kubelet            Failed to pull image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0": rpc error: code = Unknown desc = Error reading manifest 2010.22.0 in sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana: unauthorized: authentication required
     Warning  Failed          12h (x3 over 12h)       kubelet            Error: ErrImagePull
     Normal   Pulling         99m (x129 over 12h)     kubelet            Pulling image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0"
     Warning  Failed          49m (x3010 over 12h)    kubelet            Error: ImagePullBackOff
     Normal   BackOff         4m21s (x3212 over 12h)  kubelet            Back-off pulling image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0"

Resolution

There are two ways to resolve this:

  • The recommended approach is to update SDI Observer to version 0.1.9 or newer.

  • A manual alternative fix is to execute the following:

  1. Determine the currently configured image pull secret:
          # oc get -n "${SDI_NAMESPACE:-sdi}" vc/vora -o jsonpath='{.spec.docker.imagePullSecret}{"\n"}'
           slp-docker-registry-pull-secret
  1. Link the secret with the default service account:
          # oc secret link --for=pull default slp-docker-registry-pull-secret

3. Troubleshooting SDI Observer (implemented with a bash script) problems

3.1. Failure of build due to repository outage

If the build of SDI Observer or SDI Registry is failing with a similar error like the one below, the chosen Fedora repository mirror is probably temporarily down:

   # oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
   Extra Packages for Enterprise Linux Modular 8 - 448  B/s |  16 kB     00:36
   Failed to download metadata for repo 'epel-modular'
   Error: Failed to download metadata for repo 'epel-modular'
   subprocess exited with status 1
   subprocess exited with status 1
   error: build error: error building at STEP "RUN dnf install -y   https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm &&   dnf install -y parallel procps-ng bc git httpd-tools && dnf clean all -y": exit status 1

Please try to start the build again after a minute or two as shown below:

   # oc start-build NAMESPACE="${NAMESPACE:-sdi-observer}" -F bc/sdi-observer
Article Type