RHOCP 4 initContainer in CrashLoopBackOff on pod with Service Mesh sidecar injected
Environment
- Red Hat OpenShift Container Platform (RHOCP) 4
- Red Hat OpenShift Service Mesh (OSSM) 2
Issue
-
Pod with istio-proxy sidecar injected fails to initialize due to failed
initContainer, wheninitContainerneeds to communicate with some service on a different project. -
Pod with istio-proxy sidecar injected fails to initialize due to failed
initContainer, wheninitContainerneeds to communicate with a service in or outside the 'Mesh' via itsFQDN. -
The
initContainerin a Service Mesh pod is not able to resolve or connect to any service:Could not resolve host: kubernetes.default.svc.cluster.local Closing connection 0 -
The
initContainerstarts before the sidecar so theNetworkPoliciesare blocking any egress traffic from mesh.
Resolution
Permanent solution is introduced with ServiceMesh 2.6 and requires OpenShift 4.16 at minimum, as reported in the Service Mesh 2.6 Release Notes.
- Enable the Native sidecar in the ServiceMeshControlPlane, for example:
oc -n istio-system patch smcp basic --type=merge --patch '{"spec":{"runtime":{"components":{"pilot":{"container":{"env":{"ENABLE_NATIVE_SIDECARS":"true"}}}}}}}'
- Restart the application pods.
Workarounds
- Use the
traffic.sidecar.istio.io/excludeOutboundPortsortraffic.sidecar.istio.io/excludeOutboundIPRangesannotations with the pod to allow the traffic. See the example withexcludeOutboundPorts.
- Note: this workaround is not valid for DNS traffic.
- Execute the
initContaineras user1337. See the example withrunAsUser: 1337.
- Note: this option must be used with care and only as a last resort in case of no possibility to use IP addresses, since the pod will run with more privileges.
Example with excludeOutboundPorts
If some deployment, that is injecting 'istio-proxy' sidecar, has an initContainer that needs to run before the application container which needs, for any reason, to run some tasks prior by establishing a connection to some other service in the same project or another project, there are some details that need to be in consideration to avoid the initContainer to fail and with that the pod failing to initialize.
In the example below we will go through some of these considerations before running some deployment. This example runs a simple Django with Postgresql web application, where an initContainer was configured on the Django deploymentConfig which needs to get a valid response from a 'TCP' server service running on another project.
$ oc get all,ep -n django-project
NAME READY STATUS RESTARTS AGE
pod/django-psql-persistent-8-rbkj4 2/2 Running 0 94m
pod/postgresql-2-7kfnm 2/2 Running 0 126m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/django-psql-persistent ClusterIP 172.32.112.108 <none> 8080/TCP 5h12m
service/postgresql ClusterIP 172.32.136.38 <none> 5432/TCP 5h12m
NAME REVISION DESIRED CURRENT TRIGGERED BY
deploymentconfig.apps.openshift.io/django-psql-persistent 10 1 0 config,image(django-psql-persistent:latest)
deploymentconfig.apps.openshift.io/postgresql 2 1 1 config,image(postgresql:12-el8)
NAME TYPE FROM LATEST
buildconfig.build.openshift.io/django-psql-persistent Source Git 1
NAME TYPE FROM STATUS STARTED DURATION
build.build.openshift.io/django-psql-persistent-1 Source Git@064993e Complete 5 hours ago 6m18s
NAME ENDPOINTS AGE
endpoints/django-psql-persistent 10.132.0.146:8080 5h12m
endpoints/postgresql 10.135.0.174:5432 5h12m
$ oc get dc/django-psql-persistent -o yaml
<!---->
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
name: django-psql-persistent
strategy:
activeDeadlineSeconds: 21600
recreateParams:
timeoutSeconds: 600
resources: {}
type: Recreate
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/excludeOutboundPorts: "9000"
creationTimestamp: null
labels:
name: django-psql-persistent
name: django-psql-persistent
spec:
containers:
- env:
image: image-registry.openshift-image-registry.svc:5000/test-istio/django-psql-persistent@sha256:0f90bc03677c364ceed6d4ed71d0664f22a954a47fe4b35cab90f348d5fb4acd
imagePullPolicy: IfNotPresent
name: django-psql-persistent
ports:
- containerPort: 8080
protocol: TCP
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
initContainers:
- command:
- /bin/bash
- -c
-
set -euo pipefail
CONNECTION=$(echo "world" | nc tcp-echo.approve-init-containers.svc.cluster.local 9000)
if [[ ${CONNECTION} == "hello world" ]]; then
echo "Connection succeeded, continue with the deployment"
exit 0
else
echo "Connection failed. Stop and review what is wrong"
exit 1
fi
image: registry.redhat.io/rhel7/rhel-tools
imagePullPolicy: IfNotPresent
name: check-approval
<---->
The 'TCP' service is a simple 'tcp-echo' server that receives and responds to connections via 'TCP' by sending "hello + the client side request".
$ oc get all,ep -n approve-init-containers
NAME READY STATUS RESTARTS AGE
pod/tcp-echo-567c9458b8-z86qs 2/2 Running 0 174m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/tcp-echo ClusterIP 172.32.248.178 <none> 9000/TCP,9001/TCP 8h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/tcp-echo 1/1 1 1 8h
NAME ENDPOINTS AGE
endpoints/tcp-echo 10.132.0.131:9001,10.132.0.131:9000 8h
In order to have this working there are some special extra settings needed:
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/excludeOutboundPorts: "9000" --> this annotation must be present excluding the outbound port used by the `initContainer` to avoid traffic loss. In this case port 9000 which is the port of the tcp-echo service.
initContainers:
- command:
- /bin/bash
- -c
-
set -euo pipefail
CONNECTION=$(echo "world" | nc 172.32.248.178 9000) --> the connection made by the `initContainer` must go to the service ClusterIP and not the service hostname
if [[ ${CONNECTION} == "hello world" ]]; then
echo "Connection succeeded, continue with the deployment"
exit 0
else
echo "Connection failed. Stop and review what is wrong"
exit 1
fi
With this when the pod is schedule the initContainer will start and finish its task and then the pod will initialize starting the application and 'istio-proxy' containers:
$ oc get pods
NAME READY STATUS RESTARTS AGE
django-psql-persistent-12-pn8f4 2/2 Running 1 (84s ago) 2m6s
postgresql-2-9dttg 2/2 Running 0 2m1s
$ oc logs -c check-approval django-psql-persistent-12-pn8f4
Connection succeeded, continue with the deployment
And from the 'tcp-echo' server the connection can also be seen:
$ oc logs -c tcp-echo tcp-echo-567c9458b8-sck4k
listening on [::]:9002, prefix: hello
listening on [::]:9000, prefix: hello
listening on [::]:9001, prefix: hello
request: world
response: hello world
$ oc logs -c istio-proxy tcp-echo-567c9458b8-sck4k
[2022-01-19T20:52:09.370Z] "- - -" 0 - - - "-" 6 12 36 - "-" "-" "-" "-" "127.0.0.1:9000" inbound|9000|| 127.0.0.1:60570 10.132.0.102:9000 10.132.0.103:52380 - -
If initContainer is failing because of DNS errors accessing the service hostname the current workaround is to make sure containers use the service ClusterIP as the example on the script above.
Example with runAsUser: 1337
In cases where is not possible at all to use an IP address the solution is to add the istio-proxy UID in addition to the annotation - traffic.sidecar.istio.io/excludeOutboundPorts: <port> - in order for the istio-cni to not have any interference in the initContainer network connections. For this to work the pod must run with either nonroot or anyuid scc:
$ oc adm policy add-scc-to-user nonroot -z <my-serviceaccout-name>
Then add the user "1337" to the initContainer section:
template:
metadata:
annotations:
sidecar.istio.io/inject: "true"
traffic.sidecar.istio.io/excludeOutboundPorts: "9000"
[...]
spec:
containers:
[...]
initContainers:
- command:
- /bin/bash
- -c
-
set -euo pipefail
CONNECTION=$(echo "world" | nc tcp-echo-server.mycloud.example.com 9000)
if [[ ${CONNECTION} == "hello world" ]]; then
echo "Connection succeeded, continue with the deployment"
exit 0
else
echo "Connection failed. Stop and review what is wrong"
exit 1
fi
image: registry.redhat.io/rhel7/rhel-tools
imagePullPolicy: IfNotPresent
name: check-approval
securityContext:
runAsUser: 1337 --> add the securityContext.runAsUser to the container
Once the pod starts, the initContainer will start with user 1337 and bypass all the istio-cni settings used for the istio-proxy sidecar.
Root Cause
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
- The Red Hat Openshift Service Mesh uses Istio CNI which is used to configure the network of the
envoyProxy, including its iptables rules to capture applications traffic. - This plugin is also used to connect to the OCP cluster network including Multus CNI.
- As stated by the Content from istio.io is not included.upstream Istio this might create issues with
initContainerssince the pod will not start until theinitContainerfinishes. Because the pod will not start, theistio-proxywill also wait to start, but the Istio CNI initializes all the network for theenvoyProxyto take over as soon as the pod is scheduled and the Istio control plane receives information that a newistio-proxywill be injected, causing possible network connectivity failures on theinitContainer.
Diagnostic Steps
Check logs on the initContainer and the remote service pod where the connection is being made:
$ oc logs -c <initContainer_name> <some-pod> -n <some-project>
$ oc logs -c <container-name> <some-remote-service-pod> -n <some-project>
Recheck the configuration of the pod being deployed to make sure there is no annotation missing and how is the initContainer configure to make its connections.
In case of DNS issues there are several ways to check:
- Checking logs containers when static `initContainer` is configured:
$ oc logs -c <initContainer-name> <some-pod-name>
- In case of dynamic `initContainers` injection, containers in the pod will start and DNS errors might be seen in the 'istio-proxy' sidecar:
$ oc logs -c istio-proxy <some-pod-name>
A simple sleep initContainer can be configure to test DNS issues:
$ oc edit dc/django-psql-persistent
<!---add a sleep init container--->
initContainers:
- command:
- /bin/sleep
- 3650d
image: curlimages/curl
imagePullPolicy: IfNotPresent
name: sleep
<---->
$ oc rsh -c sleep <pod-name>
~ $ nslookup <service-name>.<projedt-name>.svc.cluster.local
~ $ echo "world" | nc <service-name>.<projedt-name>.svc.cluster.local 9000
The above should have different results when sidecar.istio.io/inject is set to true or false. When sidecar is not being injected everything works correctly.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.