Infrastructure Nodes in OpenShift 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP) - 4

Issue

Infrastructure nodes allow customers to isolate infrastructure workloads for two primary purposes:

  1. to prevent incurring billing costs against subscription counts and
  2. to separate maintenance and management.

This solution is meant to complement the official documentation on creating Infrastructure nodes in OpenShift 4. In addition there is a great OpenShift Commons video describing this whole process: Content from youtu.be is not included.OpenShift Commons: Everything about Infra nodes

Infrastructure nodes host only infrastructure components, such as the default router, the integrated container image registry, and the components for cluster metrics and monitoring. These infrastructure machines are not counted toward the total number of subscriptions that are required to run the environment.

To resolve the first problem, all that is needed is a node label added to a particular node, set of nodes, or machines and machineset. Red Hat subscription vCPU counts omit any vCPU reported by a node labeled node-role.kubernetes.io/infra: "" and you will not be charged for these resources from Red Hat. Please see How to confirm infra nodes not included in subscription cost in OpenShift Cluster Manager? to confirm your vCPU reports correctly after applying the configuration changes in this article.

To resolve the second problem we need to schedule infrastructure workloads specifically to infrastructure nodes and also to prevent other workloads from being scheduled on infrastructure nodes. There are two strategies for accomplishing this that we will go into later.

You may ask why infrastructure workloads are different from those workloads running on the control plane. At a minimum, an OpenShift cluster contains This page is not included, but the link has been rewritten to point to the nearest parent document.2 worker nodes in addition to 3 control plane nodes. While control plane components critical to the cluster operability are isolated on the masters, there are still some infrastructure workloads that by default run on the worker nodes - the same nodes on which cluster users deploy their applications.

Note: To know the workloads that can be executed in infrastructure nodes, check the "Red Hat OpenShift control plane and infrastructure nodes" section in This content is not included.OpenShift sizing and subscription guide for enterprise Kubernetes.

Planning node changes around any nodes hosting these infrastructure components should not be addressed lightly, and in general should be addressed separately from nodes specifically running normal application workloads.

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

Note: in OSD and ROSA it's not possible to create MachineSets/MachineConfigPools. Also, the infra nodes in OSD and ROSA are managed by Red Hat, and customer workloads cannot be executed there. Please refer to This content is not included.Create and configure MachineSets/MachinePools in OSD and ROSA to manage MachineSets using the OCM MachinePools.

Isolating Infrastructure Nodes

Applying a specific node selector to all infrastructure components will guarantee that they will be scheduled on nodes with that label. See more details on node selectors in placing pods on specific nodes using node selectors, and about node labels in understanding how to update labels on nodes.

Our node label and matching selector for infrastructure components will be node-role.kubernetes.io/infra: "".

To prevent other workloads from also being scheduled on those infrastructure nodes, we need one of two solutions:

  • Apply a taint to the infrastructure nodes and tolerations to the desired infrastructure workloads.
    OR
  • Apply a completely separate label to your other nodes and matching node selector to your other workloads such that they are mutually exclusive from infrastructure nodes.

TIP: To ensure High Availability (HA) each cluster should have three Infrastructure nodes, ideally across availability zones. See more details about rebooting nodes running critical infrastructure.

TIP: Review the This page is not included, but the link has been rewritten to point to the nearest parent document.infrastructure node sizing suggestions

About the "worker" role and the MachineConfigPool

By default all nodes except for masters will be labeled with node-role.kubernetes.io/worker: "". We will be adding node-role.kubernetes.io/infra: "" to infrastructure nodes.

The MachineConfigOperator reconciles any available MachineConfigs defined to match a specific selector in a MachineConfigPool. All custom MCP objects will descend from the parent worker pool as documented in Content from github.com is not included.custom pools. You do not need a custom pool unless you actually need to change a specific node to have a different set of MachineConfigs applied to it. You do not need a custom machineconfigpool or machineconfig for Infrastructure nodes to work correctly, as it is not a strict requirement for any of the particular problems here (node labeling, scheduling specific workloads, isolation, preventing other workload scheduling). You can, however, use one if you find it useful, and you can find an example of how to create one in the above link. For example, if you decide for some other reason all your infrastructure nodes should have a particular MachineConfig that does something specifically different from your worker nodes, you would use a MachineConfigPool to ensure that MachineConfig applies to those nodes particular to your pool.

However, if you want to remove the existing worker role from your infra nodes, you will need an MCP to ensure that all the nodes upgrade correctly. This is because the worker MCP is responsible for updating and upgrading the nodes, and it finds them by looking for this node-role label. If you remove the label, you must have a MachineConfigPool that can find your infra nodes by the infra node-role label instead. Previously this was not the case and removing the worker label could have caused issues in OCP <= 4.3.

This infra MCP definition below will find all MachineConfigs labeled both "worker" and "infra" and it will apply them to any Machines or Nodes that have the "infra" role label. In this manner, you will ensure that your infra nodes can upgrade without the "worker" role label.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""

Configuring Infrastructure Nodes Using Node Selectors and Taints and Tolerations

Applying a taint to the infrastructure nodes and a toleration for that taint to all infrastructure components will guarantee that only those resources will be scheduled on the Infrastructure nodes. Taints can prevent workloads that do not have a matching toleration from running on particular nodes. However, some workloads such as daemonsets still need to be scheduled on these particular nodes. In this case, those workloads need a universal toleration. There was an outstanding issue where taints were causing issues with some infrastructure daemonset components that may not have had a universal toleration, but it has been resolved since RHBA-2020:3180 with 4.3.31, in RHBA-2020:2786 with 4.4.11 and RHBA-2020:240 with 4.5.1. See Critical DaemonSets Missing Universal Toleration for further information.

With MachineSets

If your cluster was installed using MachineSets to manage your Machines and Nodes, then you can use MachineSets to define your infrastructure nodes as well. See the official documentation: Creating Infrastructure Machinesets.

An example MachineSet with the required nodeSelector and taints applied might look like this:

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> 
  name: <infrastructureID>-infra-<zone> 
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> 
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-infra-<zone> 
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> 
        machine.openshift.io/cluster-api-machine-role: infra 
        machine.openshift.io/cluster-api-machine-type: infra 
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-infra-<zone> 
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/infra: ""
          node-role.kubernetes.io: infra
      taints:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved

Without MachineSets

If you are not using the MachineSet API to manage your nodes, labels and taints are applied manually to each node:

Label it:

oc label node <node-name> node-role.kubernetes.io/infra=
oc label node <node-name> node-role.kubernetes.io=infra

Taint it:

oc adm taint nodes -l node-role.kubernetes.io/infra node-role.kubernetes.io/infra=reserved:NoSchedule node-role.kubernetes.io/infra=reserved:NoExecute

Moving Components to the Infrastructure Nodes

To move components to the infrastructure nodes, they must now have the infra Node Selector and a Toleration for the Taint assigned to the infrastructure nodes.
The following is an example, taken from the IngressController default, of what should be included in each respective resource spec to apply the node selector and toleration:

spec:
  nodePlacement:
    nodeSelector:
      matchLabels:
        node-role.kubernetes.io/infra: ""
    tolerations:
    - effect: NoSchedule
      key: node-role.kubernetes.io/infra
      value: reserved
    - effect: NoExecute
      key: node-role.kubernetes.io/infra
      value: reserved
Router

To move the router, the following patch on the IngressController will add both the node selector and Toleration:

oc patch ingresscontroller/default -n  openshift-ingress-operator  --type=merge -p '{"spec":{"nodePlacement": {"nodeSelector": {"matchLabels": {"node-role.kubernetes.io/infra": ""}},"tolerations": [{"effect":"NoSchedule","key": "node-role.kubernetes.io/infra","value": "reserved"},{"effect":"NoExecute","key": "node-role.kubernetes.io/infra","value": "reserved"}]}}}'

TIP: The router is configured by default to have only 2 replicas, but with 3 infrastructure nodes the following patch is required to scale to 3 routers.

oc patch ingresscontroller/default -n openshift-ingress-operator --type=merge -p '{"spec":{"replicas": 3}}'
Registry

To move the registry, apply the following patch to the config/cluster object.

oc patch configs.imageregistry.operator.openshift.io/cluster --type=merge -p '{"spec":{"nodeSelector": {"node-role.kubernetes.io/infra": ""},"tolerations": [{"effect":"NoSchedule","key": "node-role.kubernetes.io/infra","value": "reserved"},{"effect":"NoExecute","key": "node-role.kubernetes.io/infra","value": "reserved"}]}}'
Monitoring

Prometheus, Grafana and AlertManager comprise the default monitoring stack. To move these components create a config map with the required Node Selectors and Tolerations.

Define the ConfigMap as the cluster-monitoring-configmap.yaml file with the following:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |+
    alertmanagerMain:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    prometheusK8s:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    prometheusOperator:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    grafana:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    k8sPrometheusAdapter:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    kubeStateMetrics:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    telemeterClient:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    openshiftStateMetrics:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute
    thanosQuerier:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoSchedule
      - key: node-role.kubernetes.io/infra
        value: reserved
        effect: NoExecute

Then apply it to the cluster::

oc create -f cluster-monitoring-configmap.yaml
Logging

Logging components can also be moved to the infrastructure nodes. Additional information for moving the logging resources can be found in the This page is not included, but the link has been rewritten to point to the nearest parent document.OCP Documentation.

Configuring Infrastructure Nodes Using only Node Selectors

If you prefer to avoid taints entirely, this is also possible through a specified scheduling behavior. You can schedule your own workloads to run on specific non-infrastructure nodes by applying a distinct label to these other nodes and either updating your default scheduler to choose this particular node label or specifically request this node label by project namespace.

With MachineSets

To add app nodes specific label in MachineSet enabled installation and we need to modify the MachineSet and label existing nodes manually. You modify the MachineSet so that any new Machines receive this particular label, but you also modify the current Nodes to add the label to them as well because the Machine API does not dynamically update Machines or Nodes from changes to the MachineSet that originally created them. By default the cluster may already have a "worker" MachineSet that can be repurposed for your app label.

$ oc patch machineset $MACHINESET_APP --type=merge -p '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io/app":""}}}}}}'
$ oc patch machineset $MACHINESET_INFRA  --type=merge -p '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io/infra":""}}}}}}'
$ oc patch machineset $MACHINESET_INFRA  --type=merge -p '{"spec":{"template":{"spec":{"metadata":{"labels":{"node-role.kubernetes.io":"infra"}}}}}}'
$ oc label node <node-name> node-role.kubernetes.io/app=""
$ oc label node <node-name> node-role.kubernetes.io/infra=""
$ oc label node <node-name> node-role.kubernetes.io=infra

Without MachineSets

Worker nodes can be designated as infra nodes or app nodes through labeling.

  1. Add a label to the worker node(s) you wish to act as app node(s):
$ oc label node <node-name> node-role.kubernetes.io/app=""
  1. Add a label to the worker node(s) you wish to act as infra node(s):
$ oc label node <node-name> node-role.kubernetes.io/infra=""
$ oc label node <node-name> node-role.kubernetes.io=infra
  1. Check to see if applicable nodes now have the infra role and app roles. Note, that the worker roles should remain.
$ oc get nodes

Change the Scheduling Behavior for App Workloads

Create a default node selector, so pods without a nodeSelector will be assigned a subset of nodes to be deployed on, for example by default deploy in worker nodes.
As an example, the defaultNodeSelector to deploy pods on worker nodes by default would look like:

defaultNodeSelector: node-role.kubernetes.io/app=

This could also be performed with the following patch command:

$ oc patch scheduler cluster --type=merge -p '{"spec":{"defaultNodeSelector":"node-role.kubernetes.io/app="}}'

Note: Prior to a bugfix in 4.6.1, oc debug node would not work on master or infra nodes after changing the defaultNodeSelector from the installation default. This issue is described in This content is not included.RHBZ #1812813 and oc debug node Fails When a Default nodeSelector is Defined

Note: when changing the default node selector on the scheduler, some projects need an explicit, empty project node selector for DaemonSets. For example, the openshift-logging project (also, in OpenShift 4.18 to 4.18.12, the openshift-cluster-olm-operator, openshift-catalogd, and openshift-operator-controller due to a bug described in Upgrade to OpenShift 4.18 fails due to OLM operator scheduling issue). Review any other daemonsets and the behavior wanted. Annotating them like for example:

$ oc annotate namespace openshift-logging openshift.io/node-selector=
$ oc annotate namespace openshift-cluster-olm-operator openshift-catalogd openshift-operator-controller openshift.io/node-selector=

Alternatively to changing the default scheduler for all workloads on the cluster, you could include this annotation on your namespace to control the scheduling behavior of all resources in that namespace. See How to configure project node selector in OpenShift 4 for project node selector details.

$ oc annotate namespace $PROJECT openshift.io/node-selector=node-role.kubernetes.io/app=

Change the Scheduling Behavior for Infra Workloads

Move infrastructure resources to the newly labeled infra nodes. This would look exactly the same as above without the spec for tolerations. See further documentation about This page is not included, but the link has been rewritten to point to the nearest parent document.moving resources to infrastructure machine sets.

Root Cause

Infrastructure nodes allow customers to isolate infrastructure workloads, but are not included in the OCP 4 default installation.

Diagnostic Steps

In a default OCP 4 installation, only master and worker MCP exists, and also only master and worker roles:

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-xxxxxxxxxx   True      False      False      3              3                   3                     0                      8h
worker   rendered-worker-yyyyyyyyyy   True      False      False      3              3                   3                     0                      8h

$ oc get nodes
NAME                                               STATUS   ROLES    AGE    VERSION
master-0.lab.example.com   Ready    master   8h   v1.19.0+b00ba52
master-1.lab.example.com   Ready    master   8h   v1.19.0+b00ba52
master-2.lab.example.com   Ready    master   8h   v1.19.0+b00ba52
worker-0.lab.example.com   Ready    worker   8h   v1.19.0+b00ba52
worker-1.lab.example.com   Ready    worker   8h   v1.19.0+b00ba52
worker-2.lab.example.com   Ready    worker   8h   v1.19.0+b00ba52
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.