429 response errors and duplicated logs when upgrading to Logging v6 in RHOCP 4

Solution Verified - Updated

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4
  • Red Hat OpenShift Logging (RHOL)
    • 5
    • 6
  • Vector
  • Fluentd

Issue

  • After upgrading to Logging v6 is observed 429 Response codes
  • When upgrading from Logging v5 to v6 duplicated logs are seen in the Log store
  • With the upgrade to Logging v6, the Elasticsearch is overwhelmed and duplicated logs are visible
  • How can it avoided that the collector starts to read old the logs available?
  • After migrating from Logging v5 to v6 is observed Rate Limit errors in Loki and Tenant Rate Limit

Resolution

1. If the collector type in Logging v5 was Vector

Follow the steps given in the Red Hat Knowledge Article "How to transition the collectors and the default log store from Red Hat OpenShift Logging 5 to 6" taken in special consideration the section "Move the Vector checkpoints for the clusterLogging CR instance".

2. If the collector type in Logging v5 was Fluentd

  1. Migrate the log collector from Fluentd to Vector before starting the migration to Logging v6 following the Red Hat Knowledge Article "Migrating the log collector from Fluentd to Vector reducing the number of logs duplicated in RHOCP 4"
  2. Follow the steps given in the Red Hat Knowledge Article "How to transition the collectors and the default log store from Red Hat OpenShift Logging 5 to 6" taken in special consideration the section "Move the Vector checkpoints for the clusterLogging CR instance".

Root Cause

It exists 2 scenarios:

1. If the collector type in RHOL v5 was Fluentd

Fluentd uses a pos file that inside are kept the list of log file opened for read and the last position read. These pos files have not a format that Vector can interpret and continue reading from the last position inside the file read by Vector.

At the moment of migrating to Vector as collector type, Vector starts to read all the logs available in the nodes from the begining.

2. If the collector type in RHOL v5 was Vector

The Vector checkpoint in Logging v5 are in a different path that in Logging v6. For more details read the Red Hat Knowledge Article "How to transition the collectors and the default log store from Red Hat OpenShift Logging 5 to 6" section "Move the Vector checkpoints for the clusterLogging CR instance".

Diagnostic Steps

Migrate from Logging v5 to Logging v6 and observe in the receivers:

  1. HTTP 429 server response errors
  2. HTTP 500 Internal server errors response code
  3. If it's used Loki as Log store, also Rate Limit and Tenant Rate Limit
  4. Possible to get the receiver overwhelmed getting unresponsive or with 500 response errors as not able to support the load
  5. Duplicated logs in the log storage with old dates
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.