Troubleshooting

Red Hat Edge Manager 1.2

Diagnose and fix common problems with Red Hat Edge Manager

Red Hat Edge Manager Documentation Team

Abstract

This document provides troubleshooting information for Red Hat Edge Manager.

Preface

Diagnose common issues, interpret errors, and collect logs for Red Hat Edge Manager.

Chapter 1. Troubleshooting Red Hat Edge Manager

When working with devices in Red Hat Edge Manager, troubleshooting begins with interpreting the structured status messages provided by the device. By identifying the specific phase and component where a failure occurred, you can quickly determine whether an issue is caused by local resource constraints, network connectivity, or configuration errors.

1.1. Troubleshooting dependency synchronization

Use device status, events, and the CLI to diagnose why upstream configuration changes are not reaching devices or why sync probes report failures.

Prerequisites

  • You can run the Flight Control CLI and are logged in to the Red Hat Edge Manager service.
  • The device or fleet references configuration through gitRef, httpRef, or secretRef.

Upstream change detected but device stays out of date

  1. Confirm that a new template version exists for the fleet:

    flightctl get templateversions --fleetname <fleet_name>

    Look for a sync-driven name such as v1-8ebebaf8 in addition to spec-driven v1.

  2. Check device update status:

    flightctl get device/<device_name> -o yaml

    Review status.updated.status and fleet controller annotations such as fleet-controller/templateVersion.

  3. List recent events on the device:

    flightctl get events --field-selector 'involvedObject.kind=Device,involvedObject.name=<device_name>'

    Verify that DependencyChangeDetected appears after the upstream change. Inspect structured event details:

flightctl get events --field-selector 'involvedObject.kind=Device,involvedObject.name=<device_name>,reason=DependencyChangeDetected' -o yaml

The event details include resourceKey and fingerprint values for the upstream resource that changed.

If a template version exists but the device remains OutOfDate, investigate agent connectivity and update errors using Device update status and update state and Troubleshooting device error codes.

DependencySyncProbeFailed events

A DependencySyncProbeFailed warning indicates that Red Hat Edge Manager could not reach or authenticate to an upstream dependency during a sync cycle.

Common causes include:

  • Invalid or expired credentials on the Repository resource
  • Network or firewall rules blocking Git, HTTP, or the Kubernetes API
  • HTTP endpoints that are unreachable or return server errors

Confirm that the repository is reachable:

flightctl get repository/<name>

The ACCESSIBLE condition should be True.

The event message is sanitized and does not include secrets. Fix the underlying repository or secret access, then wait for the next poll cycle or trigger a specification reconciliation if needed.

Inspect probe failure details:

flightctl get events --field-selector 'reason=DependencySyncProbeFailed' -o yaml

Secret informer disconnect or not running

Secret synchronization requires an in-cluster Red Hat Edge Manager deployment and a running flightctl-periodic service with permission to watch labeled secrets.

  1. Confirm Red Hat Edge Manager is installed in the same cluster as the secrets.
  2. Verify the secret label: flightctl.io/sync-<release_namespace>: "true".
  3. Check periodic service logs for informer or watch errors:

    oc logs -n <release_namespace> deployment/flightctl-periodic --tail=200
  4. If you use Prometheus, check that flightctl_dependency_sync_informer_connected is 1. A value of 0 indicates the informer is not connected.
  5. Restart the flightctl-periodic deployment if the informer failed to start after RBAC or configuration changes.

For RBAC and labeling requirements, see Configuring synchronization in the Additional resources section.

Secret changes not detected

Verify all of the following:

  • Red Hat Edge Manager is deployed in-cluster (secret informers are not started for off-cluster installations).
  • The secret has the label flightctl.io/sync-<release_namespace>: "true".
  • The flightctl-periodic deployment can list and watch secrets in the secret namespace (cluster-wide RBAC or namespace Role as required).
  • The flightctl-worker service account can read the secret when rendering devices.

For labeling and RBAC, see Configuring synchronization in the Additional resources section.

HTTP resources not updating

  • Confirm the Repository resource is accessible and ACCESSIBLE is True: flightctl get repository/<name>.
  • Periodic sync uses conditional HEAD when the server returns ETag or Last-Modified. Endpoints without those headers are not actively probed. A sha256: fingerprint is recorded when configuration is rendered, not by the sync probe.
  • Ensure the HTTP server returns ETag or Last-Modified if you need automatic change detection between device renders. To verify response headers:

    curl -I <endpoint-url>
  • Parameterized suffix values are resolved per device; only devices whose resolved URL content changed are updated.

Git commits not applied

  • Verify the fleet or device references the correct targetRevision and path.
  • Allow up to one polling interval (default 15 minutes) after pushing a commit before expecting a new template version.
  • Ensure the Repository is accessible from the Red Hat Edge Manager service and that credentials are valid.

Duplicate or ambiguous configuration provider names

Each config entry name in a device or fleet template must be unique. If two providers share a name, status.dependencySync.configRefs can be ambiguous and fleet reconciliation can fail.

Rename configuration providers so each name is unique, then re-apply the fleet or device specification.

1.2. Troubleshooting device error codes

To improve security and performance, Red Hat Edge Manager uses structured error codes in device status responses. These codes replace verbose system logs with categorized, actionable summaries, ensuring sensitive data (like credentials) is never exposed in the API or UI.

Error message anatomy

Every error message follows a standardized 250-character format to help you quickly pinpoint the phase, component, and specific cause of a failure.

The error message format is as follows:

[timestamp] While <Phase>, <Component> failed [for "<Element>"]: <Category> issue - <STATUS_CODE>
FieldDescriptionExamples

Phase

The stage of the operation where the error occurred.

Preparing, ApplyingUpdate, Rebooting, RollingBack

Component

The specific system area affected.

os, config, applications, systemd

Element

The specific resource (file, service, or image).

/etc/app.conf, fleet-agent.service, quay.io/app

Category

The functional area of the failure.

Network, Security, Resource

Status Code

The standardized gRPC-based error code.

UNAVAILABLE, PERMISSION_DENIED, INTERNAL

Error reference & resolution

Use the table below to identify the root cause of a status code and the recommended next steps for resolution.

CategoryStatus CodeCommon CausesRecommended Action

Network

UNAVAILABLE / DEADLINE_EXCEEDED

DNS failure, registry unreachable, or connection timeout. Image non-existent or inaccessible due to registry permissions.

Check device internet connectivity and firewall rules for registry access. Verify the image name/tag and registry-level access permissions.

Security

PERMISSION_DENIED / UNAUTHENTICATED

Invalid credentials, expired tokens, or insufficient permissions.

Verify registry credentials and ensure the device identity is valid.

Configuration

INVALID_ARGUMENT / FAILED_PRECONDITION

Syntax errors in YAML/JSON or missing mandatory fields. Invalid element, token, or path format.

Validate your configuration spec against the schema.

Filesystem

NOT_FOUND / ALREADY_EXISTS

Missing files, directory conflicts, or path errors.

Verify the existence of required local resources or mount points.

Resource

RESOURCE_EXHAUSTED

Disk full, Out of Memory (OOM), or CPU throttling.

Check device telemetry for disk usage and memory pressure.

System

INTERNAL / UNKNOWN

Unexpected system faults or unclassified errors.

See Deep Dive Debugging below to correlate with journal logs.

Rollback and failed OS updates

If an OS update fails, the device automatically rolls back to the previous version. The phase may appear as RollingBack; when rollback completes, the update condition reason is Error. The device does not retry the failed version automatically. For how to recognize a rollback and what to do next, see Troubleshooting OS update rollback in the Additional resources section.

Deep dive debugging

While API status responses are sanitized for security, full error details—including stack traces and raw Go error chains—are preserved in the local device journal.

If you encounter an UNKNOWN or INTERNAL error, or if the status message is truncated, you can map the status code to the detailed log:

  1. Retrieve the device status, making sure to note the timestamp and component from the message field.

    flightctl get device/<device-name> -o yaml
  2. Access the device logs.

If you use the Flight Control web console, open the device Logs tab and retrieve logs with filters that include the failure time. For steps, see Viewing, streaming, and downloading device logs in the web console in the Additional resources section.

If you have shell access on the device, search the local journal for the corresponding error context to see the unredacted failure:

journalctl -u flightctl-agent.service | grep "failed to reload systemd daemon"
Note

API responses are limited to 250 characters. For the full diagnostic context—including raw Go error strings and detailed stack traces—refer to the local logs on the device.

1.3. Troubleshooting OS update rollback

Recognize when a device has rolled back after a failed OS update and what to do next.

When an OS update fails, Red Hat Edge Manager uses greenboot to automatically roll back the device to the previous working OS version. This section helps you recognize when a rollback occurred and what to do next.

Recognizing a rollback or failed update

Check the device status to see whether an update failed and the device rolled back:

  1. Retrieve the device status:

    flightctl get device/<device_name> -o yaml
  2. In the output, check:

    • status.updated.status: After a rollback, the device is typically OutOfDate (the device is running the previous OS version, not the version that was requested).
    • status.conditions: Look for the Updating condition. If the condition’s reason is Error, the update failed and the device has rolled back to the pre-update OS and configuration. If the reason was RollingBack, the agent was in the process of rolling back when it last reported.

The status.updated.info field may contain a short message about the last state transition.

Viewing greenboot and rollback logs

When troubleshooting a rollback, the most useful logs are from greenboot itself. On the device, use these commands to view them:

  1. To view health check output (greenboot health check results), run:

    sudo journalctl -o cat -u greenboot-healthcheck.service

    The following example shows journal output typical of a failed greenboot health check. Use it to pattern-match what you see on a device:

    Running Required Health Check Scripts...
    [20_check_flightctl_agent.sh] INFO: === flightctl-agent greenboot health check started ===
    [20_check_flightctl_agent.sh] INFO: GRUB boot variables:
    boot_success=0
    boot_counter=2
    ...
    time="..." level=error msg="health: Service check failed: service is not enabled (state: disabled)"
    [20_check_flightctl_agent.sh] ERROR: flightctl-agent health check failed
  2. To view pre-rollback diagnostic output (scripts that run before rollback), run:

    sudo journalctl -o cat -u redboot-task-runner.service
  3. To quickly check whether the last boot was declared successful by greenboot, inspect the GRUB environment on the device:

    sudo grub2-editenv - list | grep ^boot_success

    A value of boot_success=1 means greenboot declared the boot healthy. A value of 0 means either health checks are still running or the boot was declared failed.

Enabling persistent journal storage

By default, the systemd journal service stores data in the volatile /run/log/journal directory, which does not persist across reboots. To retain greenboot and agent logs for post-rollback analysis, enable persistent storage.

  1. Create the journal configuration directory:

    sudo mkdir -p /etc/systemd/journald.conf.d
  2. Create the configuration file:

    cat <<EOF | sudo tee /etc/systemd/journald.conf.d/flightctl.conf &>/dev/null
    [Journal]
    Storage=persistent
    SystemMaxUse=1G
    RuntimeMaxUse=1G
    EOF
  3. Edit the configuration file values for your size requirements. For example, adjust SystemMaxUse and RuntimeMaxUse in /etc/systemd/journald.conf.d/flightctl.conf.
  4. Restart the journal service to apply the configuration:
sudo systemctl restart systemd-journald

Post-rollback recovery and diagnostics

  • Verify the device is running: The device should be online and running the previous OS version. Confirm that status.summary.status is Online or Degraded and that status.os.image matches the previous (working) image.
  • Investigate the failure: Use the device status message and the device logs. In the Flight Control web console, open the device Logs tab to view or stream Agent or System logs (see Viewing, streaming, and downloading device logs in the web console in the Additional resources section). Prefer the greenboot journal output (see Viewing greenboot and rollback logs in the Additional resources section) when you need rollback-specific messages; you can also check the agent journal on the device (for example, journalctl -u flightctl-agent.service) when you have shell access. Common causes include health check failures after reboot, network or registry issues, or resource constraints. See Troubleshooting device error codes for error categories and recommended actions.
  • Fix and try a new version: Address the underlying issue (for example, fix the OS image or configuration, or resolve network or resource problems). When ready, update the device spec to a new OS image version or a corrected image so the agent can attempt an update again.

    Note

    The agent does not retry a failed version. It marks the failed version and skips it in future reconciliation. Pushing the same OS image again without change will not trigger a retry; you must push a new image version (different digest).

When to escalate

Consider escalating or opening a support case if:

  • The device does not come back online after a rollback.
  • Rollbacks happen repeatedly for the same or different OS versions.
  • The device status remains in RollingBack or Error for an extended period with no recovery.
  • You need to force a retry of a previously failed version and the product does not provide a supported way to do so.

1.4. Generating a device log bundle

Use the integrated flightctl-must-gather script directly on the device to generate a comprehensive bundle of diagnostic logs. This log bundle, in a standard .tar format, provides the necessary data to debug the device agent and assists in efficient troubleshooting and bug reporting.

To view or download recent journal lines from the Flight Control web console without SSH, see Viewing, streaming, and downloading device logs in the web console.

Procedure

  1. Run the following command on the device and include the .tar file in the bug report.

    Note

    This depends on an SSH connection to extract the .tar file.

    sudo flightctl-must-gather

1.5. Viewing a device’s effective target configuration

The device manifest returned by the flightctl get device command still only has references to external configuration and secret objects. Only when the device agent queries the service, the service replaces the references with the actual configuration and secret data.

While this better protects potentially sensitive data, it also makes troubleshooting faulty configurations hard. This is why a user can be authorized to query the effective configuration as rendered by the service to the agent.

Procedure

  • To query the effective configuration, use the following command:

    flightctl get device/${device_name} --rendered | jq

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.