How to deal with journal errors in qpid broker?

Solution Verified - Updated

Environment

  • MRG Messaging 2.*
    • (legacy) store module used

Issue

  • Our qpid broker raised a journal error (JERR__UNEXPRESPONSE, RHM_IORES_EMPTY, RHM_IORES_BUSY or so).

  • What does each particular error type mean?

  • What is the process to generally recover from a journal error?

  • What logs, configuration and command outputs need to be gathered to help troubleshoot the problem?

Resolution

What does each particular error type mean?

See Root cause for details. It is recommended to understand what error one hit as that suggests what type of behaviour might trigger the error.

How to recover from a journal error

Generally, some errors are self-recoverable (like store is full) while some aren't. In all cases, it is recommended to backup journal files of the affected queue. For default configuration, it means backing up directory (just one fits the mask below for any queue):

/var/lib/qpid/rhn/jrnl/*/<queueName>

Such backup prevents potential data loss and allows Red Hat to investigate the root cause of the failure (see below).

If the error is self-recoverable (meaning it happened once but is not regularly repeating), the journal can be safely used further. For not self-recoverable errors occurring e.g. on every read or write access of the journal (and preventing usage of the queue), it is recommended to follow this procedure:

  • If the queue is not empty and contains messages that can't be lost, try to stop producers and consume messages. If stopping producers isn't possible, then try the below steps (that might fail for too damaged journal):
    • Start another broker, create there the same (durable) queue.
    • Stop it and replace its journal files by the backed up journal files.
    • Start it again and let consumer(s) to dequeue messages from this broker.
  • Once the content of the journal is not important, simply delete and re-create the queue (with its journal) and all bindings again.

Some journal errors are recoverable even by restarting the broker, allowing the store to recover during initial journals read. If this is successful, then there is a good (although not guaranteed) chance that normal activity can resume. However, if the journal records themselves have somehow become corrupted or bad, then recovery may also fail, data in the journal are lost and one has to manually delete the journal first in order to even start the broker up. Therefore it is up to individual decision whether to try restarting the broker in order to possibly recover the journal.

What information to gather for further troubleshoot

It is highly recommend to provide to Red Hat support below information in order to troubleshoot a journal problem. Note that many these problems need building accurate reproducer in order to even understand the real root cause. First several questions can (but not need to) be answered by providing sosreport:

  • What filesystem (ext4 or so) is used for the journal store?
  • Is the machine bare metal or virtual? If virtual, please provide type of hypervisor.
  • broker version:

rpm -qa | grep qpid

Answers to below questions have to be provided in addition to potential sosreport. The sooner after the error are the data provided, the better, otherwise some further activity on the journal can hide the original symptoms of the problem:

  • broker logs, regardless of verbosity set in past. Store module itself does not log too much, therefore enabling more verbose logging after the error occurred does not help. While logs preceding the error are relevant - ideally of at least debug+ level
  • If the error is repeating, describe what behaviour (like any enqueue attempt) triggers an error log.
  • if possible (i.e. due to security reasons), provide current content of the journal (there is one directory every time for the below mask) - as soon as the error is seen:

/var/lib/qpidd/rhn/jrnl/*/<queueName>

  • Provide queue provisioning and statistics, by:
    • either issuing proper QMF method (like get_queue_details.cpp in this KCS solution)
    • or doing so via qpid-tool:
      • run qpid-tool and after prompt appears, wait 10 seconds
      • type list queue
      • in the output, find the queue name and check its ID in the left-most column
      • type show <ID> for the ID of the queue, and store whole output
  • What is average and maximal message content size in this queue?
  • What is usual size of message headers?
  • What is average and maximal queue depth?
  • Are messages enqueued solely to this queue? Or are (some or all?) sent via an exchange that enqueues them also to another queue(s)?
  • If also another queue(s) can enqueue same message, provide their configuration (qpid-config queues).
  • How many producers can concurrently enqueue to this queue?
  • How many consumers can concurrently dequeue from this queue?
  • Are there any browsers of the queue? If so, run they concurrently with some consumer(s)?

How to react on some particular journal errors?

RHM_IORES_ENQCAPTHRESH

The journal reached its capacity. Dequeues are allowed now only. To prevent it, create it with bigger capacity. See This content is not included.manual for details.

If the queue has just few messages while the error is seen, let follow this solution.

JERR_MAP_LOCKED

See this solution for details.

Root Cause

List of journal errors and their descriptions, as taken from store module source code. Most of them are a symptom of a store module bug where it is required to gather information (see Resolution) to investigate further:

  • Possible return states from journal read and write operations:
        RHM_IORES_SUCCESS = 0,  ///< Success: IO operation completed noramlly.
        RHM_IORES_PAGE_AIOWAIT, ///< IO operation suspended - next page is waiting for AIO.
        RHM_IORES_FILE_AIOWAIT, ///< IO operation suspended - next file is waiting for AIO.
        RHM_IORES_EMPTY,        ///< During read operations, nothing further is available to read.
        RHM_IORES_RCINVALID,    ///< Read page cache is invalid (ie obsolete or uninitialized)
        RHM_IORES_ENQCAPTHRESH, ///< Enqueue capacity threshold (limit) reached.
        RHM_IORES_FULL,         ///< During write operations, the journal files are full.
        RHM_IORES_BUSY,         ///< Another blocking operation is in progress.
        RHM_IORES_TXPENDING,    ///< Operation blocked by pending transaction.
        RHM_IORES_NOTIMPL       ///< Function is not yet implemented.
  • Internal journal errors:
        // generic errors
        static const u_int32_t JERR__MALLOC;            ///< Buffer memory allocation failed
        static const u_int32_t JERR__UNDERFLOW;         ///< Underflow error
        static const u_int32_t JERR__NINIT;             ///< Operation on uninitialized class
        static const u_int32_t JERR__AIO;               ///< AIO failure
        static const u_int32_t JERR__FILEIO;            ///< File read or write failure
        static const u_int32_t JERR__RTCLOCK;           ///< Reading real-time clock failed
        static const u_int32_t JERR__PTHREAD;           ///< pthread failure
        static const u_int32_t JERR__TIMEOUT;           ///< Timeout waiting for an event
        static const u_int32_t JERR__UNEXPRESPONSE;     ///< Unexpected response to call or event
        static const u_int32_t JERR__RECNFOUND;         ///< Record not found
        static const u_int32_t JERR__NOTIMPL;           ///< Not implemented

        // class jcntl
        static const u_int32_t JERR_JCNTL_STOPPED;      ///< Operation on stopped journal
        static const u_int32_t JERR_JCNTL_READONLY;     ///< Write operation on read-only journal
        static const u_int32_t JERR_JCNTL_AIOCMPLWAIT;  ///< Timeout waiting for AIOs to complete
        static const u_int32_t JERR_JCNTL_UNKNOWNMAGIC; ///< Found record with unknown magic
        static const u_int32_t JERR_JCNTL_NOTRECOVERED; ///< Req' recover() to be called first
        static const u_int32_t JERR_JCNTL_RECOVERJFULL; ///< Journal data files full, cannot write
        static const u_int32_t JERR_JCNTL_OWIMISMATCH;  ///< OWI change found in unexpected location

        // class jdir
        static const u_int32_t JERR_JDIR_NOTDIR;        ///< Exists but is not a directory
        static const u_int32_t JERR_JDIR_MKDIR;         ///< Directory creation failed
        static const u_int32_t JERR_JDIR_OPENDIR;       ///< Directory open failed
        static const u_int32_t JERR_JDIR_READDIR;       ///< Directory read failed
        static const u_int32_t JERR_JDIR_CLOSEDIR;      ///< Directory close failed
        static const u_int32_t JERR_JDIR_RMDIR;         ///< Directory delete failed
        static const u_int32_t JERR_JDIR_NOSUCHFILE;    ///< File does not exist
        static const u_int32_t JERR_JDIR_FMOVE;         ///< File move failed
        static const u_int32_t JERR_JDIR_STAT;          ///< File stat failed
        static const u_int32_t JERR_JDIR_UNLINK;        ///< File delete failed
        static const u_int32_t JERR_JDIR_BADFTYPE;      ///< Bad or unknown file type (stat mode)

        // class fcntl
        static const u_int32_t JERR_FCNTL_OPENWR;       ///< Unable to open file for write
        static const u_int32_t JERR_FCNTL_WRITE;        ///< Unable to write to file
        static const u_int32_t JERR_FCNTL_CLOSE;        ///< File close failed
        static const u_int32_t JERR_FCNTL_FILEOFFSOVFL; ///< Increased offset past file size
        static const u_int32_t JERR_FCNTL_CMPLOFFSOVFL; ///< Increased cmpl offs past subm offs
        static const u_int32_t JERR_FCNTL_RDOFFSOVFL;   ///< Increased read offs past write offs

        // class lfmgr
        static const u_int32_t JERR_LFMGR_BADAEFNUMLIM; ///< Bad auto-expand file number limit
        static const u_int32_t JERR_LFMGR_AEFNUMLIMIT;  ///< Exceeded auto-expand file number limit
        static const u_int32_t JERR_LFMGR_AEDISABLED;   ///< Attempted to expand with auto-expand disabled

        // class rrfc
        static const u_int32_t JERR_RRFC_OPENRD;        ///< Unable to open file for read

        // class jrec, enq_rec, deq_rec, txn_rec
        static const u_int32_t JERR_JREC_BADRECHDR;     ///< Invalid data record header
        static const u_int32_t JERR_JREC_BADRECTAIL;    ///< Invalid data record tail

        // class wmgr
        static const u_int32_t JERR_WMGR_BADPGSTATE;    ///< Page buffer in illegal state.
        static const u_int32_t JERR_WMGR_BADDTOKSTATE;  ///< Data token in illegal state.
        static const u_int32_t JERR_WMGR_ENQDISCONT;    ///< Enq. new dtok when previous part compl.
        static const u_int32_t JERR_WMGR_DEQDISCONT;    ///< Deq. new dtok when previous part compl.
        static const u_int32_t JERR_WMGR_DEQRIDNOTENQ;  ///< Deq. rid not enqueued

        // class rmgr
        static const u_int32_t JERR_RMGR_UNKNOWNMAGIC;  ///< Found record with unknown magic
        static const u_int32_t JERR_RMGR_RIDMISMATCH;   ///< RID mismatch between rec and dtok
        //static const u_int32_t JERR_RMGR_FIDMISMATCH;   ///< FID mismatch between emap and rrfc
        static const u_int32_t JERR_RMGR_ENQSTATE;      ///< Attempted read when wstate not ENQ
        static const u_int32_t JERR_RMGR_BADRECTYPE;    ///< Attempted op on incorrect rec type

        // class data_tok
        static const u_int32_t JERR_DTOK_ILLEGALSTATE;  ///< Attempted to change to illegal state
//         static const u_int32_t JERR_DTOK_RIDNOTSET;     ///< Record ID not set

        // class enq_map, txn_map
        static const u_int32_t JERR_MAP_DUPLICATE;      ///< Attempted to insert using duplicate key
        static const u_int32_t JERR_MAP_NOTFOUND;       ///< Key not found in map
        static const u_int32_t JERR_MAP_LOCKED;         ///< rid locked by pending txn

        // class jinf
        static const u_int32_t JERR_JINF_CVALIDFAIL;    ///< Compatibility validation failure
        static const u_int32_t JERR_JINF_NOVALUESTR;    ///< No value attr found in jinf file
        static const u_int32_t JERR_JINF_BADVALUESTR;   ///< Bad format for value attr in jinf file
        static const u_int32_t JERR_JINF_JDATEMPTY;     ///< Journal data files empty
        static const u_int32_t JERR_JINF_TOOMANYFILES;  ///< Too many journal data files
        static const u_int32_t JERR_JINF_INVALIDFHDR;   ///< Invalid file header
        static const u_int32_t JERR_JINF_STAT;          ///< Error while trying to stat a file
        static const u_int32_t JERR_JINF_NOTREGFILE;    ///< Target file is not a regular file
        static const u_int32_t JERR_JINF_BADFILESIZE;   ///< File is of incorrect or unexpected size
        static const u_int32_t JERR_JINF_OWIBAD;        ///< OWI inconsistent (>1 transition in non-ae journal)
        static const u_int32_t JERR_JINF_ZEROLENFILE;   ///< Journal info file is zero length (empty).

        // Negative returns for some functions
        static const int32_t AIO_TIMEOUT;               ///< Timeout waiting for AIO return
        static const int32_t LOCK_TAKEN;                ///< Attempted to take lock, but it was taken by another thread
SBR
Components
Category

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.