Sending RHV monitoring data to a remote Elasticsearch instance

Updated

Overview

For administrators who have existing Elasticsearch, Logstash, Kibana (ELK) stacks to monitor their infrastructure. You can monitor your RHV infrastructure using your ELK dashboards. Please use these Content from www.elastic.co is not included.Elasticseach resources to install and configure your ELK stack.

This article describes how to enable sending collected metrics and logs to a standalone Elasticsearch instance for monitoring the Red Hat Virtualization environment (RHV 4.3 or higher).
The procedure includes:

  • Deploying collectd and rsyslog on hosts
  • Importing sample Kibana dashboards
  • Using Kibana Dashboards
  • Metrics Schema

Prerequisites

  • An Elasticsearch instance with Kibana

Setup Procedure

  1. Log in to the Red Hat Virtualization Manager machine using SSH.
  2. Copy config.yml.example to create a new config.yml:
  # cp /etc/ovirt-engine-metrics/config.yml.example /etc/ovirt-engine-metrics/config.yml.d/config.yml
  1. Optionally, edit the ovirt_env_name and elasticsearch_host parameters (see Notes below for definitions) in the config.yml,
    and add the following variable:

    use_omelasticsearch_cert: false

    When using certificates, set use_omelasticsearch_cert to true.

  2. Save the file. These parameters are mandatory and are documented inside the *.yml file.

  3. In the Elasticsearch Kibana portal, go to Home > Management > Stack Management, and select Kibana > Index Patterns, and add the following index patterns:

    project.ovirt-metrics-<ovirt_env_name>*
    project.ovirt-logs-<ovirt_env_name>*

    Where <ovirt_env_name> is the name you defined in

    /etc/ovirt-engine-metrics/config.yml.d/config.yml

    • The default <ovirt_env_name> value is engine.
    • The dynamic index templates should be set to automatically add the date to the index name based on the document date.
  4. Click Refresh for each of the new index patterns.

  5. Deploy collectd and rsyslog on the hosts:

     # /usr/share/ovirt-engine-metrics/setup/ansible/configure_ovirt_machines_for_metrics.sh
    

Note: the configure_ovirt_machines_for_metrics.sh script runs an Ansible role that includes linux-system-roles (see This content is not included.Administration and configuration tasks using System Roles in RHEL ) and uses it to deploy and configure rsyslog on the host. rsyslog collects metrics from collectd and sends them to Elasticsearch.

  1. Import the pre-defined dashboard examples to Kibana, see Using Kibana Dashboards below .

    Note: the dashboard examples are only available after completing the procedure for deploying collectd and rsyslog.

  2. When prompted to select an index to use for each dashboard, select the following indices from the drop-down list:

    project.ovirt-metrics-<ovirt_env_name>

    project.ovirt-logs-<ovirt_env_name>

Expected result

Metrics and log data are saved to Elasticsearch and visible in the Kibana dashboard.
The engine.log is collected from the Manager machine, and the vdsm.log is collected from the hosts.

NOTES

  • To test whether data is received by Elasticsearch run the following command on the Elasticsearch host:

    curl -X GET "localhost:9200/_cat/indices/*?v&s=index&pretty"
    

    The results are displayed in a table that includes a column for the value docs.count.
    If docs.count = 0, this indicates that no data has reached Elasticsearch.
    For example:

    [root@hostname-e-k ~]# curl -X GET "localhost:9200/_cat/indices/*?v&s=index&pretty"
    | health|status| index                                   | uuid       |pri|rep| docs.count|docs.deleted| store.size |
    |-------|------|-----------------------------------------|------------|---|---|----------:|-----------:|-----------:|
    | yellow| open | .kibana                                 | KRSclvC2Sx | 1 | 1 |        50 |          2 |       78kb |
    | yellow| open | project.ovirt-logs-engine..2020.09.21   | czuFHjhngf | 5 | 1 |      1504 |          0 |      936kb |
    | yellow| open | project.ovirt-logs-engine..2020.09.22   | UbTi-LNhrt | 5 | 1 |      2151 |          0 |    908.2kb |
    | yellow| open | project.ovirt-logs-engine..2020.09.23   | 7mAomrgDQKi| 5 | 1 |      2155 |          0 |    928.1kb |
    | yellow| open | project.ovirt-logs-engine..2020.09.24   | tvuC1nyiQy | 5 | 1 |      1525 |          0 |        1mb |
    | yellow| open | project.ovirt-metrics-engine..2020.09.21| EyAAaY2hSq | 5 | 1 |     25551 |          0 |   1009.3mb |
    
  • If you are using basic HTTP authentication, the Elasticsearch password needs to be specified and encrypted:

  1. Copy the example /etc/ovirt-engine-metrics/secure_vars.yaml.example and create /etc/ovirt-engine-metrics/config.yml.d/secure_vars.yaml:

    cp /etc/ovirt-engine-metrics/secure_vars.yaml.example /etc/ovirt-engine-metrics/config.yml.d/secure_vars.yaml
    
  2. Encrypt the file and set the vault password:

    ansible-vault encrypt /etc/ovirt-engine-metrics/config.yml.d/secure_vars.yaml
    
  3. Specify the logging_elasticsearch_password in the encrypted file. If there are multiple of servers, the Elasticsearch servers should share a single password.

    ansible-vault edit /etc/ovirt-engine-metrics/config.yml.d/secure_vars.yaml
    

Note: The default user name is elastic. If you want to use a different user name, you must add the uid parameter to the config.yml file.

  1. Deploy collectd and rsyslog on the hosts:

     /usr/share/ovirt-engine-metrics/setup/ansible/configure_ovirt_machines_for_metrics.sh --ask-vault-pass
    
  • If you are not using HTTPS, add the following variables to the config.yml file:

    rsyslog_elasticsearch_usehttps_metrics: off
    rsyslog_elasticsearch_usehttps_logs: off
    
  • If certificates are required to communicate with Elasticsearch, you need to specify their location.

    Default certificate locations:

    # Where to find the SSL CA certificate used to communicate with Elasticsearch:

    rsyslog_elasticsearch_ca_cert_path: '/etc/pki/tls/certs/elasticsearch_ca_cert.pem'
    See also Content from www.rsyslog.com is not included.rsyslog and tls.cacert

    # Where to find the SSL client certificate used to communicate with Elasticsearch:

    rsyslog_elasticsearch_client_cert_path: '/etc/pki/tls/certs/elasticsearch_client_cert.pem'
    See also Content from www.rsyslog.com is not included.rsyslog and tls.mycert

    # Where to find the SSL client key used to communicate with Elasticsearch:

    rsyslog_elasticsearch_client_key_path: '/etc/pki/tls/private/elasticsearch_client_key.pem'
    See also Content from www.rsyslog.com is not included.syslog and tls.myprivkey

  • ovirt_env_name: environment name, used to identify data collected in a single central store, sent from more than one oVirt engine.
    Use the following convention:

    • Include only alphanumeric characters and hyphens ( “-“ ).
    • Name cannot begin with a hyphen or a number, or end with a hyphen.
    • Maximum of 49 characters. Wildcard patterns (e.g. ovirt-metrics*) cannot be used.
  • elasticsearch_host: address or hostname (FQDN) of the Elasticsearch server host. The value can be a single host or a list of hosts.

USING KIBANA DASHBOARDS

A dashboard displays a set of saved visualizations. Dashboards have the advantage of enabling you to quickly access a wide range of metrics while offering the flexibility of changing them to match your individual needs.

You can use the Dashboard tab to create your own dashboards. Alternatively, Red Hat provides the following dashboard examples, which you can import into Kibana and use as is or customize to suit your specific needs:

  • System dashboard
  • Hosts dashboard
  • VMs dashboard

Importing Dashboard Examples
Note: The dashboard examples are only available after completing the procedure for Deploying collectd and rsyslog.

  1. Copy the /etc/ovirt-engine-metrics/dashboards-examples directory from the Manager virtual machine to your local machine.

  2. Log in to the Kibana console using the URL (https://kibana.example.com) that you recorded during the setup process. Use the default admin user, and the password you defined during setup.

  3. Open Kibana and click the Management tab.

  4. Click the Saved Objects tab.

  5. Click Import and import Searches from your local copy of /etc/ovirt-engine-metrics/dashboards-examples.

  6. Click Import and import Visualizations.
    Note: If you see an error message while importing the visualizations, check your hosts to ensure that collectd and rsyslog are running without errors.

  7. Click Import and import Dashboards.
    Note: If you are logged in as the admin user, you may see a message regarding missing index patterns while importing the visualizations.
    Select the project.* index pattern instead.

  8. In the Elasticsearch Kibana portal, go to Home > Management > Stack Management, select Kibana > Index Patterns, and add the following index patterns:

    project.ovirt-metrics-<ovirt_env_name>*
    project.ovirt-logs-<ovirt_env_name>*

  9. Click Refresh for each of the new index patterns.

The imported dashboards are now stored in the system.

Loading Saved Dashboards

Once you have created and saved a dashboard, or imported Red Hat’s sample dashboards, you can display them in the Dashboard tab:

  1. Click the Dashboards tab to display a list of saved dashboards.
  2. Click a saved dashboard to load it.

Metrics Schema

The following sections describe the metrics that are available from the Field menu when creating visualizations.

NOTE:
All metric values are collected at 10 second intervals.

Aggregation Metrics

The Aggregation metric aggregates several values into one using aggregation functions such as sum, average, min, and max. It is used to provide a combined value for average and total CPU statistics.

The following table describes the aggregation metrics reported by the Aggregation plugin.

Metric Namecollectd.type_instanceDescription
collectd.aggregation.percent ainterrupt
user
wait
nice
softirq
system
idle
steal
The average and total CPU usage, as an aggregated percentage, for each of the collectd.type_instance states

Additional Values

  • collectd.plugin: Aggregation
  • collectd.type_instance: cpu-average / cpu-sum
  • collectd.plugin_instance:
  • collectd.type: percent
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

CPU Metrics

CPU metrics display the amount of time spent by the hosts' CPUs, as a percentage.

The following table describes CPU metrics as reported by the CPU plugin.

Metric Namecollectd.type_instanceDescription
collectd.cpu.percent ainterrupt
user
wait
nice
softirq
system
idle
steal
The percentage of time spent, per CPU, in the collectd.type_instance states.

Additional Values

  • collectd.plugin: CPU
  • collectd.plugin_instance: The CPU's number
  • collectd.type: percent
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

CPU Load Average Metrics

CPU load represents CPU contention, that is, the average number of schedulable processes at any given time. This is reported as an average value for all CPU cores on the host. Each CPU core can only execute one process at a time. Therefore, a CPU load average above 1.0 indicates that the CPUs have more work than they can perform, and the system is overloaded.

CPU load is reported over short term (last one minute), medium term (last five minutes) and long term (last fifteen minutes). While it is normal for a host's short term load average to exceed 1.0 (for a single CPU), sustained load average above 1.0 on a host may indicate a problem.

On multi-processor systems, the load is relative to the number of processor cores available. The "100% utilization" mark is 1.00 on a single-core, 2.00 on a dual-core, 4.00 on a quad-core system.

Red Hat recommends looking at CPU load in conjunction with CPU Metrics.

The following table describes the CPU load metrics reported by the Load plugin.

Metric NameDescription
collectd.load.load.longtermAverage number of schedulable processes per CPU core over the last 15 minutes. A value above 1.0 indicates the system was overloaded during the last 15 minutes.
collectd.load.load.midtermAverage number of schedulable processes per CPU core over the last five minutes. A value above 1.0 indicates the system was overloaded during the last 5 minutes.
collectd.load.load.shorttermAverage number of schedulable processes per CPU core over the last one minute. A value above 1.0 indicates the system was overloaded during the last minute.

Additional Values

  • collectd.plugin: Load
  • collectd.type: load
  • collectd.type_instance: None
  • collectd.plugin_instance: None
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

Disk Consumption Metrics

Disk consumption (DF) metrics enable you to monitor metrics about disk consumption, such as the used, reserved, and free space for each mounted file system.

The following table describes the disk consumption metrics reported by the DF plugin.

Metric NameDescription
collectd.df.df_complexThe amount of free, used, and reserved disk space, in bytes, on this file system.
collectd.df.percent_bytesThe amount of free, used, and reserved disk space, as a percentage of total disk space, on this file system.

Additional Values

  • collectd.plugin: DF
  • collectd.type_instance: free, used, reserved
  • collectd.plugin_instance: A mounted partition
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

Disk Operation Metrics

Disk operation metrics are reported per physical disk on the host, and per partition.

The following table describes the disk operation metrics reported by the Disk plugin.

Metric NameDescriptioncollectd.dstypes
collectd.disk.disk_ops.readThe number of disk read operations.Derive
collectd.disk.disk_ops.writeThe number of disk write operations.Derive
collectd.disk.disk_merged.readThe number of disk reads that have been merged into single physical disk access operations. In other words, this metric measures the number of instances in which one physical disk access served multiple disk reads. The higher the number, the better.Derive
collectd.disk.disk_merged.writeThe number of disk writes that were merged into single physical disk access operations. In other words, this metric measures the number of instances in which one physical disk access served multiple write operations. The higher the number, the better.Derive
collectd.disk.disk_time.readThe average amount of time it took to do a read operation, in milliseconds.Derive
collectd.disk.disk_time.writeThe average amount of time it took to do a write operation, in milliseconds.Derive
collectd.disk.pending_operationsThe queue size of pending I/O operations.Gauge
collectd.disk.disk_io_time.io_timeThe time spent doing I/Os in milliseconds. This can be used as a device load percentage, where a value of 1 second of time spent represents a 100% load.Derive
collectd.disk.disk_io_time.weighted_io_timeA measure of both I/O completion time and the backlog that may be accumulating.Derive

Additional Values

  • collectd.plugin: Disk
  • collectd.type_instance: None
  • collectd.plugin_instance: The disk's name
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10

Entropy Metrics

Entropy metrics display the available entropy pool size on the host. Entropy is important for generating random numbers, which are used for encryption, authorization, and similar tasks.

The following table describes the entropy metrics reported by the Entropy plugin.

Metric NameDescription
collectd.entropy.entropyThe entropy pool size, in bits, on the host.

Additional Values

  • collectd.plugin: Entropy
  • collectd.type_instance: None
  • collectd.plugin_instance: None
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

Network Interface Metrics

The following types of metrics are reported from physical and virtual network interfaces on the host:

  • Bytes (octets) transmitted and received (total, or per second)
  • Packets transmitted and received (total, or per second)
  • Interface errors (total, or per second)

The following table describes the network interface metrics reported by the Interface plugin.

collectd.typeMetric NameDescription
if_octetscollectd.interface.if_octets.rxA count of the bytes received by the interface. You can view this metric as a Rate/sec or a cumulative count (Max):

Rate/sec: Provides the current traffic level on the interface in bytes/sec.

Max: Provides the cumulative count of bytes received. Note that since this metric is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_octetscollectd.interface.if_octets.txA count of the bytes transmitted by the interface. You can view this metric as a Rate/sec or a cumulative count (Max):

Rate/sec: Provides the current traffic level on the interface in bytes/sec.

Max: Provides the cumulative count of bytes transmitted. Note that since this metric is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_packetscollectd.interface.if_packets.rxA count of the packets received by the interface.

You can view this metric as a Rate/sec or a cumulative count (Max):

Rate/sec: Provides the current traffic level on the interface in bytes/sec.

Max: Provides the cumulative count of packets received. Note that since this metric is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_packetscollectd.interface.if_packets.txA count of the packets transmitted by the interface.
You can view this metric as a Rate/sec or a cumulative count (Max):

Rate/sec: Provides the current traffic level on the interface in packets/sec.

Max: Provides the cumulative count of packets transmitted. Note that since this metric is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_errorscollectd.interface.if_errors.rxA count of errors received on the interface.
You can view this metric as a Rate/sec or a cumulative count (Max).

Rate/sec rollup provides the current rate of errors received on the interface in errors/sec.

Max rollup provides the total number of errors received since the beginning. Note that since this is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_errorscollectd.interface.if_errors.txA count of errors transmitted on the interface.
You can view this metric as a Rate/sec or a cumulative count (Max).

Rate/sec rollup provides the current rate of errors transmitted on the interface in errors/sec.

Max rollup provides the total number of errors transmitted since the beginning. Note that since this is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_droppedcollectd.interface.if_dropped.rxA count of dropped packets received on the interface.
You can view this metric as a Rate/sec or a cumulative count (Max).

Rate/sec rollup provides the current rate of dropped packets received on the interface in packets/sec.

Max rollup provides the total number of dropped packets received since the beginning. Note that since this is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.
if_droppedcollectd.interface.if_dropped.txA count of dropped packets transmitted on the interface.
You can view this metric as a Rate/sec or a cumulative count (Max).

Rate/sec rollup provides the current rate of dropped packets transmitted on the interface in packets/sec.

Max rollup provides the total number of dropped packets transmitted since the beginning. Note that since this is a cumulative counter, its value will periodically restart from zero when the maximum possible value of the counter is exceeded.

Additional Values

  • collectd.plugin: Interface
  • collectd.type_instance: None
  • collectd.plugin_instance: The network's name
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Derive

Memory Metrics

Metrics collected about memory usage.

The following table describes the memory usage metrics reported by the Memory plugin.

Metric Namecollectd.typecollectd.type_instanceDescription
collectd.memory.memorymemoryusedThe total amount of memory used.
cachedThe amount of memory used for caching disk data for reads, memory-mapped files, or tmpfs data.
bufferedThe amount of memory used for buffering, mostly for I/O operations.
slab_reclThe amount of reclaimable memory used for slab kernel allocations.
slab_unreclAmount of unreclaimable memory used for slab kernel allocations.
collectd.memory.percentpercentusedThe total amount of memory used, as a percentage.
freeThe total amount of unused memory, as a percentage.
cachedThe amount of memory used for caching disk data for reads, memory-mapped files, or tmpfs data, as a percentage.
bufferedThe amount of memory used for buffering I/O operations, as a percentage.
slab_reclThe amount of reclaimable memory used for slab kernel allocations, as a percentage.
slab_unreclThe amount of unreclaimable memory used for slab kernel allocations, as a percentage.

Additional Values

  • collectd.plugin: Memory
  • collectd.plugin_instance: None
  • ovirt.entity: Host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

NFS Metrics

NFS metrics enable you to analyze the use of NFS procedures.

The following table describes the NFS metrics reported by the NFS plugin.

Metric Namecollectd.type_instanceDescription
collectd.nfs.nfs_procedurenull / getattr / lookup / access / readlink / read / write / create / mkdir / symlink / mknod / rename / readdir / remove / link / fsstat / fsinfo / readdirplus / pathconf / rmdir / commit / compound / reserved / access / close / delegpurge / putfh / putpubfh putrootfh / renew / restorefh / savefh / secinfo/ setattr / setclientid / setcltid_confirm / verify / open / openattr / open_confirm / exchange_id / create_session / destroy_session / bind_conn_to_session / delegreturn / getattr / getfh / lock / lockt / locku / lookupp / open_downgrade / nverify/ release_lockowner / backchannel_ctl / free_stateid / get_dir_delegation / getdeviceinfo / getdevicelist / layoutcommit / layoutget / layoutreturn / secinfo_no_name / sequence / set_ssv / test_stateid / want_delegation / destroy_clientid / reclaim_completeThe number of processes per collectd.type_instance state.

Additional Values

  • collectd.plugin: NFS
  • collectd.plugin_instance: File system + server or client (for example: v3client)
  • collectd.type: nfs_procedure
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Derive

PostgreSQL Metrics

PostgreSQL data collected by executing SQL statements on a PostgreSQL database.

The following table describes the PostgreSQL metrics reported by the PostgreSQL plugin.

Metric Namecollectd.type_instanceDescription
collectd.postgresql.pg_numbackendsN/AHow many server processes this database is using.
collectd.postgresql.pg_n_tup_gliveThe number of live rows in the database.
deadThe number of dead rows in the database. Rows that are deleted or obsoleted by an update are not physically removed from their table; they remain present as dead rows until a VACUUM is performed.
collectd.postgresql.pg_n_tup_cdelThe number of delete operations.
updThe number of update operations.
hot_updThe number of update operations that have been performed without requiring an index update.
insThe number of insert operations.
collectd.postgresql.pg_xactnum_deadlocksThe number of deadlocks that have been detected by the database. Deadlocks are caused by two or more competing actions that are unable to finish because each is waiting for the other's resources to be unlocked.
collectd.postgresql.pg_db_sizeN/AThe size of the database on disk, in bytes.
collectd.postgresql.pg_blks aheap_readHow many disk blocks have been read.
heap_hitHow many read operations were served from the buffer in memory, so that a disk read was not necessary. This only includes hits in the PostgreSQL buffer cache, not the operating system's file system cache.
idx_readHow many disk blocks have been read by index access operations.
idx_hitHow many index access operations have been served from the buffer in memory.
toast_readHow many disk blocks have been read on TOAST tables.
toast_hitHow many TOAST table reads have been served from buffer in memory.
tidx_readHow many disk blocks have been read by index access operations on TOAST tables.

Additional Values

  • collectd.plugin: Postgresql
  • collectd.plugin_instance: Database's Name
  • ovirt.entity: engine
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

Process Metrics

The following table describes the process metrics reported by the Processes plugin.

Metric Namecollectd.typeDescriptioncollectd.dstypes
collectd.processes.ps_stateps_stateThe number of processes in each state.Gauge
collectd.processes.ps_disk_ops.readps_disk_opsThe process's I/O read operations.Derive
collectd.processes.ps_disk_ops.writeps_disk_opsThe process's I/O write operations.Derive
collectd.processes.ps_vmps_vmThe total amount of memory including swap.Gauge
collectd.processes.ps_rssps_rssThe amount of physical memory assigned to the process.Gauge
collectd.processes.ps_dataps_dataGauge
collectd.processes.ps_codeps_codeGauge
collectd.processes.ps_stacksizeps_stacksizeGauge
collectd.processes.ps_cputime.systps_cputimeThe amount of time spent by the matching processes in kernel mode. The values are scaled to microseconds per second to match collectd's numbers.Derive
collectd.processes.ps_cputime.userps_cputimeThe amount of time spent by the matching processes in user mode. The values are scaled to microseconds per second.Derive
collectd.processes.ps_count.processesps_countThe number of processes for the defined process.Gauge
collectd.processes.ps_count.threadsps_countThe number of threads for the defined process.Gauge
collectd.processes.ps_pagefaults.majfltaddps_pagefaultsThe number of major page faults caused by the process.Derive
collectd.processes.ps_pagefaults.minfltps_pagefaultsThe number of major page faults caused by the process.Derive
collectd.processes.ps_disk_octets.writeps_disk_octetsThe process's I/O write operations in transferred bytes.Derive
collectd.processes.ps_disk_octets.readps_disk_octetsThe process's I/O read operations in transferred bytes.Derive
collectd.processes.fork_ratefork_rateThe system's fork rate.Derive

Additional Values

  • collectd.plugin: Processes
  • collectd.type_instance: N/A (except for collectd.processes.ps_state=running/ zombies/ stopped/ paging/ blocked/ sleeping)
  • ovirt.entity: host
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10

StatsD Metrics

The following table describes the StatsD metrics reported by the StatsD plugin.

Metric Namecollectd.typecollectd.type_instanceDescription
collectd.statsd.host_storagehost_storagestorage uuidThe latency for writing to the storage domain.
collectd.statsd.vm_balloon_curvm_balloon_curN/AThe current amount of memory available to the guest virtual machine (in KB).
collectd.statsd.vm_balloon_maxvm_balloon_maxN/AThe maximum amount of memory available to the guest virtual machine (in KB).
collectd.statsd.vm_balloon_minvm_balloon_minN/AThe minimum amount of memory guaranteed to the guest virtual machine (in KB).
collectd.statsd.vm_balloon_targetvm_balloon_targetN/AThe amount of memory requested (in KB).
collectd.statsd.vm_cpu_sysvm_cpu_sysN/ARatio of non-guest virtual machine CPU time to total CPU time spent by QEMU.
collectd.statsd.vm_cpu_usagevm_cpu_usageN/ATotal CPU usage since VM start in (ns).
collectd.statsd.vm_cpu_uservm_cpu_userN/ARatio of guest virtual machine CPU time to total CPU time spent by QEMU.
collectd.statsd.vm_disk_apparent_sizevm_disk_apparent_sizedisk nameThe size of the disk (in bytes).
collectd.statsd.vm_disk_flush_latencyvm_disk_flush_latencydisk nameThe virtual disk's flush latency (in seconds).
collectd.statsd.vm_disk_read_bytesvm_disk_read_bytesdisk nameThe read rate from disk (in bytes per second).
collectd.statsd.vm_disk_read_latencyvm_disk_read_latencydisk nameThe virtual disk's read latency (in seconds).
collectd.statsd.vm_disk_read_opsvm_disk_read_opsdisk nameThe number of read operations since the virtual machine was started.
collectd.statsd.vm_disk_read_ratevm_disk_read_ratedisk nameThe virtual machine's read activity rate (in bytes per second).
collectd.statsd.vm_disk_true_sizevm_disk_true_sizedisk nameThe amount of underlying storage allocated (in bytes).
collectd.statsd.vm_disk_write_latencyvm_disk_write_latencydisk nameThe virtual disk's write latency (in seconds).
collectd.statsd.vm_disk_write_opsvm_disk_write_opsdisk nameThe number of write operations since the virtual machine was started.
collectd.statsd.vm_disk_write_ratevm_disk_write_ratedisk nameThe virtual machine's write activity rate (in bytes per second).
collectd.statsd.vm_nic_rx_bytesvm_nic_rx_bytesnetwork nameThe total number of incoming bytes.
collectd.statsd.vm_nic_rx_droppedvm_nic_rx_droppednetwork nameThe number of incoming packets that have been dropped.
collectd.statsd.vm_nic_rx_errorsvm_nic_rx_errorsnetwork nameThe number of incoming packets that contained errors.
collectd.statsd.vm_nic_speedvm_nic_speednetwork nameThe interface speed (in Mbps).
collectd.statsd.vm_nic_tx_bytesvm_nic_tx_bytesnetwork nameThe total number of outgoing bytes.
collectd.statsd.vm_nic_tx_droppedvm_nic_tx_droppednetwork nameThe number of outgoing packets that were dropped.
collectd.statsd.vm_nic_tx_errorsvm_nic_tx_errorsnetwork nameThe number of outgoing packets that contained errors.

Additional Values

  • collectd.plugin: StatsD
  • collectd.plugin_instance: The virtual machine's name (except for collectd.statsd.host_storage=N/A)
  • ovirt.entity: vm (except for collectd.statsd.host_storage=host)
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10
  • collectd.dstypes: Gauge

Swap Metrics

Swap metrics enable you to view the amount of memory currently written onto the hard disk, in bytes, according to available, used, and cached swap space.

The following table describes the Swap metrics reported by the Swap plugin.

Metric Namecollectd.typecollectd.type_instancecollectd.dstypesDescription
collectd.swap.swapswapused / free / cachedGaugeThe used, available, and cached swap space (in bytes).
collectd.swap.swap_ioswap_ioin / outDeriveThe number of swap pages written and read per second.
collectd.swap.percentpercentused / free / cachedGaugeThe percentage of used, available, and cached swap space.

Additional Fields

  • collectd.plugin: Swap
  • collectd.plugin_instance: None
  • ovirt.entity: host or {engine-name}
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10

Virtual Machine Metrics

The following table describes the virtual machine metrics reported by the Virt plugin.

Metric Namecollectd.typecollectd.type_instancecollectd.dstypes
collectd.virt.ps_cputime.systps_cputime.systN/ADerive
collectd.virt.percentpercentvirt_cpu_totalGauge
collectd.virt.ps_cputime.userps_cputime.userN/ADerive
collectd.virt.virt_cpu_totalvirt_cpu_totalCPU numberDerive
collectd.virt.virt_vcpuvirt_vcpuCPU numberDerive
collectd.virt.disk_octets.readdisk_octets.readdisk nameGauge
collectd.virt.disk_ops.readdisk_ops.readdisk nameGauge
collectd.virt.disk_octets.writedisk_octets.writedisk nameGauge
collectd.virt.disk_ops.writedisk_ops.writedisk nameGauge
collectd.virt.if_octets.rxif_octets.rxnetwork nameDerive
collectd.virt.if_dropped.rxif_dropped.rxnetwork nameDerive
collectd.virt.if_errors.rxif_errors.rxnetwork nameDerive
collectd.virt.if_octets.txif_octets.txnetwork nameDerive
collectd.virt.if_dropped.txif_dropped.txnetwork nameDerive
collectd.virt.if_errors.txif_errors.txnetwork nameDerive
collectd.virt.if_packets.rxif_packets.rxnetwork nameDerive
collectd.virt.if_packets.txif_packets.txnetwork nameDerive
collectd.virt.memorymemoryrss / total /actual_balloon / available / unused / usable / last_update / major_fault / minor_fault / swap_in / swap_outGauge
collectd.virt.total_requeststotal_requestsflush-DISKDerive
collectd.virt.total_time_in_mstotal_time_in_msflush-DISKDerive
collectd.virt.total_time_in_mstotal_time_in_msflush-DISKDerive

Additional Values

  • collectd.plugin: virt
  • collectd.plugin_instance: The virtual machine's name
  • ovirt.entity: vm
  • ovirt.cluster.name.raw: The cluster's name
  • ovirt.engine_fqdn.raw: The {engine-name}'s FQDN
  • hostname: The host's FQDN
  • ipaddr4: IP address
  • interval: 10

Gauge and Derive Data Source Types

Each metric includes a collectd.dstypes value that defines the data source's type:

  • Gauge: A gauge value is simply stored as-is and is used for values that may increase or decrease, such as the amount of memory used.

  • Derive:
    These data sources assume that the change of the value is interesting, i.e., the derivative. Such data sources are very common for events that can be counted, for example the number of disk read operations. The total number of disk read operations is not interesting, but rather the change since the value was last read. The value is therefore converted to a rate using the following formula:

      rate = value(new)-value(old)\
              time(new)-time(old)
    

NOTE:
If value(new) is less than value (old), the resulting rate will be negative. If the minimum value to zero, such data points will be discarded.


Article Type