Consul icon

Consul

Consul

Plugin: go.d.plugin Module: consul

Overview

This collector monitors key metrics of Consul Agents: transaction timings, leadership changes, memory usage and more.

It periodically sends HTTP requests to Consul REST API.

Used endpoints:

This collector is supported on all platforms.

This collector supports collecting metrics from multiple instances of this integration, including remote instances.

Default Behavior

Auto-Detection

This collector discovers instances running on the local host, that provide metrics on port 8500.

On startup, it tries to collect metrics from:

  • http://localhost:8500
  • http://127.0.0.1:8500

Limits

The default configuration for this integration does not impose any limits on data collection.

Performance Impact

The default configuration for this integration is not expected to impose a significant performance impact on the system.

Setup

Prerequisites

Enable Prometheus telemetry

Enable telemetry on your Consul agent, by increasing the value of prometheus_retention_time from 0.

Add required ACLs to Token

Required only if authentication is enabled.

ACL Endpoint
operator:read autopilot health status
node:read checks
agent:read configuration, metrics, and lan coordinates

Configuration

File

The configuration file name for this integration is go.d/consul.conf.

You can edit the configuration file using the edit-config script from the Netdata config directory.

cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
sudo ./edit-config go.d/consul.conf

Options

The following options can be defined globally: update_every, autodetection_retry.

Name Description Default Required
update_every Data collection frequency. 1 False
autodetection_retry Recheck interval in seconds. Zero means no recheck will be scheduled. 0 False
url Server URL. http://localhost:8500 True
acl_token ACL token used in every request. False
max_checks Checks processing/charting limit. False
max_filter Checks processing/charting filter. Uses simple patterns. False
username Username for basic HTTP authentication. False
password Password for basic HTTP authentication. False
proxy_url Proxy URL. False
proxy_username Username for proxy basic HTTP authentication. False
proxy_password Password for proxy basic HTTP authentication. False
timeout HTTP request timeout. 1 False
method HTTP request method. GET False
body HTTP request body. False
headers HTTP request headers. False
not_follow_redirects Redirect handling policy. Controls whether the client follows redirects. False False
tls_skip_verify Server certificate chain and hostname validation policy. Controls whether the client performs this check. False False
tls_ca Certification authority that the client uses when verifying the server’s certificates. False
tls_cert Client tls certificate. False
tls_key Client tls key. False

Examples

Basic

An example configuration.

jobs:
  - name: local
    url: http://127.0.0.1:8500
    acl_token: "ec15675e-2999-d789-832e-8c4794daa8d7"

Basic HTTP auth

Local server with basic HTTP authentication.

jobs:
  - name: local
    url: http://127.0.0.1:8500
    acl_token: "ec15675e-2999-d789-832e-8c4794daa8d7"
    username: foo
    password: bar

Multi-instance

Note: When you define multiple jobs, their names must be unique.

Collecting metrics from local and remote instances.

jobs:
  - name: local
    url: http://127.0.0.1:8500
    acl_token: "ec15675e-2999-d789-832e-8c4794daa8d7"

  - name: remote
    url: http://203.0.113.10:8500
    acl_token: "ada7f751-f654-8872-7f93-498e799158b6"

Metrics

Metrics grouped by scope.

The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.

The set of metrics depends on the Consul Agent mode.

Per Consul instance

These metrics refer to the entire monitored application.

This scope has no labels.

Metrics:

Metric Dimensions Unit Leader Follower Client
consul.client_rpc_requests_rate rpc requests/s
consul.client_rpc_requests_exceeded_rate exceeded requests/s
consul.client_rpc_requests_failed_rate failed requests/s
consul.memory_allocated allocated bytes
consul.memory_sys sys bytes
consul.gc_pause_time gc_pause seconds
consul.kvs_apply_time quantile_0.5, quantile_0.9, quantile_0.99 ms
consul.kvs_apply_operations_rate kvs_apply ops/s
consul.txn_apply_time quantile_0.5, quantile_0.9, quantile_0.99 ms
consul.txn_apply_operations_rate txn_apply ops/s
consul.autopilot_health_status healthy, unhealthy status
consul.autopilot_failure_tolerance failure_tolerance servers
consul.autopilot_server_health_status healthy, unhealthy status
consul.autopilot_server_stable_time stable seconds
consul.autopilot_server_serf_status active, failed, left, none status
consul.autopilot_server_voter_status voter, not_voter status
consul.network_lan_rtt min, max, avg ms
consul.raft_commit_time quantile_0.5, quantile_0.9, quantile_0.99 ms
consul.raft_commits_rate commits commits/s
consul.raft_leader_last_contact_time quantile_0.5, quantile_0.9, quantile_0.99 ms
consul.raft_leader_oldest_log_age oldest_log_age seconds
consul.raft_follower_last_contact_leader_time leader_last_contact ms
consul.raft_rpc_install_snapshot_time quantile_0.5, quantile_0.9, quantile_0.99 ms
consul.raft_leader_elections_rate leader elections/s
consul.raft_leadership_transitions_rate leadership transitions/s
consul.server_leadership_status leader, not_leader status
consul.raft_thread_main_saturation_perc quantile_0.5, quantile_0.9, quantile_0.99 percentage
consul.raft_thread_fsm_saturation_perc quantile_0.5, quantile_0.9, quantile_0.99 percentage
consul.raft_fsm_last_restore_duration last_restore_duration ms
consul.raft_boltdb_freelist_bytes freelist bytes
consul.raft_boltdb_logs_per_batch_rate written logs/s
consul.raft_boltdb_store_logs_time quantile_0.5, quantile_0.9, quantile_0.99 ms
consul.license_expiration_time license_expiration seconds

Per node check

Metrics about checks on Node level.

Labels:

Label Description
datacenter Datacenter Identifier
node_name The node’s name
check_name The check’s name

Metrics:

Metric Dimensions Unit Leader Follower Client
consul.node_health_check_status passing, maintenance, warning, critical status

Per service check

Metrics about checks at a Service level.

Labels:

Label Description
datacenter Datacenter Identifier
node_name The node’s name
check_name The check’s name
service_name The service’s name

Metrics:

Metric Dimensions Unit Leader Follower Client
consul.service_health_check_status passing, maintenance, warning, critical status

Alerts

The following alerts are available:

Alert name On metric Description
consul_node_health_check_status consul.node_health_check_status node health check ${label:check_name} has failed on server ${label:node_name} datacenter ${label:datacenter}
consul_service_health_check_status consul.service_health_check_status service health check ${label:check_name} for service ${label:service_name} has failed on server ${label:node_name} datacenter ${label:datacenter}
consul_client_rpc_requests_exceeded consul.client_rpc_requests_exceeded_rate number of rate-limited RPC requests made by server ${label:node_name} datacenter ${label:datacenter}
consul_client_rpc_requests_failed consul.client_rpc_requests_failed_rate number of failed RPC requests made by server ${label:node_name} datacenter ${label:datacenter}
consul_gc_pause_time consul.gc_pause_time time spent in stop-the-world garbage collection pauses on server ${label:node_name} datacenter ${label:datacenter}
consul_autopilot_health_status consul.autopilot_health_status datacenter ${label:datacenter} cluster is unhealthy as reported by server ${label:node_name}
consul_autopilot_server_health_status consul.autopilot_server_health_status server ${label:node_name} from datacenter ${label:datacenter} is unhealthy
consul_raft_leader_last_contact_time consul.raft_leader_last_contact_time median time elapsed since leader server ${label:node_name} datacenter ${label:datacenter} was last able to contact the follower nodes
consul_raft_leadership_transitions consul.raft_leadership_transitions_rate there has been a leadership change and server ${label:node_name} datacenter ${label:datacenter} has become the leader
consul_raft_thread_main_saturation consul.raft_thread_main_saturation_perc average saturation of the main Raft goroutine on server ${label:node_name} datacenter ${label:datacenter}
consul_raft_thread_fsm_saturation consul.raft_thread_fsm_saturation_perc average saturation of the FSM Raft goroutine on server ${label:node_name} datacenter ${label:datacenter}
consul_license_expiration_time consul.license_expiration_time Consul Enterprise licence expiration time on node ${label:node_name} datacenter ${label:datacenter}

Troubleshooting

Debug Mode

To troubleshoot issues with the consul collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn’t working.

  • Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that’s not the case on your system, open netdata.conf and look for the plugins setting under [directories].

    cd /usr/libexec/netdata/plugins.d/
    
  • Switch to the netdata user.

    sudo -u netdata -s
    
  • Run the go.d.plugin to debug the collector:

    ./go.d.plugin -d -m consul
    

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo