Elasticsearch monitoring with Netdata

What is Elasticsearch?

Elasticsearch is an open source distributed search engine based on Apache Lucene. It provides users with a flexible, powerful and scalable search solution. It enables users to quickly search and analyze large datasets and provides analytics and insights into data.

Monitoring Elasticsearch with Netdata

The prerequisites for monitoring Elasticsearch with Netdata are to have Elasticsearch and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for Elasticsearch monitoring please read the collector documentation.

You should now see the Elasticsearch section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What Elasticsearch metrics are important to monitor - and why?

Node Indices Indexing

The rate at which documents are indexed into Elasticsearch. This metric can help identify potential issues related to indexing throughput and indexing time. If the number of indexing operations per second is consistently low, it could indicate an issue with the data being indexed or the indexing process. Monitoring the node_indices_indexing_time metric can help identify any sudden or prolonged increases in indexing time.

The rate at which documents are searched from Elasticsearch. This metric helps identify potential issues related to search throughput and search time. If the number of search operations per second is consistently low, it could indicate an issue with the data being searched or the search process. Monitoring the node_indices_search_time metric can help identify any sudden or prolonged increases in search time.

Node Indices Refresh

The rate at which documents are refreshed in Elasticsearch. This metric helps identify potential issues related to refresh throughput and refresh time. If the number of refresh operations per second is consistently low, it could indicate an issue with the data being refreshed or the refresh process. Monitoring the node_indices_refresh_time metric can help identify any sudden or prolonged increases in refresh time.

Node Indices Flush

The rate at which documents are flushed from Elasticsearch. This metric helps identify potential issues related to flush throughput and flush time. If the number of flush operations per second is consistently low, it could indicate an issue with the data being flushed or the flush process. Monitoring the node_indices_flush_time metric can help identify any sudden or prolonged increases in flush time.

Node Indices Fielddata Memory Usage

The amount of memory used by the fielddata cache in Elasticsearch. This metric helps identify potential issues related to memory usage. If the amount of memory used is consistently high, it could indicate an issue with the data being stored in the fielddata cache or the size of the fielddata cache.

Node Indices Fielddata Evictions

The rate at which documents are evicted from the fielddata cache in Elasticsearch. This metric helps identify potential issues related to evicting documents from the fielddata cache. If the number of evictions per second is consistently high, it could indicate an issue with the data being stored in the fielddata cache or the size of the fielddata cache.

Node Indices Segments Count

The total number of segments in Elasticsearch. This metric helps identify potential issues related to the number of segments in the index. If the number of segments is consistently high, it could indicate an issue with the indexing process or the size of the index.

Node Indices Segments Memory Usage Total

The total amount of memory used by all segments in Elasticsearch. This metric helps identify potential issues related to memory usage. If the amount of memory used is consistently high, it could indicate an issue with the data being stored in the segment or the size of the segment.

Node Indices Segments Memory Usage

The amount of memory used by specific segments in Elasticsearch. This metric helps identify potential issues related to memory usage. If the amount of memory used by the terms, stored_fields, term_vectors, norms, points, doc_values, index_writer, version_map, and fixed_bit_set segments is consistently high, it could indicate an issue with the data being stored in those segments or the size of those segments.

Node Indices Translog Operations

The total number of operations in the translog in Elasticsearch. This metric helps identify potential issues related to the size of the translog. If the number of operations is consistently high, it could indicate an issue with the indexing process or the size of the index.

Node Indices Translog Size

The total size of the translog in Elasticsearch. This metric helps identify potential issues related to the size of the translog. If the size of the translog is consistently high, it could indicate an issue with the indexing process or the size of the index.

Node File Descriptors

Node File Descriptors (node_file_descriptors) measure the number of open file descriptors in a node. This metric is important because it helps to detect if there are too many open files in the node, which can lead to system instability, disk exhaustion, and performance issues. It is important to monitor this metric to identify any potential issues before they become a problem.

Node JVM Heap

Node JVM Heap (node_jvm_heap) is the percentage of the total Java Virtual Machine (JVM) heap that is currently in use. This metric is important to monitor as it helps to identify potential memory issues and helps to determine if the JVM is running efficiently. If the JVM memory is not being used efficiently, it can lead to a decrease in performance or even crashes.

Node JVM Heap Bytes

Node JVM Heap Bytes (node_jvm_heap_bytes) is the amount of memory used and committed by the JVM. This metric is important to monitor as it helps to identify how efficiently the JVM is using its resources. If the JVM is not using its memory efficiently, it can lead to a decrease in performance or even crashes.

Node JVM Buffer Pools Count

Node JVM Buffer Pools Count (node_jvm_buffer_pools_count) measures the number of direct and mapped buffer pools that are currently in use. This metric is important to monitor as it helps to identify any potential memory leaks and helps to determine if the JVM is running efficiently. If the JVM memory is not being used efficiently, it can lead to a decrease in performance or even crashes.

Node JVM Buffer Pool Direct Memory

Node JVM Buffer Pool Direct Memory (node_jvm_buffer_pool_direct_memory) is the amount of memory used and committed by the direct buffer pools. This metric is important to monitor as it helps to identify any potential memory leaks and helps to determine if the JVM is running efficiently. If the JVM memory is not being used efficiently, it can lead to a decrease in performance or even crashes.

Node JVM Buffer Pool Mapped Memory

Node JVM Buffer Pool Mapped Memory (node_jvm_buffer_pool_mapped_memory) is the amount of memory used and committed by the mapped buffer pools. This metric is important to monitor as it helps to identify any potential memory leaks and helps to determine if the JVM is running efficiently. If the JVM memory is not being used efficiently, it can lead to a decrease in performance or even crashes.

Node JVM GC Count

Node JVM GC Count (node_jvm_gc_count) measures the number of young and old garbage collections that occur per second. This metric is important to monitor as it helps to identify any potential memory leaks and helps to determine if the JVM is running efficiently. If the JVM is not performing garbage collections efficiently, it can lead to a decrease in performance or even crashes.

Node JVM GC Time

Node JVM GC Time (node_jvm_gc_time) measures the amount of time it takes for the JVM to perform a garbage collection. This metric is important to monitor as it helps to identify any potential memory leaks and helps to determine if the JVM is running efficiently. If the JVM is not performing garbage collections efficiently, it can lead to a decrease in performance or even crashes.

Node Thread Pool Queued

Node Thread Pool Queued (node_thread_pool_queued) measures the number of threads queued for execution in each thread pool. This metric is important to monitor as it helps to identify any potential resource contention issues and helps to determine if the JVM is running efficiently. If the JVM is not using its resources efficiently, it can lead to a decrease in performance or even crashes.

Node Thread Pool Rejected

Node Thread Pool Rejected (node_thread_pool_rejected) measures the number of threads that have been rejected by each thread pool. This metric is important to monitor as it helps to identify any potential resource contention issues and helps to determine if the JVM is running efficiently. If the JVM is not using its resources efficiently, it can lead to a decrease in performance or even crashes.

Cluster Communication Packets

Cluster communication packets are the packets of data exchanged between nodes in an Elasticsearch cluster. Monitoring these packets can help identify network issues related to Elasticsearch operations and help identify bottlenecks. It is important to monitor both received and sent cluster communication packets as each will provide different insights.

Cluster Communication

Cluster communication is the amount of data exchanged between nodes in an Elasticsearch cluster. Monitoring this metric can help identify network issues related to Elasticsearch operations and help identify bottlenecks. It is important to monitor both received and sent cluster communication as each will provide different insights.

HTTP Connections

HTTP connections are the number of open HTTP connections to the Elasticsearch cluster. When this number is high, it can indicate that the cluster is overburdened and needs to be scaled out. Monitoring this metric on a regular basis can help identify increasing resource utilization and enable proactive scaling out of the cluster.

Breakers Trips

Breakers trips are the number of trips individual circuit breakers take per second. Circuit breakers are used to protect the cluster from memory related problems and are configured to trip when certain thresholds are reached. Monitoring these trips can help identify memory related issues in the cluster before the cluster becomes unusable.

Cluster Health Status

Cluster health status is the overall health of the Elasticsearch cluster. It is represented by a status code that can be either green, yellow, or red. Monitoring this metric can help identify potential issues in the cluster, such as unbalanced shards or replica shards, before they become more serious.

Cluster Number of Nodes

Cluster number of nodes is the total number of nodes in the Elasticsearch cluster, including both data nodes and coordinating nodes. Monitoring this metric can help identify when the cluster needs to be scaled out or in, depending on the cluster size.

Cluster Shards Count

Cluster shards count is the total number of primary and replica shards in the cluster. Monitoring this metric can help identify unassigned shards and enable proactive rebalancing of the cluster.

Cluster Pending Tasks

Cluster pending tasks is the number of tasks that are currently pending and waiting to be processed. Monitoring this metric can help identify potential bottlenecks in the cluster and enable proactive scaling out of the cluster.

Cluster Number of In-Flight Fetch

Cluster number of in-flight fetch is the number of in-flight fetch requests that are currently being processed by the cluster. Monitoring this metric can help identify potential bottlenecks in the cluster and enable proactive scaling out of the cluster.

Cluster Indices Count

Cluster indices count is the number of indices in the cluster. Monitoring this metric can help identify when the cluster is being overburdened and needs to be scaled out.

Cluster Indices Shards Count

Cluster indices shards count is the total number of primary and replica shards per index in the cluster. Monitoring this metric can help identify unassigned shards and enable proactive rebalancing of the cluster.

Cluster Indices Docs Count

Cluster indices docs count is the number of documents stored in each index in the cluster. Monitoring this metric can help identify when the cluster is being overburdened and needs to be scaled out.

Cluster Indices Store Size

Cluster indices store size is the size of the data stored in each index in the cluster. Monitoring this metric can help identify when the cluster is being overburdened and needs to be scaled out.

Cluster Indices Query Cache

Cluster indices query cache is the number of caches hits and misses per index in the cluster. Monitoring this metric can help identify when the cluster is being overburdened and needs to be scaled out.

Cluster Nodes by Role Count

Cluster nodes by role count is the number of nodes with specific roles in the cluster. Monitoring this metric can help identify when the cluster needs to be scaled out or in, depending on the roles assigned to the nodes.

Node Index Health

The current health of an Elasticsearch index. This metric can provide valuable insights into the overall performance of the index. The metric is represented as either a “green”, “yellow” or “red” status, indicating the current status of the index. A “green” status indicates that the index is healthy and functioning as expected. A “yellow” status indicates that the index is still functioning but may be experiencing some performance issues. Finally, a “red” status indicates that the index is not functioning as expected and needs to be addressed immediately.

Monitoring this metric can help prevent any potential downtime or performance issues with the index. It can also help identify potential issues before they become more serious. By monitoring the node_index_health metric, users can quickly identify and address any issues that may arise with the index.

Node Index Shards Count

The number of shards in an Elasticsearch index. Shards are used to store and manage data within an index. By monitoring this metric, users can ensure that the index is properly configured and the number of shards is adequate for the size of the data. If the number of shards is too low, then the index could suffer from poor performance or even timeouts. If the number of shards is too high, then it could cause the index to become over-provisioned, resulting in wasted resources.

Monitoring the number of shards in an index can help identify any potential issues with the index and can help prevent any issues from becoming more serious.

Node Index Docs Count

The number of documents stored in an Elasticsearch index. This metric is used to determine the size of the index and can provide valuable insights into the performance of the index. If the number of documents is too high, then the index could suffer from poor performance or even timeouts. If the number of documents is too low, then it could indicate that the index is not being used to its full potential.

Monitoring this metric can help identify any potential issues with the index and can help prevent any issues from becoming more serious.

Node Index Store Size

The size of the data stored in an Elasticsearch index. This metric is used to determine the amount of data stored in the index and can provide valuable insights into the performance of the index. If the store size is too large, then the index could suffer from poor performance or even timeouts. If the store size is too small, then it could indicate that the index is not being used to its full potential.

Monitoring this metric can help identify any potential issues with the index and can help prevent any issues from becoming more serious. It can also help identify any inefficient use of resources, allowing users to make more informed decisions about how to optimize their use of the index.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo