HDFS monitoring with Netdata

What is HDFS?

Apache Hadoop Distributed File System (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. It stores data reliably even in the case of hardware failure and is designed to run on commodity hardware. HDFS is highly fault-tolerant, providing high throughput access to application data and is suitable for applications with large data sets.

Monitoring HDFS with Netdata

The prerequisites for monitoring HDFS with Netdata are to have HDFS and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for HDFS monitoring please read the collector documentation.

You should now see the HDFS section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What HDFS metrics are important to monitor - and why?

Heap Memory

Heap Memory is the amount of memory allocated to the Java Virtual Machine (JVM) for running HDFS. Monitoring Heap Memory is important for ensuring that the JVM has enough memory to function properly. If the JVM does not have enough memory, it can cause out of memory errors or poor performance.

GC Count Total

GC Count Total, or Garbage Collection Count Total, is a metric that shows the total number of garbage collection events during the current time window. Garbage collection is the process of reclaiming memory that is no longer being used. Monitoring this metric is important for ensuring that the JVM is able to reclaim memory efficiently and is not running out of memory.

GC Time Total

GC Time Total, or Garbage Collection Time Total, is a metric that shows the total amount of time spent on garbage collection during the current time window. Monitoring this metric is important for ensuring that the JVM is able to reclaim memory efficiently and is not running out of memory.

GC Threshold

GC Threshold is a metric that shows the number of garbage collection events that should trigger an alert. This metric can be used to alert administrators of potential memory issues.

Threads

Threads is a metric that shows the number of active threads in the JVM. Monitoring this metric is important for ensuring that the JVM is able to process tasks efficiently and that there are no threading issues.

Logs Total

Logs Total is a metric that shows the total number of logs generated during the current time window. Monitoring this metric is important for ensuring that the system is logging correctly and that there are no issues with the logging system.

RPC Bandwidth

RPC Bandwidth is a metric that shows the amount of data transferred during Remote Procedure Calls (RPCs). Monitoring this metric is important for ensuring that the system is able to send and receive data efficiently.

RPC Calls

RPC Calls is a metric that shows the number of RPC calls made during the current time window. Monitoring this metric is important for ensuring that the system is able to send and receive data efficiently.

Open Connections

Open Connections is a metric that shows the number of open connections to the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.

Call Queue Length

Call Queue Length is a metric that shows the length of the call queue. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.

Avg Queue Time

Avg Queue Time is a metric that shows the average time spent in the call queue. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.

Avg Processing Time

Avg Processing Time is a metric that shows the average time spent processing requests. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.

Capacity

Capacity is a metric that shows the amount of disk space available and used in the HDFS cluster. Monitoring this metric is important for ensuring that the system has enough disk space to perform its tasks.

Used Capacity

Used Capacity is a metric that shows the amount of disk space used for HDFS and non-HDFS operations. Monitoring this metric is important for ensuring that the system has enough disk space to perform its tasks.

Load

Load is a metric that shows the load on the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to process requests efficiently and that the system is not overloaded.

Volume Failures Total

Volume Failures Total is a metric that shows the number of storage volume failures during the current time window. Monitoring this metric is important for ensuring that the system is able to access data stored on the storage volumes efficiently.

Files Total

Files Total is a metric that shows the total number of files stored in the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to access files stored in the system efficiently.

Blocks Total

Blocks Total is a metric that shows the total number of data blocks stored in the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to access data stored in the system efficiently.

Blocks

Blocks is a metric that shows the number of corrupt, missing, and under-replicated data blocks. Monitoring this metric is important for ensuring that the system is able to access data stored in the system efficiently and that data is not lost or corrupted.

Data Nodes

Data Nodes is a metric that shows the number of live, dead, and stale DataNodes in the HDFS cluster. Monitoring this metric is important for ensuring that the system is able to access data stored in the system efficiently and that DataNodes are running correctly.

Datanode Capacity

Datanode Capacity is a metric that shows the amount of disk space available and used on DataNodes. Monitoring this metric is important for ensuring that DataNodes have enough disk space to perform their tasks.

Datanode Used Capacity

Datanode Used Capacity is a metric that shows the amount of disk space used for HDFS and non-HDFS operations on DataNodes. Monitoring this metric is important for ensuring that DataNodes have enough disk space to perform their tasks.

Datanode Failed Volumes

Datanode Failed Volumes is a metric that shows the number of failed storage volumes on DataNodes. Monitoring this metric is important for ensuring that the system is able to access data stored on the storage volumes efficiently.

Datanode Bandwidth

Datanode Bandwidth is a metric that shows the amount of data read and written by DataNodes. Monitoring this metric is important for ensuring that DataNodes are able to send and receive data efficiently.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo