RiakKV monitoring with Netdata

What is RiakKV?

RiakKV is an open-source distributed NoSQL database designed to provide high availability and scalability. It is built on the Apache Software Foundation’s Riak distributed database, using the key-value data model. RiakKV provides a robust and reliable data store that is easy to deploy and manage.

Monitoring RiakKV with Netdata

The prerequisites for monitoring RiakKV with Netdata are to have RiakKV and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for RiakKV monitoring please read the collector documentation.

You should now see the RiakKV section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What RiakKV metrics are important to monitor - and why?

Throughput

KV Operations

KV operations are key-value operations that can be used to get or put data into Riakkv. These operations are typically used to store and retrieve data from the distributed key-value store. Monitoring KV operations can provide insights into the performance and availability of Riakkv by providing metrics on successful gets, puts, and operations times.

Data Type Updates

Data type updates in Riakkv include counters, sets, and maps. Monitoring these data types can provide insight into the performance of the system and help identify any potential issues. Counters can be monitored to track the number of times a particular operation has been called, while sets and maps can be monitored to track the number of elements stored in them.

Search Queries

Search queries are used to query the distributed key-value store. Monitoring search queries can provide insights into the performance of queries, such as the number of query requests, query times, and query results. This can be used to identify any potential performance issues or to optimize the query times.

Search Documents

Search documents are documents that are indexed for searching by Riakkv. Monitoring the number of documents indexed can provide insights into the performance of the system. It can be used to identify any potential issues with indexing or to optimize the indexing process.

Strong Consistency Operations

Strong consistency operations are operations that are guaranteed to be consistent across the distributed key-value store. Monitoring these operations can provide insights into the performance and availability of the system. For example, monitoring the number of successful gets and puts can be used to identify any potential performance issues or to optimize the read/write throughput.

Latency

KV latency

KV latency is a measure of the time it takes for a key-value store to get and put data. It is measured by looking at the mean, median, 95th, 99th, and 100th percentile of the data operation. This metric is important to monitor because it provides an indication of how well the system is performing when it comes to retrieving and storing data. If the latency is too high, then the system may not be able to keep up with the demand or may be experiencing some sort of bottleneck. Monitoring this metric can help identify and address any issues with the system before they become too severe.

Data type latency

Data type latency measures the time it takes for an operation to complete on various data types, such as counters, sets, and maps. It is measured by looking at the mean, median, 95th, 99th, and 100th percentile of the operation. This metric is important to monitor because it can help identify any bottlenecks or other performance issues with the system. If certain data types are taking longer than expected to complete operations, then this could be an indication of a problem that needs to be addressed.

Search latency

Search latency measures the time it takes for a search query to complete. It is measured by looking at the median, min, max, and 95th and 99th percentile of the query. This metric is important to monitor because it can help identify any performance issues with the system. If searches are taking too long to complete, then this could be an indication of a problem that needs to be addressed.

Strong consistency latency

Strong consistency latency measures the time it takes for a read or write operation to complete with strong consistency. It is measured by looking at the mean, median, 95th, 99th, and 100th percentile of the operation. This metric is important to monitor because it can help identify any performance issues with the system when it comes to read and write operations. If strong consistency operations take too long to complete, then this could be an indication of a problem that needs to be addressed.

Erlang VM

Processes

Processes are individual tasks running within a system. They can be created, paused, resumed, terminated, and monitored. Monitoring processes is important to ensure that the system is running smoothly and to identify any potential issues that could arise.

Processes can be monitored to make sure they are running as expected, are not taking too long to complete, are not consuming too many resources, and are not causing any system instability. If processes are not monitored, it can lead to decreased system performance, system crashes, or unexpected errors.

Processes.allocated

Processes.allocated is a metric which measures the total amount of memory allocated to processes running on a system. This metric can be used to identify any potential issues with memory consumption, as processes that are consuming too much memory can lead to system instability or decreased performance.

Monitoring processes.allocated can help identify any processes that are consuming too much memory, and can help identify potential memory leaks or other issues that could be causing stability or performance issues.

Processes.used

Processes.used is a metric which measures the amount of memory actually being used by processes running on a system. This metric can be used to identify any potential issues with memory consumption, as processes that are using too much memory can lead to system instability or decreased performance.

Monitoring processes.used can help identify any processes that are using too much memory, and can help identify potential memory leaks or other issues that could be causing stability or performance issues.

Health

Siblings encountered in KV operations

Siblings encountered in KV operations is a metric that measures the number of siblings encountered during a specified period of time (e.g. the past minute). It is important to monitor this metric in order to ensure that the system is not experiencing any issues with concurrency. If too many siblings are encountered, it can lead to performance issues and even data loss. Monitoring this metric can help detect and prevent these issues.

Object size in KV operations

Object size in KV operations is a metric that measures the size of objects being retrieved from the KV store during a specified period of time (e.g. the past minute). By monitoring this metric, it is possible to detect if there is an issue with the size of the objects being retrieved, which can lead to performance issues. This metric can also be used to detect if there is an issue with the object itself (e.g. corrupt data).

Message queue length

Message queue length is a metric that measures the number of unprocessed messages in the queue during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any bottlenecks due to message queue length. If too many messages are left in the queue, it can lead to performance issues and even data loss. Monitoring this metric can help detect and prevent these issues.

Index operations encountered by Search is a metric that measures the number of index operations encountered by the search service during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any issues with index operations, such as slow response times or errors. Monitoring this metric can help detect and prevent these issues.

Protocol buffer connections

Protocol buffer connections is a metric that measures the number of active protocol buffer connections during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any issues with protocol buffer connections, such as slow response times or errors. Monitoring this metric can help detect and prevent these issues.

Repair operations coordinated by this node

Repair operations coordinated by this node is a metric that measures the number of repair operations coordinated by this node during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any issues with repair operations, such as slow response times or errors. Monitoring this metric can help detect and prevent these issues.

Active finite state machines by kind

Active finite state machines by kind is a metric that measures the number of active finite state machines by kind (e.g. get, put, secondary_index, list_keys) during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any issues with finite state machines, such as slow response times or errors. Monitoring this metric can help detect and prevent these issues.

Rejected finite state machines

Rejected finite state machines is a metric that measures the number of rejected finite state machines (e.g. get, put) during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any issues with rejected finite state machines, such as slow response times or errors. Monitoring this metric can help detect and prevent these issues.

Number of writes to Search failed due to bad data format by reason

Number of writes to Search failed due to bad data format by reason is a metric that measures the number of failed writes to the search service due to bad data format (e.g. bad_entry, extract_fail) during a specified period of time (e.g. the past minute). This metric is important to monitor in order to ensure that the system is not experiencing any issues with bad data format, such as slow response times or errors. Monitoring this metric can help detect and prevent these issues.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo