Cassandra monitoring with Netdata

What is Cassandra?

Cassandra is an open-source, distributed, wide-column NoSQL database management system written in Java. Cassandra was originally developed by Avinash Lakshmanan and Prashant Malik at Facebook and then released as open source, eventually becoming part of the Apache project.

Cassandra is a NoSQL database - NoSQL (also known as “not only SQL”) databases do not require data to be stored in tabular format. They provide flexible schemas and scale easily with large amounts of data and high user loads.

Cassandra also offers some key advantages over other NoSQL databases:

For these reasons Cassandra is used by large organizations such as Apple, Netflix, Facebook and others.

How to monitor Cassandra performance?

When using Cassandra in production it becomes crucial to quickly detect any issues or problems (including but not limited to read/write latency, errors and exceptions) that may arise and rectify them as soon as possible.

To achieve this, thorough monitoring of Cassandra is essential!

Cassandra exposes metrics via JMX (Java Management Extensions) and there are a few different ways in which you can access them including nodetool, Jconsole or a JMX integration. While nodetool and Jconsole are very useful tools and the right choice if you just want a quick view of what’s happening right now - a more comprehensive JMX integration is the way to go for detailed troubleshooting. Netdata uses the Prometheus JMX integration to collect Cassandra metrics.

What metrics to monitor?

There are hundreds of possible metrics that can be collected from Cassandra - and it can get a bit overwhelming. So let’s try and keep things simple, by going through the most important metrics that will help you to monitor the performance of your Cassandra cluster.

Throughput

Monitoring the throughput of a Cassandra cluster in terms of the read and write requests received is crucial to understand overall performance and activity levels. This information should also guide you when it comes to choosing the right compaction strategy - which may vary, depending on whether your workload is read-heavy or write-heavy.

If your data is modeled properly, Cassandra offers near linear scalability.

1_r2pJJZxKNktYmRN5mi5tOA

Source: Benchmarking Cassandra Scalability (Netflix Tech Blog)

Latency

Latency often acts as the canary in the coal mine and monitoring latency gives you an early warning about upcoming performance bottlenecks or a shift in usage patterns. Latency can be impacted by disk access, network latency or replication configuration.

Latency is measured in a couple of different ways. Latency across reads and writes are measured as a histogram with percentile bins of 50th, 75th, 95th, 98th, 99th, 99.9th so you understand the latency distribution across time. Cassandra uses a histogram with an exponentially decaying reservoir which is representative (roughly) of the last 5 minutes of data. The total latency (summed across all requests) is also measured and presented in a different chart.

Consistently high latency or even occasional and infrequent spikes in latency could point to systemic issues with the cluster such as:

Cache

Cassandra provides built-in efficient caching functionality through the key cache and row cache. Key caching is enabled by default and holds the location of keys in memory per column family. It is recommended for most common scenarios and a high key cache utilization is desirable. If the key cache hit ratio is consistently < 80% or cache misses are consistently seen, consider increasing the key cache size.

Row cache, unlike the key cache, is not enabled by default and stores the entire contents of the row in memory and is intended for more specialized use-cases. For example, if you have a small subset of data that gets access frequently, and with each access you need almost all of the columns returned using a row cache would be a good fit. For these specialized use-cases row cache can bring about very significant gains in efficiency and performance.

Disk usage

Monitoring disk usage levels and patterns is key for Cassandra - as it is for other data stores. It is recommended to budget for free disk space at all times so that there is always available disk space for Cassandra to perform operations which temporarily use up additional disk space, such as compaction. How much free disk space should be maintained depends on the compaction strategy, but 30% is generally considered a reasonable default.

Screenshot 2022-10-27 232349

Compaction

In Cassandra, writes are written to the commit log and to the active Memtable. Memtables are later flushed to disk, to a file called SSTable. Compaction is the background process by which Cassandra reconciles copies of data spread across different SSTables. Compaction is crucial for improving read performance and enables Cassandra to store fewer SSTables.

Picking the right compaction strategy based on the workload will ensure the best performance for both querying and for compaction itself.

The different compaction strategies that Cassandra uses are:

Compaction performance can be understood by monitoring the rate of completed compaction tasks and pending compaction tasks. A growing queue of pending compaction tasks means the Cassandra cluster is struggling to keep up with the workload.

Screenshot 2022-10-27 232316

Thread pools

Cassandra, being based on Staged Event Driven Architecture (SEDA) separates different tasks in stages. Each stage has a queue and a thread pool. If these queues are filled up it could indicate potential performance issues.

JVM runtime

Cassandra is a Java application and utilizes the JVM runtime. There are of course a multitude of JVM metrics available but monitoring the memory usage and the garbage collection stats are of particular importance for Cassandra.

ParNew (young-generation) garbage collections occur relatively often. All application threads pause while ParNew garbage collection happens, so keep a close eye on ParNew latency as any significant increase here will considerably impact Cassandra’s performance.

ConcurrentMarkSweep or CMS (old-generation) garbage collection also temporarily stops application threads, but it does so intermittently. If CMS latency is consistently high it could mean your cluster is running out of memory and more nodes may need to be added to the cluster.

Screenshot 2022-10-27 232424

Screenshot 2022-10-27 232435

Errors

It is crucial to monitor Cassandra’s own error and exception metrics. Possibly the most important one is the rate of unavailable exceptions, which could indicate that there are one or more nodes which have gone down.

Monitoring Cassandra with Netdata

Let us use the cassandra-stress tool to generate some workload with the following command:

cassandra-stress mixed duration=15m -rate threads=6

On the Netdata Cloud UI, navigate to the Cassandra section on the menu you see on the right side of the Overview tab. Clicking on it will expand the different sections into which Cassandra metrics are organized.

image

Clicking on the Cassandra section will also bring up the summary overview which presents 4 of the key Cassandra performance indicators:

This helps you to understand at a glance if there’s something seriously wrong with your Cassandra cluster that requires further troubleshooting or not.

image

To get to the rest of the charts you can either scroll down or click on the section you are interested in. Let’s walk through some of the metrics and see how these charts look.

Since we used a single cassandra-stress command to initiate both reads and writes you can see that a similar pattern is followed for incoming requests.

image

When it comes to latency, it is measured in a couple of different ways. Latency across reads and writes are measured as a histogram with percentile bins of 50th, 75th, 95th, 98th, 99th, 99.9th to give you a good understanding of the latency distribution. Cassandra uses a histogram with an exponentially decaying reservoir which is representative (roughly) of the last 5 minutes of data. The total latency (summed across all requests) is also measured and presented in a different chart (not pictured here).

image image

It’s always a good idea to keep an eye on how the cache is doing. Cassandra offers both the default key cache and an optional row cache. In the example below you can see that the row cache is not in use while the key cache has a pretty good hit ratio of 85%.

image

Along with the key cache hit ratio, you can also monitor the utilization of the key cache itself and understand if you need to increase the allocated cache size or not. An underutilized key cache is also worth notice and might point to a change you need to make in how the workloads are managed.

image

The live disk space used up by Cassandra is another metric you can keep an eye on - and it is good practice to make sense of this chart in comparison to the built-in disk space usage chart under the mount points section.

image

A quick look at the compaction metrics tells us that things are in the green - the pending compaction tasks do not pile up and are quickly processed.

image

JVM related metrics such as the JVM memory used and garbage collection metrics are grouped under the JVM runtime section of Cassandra.

image

And last but not least we have the crucial section on potential errors and exceptions that are happening on this Cassandra cluster. During the benchmarking run we executed earlier we can see that the dropped message rate has spiked, and there was a temporary spike in requests being timed out as well.

image image

There are other error and exception charts (unavailable exceptions, storage exceptions, failures etc.) which are not shown here since those errors or exceptions were not triggered during this particular benchmarking run.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo