NVMe monitoring with Netdata

What is NVMe?

NVMe is an open, logical-device interface specification for accessing a computer’s non-volatile storage media usually attached via PCI Express (PCIe) bus. NVMe allows host hardware and software to fully exploit the levels of parallelism possible in modern SSDs. As a result, NVMe reduces I/O overhead and brings various performance improvements relative to previous logical-device interfaces, including multiple long command queues, and reduced latency.

Monitoring NVMe with Netdata

The prerequisites for monitoring NVMe with Netdata are that you have:

netdata ALL=(root) NOPASSWD: /usr/sbin/nvme

For more information please read the NVMe collector documentation.

What NVMe metrics are important to monitor - and why?

Endurance

Endurance indicates the consumed lifetime of the device based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the device has been consumed, but may not indicate a device failure. The value can be greater than 100 if you use the storage beyond its planned lifetime. But if the endurance number is approaching 100 you should consider replacing your disks. The remaining estimated lifetime can be thought of as (100 - endurance).

In the example shown here, we can see a new disk with a lot of life left to live.

image

In most cases of SSD failure, it isn’t endurance that’s the culprit. Firmware issues, media failures and hardware issues are all more likely (in that order) to be the root cause.

Spare Capacity

This metric measures the amount of spare capacity available on the device and is represented as a percentage. A lower percentage indicates that the device is running low on spare capacity.

SSDs provide a set of internal spare capacity, called spare blocks, that can be used to replace blocks that have reached their write operation limit. After all of the spare blocks have been used, the next block that reaches its limit causes the disk to fail.

image

IO Transferred

This metric measures the total amount of data read from and written to the device. It can be helpful in understanding the load profile of the system and whether it is write or read heavy as well as for troubleshooting potential performance issues.

image

Power Cycles

This metric measures the number of times this host has been rebooted or the device has been woken up after sleep. A high number of power cycles does not affect the device’s life expectancy.

image

Power On Time

Power-on time is the length of time the device has been supplied with power (in other words, powered on).

image

Critical Warnings

This metric measures the number of critical warnings that occurred. The status of the warning indicates what is the problem to be addressed.

In the example shown here, we do not yet have any critical warnings but if any do pop up they will show up on this chart.

image

Unsafe Shutdowns

This metric measures the number of times the device has been shut down (power outage) without a shutdown notification being sent. Depending on the NVMe device you are using, unsafe shutdowns can cause data corruption and shorten the lifespan of the device.

image

Media Errors

This metric measures the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this counter.

image

Error Log Entries

This metric measures the number of entries in the Error Information Log. While Error log entries may indicate problems that need to be addressed, an increase in the number of records is not by itself an indicator of any failure condition.

image

Temperature

image

Thermal management transitions

The thermal management transitions metrics measure the rate of temperature transitions of specific components on the device. These metrics can be helpful in troubleshooting temperature-related issues.

Troubleshooting

Netdata comes with built in alerts for many monitoring use-cases including NVMe monitoring. By default an alert is triggered if the number of critical warnings is non-zero. If you would like to update the alert thresholds for this alert or want to create your own alert for another metric – please follow the instructions here.

 template: nvme_device_critical_warnings_state
 families: *
       on: nvme.device_critical_warnings_state
    class: Errors
     type: System
component: Disk
   lookup: max -30s unaligned
    units: state
    every: 10s
     crit: $this != nan AND $this != 0
    delay: down 5m multiplier 1.5 max 2h
     info: NVMe device $label:device has critical warnings
       to: sysadmin

You can also rely on other troubleshooting and data exploration features such as Anomaly Advisor and Metric Correlation to make sense of your NVMe metrics and try to understand what stressors or variables may have influenced it.

The observability platform companies need to succeed

Sign up for free

Want a personalised demo of Netdata for your use case?

Book a Demo