NVMe is an open, logical-device interface specification for accessing a computer’s non-volatile storage media usually attached via PCI Express (PCIe) bus. NVMe allows host hardware and software to fully exploit the levels of parallelism possible in modern SSDs. As a result, NVMe reduces I/O overhead and brings various performance improvements relative to previous logical-device interfaces, including multiple long command queues, and reduced latency.
The prerequisites for monitoring NVMe with Netdata are that you have:
nvme-cli
nvme
as root without a password. You can do this by adding the netdata user to the /etc/sudoers
file (use which nvme
to find the full path to the nvme binary):netdata ALL=(root) NOPASSWD: /usr/sbin/nvme
For more information please read the NVMe collector documentation.
Endurance indicates the consumed lifetime of the device based on the actual usage and the manufacturer’s prediction of NVM life. A value of 100 indicates that the estimated endurance of the device has been consumed, but may not indicate a device failure. The value can be greater than 100 if you use the storage beyond its planned lifetime. But if the endurance number is approaching 100 you should consider replacing your disks. The remaining estimated lifetime can be thought of as (100 - endurance).
In the example shown here, we can see a new disk with a lot of life left to live.
In most cases of SSD failure, it isn’t endurance that’s the culprit. Firmware issues, media failures and hardware issues are all more likely (in that order) to be the root cause.
This metric measures the amount of spare capacity available on the device and is represented as a percentage. A lower percentage indicates that the device is running low on spare capacity.
SSDs provide a set of internal spare capacity, called spare blocks, that can be used to replace blocks that have reached their write operation limit. After all of the spare blocks have been used, the next block that reaches its limit causes the disk to fail.
This metric measures the total amount of data read from and written to the device. It can be helpful in understanding the load profile of the system and whether it is write or read heavy as well as for troubleshooting potential performance issues.
This metric measures the number of times this host has been rebooted or the device has been woken up after sleep. A high number of power cycles does not affect the device’s life expectancy.
Power-on time is the length of time the device has been supplied with power (in other words, powered on).
This metric measures the number of critical warnings that occurred. The status of the warning indicates what is the problem to be addressed.
In the example shown here, we do not yet have any critical warnings but if any do pop up they will show up on this chart.
This metric measures the number of times the device has been shut down (power outage) without a shutdown notification being sent. Depending on the NVMe device you are using, unsafe shutdowns can cause data corruption and shorten the lifespan of the device.
This metric measures the number of occurrences where the controller detected an unrecovered data integrity error. Errors such as uncorrectable ECC, CRC checksum failure, or LBA tag mismatch are included in this counter.
This metric measures the number of entries in the Error Information Log. While Error log entries may indicate problems that need to be addressed, an increase in the number of records is not by itself an indicator of any failure condition.
Warning composite temperature time The time the device has been operating above the Warning Composite Temperature Threshold (WCTEMP) and below Critical Composite Temperature Threshold (CCTEMP).
Critical composite temperature time The time the device has been operating above the Critical Composite Temperature Threshold (CCTEMP).
The thermal management transitions metrics measure the rate of temperature transitions of specific components on the device. These metrics can be helpful in troubleshooting temperature-related issues.
Thermal management temp1 transitions The number of times the controller has entered lower active power states or performed vendor-specific thermal management actions, minimizing performance impact, to attempt to lower the Composite Temperature due to the host-managed thermal management feature.
Thermal management temp2 transitions The number of times the controller has entered lower active power states or performed vendor-specific thermal management actions, regardless of the impact on performance (e.g., heavy throttling), to attempt to lower the Combined Temperature due to the host-managed thermal management feature.
Thermal management temp1 time The amount of time the controller has entered lower active power states or performed vendor-specific thermal management actions, minimizing performance impact, to attempt to lower the Composite Temperature due to the host-managed thermal management feature.
Thermal management temp2 time The amount of time the controller has entered lower active power states or performed vendor-specific thermal management actions, regardless of the impact on performance (e.g., heavy throttling), to attempt to lower the Combined Temperature due to the host-managed thermal management feature.
Netdata comes with built in alerts for many monitoring use-cases including NVMe monitoring. By default an alert is triggered if the number of critical warnings is non-zero. If you would like to update the alert thresholds for this alert or want to create your own alert for another metric – please follow the instructions here.
template: nvme_device_critical_warnings_state
families: *
on: nvme.device_critical_warnings_state
class: Errors
type: System
component: Disk
lookup: max -30s unaligned
units: state
every: 10s
crit: $this != nan AND $this != 0
delay: down 5m multiplier 1.5 max 2h
info: NVMe device $label:device has critical warnings
to: sysadmin
You can also rely on other troubleshooting and data exploration features such as Anomaly Advisor and Metric Correlation to make sense of your NVMe metrics and try to understand what stressors or variables may have influenced it.
Want a personalised demo of Netdata for your use case?