Ceph

Plugin: go.d.plugin Module: ceph

Overview

This collector monitors the overall health status and performance of your Ceph clusters. It gathers key metrics for the entire cluster, individual Pools, and OSDs.

It collects metrics by periodically issuing HTTP GET requests to the Ceph Manager REST API:

/api/monitor (only once to get the Ceph cluster id (fsid))
/api/health/minimal
/api/osd
/api/pool?stats=true

This collector is only supported on the following platforms:

Linux

This collector supports collecting metrics from multiple instances of this integration, including remote instances.

Default Behavior

Auto-Detection

The collector can automatically detect Ceph Manager instances running on:

localhost that are listening on port 8443
within Docker containers

Note that the Ceph REST API requires a username and password. While Netdata can automatically detect Ceph Manager instances and create data collection jobs, these jobs will fail unless you provide the necessary credentials.

Limits

The default configuration for this integration does not impose any limits on data collection.

Performance Impact

The default configuration for this integration is not expected to impose a significant performance impact on the system.

Setup

Prerequisites

No action required.

Configuration

File

The configuration file name for this integration is go.d/ceph.conf.

You can edit the configuration file using the edit-config script from the Netdata config directory.

cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
sudo ./edit-config go.d/ceph.conf

Options

The following options can be defined globally: update_every.

Name	Description	Default	Required
update_every	Data collection frequency.	1	no
autodetection_retry	Recheck interval in seconds. Zero means no recheck will be scheduled.	0	no
url	The URL of the Ceph Manager API.	https://127.0.0.1:8443	yes
timeout	HTTP request timeout.	2	no
username	Username for basic HTTP authentication.		yes
password	Password for basic HTTP authentication.		yes
proxy_url	Proxy URL.		no
proxy_username	Username for proxy basic HTTP authentication.		no
proxy_password	Password for proxy basic HTTP authentication.		no
method	HTTP request method.	GET	no
body	HTTP request body.		no
headers	HTTP request headers.		no
not_follow_redirects	Redirect handling policy. Controls whether the client follows redirects.	no	no
tls_skip_verify	Server certificate chain and hostname validation policy. Controls whether the client performs this check.	yes	no
tls_ca	Certification authority that the client uses when verifying the server’s certificates.		no
tls_cert	Client TLS certificate.		no
tls_key	Client TLS key.		no

Examples

Basic

A basic example configuration.

jobs:
  - name: local
    url: https://127.0.0.1:8443
    username: user
    password: pass

Multi-instance

Note: When you define multiple jobs, their names must be unique.

Collecting metrics from local and remote instances.

jobs:
  - name: local
    url: https://127.0.0.1:8443
    username: user
    password: pass

  - name: remote
    url: https://192.0.2.1:8443
    username: user
    password: pass

Metrics

Metrics grouped by scope.

The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.

Per cluster

These metrics refer to the entire Ceph cluster.

Labels:

Label	Description
fsid	A unique identifier of the cluster.

Metrics:

Metric	Dimensions	Unit
ceph.cluster_status	ok, err, warn	status
ceph.cluster_hosts_count	hosts	hosts
ceph.cluster_monitors_count	monitors	monitors
ceph.cluster_osds_count	osds	osds
ceph.cluster_osds_by_status_count	up, down, in, out	status
ceph.cluster_managers_count	active, standby	managers
ceph.cluster_object_gateways_count	object	gateways
ceph.cluster_iscsi_gateways_count	iscsi	gateways
ceph.cluster_iscsi_gateways_by_status_count	up, down	gateways
ceph.cluster_physical_capacity_utilization	utilization	percent
ceph.cluster_physical_capacity_usage	avail, used	bytes
ceph.cluster_objects_count	objects	objects
ceph.cluster_objects_by_status_distribution	healthy, misplaced, degraded, unfound	percent
ceph.cluster_pools_count	pools	pools
ceph.cluster_pgs_count	pgs	pgs
ceph.cluster_pgs_by_status_count	clean, working, warning, unknown	pgs
ceph.cluster_pgs_per_osd_count	per_osd	pgs

Per osd

These metrics refer to the Object Storage Daemon (OSD).

Labels:

Label	Description
fsid	A unique identifier of the cluster.
osd_uuid	OSD UUID.
osd_name	OSD name.
device_class	OSD device class.

Metrics:

Metric	Dimensions	Unit
ceph.osd_status	up, down, in, out	status
ceph.osd_space_usage	avail, used	bytes
ceph.osd_io	read, written	bytes/s
ceph.osd_iops	read, write	ops/s
ceph.osd_latency	commit, apply	milliseconds

Per pool

These metrics refer to the Pool.

Labels:

Label	Description
fsid	A unique identifier of the cluster.
pool_name	Pool name.

Metrics:

Metric	Dimensions	Unit
ceph.pool_space_utilization	utilization	percent
ceph.pool_space_usage	avail, used	bytes
ceph.pool_objects_count	object	objects
ceph.pool_io	read, written	bytes/s
ceph.pool_iops	read, write	ops/s

Alerts

The following alerts are available:

Alert name	On metric	Description
ceph_cluster_physical_capacity_utilization	ceph.cluster_physical_capacity_utilization	Ceph cluster ${label:fsid} disk space utilization

Troubleshooting

Debug Mode

Important: Debug mode is not supported for data collection jobs created via the UI using the Dyncfg feature.

To troubleshoot issues with the ceph collector, run the go.d.plugin with the debug option enabled. The output should give you clues as to why the collector isn’t working.

Navigate to the plugins.d directory, usually at /usr/libexec/netdata/plugins.d/. If that’s not the case on your system, open netdata.conf and look for the plugins setting under [directories].
```
cd /usr/libexec/netdata/plugins.d/
```
Switch to the netdata user.
```
sudo -u netdata -s
```
Run the go.d.plugin to debug the collector:
```
./go.d.plugin -d -m ceph
```

Getting Logs

If you’re encountering problems with the ceph collector, follow these steps to retrieve logs and identify potential issues:

Run the command specific to your system (systemd, non-systemd, or Docker container).
Examine the output for any warnings or error messages that might indicate issues. These messages should provide clues about the root cause of the problem.

System with systemd

Use the following command to view logs generated since the last Netdata service restart:

journalctl _SYSTEMD_INVOCATION_ID="$(systemctl show --value --property=InvocationID netdata)" --namespace=netdata --grep ceph

System without systemd

Locate the collector log file, typically at /var/log/netdata/collector.log, and use grep to filter for collector’s name:

grep ceph /var/log/netdata/collector.log

Note: This method shows logs from all restarts. Focus on the latest entries for troubleshooting current issues.

Docker Container

If your Netdata runs in a Docker container named “netdata” (replace if different), use this command:

docker logs netdata 2>&1 | grep ceph

Industry

Technology

Use cases

Ceph

Ceph

Overview

Default Behavior

Auto-Detection

Limits

Performance Impact

Setup

Prerequisites

Configuration

File

Options

Examples

Basic

Multi-instance

Metrics

Per cluster

Per osd

Per pool

Alerts

Troubleshooting

Debug Mode

Getting Logs

System with systemd

System without systemd

Docker Container

The observability platform companies need to succeed