Netdata vs Prometheus: A 2025 Performance Analysis

A Deep Dive into Performance, Efficiency, and Scalability at 4.6M Metrics per Second

Netdata vs Prometheus: A 2025 Performance Analysis

When it comes to infrastructure monitoring, performance, scalability, and efficiency are critical considerations. In this blog post, we revisit two widely adopted open-source monitoring solutions: Netdata and Prometheus. Both tools have introduced notable improvements in their latest versions, emphasizing scalability and enhanced efficiency.

In our previous analysis, we explored key differences between these systems, focusing on resource consumption and data retention. This follow-up expands on that comparison by subjecting both tools to a significantly larger workload. With the number of monitored nodes increased to 1000, containers to 80k, and metrics ingestion reaching 4.6 million metrics per second, we examine how each system performs under these demanding conditions, focusing on CPU utilization, memory requirements, disk I/O, network usage, and data retention during data ingestion.

Our goal is to present an objective comparison, highlighting the strengths and limitations of both solutions. By providing detailed insights into their performance and scalability, this analysis aims to help users make informed decisions based on their specific infrastructure needs.

TL;DR

  • 4.6M metrics/second on a single node
    Both Netdata v2.2 and Prometheus v3.1 were tested at high ingestion rates on identical hardware. Default configurations, no clustering, no tuning.

  • CPU & Memory

    • Netdata used ~9 cores and 47 GiB peak RSS; Prometheus used ~15 cores and 383 GiB peak RSS.
    • Prometheus required 500 GiB of RAM to remain stable.
  • Retention

    • With 1 TiB of storage, Prometheus retained ~2 hours of per-second data.
    • Again with 1 TiB of storage, Netdata retained ~1.25 days of per-second data and up to almost 3 months with automatic downsampling tiers.
  • Disk & Network Usage

    • Prometheus disk I/O averaged at ~147 MiB/s (WAL & compaction) vs. Netdata’s ~4.7 MiB/s.
    • Prometheus consumed ~515 Mbps; Netdata used ~448 Mbps, aided by ZSTD compression.
  • Query Performance

    • Netdata’s tiered engine consistently outperformed Prometheus in large queries (up to 22× faster).
    • Netdata preserved 100% of expected data vs. ~93.7% for Prometheus due to scrape stalls.

Read on for the detailed methodology, metrics, and full comparison.

A reality check: Netdata vs Prometheus at Scale

For most Prometheus users, handling 4–5 million samples per second might seem like an extreme workload reserved for niche, highly optimized setups. However, for Netdata users, this scale is routine—a direct consequence of its design philosophy and ease of use.

Netdata’s “It Just Works” Philosophy

  • Simplicity at Scale: Netdata makes it remarkably easy to configure and operate monitoring for thousands of nodes, even for users with limited technical expertise. Setting up a couple of Netdata Parents and configuring thousands of agents to push metrics to them is straightforward, requiring minimal effort.
  • Transparent Performance: Most Netdata users don’t need to understand the technical details or implications of handling millions of samples per second. The system is designed to “just work” as long as sufficient server resources are provided.

Contrast with Prometheus

  • Unrealistic Expectations for Prometheus: For Prometheus users, achieving similar scale often demands significant engineering expertise, tuning, and additional tools (e.g., Thanos or Cortex) to handle distributed workloads. Even then, the overhead and complexity can make this scale prohibitive for many organizations.
  • Operational Complexity: Prometheus’s design places the burden of scalability on the user, making high-scale setups uncommon except in environments with dedicated observability teams and substantial resources.

A Common Scenario for Netdata

Given that Netdata collects 2,000+ metrics per node, even monitoring 800–1,000 nodes naturally results in millions of samples per second. This isn’t an outlier—it’s business as usual for Netdata users, especially those managing large infrastructures. While this scale might sound extreme or unrealistic to Prometheus users, it reflects typical workloads for Netdata deployments.

This distinction highlights one of Netdata’s core strengths: effortless scalability for real-world use cases. What might seem like a daunting challenge in other monitoring systems is routine in Netdata, offering users a level of scalability that is both accessible and transparent.

Test Environment

For this updated comparison, we scaled up the workload to assess how each tool handles even more demanding data ingestion rates. We monitored 1000 nodes, each collecting system and application metrics every second, totaling 4.6 million metrics per second. This setup simulates a high-scale infrastructure environment to see how both systems perform under stress.

Hardware Setup

The test was conducted on a dual AMD EPYC 7713 64-Core Processor system, designed to provide sufficient computational power for running both Netdata and Prometheus. Each system was hosted on a separate virtual machine (VM) on the same physical host, ensuring identical hardware conditions and minimizing external variability.

Initially, both VMs were configured identically with 200 GiB of RAM each. However, Prometheus v3.1 encountered Out-Of-Memory (OOM) events shortly after starting, prompting incremental increases in its memory allocation. After multiple iterations, we found 500 GiB of RAM to be sufficient for stable operation.

  • VM Configuration:
    • Netdata VM: 52 CPU cores, 200 GiB of RAM, 4 TiB of storage on a dedicated SSD.
    • Prometheus VM: 52 CPU cores, 500 GiB of RAM, 4 TiB of storage on a dedicated SSD.
    • Both VMs were placed on separate NUMA nodes to avoid resource contention and ensure optimal performance. We ensured that both VMs are isolated to their NUMA node so that there are no overheads.

The physical host server had the following specifications:

  • Physical Server Configuration:
    • 2x AMD EPYC 7713 64-Core Processors (256 threads in total)
    • 1 TiB of RAM
    • 2x 4 TB SSD storage (a dedicated SSD for each VM) and an NVMe disk for the operating system.

The 1000 monitored nodes transmitted their metrics to the centralized monitoring VMs through 2 bonded 20 Gbps network cards, ensuring that network bandwidth was not a limiting factor during the test. Network utilization was actively monitored to evaluate how each platform managed the heavy data influx.

This setup provided an isolated environment to assess the performance, focusing on resource usage and scalability under identical conditions.

Data Collection

For this test we needed exactly the same data on both systems and per-second resolution. Since Netdata agents export their metrics in the OpenMetrics format, we used 1000 Netdata agents as the data source for both systems. Netdata used its streaming protocol to receive data, while Prometheus scraped metrics exposed in Prometheus format (OpenMetrics).

Some users suggested using Prometheus’s remote-write capability to push metrics to Prometheus. However, we decided against this approach as it would diverge from Prometheus’s default design of scraping metrics. Instead, 1000 Netdata agents served as the data source for both systems.

  • Prometheus scraped metrics from the Netdata agents using the OpenMetrics protocol.
  • Netdata received metrics directly through its streaming protocol.

It is important to note that this setup inherently favors Prometheus by limiting the number of data sources. Fewer scrape targets are less overhead than typical real-world deployments, in which Prometheus often monitors multiple exporters per node, significantly increasing the number of scrape targets.

Netdata, by contrast, is used exactly like that on production systems, having multiple agents push metrics to a centralized parent system using its streaming protocol.

Configuration

Both systems were configured with default settings, with the following adjustments made for fairness:

Both systems were configured to use 1 TiB of disk storage for retaininng metrics.

For Netdata, the storage was divided into three tiers of *333 GiB each, leveraging its default tiering mechanism. This feature improves retention efficiency and query performance although it requires additional resources. Since tiering is a recommended configuration for Netdata, it was kept enabled.

Netdata has two additional features enabled by default, which would be unfair to keep enabled: Machine Learning (ML) and Health (alerts).

By default, Netdata agents train Machine Learning (ML) models at the edge. This means ML training occurs at the data-collecting agents, and the resulting ML models, along with anomaly detection information, are propagated to the Netdata Parent. However, due to the scale of this setup and the total resources required, ML was disabled at the 1000 Netdata agents.

If ML had been enabled at the Netdata Parent under test, it would have been responsible for training ML models for all 1000 agents, introducing additional workload that could skew the results. Although ML in Netdata is optimized and lightweight, this additional processing would have unfairly impacted Netdata’s performance metrics. Therefore, ML was also disabled at the Netdata Parent.

Even with ML disabled, the Netdata Parent performs the same underlying work, as the protocol transfers anomaly information regardless of ML training. This is because ML in Netdata adds just a single anomaly bit to the metric samples, which is stored in the database without additional overhead.

Netdata ships with a comprehensive set of predefined alerts, while Prometheus does not include any alerts by default. To ensure a fair comparison, alerting was disabled in Netdata.

No other configuration changes were made to either system. Netdata is known for its scale-out, distributed design using multiple independent Netdata Parents and Parent Clusters which all contribute data to the observability platform. Prometheus is also often used in a distributed fashion (e.g., multiple Prometheus servers, or with Thanos/Cortex/Mimir, etc.). We are also aware that Prometheus allows tuning chunk sizes, compaction intervals and concurrency. However, our intention is to test the default single node ingestion performance without any tuning or specialized configuration on either system.

Also both systems are expected to store full cardinality data, without any aggregation across metrics or dimensions. In the case of database tiers in Netdata, it is aggregating across time while also storing the high-resolution data as-received. So, both Prometheus and Netdata are expected to store the original high resolution data.

The Comparison

We analyze the performance of Netdata v2.2 and Prometheus v3.1 under a demanding workload of 4.6 million metrics per second collected from 1000 nodes. By examining key resource metrics such as CPU utilization, memory consumption, disk I/O, data retention, and network bandwidth, we aim to provide a clear and objective view of how each system performs at scale.

While both tools handled the workload, their performance characteristics varied significantly, reflecting differences in their architectural design and resource optimization strategies. The following analysis highlights these distinctions and provides insights into the trade-offs between the two systems, helping users better understand which solution may suit their infrastructure needs.

CPU Utilization

The CPU utilization highlights significant differences in how each system handles the large-scale metric ingestion. The CPU usage reported by systemd services is illustrated below:

image Image: CPU utilization from the systemd services. 1 CPU core = 100%. Netdata is the blue line. Prometheus is the brown line.

Netdata v2.2 Prometheus v3.1
average cores 9.36 14.76
cores per million metrics 2.03 3.21

Prometheus exhibited intermittent scraping delays, a behavior also observed in Prometheus v2. During these delays, Prometheus temporarily stopped scraping metrics, resulting in a sharp drop in CPU utilization to 1.1 cores (110%). This pause lasted approximately 10 seconds and was accompanied by a drop in network bandwidth:

image Image: Prometheus CPU utilization drops for 10 seconds, indicating a temporary pause in metric scraping.

This CPU drop aligns with a similar dip in network bandwidth, where Prometheus also shows a sharp decline for those 10 seconds:

image Image: Network bandwidth for Prometheus drops to near zero during the same 10-second period.

These short freezes lower the average CPU utilization of Prometheus but come at the cost of data loss. During these periods, Prometheus’s internal mechanics prevent it from collecting metrics on time. This could be a critical consideration in environments where continuous metric collection is required.

CPU utilization vs. their last comparison

  • Netdata: In the latest version (v2.2), Netdata introduces gorilla compression for its database, which slightly increased its overall CPU usage—by about 10%, from 1.8 to 2.03 CPU cores per million metrics per second. However, this increase in CPU consumption is more than offset by a significant reduction in memory usage. As we’ll explore below, Netdata v2.2 is able to handle 4.6 million metrics per second with slightly more memory than Netdata v1.43 needed for just 2.7 million metrics per second. This improvement is possible thanks to advanced compression techniques that allow Netdata to store metric samples more efficiently in memory.

    Additionally, Netdata v2.2 improved its threading system to reduce context switches. Instead of allocating a separate thread per data source, Netdata now uses a thread per CPU core, with each core managing multiple data sources. This provides for a smoother experience and making the system more efficient, especially under heavy loads.

  • Prometheus: While Prometheus v3.1 developers claim significant progress in improving CPU utilization at scale compared to Prometheus v2, our tests show that the CPU requirements have increased by 15%. The system now consumes 3.21 CPU cores per million metrics per second, up from 2.8 cores in Prometheus v2. This additional CPU load can be attributed to new processing tasks in Prometheus v3, such as string interning, a feature that Netdata has long included.

Netdata demonstrated lower CPU usage, averaging 2.03 cores per million metrics per second, compared to Prometheus’s 3.21 cores. Additionally, Netdata’s consistent performance without interruptions contrasts with Prometheus’s intermittent scraping pauses, which could result in data loss in high-scale environments. While Prometheus v3.1 includes enhancements, its CPU requirements have increased compared to earlier versions.

Memory Requirements

Our tests revealed that Prometheus v3.1 requires 500 GiB of RAM to handle this workload, despite claims from its developers that memory efficiency has improved in v3. This represents a significant increase over Prometheus v2, which required half as much memory per million of metrics per second.

This substantial memory requirement poses a challenge for scalability, as it becomes increasingly difficult to find servers with sufficient RAM to support larger deployments.

The memory usage reported by CGROUPS memory.stat provides a breakdown of Prometheus’s memory consumption, including categories like anonymous memory, file mappings, and kernel caches. Here is the reported memory usage for both systems:

image Image: systemd services memory usage (from memory.stat): Netdata is the blue line, Prometheus is the brown line.

  • Prometheus: Utilizes all the 500 GiB of RAM allocated to it, with a significant portion attributed to file mappings.

  • Netdata: Consumes about half of its 200 GiB VM, demonstrating much higher efficiency. As we will see bellow the primary driver of Netdata’s memory usage is the operating system’s memory cache.

The detailed CGROUPS breakdown for Prometheus and Netdata highlights this difference further:

Prometheus Memory Breakdown from memory.stat (CGROUPS)

image Image: Prometheus systemd service memory breakdown.

Prometheus’s memory usage reaches 500 GiB, with a large portion attributed to file mappings. This indicates that Prometheus relies heavily on memory-mapped files for managing its write-ahead log (WAL) and other storage-related tasks.

Netdata Memory Breakdown from memory.stat (CGROUPS)

image Image: Netdata systemd service memory breakdown.

Netdata’s total memory usage is 120 GiB, with much of it used for caching by the operating system. Netdata’s architecture avoids extensive memory mapping, focusing instead on efficient use of available memory.

Memory Usage from memory.current (CGROUPS)

The memory.current metric from CGROUPS provides a snapshot of resident memory usage (RSS)—the actual physical memory in use by the system. This view excludes file mappings and other non-resident memory regions, providing a clearer picture of physical memory usage:

image Image: systemd services RAM consumption (from memory.current): Netdata is the blue line, Prometheus is the brown line.

  • Netdata: Peaks at 72.8 GiB of resident memory.

  • Prometheus: Peaks at 207 GiB, which is significantly lower than the 500 GiB reported in memory.stat, but still highlights its high memory usage compared to Netdata.

Process-Level Memory from VmRSS (/proc/{PID}/status)

To further investigate Prometheus’s memory usage, we analyzed the resident set size (RSS) of individual processes, as reported by VmRSS in /proc/{PID}/status. This metric represents the physical memory actively used by the process and excludes non-resident memory, and caches.

image Image: apps.plugin chart for Netdata (blue) and Prometheus (red) memory usage, excluding shared resources.

Since each of these applications were alone in each VM, all shared memory allocated by them should be accounted to them (not shared with any other process). VmRSS (Resident Set Size) excludes memory caches and only includes physical memory actively allocated to and directly used by the process. This makes this metric more accurate for the actual memory requirements of each process.

  • Prometheus: Peaks at 383 GiB of physical memory usage. This aligns with our earlier observations during testing, where Prometheus required 500 GiB of RAM to operate stably. It also explains the OOM events when Prometheus was run with only 400 GiB of RAM.
  • Netdata: Peaks at 47.1 GiB, maintaining a consistent and low memory footprint.

Memory utilization vs. their last comparison

When we compare this data to our previous tests using apps.plugin, we see some interesting trends:

  • Netdata: At 4.6 million metrics per second, Netdata uses 47.1 GiB, compared to 45.1 GiB when handling 2.7 million metrics per second in the previous comparison. This 4% increase in memory usage is largely due to the introduction of gorilla compression, which has significantly optimized Netdata’s memory footprint at the cost of some additional CPU usage. Notably, even with almost double the workload, Netdata needs only 4% more memory.

  • Prometheus: At 4.6 million metrics per second, Prometheus spikes at 383 GiB, compared to 88.8 GiB with 2.7 million metrics per second in the previous test. Prometheus v3.1 has doubled its memory requirements per million metrics per second compared to Prometheus v2, which poses a significant scalability challenge, particularly in large environments.

In summary, Netdata continues to show a massive advantage in memory efficiency, handling large volumes of data with minimal memory usage.

Disk I/O Requirements

When it comes to disk I/O, while both systems need disk throughput to handle large volumes of metrics, the patterns and intensity of their disk usage differ significantly.

image Image: systemd services chart for Netdata (blue) and Prometheus (brown) Disk I/O

Average Disk I/O (MiB/s) Read Rate (MiB/s) Write Rate (MiB/s)
Netdata v2.2 4.7 0.0002 4.7
Prometheus v3 147.3 67.5 79.7

As the charts illustrate, Netdata is far more linear in its disk usage. Its disk writes are modest, averaging 4.7 MiB/s. Importantly, Netdata does not frequently read its data files from disk unless required for queries, reducing overall disk activity. This efficient I/O behavior is a result of its continuous rolling database design, where data is written directly to its final storage location.

In contrast, Prometheus exhibits much higher disk activity, with an average of 147.3 MiB/s in total I/O. This includes 67.5 MiB/s of reads and 79.7 MiB/s of writes. Prometheus continuously rearranges its data on disk as part of its Write-Ahead Logging (WAL) and data compaction processes. These processes are necessary for maintaining consistency and durability, but they come at the cost of significantly higher disk usage.

Key Differences:

  • Netdata performs far fewer disk reads and writes, averaging just 4.7 MiB/s with minimal read activity, making it highly efficient when handling large amounts of metrics. The system’s design ensures that data is written directly to its storage without reorganization or reading, leading to less strain on the disk subsystem.

  • Prometheus, on the other hand, has much more intensive disk usage, with an average of 147.3 MiB/s, which is 31 times higher than Netdata’s. Prometheus’s design relies on a Write-Ahead Log (WAL) to ensure reliability and to batch writes efficiently before compacting data into immutable blocks. This process inherently increases disk I/O and typically RAM usage because blocks and index data must be periodically reorganized. For high-frequency ingestion at multi-million series scales, these overheads can become pronounced, as seen in the test. While Prometheus v3.1 may have introduced improvements, it’s clear that compaction and indexing are still resource-intensive.

Network Requirements

image Image: Network Traffic for Netdata (blue) and Prometheus (yellow)

Average traffic (Mbit/s)
Netdata v2.2 448
Prometheus v3 515

This difference in network requirements can be attributed to the distinct protocols used by each system.

Netdata, which uses its own streaming protocol, is designed to be lightweight and optimized for low-latency, real-time data transfer. The protocol focuses on sending compact data packets with minimal overhead, making it efficient for high-frequency metric ingestion without placing a heavy load on the network. This efficiency is further aided by ZSTD compression, which reduces the data size before transmission, leading to lower network consumption overall. Keep in mind that when clustering Netdata Parents, the bandwidth required between them is half of the bandwidth between Netdata children and parents. The reason is that Parents have the ability to combine multiple independent data collections into larger packets, reducing the number of packets transferred and improving compression efficiency.

On the other hand, Prometheus uses the OpenMetrics protocol, which is more “chatty” compared to Netdata’s streaming protocol. This means that Prometheus’s network traffic involves more verbose data packets, with additional metadata included in each scrape request. While Gzip/Deflate compression is used to optimize network usage, it doesn’t fully offset the higher transmission cost due to the verbosity of the OpenMetrics format. As a result, Prometheus ends up consuming more bandwidth during the same data collection process.

Disk Space and Retention

When comparing Netdata v2.2 and Prometheus v3.1, their approaches to disk space usage and data retention highlight key architectural differences. Both systems were configured to use 1 TiB of disk space, but their methods for writing, organizing, and managing data vary significantly.

image Image: Disk Space for Netdata (green) and Prometheus (red)

Using the du command, the disk usage for both systems was confirmed as follows:

# du -s -h /opt/prometheus/
705G	/opt/prometheus/

# du -s -h /opt/netdata/
782G	/opt/netdata/
  • Prometheus: Uses 705 GiB of its allocated 1 TiB, but this usage fluctuates significantly over time, ranging from 600 GiB to 1 TiB as it continuously reorganizes its data.
  • Netdata: Uses 782 GiB, with a steady and predictable disk usage pattern. While its current usage is below 1 TiB, this is due to its tier 2 storage not being fully utilized yet (as we will see below it would need another 2 months to fill that up, and it runs for 24 days already). And as we will see in a while, Netdata keeps more data on disk at any moment.

These differences in disk usage reflect each system’s approach to data management and retention.

Prometheus Disk Usage and Retention

Prometheus appends data to its Write-Ahead Log (WAL) in real-time. Over time, it compacts this data into fixed-duration blocks (typically two-hour blocks) that are written to its database. This process involves:

  1. Real-Time Data: Continuously appending to the WAL.
  2. Compaction: Periodically converting WAL data into blocks, which are stored in the time-series database.
  3. Reorganization: Merging, compacting, and reorganizing blocks over time to optimize storage.

This compaction and reorganization process causes Prometheus’s disk usage to fluctuate between 600 GiB and 1 TiB, with older data being overwritten once the allocated disk space is full.

Retention-wise, Prometheus v3 was able to retain only 2 hours and 16 minutes of per-second data with 1 TiB of storage:

image Image: Prometheus' retention on 1 TiB of storage

While Prometheus’s approach is designed for efficient querying and indexing, it results in relatively short retention periods for high-resolution data.

Netdata Disk Usage and Retention

Netdata takes a different approach to data storage. Instead of using a WAL and block compaction, Netdata writes metrics directly to data files in real time. This allows for incremental, consistent disk usage without sudden fluctuations. The process involves:

  1. Direct Writes: Metrics are written directly to storage tiers in their final format.
  2. File Rotation: When storage reaches its allocated capacity, the oldest file is deleted to make room for new data.
  3. Steady Usage: Disk usage remains stable and predictable, growing incrementally as higher storage tiers are filled.

Currently, Netdata’s tier 2 storage is not yet fully utilized, which explains why it hasn’t reached the 1 TiB limit. Retention on Netdata is significantly higher compared to Prometheus, as shown below:

  • Per-second data: Retained for 1 day and 8 hours.
  • Per-minute data: Retained for 9 days and 12 hours.
  • Per-hour data: Retained for more than 2 months and 20 days.

image Image: Netdata’s retention on 1 TiB of storage

Netdata’s three-tier storage model enables this extended retention:

  • Tier 1: Stores raw, high-resolution metrics.
  • Tier 2: Downsamples data for medium resolution.
  • Tier 3: Further downsampling for long-term storage.

This tiered architecture balances performance and storage efficiency, allowing Netdata to retain significantly more data for longer periods without the need for frequent reorganization and also provide significantly faster queries over very long periods (all tiers are updated concurrently, during data collection, so all tiers are available to queries in parallel).

If we extrapolate Netdata’s Tier 0 (333 GiB) to match the 1 TiB of storage allocated to Prometheus, Netdata would retain 4 full days of per-second data—40 times more than Prometheus, which stores just 2 hours and 16 minutes.

Netdata’s tiered storage approach provides better long-term retention without sacrificing recent data.

Query Performance (system.ram)

Testing query performance is outside the scope of this test, still we were very interested to see how both systems perform.

So, tried a simple test:

Query the average over the last 2 hours, of system.ram of all nodes, grouped by dimension (4 dimensions in total), providing 120 (per-minute) points over time.

This query is like “what is the average memory utilization of my 1000 nodes?” and it is big. Since we have 1000 nodes, it results in 4k time-series queries.

Prometheus

# for x in {1..10}; do time curl -sS 'http://10.10.11.21:9090/api/v1/query' --data-urlencode 'query=avg_over_time(netdata_system_ram{dimension=~".+"}[2h:60s])' --data-urlencode 'time='"$(date +%s)" >/dev/null; done
0.01s user 0.00s system 0% cpu 1.789 total
0.00s user 0.00s system 0% cpu 1.825 total
0.00s user 0.00s system 0% cpu 1.784 total
0.00s user 0.01s system 0% cpu 1.792 total
0.00s user 0.00s system 0% cpu 1.839 total
0.00s user 0.01s system 0% cpu 1.876 total
0.00s user 0.00s system 0% cpu 1.834 total
0.00s user 0.01s system 0% cpu 1.831 total
0.00s user 0.00s system 0% cpu 1.877 total
0.00s user 0.00s system 0% cpu 1.829 total

Prometheus needs on the average 1.83 seconds for this query.

Netdata

# for x in {1..10}; do time curl -sS 'http://lab-parent2:19999/api/v3/data?scope_contexts=system.ram&before=0&after=-7200&points=120' >/dev/null ; done
0.00s user 0.00s system 4% cpu 0.116 total
0.00s user 0.00s system 4% cpu 0.106 total
0.00s user 0.01s system 5% cpu 0.110 total
0.01s user 0.00s system 5% cpu 0.105 total
0.00s user 0.00s system 4% cpu 0.102 total
0.00s user 0.00s system 5% cpu 0.097 total
0.00s user 0.00s system 5% cpu 0.099 total
0.00s user 0.00s system 5% cpu 0.099 total
0.00s user 0.01s system 4% cpu 0.126 total
0.00s user 0.00s system 4% cpu 0.103 total

Netdata averages to 0.11 seconds for this query.

Of course Netdata used its tier 1 (per minute) automatically, since it can perfectly justify this query. So, we also tried selecting tier 0 (per-second) explicitly:

# for x in {1..10}; do time curl -sS 'http://lab-parent2:19999/api/v3/data?scope_contexts=system.ram&before=0&after=-7200&points=120&tier=0' >/dev/null ; done
0.00s user 0.00s system 0% cpu 1.066 total
0.00s user 0.01s system 0% cpu 1.058 total
0.00s user 0.01s system 0% cpu 1.045 total
0.00s user 0.01s system 0% cpu 1.047 total
0.01s user 0.00s system 0% cpu 1.053 total
0.00s user 0.01s system 0% cpu 1.054 total
0.00s user 0.00s system 0% cpu 1.051 total
0.01s user 0.00s system 0% cpu 1.045 total
0.01s user 0.00s system 0% cpu 1.042 total
0.01s user 0.00s system 0% cpu 1.035 total

Netdata averages to 1.05 seconds when using tier 0 data.

So, when querying high resolution data Netdata appears to be 43% faster than Prometheus (or Prometheus to be 74% slower than Netdata).

Concurrent Queries

We also tested running the same queries in parallel instead of sequentially (by just adding & at the end of the curl command). This run 10 queries in parallel on each system.

Here are our findings:

  • Prometheus timings increased by 15%, from 1.83 to 2.1s.
  • Netdata timings increased by 9%, from 1.05s to 1.15s.

Query Performance (cgroup.mem)

It was inevitable to try cgroup.mem, the containers memory, monitoring 80k containers, each with 6 dimensions, resulting in about 480k time-series to be queried.

This is the equivalent of asking “What is the average memory utilization of my 80k containers?”.

Let’s put this into perspective. The query of 480k time-series over the last 2 hours with per-second resolution (7200 samples per time-series), evaluates close to 3.5 billion samples.

Let’s see how both systems perform at this scale.

Prometheus

# for x in {1..3}; do time curl -sS 'http://10.10.11.21:9090/api/v1/query' --data-urlencode 'query=avg_over_time(netdata_cgroup_mem{dimension=~".+"}[2h:60s])' --data-urlencode 'time='"$(date +%s)" --data-urlencode 'timeout=1200s' >/dev/null; done
0.00s user 0.01s system 0% cpu 3:46.45 total
0.01s user 0.02s system 0% cpu 3:39.11 total
0.01s user 0.00s system 0% cpu 3:33.16 total

Prometheus needs about 3.7 minutes to complete this query.

To verify the number of samples Prometheus evaluates, we run the above query without the sub-query (without :60s) and we got this in the query.log of Prometheus:

{
  "time": "2025-01-27T17:46:30.002806966Z",
  "level": "INFO",
  "source": "file.go:64",
  "msg": "promql query logged",
  "params": {
    "end": "2025-01-27T17:42:09.000Z",
    "query": "avg_over_time(netdata_cgroup_mem{dimension=~\".+\"}[2h])",
    "start": "2025-01-27T17:42:09.000Z",
    "step": 0
  },
  "stats": {
    "timings": {
      "evalTotalTime": 260.230298718,
      "resultSortTime": 0,
      "queryPreparationTime": 1.20294712,
      "innerEvalTime": 258.985866302,
      "execQueueTime": 0.000207658,
      "execTotalTime": 260.230405067
    },
    "samples": {
      "totalQueryableSamples": 3105070136,
      "peakSamples": 468322
    }
  },
  "spanID": "0000000000000000",
  "httpRequest": {
    "clientIP": "10.20.4.205",
    "method": "POST",
    "path": "/api/v1/query"
  }
}

So, Prometheus evaluated 3,105,070,136 samples (3.1 billion samples).

Netdata

# for x in {1..3}; do time curl -sS 'http://lab-parent2:19999/api/v3/data?scope_contexts=cgroup.mem&before=0&after=-7200&points=120' >/dev/null ; done
curl -sS  > /dev/null  0.00s user 0.01s system 0% cpu 9.987 total
curl -sS  > /dev/null  0.00s user 0.01s system 0% cpu 9.488 total
curl -sS  > /dev/null  0.00s user 0.01s system 0% cpu 9.432 total

Netdata managed to finish this query in about 9.6 seconds, using its tier 1 (per minute) storage.

And here is the same running against tier 0:

# for x in {1..3}; do time curl -sS 'http://lab-parent2:19999/api/v3/data?scope_contexts=cgroup.mem&before=0&after=-7200&points=120&tier=0' >/dev/null ; done 
0.00s user 0.02s system 0% cpu 1:29.94 total
0.01s user 0.01s system 0% cpu 1:27.71 total
0.01s user 0.00s system 0% cpu 1:27.69 total

We run the same query again and in the response returned by Netdata, we can see statistics about the processing performed:

    "per_tier": [
      {
        "tier": 0,
        "queries": 461766,
        "points": 3324690162,
        "update_every": 1,
        "first_entry": 1737902489,
        "last_entry": 1737999349
      },
      {
        "tier": 1,
        "queries": 0,
        "points": 0,
        "update_every": 60,
        "first_entry": 1737278340,
        "last_entry": 1737999300
      },
      {
        "tier": 2,
        "queries": 0,
        "points": 0,
        "update_every": 3600,
        "first_entry": 1735563600,
        "last_entry": 1737997200
      }
    ],
    "timings": {
      "prep_ms": 830.285,
      "query_ms": 86141.256,
      "output_ms": 789.156,
      "total_ms": 87760.697,
      "cloud_ms": 87760.697
    }

So Netdata queried 462k time-series in tier 0 and aggregated 3,324,690,162 samples, in 87,760 ms (1 minute and 28 seconds), at a rate of about 38 million samples/s. This is also the single-thread query performance of Netdata, when data are cached in memory.

The above comparison also showcases the query optimization multiple tiers provide when querying longer time-frames. Query responses on higher tiers are 9 times faster compared to lower tiers (or 22 times faster compared to Prometheus). Netdata automatically selects the right tier based on query parameters, making this optimization transparent for users.

Given that Netdata found 3.32 billion samples, but Prometheus found 3.1 billion samples, it means that Prometheus did not actually scrape 458 seconds out of the total 7200 seconds in the last 2 hours, missing 6.3% of the total data it was expected to collect.

Restart Delay

Since we had to restart the systems several times, we found out that:

  • Netdata at this scale restarts in 24 minutes (time for the last node to finish replication and start streaming fresh data to the Netdata Parent). During startup Netdata verifies all data files, calculates available retention for all tiers, indexes and populates all metadata, and then waits for Netdata agents to connect to it. When Netdata children connect, they replicate the data missed while the Parent was offline and finally they start streaming fresh data. The entire process completed in 24 minutes for all 1000 nodes and once finished the Parent does not have any missing data, even for the duration it was offline.

  • Prometheus at this scale restarts in 44 minutes. Most of the time is spend in replaying its WAL. The OpenMetrics protocol used does not provide the ability to retrieve past data from the exporters, so for the time the Prometheus server was offline, data are missing.

Summary

For users looking to choose the right monitoring tool, this comparison highlights the core strengths and trade-offs between Netdata and Prometheus:

  • Netdata demonstrates significant efficiency, scalability, and retention, advantages handling high workloads with significantly lower resource consumption. It’s an ideal choice for organizations prioritizing fully automated real-time monitoring, long-term retention. Netdata’s efficiency could also translate to significant cost savings in cloud environments.

  • Prometheus, with its robust ecosystem and mature query capabilities, offers flexibility and deep integration options but requires substantially more resources for comparable workloads.

When choosing between Netdata and Prometheus, it’s essential to consider their distinct focus and target audiences:

  • Netdata is an operations-first monitoring tool, designed to empower operations teams with machine learning for anomaly detection, fully automated dashboards, powerful alerting and specialized tools for troubleshooting, logs analysis, coupled with real-time granularity, requiring minimal tuning, simplifying day-to-day operations. Netdata is designed for monitoring large infrastructures at scale and it provides unparalleled cost efficiency by keeping the data close to the edge and utilizing resources that are already available and spare.

  • Prometheus, by contrast, is a developer-first observability framework, offering the flexibility to build custom solutions. It’s ideal for teams with the resources to design tailored monitoring systems, leveraging Prometheus’s ecosystem, integrations, and its query language for in-depth customization. But if longer high-resolution retention is a must, operational complexity goes up significantly (e.g., Thanos, Cortex, or Mimir).

The table below summarizes the performance and resource metrics, giving a clear overview to help you make an informed decision based on your infrastructure needs.

Netdata Parent Prometheus
Version v2.1.0-198-gc7a58fe37
(nightly of Jan 18, 2025)
3.1.0 (branch: HEAD, revision:
7086161a93b262aa0949
dbf2aba15a5a7b13e0a3)
Configuration
(changes to the defaults)
1 TiB storage in 3 tiers
Disabled ML
Disabled Health
1 TiB storage
Per Second data collection
Hardware
(VMs on the same physical server)
52 CPU cores
200 GB RAM
4 TB SSD
52 CPU cores
500 GB RAM
4 TB SSD
Metrics offered
(approximately,
concurrently collected)
4.6 million per second 4.6 million per second
CPU Utilization
(average)
9.36 CPU cores
-37%
14.76 CPU cores
+58%
Memory Consumption
(peak)
47.1 GiB
-88%
383 GiB
+813%
Network Bandwidth 448 Mbps
-13%
515 Mbps
+15%
Disk I/O no reads
4.7 MiB/s writes
-97%
67.5 MiB/s reads
79.7 MiB/s writes
+3134%
Disk Footprint 1 TiB 1 TiB
Metrics Retention 1.25 days (per-sec)
9.5 days (per-min)
80+ days (per-hour)
+4000% or 40x (per-sec)
2 hours and 16 minutes (per-sec)
-97.5% (per-sec)
Unique time-series on disk 5.5 million 4.6 million
Bytes per sample on disk
(per-sec tier)
0.77 bytes / sample
-70% or 3.4x
2.6 bytes / sample
+337%
Query Performance
(per-sec tier)
43% faster than Prometheus
(22x faster when using higher tiers)
74% slower than Netdata
Collection Accuracy 100% 93.7%
(missed 6.3% of the expected samples)
Potential data loss
(network issues, maintenance, etc)
No
(missing samples are replicated from the source on reconnection)
Yes
(missing samples are filled from adjacent ones at query time)
Clustering Yes
(active-active Parents)
No
(not natively,
possible with more tools)

In this comparison, we’ve analyzed Netdata v2.2 and Prometheus v3.1 across several key performance metrics, including CPU utilization, memory consumption, disk I/O, network bandwidth, data retention, and storage efficiency. While Netdata consistently demonstrated superior efficiency, particularly in resource usage and retention capabilities, it’s important to acknowledge that Prometheus is an mature tool with a large and vibrant community.

Ultimately, both systems serve different needs and have their own strengths, and we do not take the results of this comparison as a dismissal of Prometheus’s capabilities. Instead, we aim to provide users with a clear understanding of where each tool excels, allowing them to make an informed decision based on their infrastructure requirements.

From an observability standpoint, the test underscores that “best tool” depends strongly on the specific needs around data granularity, retention duration, and organizational tolerance for hardware costs and complexity. Netdata’s advances in compression, lower overhead, and tiered retention show it can handle large environments extremely efficiently. Prometheus remains a powerful, well-established solution with extensive community support and a rich ecosystem (which of course Netdata is also using), but in high-scale scenarios, it may call for significantly higher investments or architectural workarounds. Of course, advanced Prometheus setups using Thanos or custom logic can mitigate some of the issues we observed, albeit at higher complexity.

Both tools are valuable depending on your goals:

  • Operational teams often prioritize out-of-the-box efficiency, minimal resource footprints, and immediate insight with real-time granularity—where Netdata shines.

  • Developer-focused teams that need more customization, broad integration, and advanced query logic may accept Prometheus’s resource overhead because it natively fits into their existing open-source tooling.

As always with observability, there’s no one-size-fits-all solution. Careful evaluation of your scale, retention needs, and infrastructure budget will guide you to the right choice—or a hybrid architecture that leverages the strengths of both.

We’re proud of Netdata’s advancements and will continue to innovate, but we also respect and appreciate the important work being done by other teams. The world of monitoring tools is diverse and constantly evolving, and our goal is always to provide the best solution for users, regardless of the competition.

Disclaimers

  1. Single-instance approach: Both Netdata and Prometheus were tested in a single-node configuration. While this showcases raw ingestion capacity and default behaviors, many real-world deployments of each tool use multi-node or distributed architectures.

  2. Default configurations: We used each tool “out of the box,” with minimal tuning or specialized settings. Advanced tuning could yield different performance outcomes for both systems.

  3. No horizontal scaling: Neither system was clustered or scaled out during these tests. Netdata has native support for horizontal scaling, and Prometheus can be used with tools like Thanos/Cortex/Mimir to distribute workloads differently.

  4. Very high ingestion rates: Our test environment ingests 4.6 million metrics per second - an unusually large volume for most observability platforms. For Netdata however, multi-million samples/s ingestion per Netdata Parent is a quite common setup when monitoring large infrastructures.

Discover More