Accurate Process Monitoring with Netdata

Understand why tracking cumulative resource consumption is crucial for accurate process monitoring.

Accurate Process Monitoring with Netdata: Why Tracking Cumulative Resource Consumption Matters

Tracking the cumulative resource consumption of processes, including short-lived and exited children, is a rare feature in monitoring tools – and it’s one of the standout capabilities Netdata offers.

Most mainstream solutions, like Datadog’s process monitoring and Prometheus’s Node Exporter, focus on active processes and only collect metrics per PID. Even specialized process monitors (top, htop, etc.) face the same limitations. They capture snapshots of currently running processes and often rely on fixed sampling intervals, which are too slow to catch very short-lived tasks. This approach falls short when trying to capture the resource footprint of dynamic applications, shell scripts, and complex process hierarchies.

Let’s break down some approaches in current solutions and see why they miss the mark.

Sampling Intervals: Increasing the frequency of data collection can occasionally capture short-lived processes, but it’s inconsistent, especially for very brief processes (like those in shell scripts) that start and exit between collection intervals. Even with one-second sampling, commands that finish within milliseconds, like cat, sed, or awk, may never be seen.

Kernel-Based Monitoring Solutions: eBPF-based tools (BCC/eBPF) can theoretically track every system call and process event, including those from short-lived children. However, these require custom configurations to capture cumulative CPU usage across child processes, and they lack native aggregation over time, so users must piece together their own solution.

Custom Prometheus Exporters with Node Exporter: Node Exporter collects system-level metrics but lacks the capability to aggregate CPU usage for exited children or frequently exiting processes, such as those in shell scripts. Custom exporters could theoretically help but are still limited by collection intervals and miss cumulative aggregation for exited children.

Advanced APMs like Dynatrace and New Relic: These solutions focus on application-level resource tracking, often for specific stacks (Java, .NET), but don’t natively support comprehensive process hierarchies or cumulative child process aggregation for system-wide tracking. They’re effective for managed applications but fall short for unstructured or shell-script-based tasks.

Utilities like psacct and acct: These Linux utilities capture resource usage of exited processes but are built for system auditing, not real-time monitoring. They don’t offer real-time dashboards or easy-to-access aggregated metrics and aren’t suitable for scaling across distributed environments.

Based on current industry capabilities, no monitoring solution can accurately analyze the total CPU consumption of a system across all applications and process groups that actually use it – except Netdata.

Let’s explore why cumulative consumption of exited processes matters and how Netdata’s apps.plugin captures the full picture.

Cumulative Consumption of Exited Processes

Consider this example:

#!/bin/bash
cat file | awk '...' | sed '...'

This simple script creates the following process tree:

bash my-script.sh  # Parent Process
    ├── cat (exits in 2ms)
    ├── awk (exits in 3ms)
    └── sed (exits in 5ms)

Each child process here plays a role in transforming or handling data, but they exit almost instantly. Monitoring tools focused on per-PID metrics would likely miss these processes and give no indication they ever ran. This is where cumulative resource tracking becomes essential.

On Linux and FreeBSD, when processes exit, the kernel passes their total CPU and page faults back to their parent. In this example, each child process reports its resource usage to bash.

But it is gets tricky…

If bash itself exits, its own total CPU and page faults, together with the CPU and page faults of all its exited children is reported back to the process that run bash.

Let’s now add time into the equation. When monitoring processes over time, in order to use this information, you have to take into account how much of each exited PID has already been accumulated into parent processes, to add only the remaining.

In our example, if bash was running for 1 minute and we monitor with a frequency of once every 10 second, most of its total CPU and page faults have already been reported. In order to find which fraction remains, we need to keep track of all processes over time and do some math between iterations, so that only its remainder since the last iteration will be accumulated to its parent when bash exits.

As scripts or applications spawn additional commands or scripts, which themselves spawn children, the complexity increases. Tracking cumulative resource consumption across parent-child relationships, especially for processes that span across multiple intervals, involves precise accounting.

Why Monitoring by PID Alone Falls Short

For many use cases, per-PID monitoring is limited and impractical.

As a user, I need to know the total resources consumption of the things I run. I am not really interested for the resource consumption of each cat or sed command, unless I have a way to aggregate all these into something that has meaning for me. Tracking PID and PPID (Parent Process ID) is a solution, only if at the end I can have proper aggregation on what I consider important.

Furthermore, cardinality explodes. Many small, short-lived commands spawn thousands of times within a short period, add just noise to a problem that is already hard to solve.

And if we added this cumulative utilization of exited process children into the picture, things get really messy, since suddenly we will have utilization that is not directly related to each PID, but it refers to its children process which have been running as different PIDs, and probably under different UIDs and GIDs.

By focusing solely on PIDs, monitoring tools can’t accurately capture the resource usage of tasks involving frequently exiting children, rendering PID-based monitoring ineffective for these cases.

Netdata’s `apps.plugin`: Aggregating CPU Consumption Across Process Trees

Netdata’s apps.plugin was designed to address these limitations from the beginning. With its configuration file apps_groups.conf, users could define processes of interest to track and aggregate, including their cumulative exited children utilization.

Maintaining a comprehensive stock apps_groups.conf has always been a challenge. Hundreds of processes were tagged by us, but as usual with Netdata, users expected things to work out of the box, and they were missing this little detail, that they need to define their own processes of interested.

In Netdata v2 we decided to change that. So, instead of aggregating per PID or “processes of interested”, we decided that every “top” process in the process tree is important and needs to be monitored.

To decide which processes are the “top” ones, we take a simple approach:

All processes with a PPID of 1
All processes that have been spawn by a process manager
All processes that have been tagged by apps_groups.conf

So, we added the ability for users to configure process managers. This is a simpler task to solve, since process managers are really a few and the stock configuration shipped with Netdata already has most of them.

Processing by tree seems more natural for aggregating the cumulative utilization of the exited children and provides a result that is closer to “the things I run” as a user.

In this screenshot we can clearly see the aggregated CPU utilization of all cron jobs and backup scripts, which otherwise would be impossible to track!

Conclusion: Accurate Process Monitoring Made Easy with Netdata

The way we monitor processes matters. Traditional PID-based monitoring can leave massive gaps in visibility, especially when tracking applications with short-lived processes or complex hierarchies. Netdata’s approach with apps.plugin fills these gaps by focusing on the entire process tree and aggregating cumulative resource usage – including exited child processes.

Netdata provides users with meaningful, grouped insights into the resource consumption of their applications and scripts. Instead of cluttered dashboards filled with individual cat or awk processes, you get a clear, aggregated view of CPU, memory, and other resources by application. This approach lets you monitor “the things you run” with accuracy, precision, and minimal setup.

If you’ve been frustrated with the limitations of PID-based monitoring, give Netdata a try. It’s free, open-source, and brings visibility and context to system resource usage in a way that’s intuitive, powerful, and built for real-world needs.

Costa Tsaousis

Accurate Process Monitoring with Netdata

Why Tracking Cumulative Resource Consumption Matters