At Netdata, our ultimate goal is to make it simpler and easier for everyone, to understand and manage the technology that runs the world.
To accomplish this, we innovate in all aspects of monitoring and troubleshooting, experimenting to find the perfect mix of attributes and features that can empower organizations of all sizes with real-time insights, actionable data, unparalleled visibility and control over their IT infrastructure.
Our guiding principles are:
-
Accessibility to monitoring
Ensure everyone has access to monitoring:
-
Zero configuration and zero preparation
Utilize auto-detection whenever possible and fully automated dashboarding and alerting for packaged applications and operating systems.
-
Minimal learning curve
So that everyone can use our ecosystem with minimal prior experience irrespective of skill level.
-
Cost efficiency
Minimal total cost of ownership is a central tenet of our design, ensuring that the cost of monitoring infrastructure only represents a fraction of the overall cost of the technology being monitored.
-
Open and free, forever, for everyone
Open source as much as possible and build a business model that can sustain the free of charge offering of all monitoring features, forever, to everyone, so that the world can benefit from the work we do, even when it is unreasonable for them to pay for monitoring.
-
-
High fidelity insights
Infrastructure is nowadays mostly virtualized. In such an environment, past common practices are inadequate and high resolution data is mandatory for revealing the inefficiencies this virtualized world introduces. Our solution should scale while collecting everything from every possible source, with the highest possible resolution and in real-time, as a standard.
-
Empower through collective wisdom
Gather, aggregate and re-share all community knowledge, experience and best practices.
-
Utilize cutting edge technology
A.I. and machine learning to simplify troubleshooting.
-
Infinite scalability through a simple modular design
Allow our ecosystem to thrive in all possible sizes of infrastructure, from tiny lab IoT devices to complex production environments of extreme scale.
Visualization, Dashboards and Troubleshooting
One of the key goals of Netdata is to eliminate the need for users to learn a new query language, or understand the layout of the underlying metrics, to create dashboards. The metadata we attach to metrics combined with a powerful UI, should allow users to filter, slice and dice the data the way they see fit, directly from the prebuilt dashboards, without dealing with the internal representation of metrics, or the exact query parameters. We have done a lot of progress on this already, however there are a few issues we need to address:
-
Many users find the prebuilt dashboards of Netdata overwhelming. A lot of information, flat, in an infinite scrolling dashboard with all its charts looking very similar to each other. We believe we can simplify the default dashboards for these users, by providing summary mini-dashboards on each section, allowing the user to get an abstract of what information is available.
-
Discoverability of metrics, especially since we collect many thousands of them, needs to be improved, to allow users to quickly find the metrics, the charts, the sections they are interested in.
-
Customizability of dashboards can also be improved drastically. Users should be able to create custom dashboards on the fly, drag and drop charts into existing and new dashboards while troubleshooting and should also be able to customize the TOC of the prebuilt dashboards.
-
More powerful filtering, slicing and dicing. The basic functionality is already implemented in Netdata Cloud dashboards, but we believe we can significantly improve the user experience.
-
Add more chart types such as histograms, heat maps, table and single number charts and the ability to drill down in a single view.
-
Allow users to annotate charts with system events from the feed (service restarts, backups, maintenance windows etc) or even user-defined annotations (deployment timestamps etc).
Another important aspect is helping users think critically about the charts and the metrics they see. We can do this by providing additional context to help them rationalize what they see in a chart, whether it is good, bad, or requires an action:
-
Overlay anomaly rate. We store the anomaly rate of metrics in the database and we have the ability to query past data for their anomaly rate at the time they were collected. Overlaying the anomaly rate automatically on charts, will allow users to use the ML engine as advisor, on every single chart.
-
Overlay alert info. Netdata alerts are already correlated to charts. This information is not yet visualized on the dashboards.
-
Overlay trend information. Automatically compare the visible time frame of a chart with the same time yesterday, last week, peak of the day, bottom of the day, allowing users to quickly evaluate the information they see on charts, based on past data.
-
Use the highlighted time frame on charts to provide statistical information about that time frame.
For almost all monitoring solutions today, troubleshooting goes through the process of validating assumptions. So, to troubleshoot an issue, users have to speculate on what could be wrong and for each of these assumptions go through a process to validate it or drop it.
We have already made some progress to improve this, by implementing “Metrics Correlations”, a feature that scans all the metrics in Netdata databases to find how they correlate with each other for a given time-frame, and “Anomaly Advisor”, a feature that uses machine learning to identify if metrics behave anomalously and which metrics behave anomalously at the same time. Both of these features provide an ordered list of metrics, scored based on their correlation with the anomaly in question, offering all the data for the much needed aha-moment.
By iterating and improving the mechanics and the user experience of these features, we believe that we can flip the process and revolutionize troubleshooting and root cause analysis into a data driven approach, instead of the long, painful and mostly inaccurate process of invalidating assumptions.
Netdata Architecture
At the heart of the Netdata architecture is the open-source database (DBENGINE), for storing all data Netdata collects. This database proves to have amazing and linear scalability and performance, being able to efficiently run in tiny environments (like IoT), to large centralization points managing hundreds of GB of data and millions of concurrently collected metrics.
Netdata Cloud utilizes Netdata agents as a distributed database, combining their collective power into a dynamic, powerful and infinitely scalable database for monitoring. This decentralization and collective utilization of resources that are already available and spare, allows Netdata to scale way better than any other monitoring solution, at a fraction of the cost.
Centralization points (Netdata Parents utilizing the streaming feature) for high availability and to support ephemeral configurations like Kubernetes, are supported with Netdata. But unlike other solutions, Netdata does not require just one centralization point. Users can have as many as needed, and Netdata Cloud can seamlessly combine all available centralization points to one distributed database. Recently this functionality has been improved significantly, by adding active-active clustering and replication.
There is however more room for improvement in the following areas:
-
Provide comprehensive deployment guides, to help users roll-out Netdata under different set ups.
-
Switch streaming to a binary protocol, to lower bandwidth requirements between Netdata agents and improve Parents scalability.
-
Replication works amazingly fast, but is relatively new and we still need to monitor and improve its robustness.
-
The communication protocol between Netdata Cloud and very busy parents can be further optimized for bandwidth utilization, to lower the cost of ownership in these extreme cases.
-
When working via Netdata Cloud, there is an increased latency on chart refreshes, mainly due to the roundtrip network delay. We believe we should re-architect the edge of Netdata Cloud (the point where agents get connected) and provide installations of it in every continent and in major countries, to minimize this latency and provide a fast and fluent experience, closer to the agent dashboard that connects directly to agents on the local LAN.
-
Simplify adding nodes to Netdata Cloud, eliminating the claiming process and replacing the agent dashboard with the Cloud Overview one.
Machine Learning and its importance for Monitoring
Machine learning can help alleviate the drudgery involved in monitoring and troubleshooting, making the troubleshooting experience shorter, more impactful and even enjoyable. Netdata’s objective is to empower the Netdata community to troubleshoot with clarity and purpose, make data-informed decisions and tell powerful and meaningful stories with their data.
Netdata’s Anomaly Advisor detects anomalies across all metrics, of all nodes, out of the box. The anomaly detection is baked into the solution itself and not an add-on as is often seen elsewhere. This comes with several advantages.
Optimized, lightweight ML running as close to the edge as possible enables improved privacy, reduced latency, increased accuracy and savings on network bandwidth.
Whether a metric or a family of metrics or a node is behaving anomalously or has behaved anomalously in the past is information that’s always available on-demand, and does not need to be calculated at query time.
Currently, the anomaly rate information is available if and when the user explicitly looks for it. Either by opening the Anomaly Advisor tab or by clicking on the anomaly icon of a specific chart. It is not automatically presented to the user in an intuitive manner that aids their troubleshooting. The onus is, for now, on the user to take an action, this is something we hope to improve.
Next Steps
The next logical step is to be more proactive with the insights that we are able to derive and nudge the user towards helpful insights rather than wait to be prompted.
-
Machine Learning should be used to reveal useful insights with users in a consumable and shareable manner.
New users may not intuitively arrive at the decision point that they have to take that extra step and trigger the ML to do something. Surfacing this information proactively will ensure that the ML is not just available but actively aiding the user.
Of course users will not be online and active on Netdata all the time, and there will be situations where the “interesting” insight happened when nobody was observing. These “incidents” should be captured and presented to the user in a way that is shareable.
-
Use Machine Learning to improve the usefulness of alerts and notifications, and reduce notification fatigue.
-
Offer opinionated suggestions - when possible - to aid the user in their troubleshooting journey.
Alerts and Notifications
The alerts engine is distributed, it runs on Netdata agents, and each agent has its own configuration. Netdata already automatically deduplicates alerts for the same metric when it comes from multiple child/parent nodes.
Dispatching of alert notifications is performed by both the agent and the cloud (centrally) and the agent has more notification methods, but we believe that Netdata’s future is related to centrally managing the dispatching of the alerts, thus eventually moving alerts notifications functionality from the agent to the cloud.
Today, we don’t have the ability to provide centrally composite alerts (i.e. alerts that examine metrics across nodes). This is going to be tricky, mainly because Netdata is distributed and we’d like to avoid querying all the nodes all the time to trigger a composite alert.
The following is a list of items we plan to improve:
-
Central Management. Alerts configuration should be administered centrally and automatically be deployed to the whole infrastructure.
-
Personalization of alerts. Alerts management should allow people with different roles to have finer control on the alerts they are interested in receiving and deal with alert fatigue. We also have to add some frequently asked-for features like scheduled silencing, maintenance times, acknowledgements, etc.
-
More alert notification methods.
- Migrate the most common notification methods from the agent to the cloud.
- Build a Netdata mobile app to make alerts available on users’ mobile devices (for free).
- Provide daily / weekly / monthly digests.
-
Team coordination, post mortem analysis, incident management. Provide a better process for allowing users to do post-mortem analysis and respond to incidents in a more collaborative way. Provide a better experience for troubleshooting through alerts.
-
Make alert configuration easier and more straightforward, to help users build their own SLIs, SLOs and SLAs.
Data Collection
One of the key principles of Netdata is automated visualization, the ability to provide meaningful dashboards, without any user configuration. To achieve this, we attach metadata to the collected metrics, providing the required information to the visualization engine for grouping metrics together and presenting all of them as ready-to-use dashboards, out of the box. This metadata enrichment is probably the important aspect contributing to easy setup and fast deployment of Netdata.
Netdata collects thousands of metrics out of the box from hundreds of different data sources, and comes with hundreds of collectors developed in-house in collaboration with our open source community. Netdata is able to collect metrics using any protocol and any methodology, using both pull (Netdata asks for metrics) and push (Netdata waits for metrics to come) architectures. We can also gather metrics from any Prometheus endpoint that uses the OpenMetrics exposition format.
With the introduction of the “functions” feature, our users can run anything on their infrastructure that can be executed in run-time and visualize the outputs on the Netdata dashboard. These functions can be thought of as specific routines that will bring additional information to help you troubleshoot or even trigger some action to happen on the node itself.
As of today, we do not support collecting logs or traces for the purpose of monitoring or troubleshooting.
These are the key data collection improvements we are currently focused on:
-
Make Netdata fully OpenTelemetry compatible and in closer alignment with the current best practices and standards, like consistent naming of metrics, natural units, data collection using double precision numbers, etc. This may require breaking backwards compatibility at some point, but it will allow Netdata to better interoperate with other open source and commercial solutions.
-
Standardize and document the metadata required to provide fully automated dashboards and alarms. This is crucial for providing out-of-the-box functionality and significantly improving accessibility to monitoring. Today, we have made a lot of progress in this area, but still the result of our work is neither standardized, nor documented.
-
Enhance the ability to collect metrics from prometheus endpoints so the metrics collected in this way are first class citizens, and can avail all the benefits of other netdata metrics (including summary dashboards, meaningful organization, automated alerts and more). The key objective is to find ways to automatically enrich the collected data with the metadata required for Netdata to provide fully automated dashboards and alarms for prometheus metrics.
-
Introduce an extensible functions framework - which will make it trivial for users to add new functions and make Netdata more powerful for troubleshooting and more.
-
Adding the ability to monitor logs. We’re avoiding to compete directly with great companies that already do this, like ElasticSearch. Yet we should be able to provide a few basic functions, like searching for a keyword or a timeframe in logs and extracting key events from logs which will be used to annotate charts and alerts. This will be a fundamental new shift that opens up tons of possibilities including the correlation of metric data and log information and much more.
Open Ecosystem
Netdata is a proponent of the “big tent” philosophy when it comes to monitoring, we acknowledge that our users will have other tools they rely on as part of their monitoring stack and we aim to be as interoperable as possible.
We focus on what we currently do well while also providing as much easy interoperability as possible with other important tools in the ecosystem.
Import from multiple data sources:
- Netdata can import data from multiple different data sources, in effect, enabling the user to monitor practically anything - Prometheus endpoints using the OpenMetrics exposition format, Pandas Collector to import data from CSVs, REST APIs and multiple other formats.
Export data from Netdata
-
Netdata also supports exporting metric data to open time series databases such as InfluxDB, OpenTSDB, TimescaleDB, etc for longer term storage and post processing.
-
Export Netdata metrics to your choice of more than 20 external storage providers for long-term archiving and further analysis using the Prometheus remote write exporting connector.
-
Users can also directly query Netdata from the shell using curl.
While we are always looking for ideas on making the monitoring and troubleshooting journey simpler and more powerful for the user, we are also excited to hear about the cool ideas and projects that users build using data exported from Netdata.
Integrating with 3rd party monitoring tools
-
The Netdata data source plugin for Grafana maximizes the troubleshooting capabilities of Netdata in Grafana, making them more widely available. Some of the key capabilities provided to you with this plugin include the following:
-
Real-time monitoring with single-second granularity.
-
Installation and out-of-the-box integrations available in seconds from one line of code.
-
Thousands of metrics from across your entire Infrastructure, with insightful metadata associated with them.
-
Access to our fresh Machine Learning metrics (anomaly rates) - exposing these capabilities at the edge!
-
The Netdata data source plugin connects directly to our Netdata Cloud APIs, meaning that you’ll need to have your nodes (hosts) connected to Netdata Cloud in order to be able to have them exposed through the plugin.
Our commitment to Open Source
Open source software has democratized technology in ways that were once unimaginable. It has enabled developers to create powerful and reliable software, often for free. At Netdata, we love open source because it aligns with our values of collaboration, innovation, and transparency. By embracing open source, we can harness the power of the community to solve complex problems and create products that have a positive impact on the world.
The heart of our software and the key building block of the Netdata ecosystem, the Netdata Agent, is and always will be open-source. It is licensed under GPL-v3+. The Netdata Agent includes the database, the query engine, the health engine and the machine learning engine which enable high fidelity monitoring and the ability to troubleshoot complex performance problems.
The Netdata Agent is our gift to the world! And we love community contributions!