DNSdist monitoring with Netdata

What is DNSdist?

DNSdist is an open-source DNS load balancer and traffic filter. It is designed to improve DNS performance and security by allowing administrators to configure a distributed DNS system with multiple authoritative name servers, and also filter out malicious traffic. DNSdist can be used for DNS traffic management, caching, and other services.

Monitoring DNSdist with Netdata

The prerequisites for monitoring DNSdist with Netdata are to have DNSdist and Netdata installed on your system.

Netdata auto discovers hundreds of services, and for those it doesn’t turning on manual discovery is a one line configuration. For more information on configuring Netdata for DNSdist monitoring please read the collector documentation.

You should now see the DNSdist section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What DNSdist metrics are important to monitor - and why?

Queries (all, recursive, empty)

The DNSdist queries metric measures the number of DNS queries received by a DNS server. It is important to monitor this metric as it can indicate how busy your DNS server is and how it is handling the workload. For example, if the number of all queries is too high, it may suggest that the server is overloaded and unable to handle the incoming requests. Additionally, if the number of recursive queries is too high, it could indicate malicious activity such as a DDOS attack or DNS amplification.

It is also important to monitor the empty queries metric, which should remain low. Empty queries are those that do not contain any question data and are usually sent by malicious actors in an attempt to overwhelm the DNS server.

Queries Dropped (rule_drop, dynamic_blocked, no_policy, non_queries)

The queries dropped metric measures the number of queries that were dropped by the DNS server due to various reasons. The rule_drop metric indicates the number of queries that were dropped due to explicit rules defined in the DNSdist configuration. The dynamic_blocked metric indicates the number of queries that were blocked due to the dynamic DNS blocking feature. The no_policy metric indicates the number of queries that were dropped because there was no applicable policy for the query. And finally, the non_queries metric indicates the number of non-query packets that were dropped because the server was not expecting them.

The queries dropped metric is important to monitor as it can help to detect malicious activity such as DDOS attacks or DNS amplification. Additionally, it can help to identify misconfigured DNSdist rules or policies that may be inadvertently dropping valid queries.

Packets Dropped (acl)

The packets dropped metric measures the number of packets that were dropped by the DNS server due to the access control list (ACL) configured in DNSdist. The ACL is used to control which clients can access the DNS server, and when configured correctly, can help to prevent malicious actors from overloading the server with requests. It is important to monitor this metric as it can help to detect malicious activity such as DDOS attacks or DNS amplification.

Answers (self_answered, nxdomain, refused, trunc_failures)

The answers metric measures the number of DNS answers sent by the server. The self_answered metric indicates the number of answers that were sent because the server was the authoritative server for the query. The nxdomain metric indicates the number of responses that were sent with an NXDOMAIN error. The refused metric indicates the number of responses that were sent with a refused error. And finally, the trunc_failures metric indicates the number of responses that were sent with a truncated error due to the response being too large for the query.

Monitoring this metric can help to identify misconfigured DNSdist rules or policies that may be inadvertently sending incorrect responses. Additionally, it can help to detect malicious activity such as DNS spoofing or cache poisoning.

Backend Responses

Backend Responses metric measures the number of responses received per second from backend servers in response to DNS queries. This metric is important to monitor as it indicates how well the DNS service is performing in terms of responding to requests. If the number of responses is too low, it could indicate that the backend servers are not responding quickly enough or that there is an issue with the DNS service itself. High response times can also indicate that the backend servers are overloaded or there are other issues with the service.

Backend commerrors

Backend commerrors metric measures the number of backend server communication errors per second. This metric is important to monitor as it indicates whether there are any issues with the backend server communication. High levels of communication errors can indicate that the backend servers are not responding in a timely manner or that there are issues with the network configuration.

Backend errros (timeouts, servfail, non_compliant)

Backend errors metric measures the number of timeouts, SERVFAIL and non-compliant responses from backend servers per second. This metric is important to monitor as it indicates whether the backend servers are responding correctly to requests. High levels of timeouts, SERVFAIL, or non-compliant responses can indicate that there is an issue with the request or response from the backend servers. It can also indicate that the backend servers are not configured correctly.

Cache (hits, misses)

Cache metric measures the number of cache hits and misses per second. This metric is important to monitor as it indicates how well the DNS service is performing in terms of caching requests. High levels of cache misses can indicate that the service is not efficiently caching requests and responses, which can lead to higher response times and increased load on the backend servers.

Server CPU (system_state, user_state)

Server CPU metric measures the amount of CPU usage in the system and user states. This metric is important to monitor as it indicates how much of the system’s resources are being used by the DNS service. High levels of system or user CPU usage can indicate that the service is overloaded or that there are potential issues with the configuration that need to be addressed. Normal value ranges for this metric would be 0-100%, with higher values indicating higher CPU usage.

Server Memory Usage

Server Memory Usage metric measures the amount of memory usage by the DNS service. This metric is important to monitor as it indicates how much of the system’s resources are being used by the DNS service. High levels of memory usage can indicate that the service is using too many resources or that there are potential issues with the configuration that need to be addressed.

Query Latency (1ms, 10ms, 50ms, 100ms, 1sec, slow)

Query Latency metric measures the average query response times for queries of different lengths. This metric is important to monitor as it indicates how well the DNS service is performing in terms of responding to requests. High response times can indicate that the backend servers are overloaded or that there are other issues with the service.

Query Latency Average (100, 1k, 10k, 1000k)

Query Latency Average metric measures the average response time for queries of different sizes. This metric is important to monitor as it indicates how well the DNS service is performing in terms of responding to requests. High response times can indicate that the backend servers are overloaded or that there are other issues with the service.

The observability platform companies need to succeed

Sign up for free

Want a personalised demo of Netdata for your use case?

Book a Demo