PostgreSQL is a popular open source object-relational database system designed to work for a wide range of workloads from single machines to data warehouses to web services with many concurrent users. PostgreSQL runs on all major operating systems and is used by teams and organizations across the world, including Netdata.
If you are using PostgreSQL in production, it is crucial that you monitor it for potential issues. And the more comprehensive the monitoring the better!
PostgreSQL, just like any other service, impacts the resources of the host upon which it is installed. This means that host metrics such as CPU, Memory, Disk Space and Network influence and are influenced by PostgreSQL performance. These basic host metrics are often treated as a first step before the user embarks on a deeper troubleshooting journey into more advanced database specific metrics.
PostgreSQL automatically aggregates many of the key metrics internally so you just need to query predefined statistics views such as pg_stat_database, pg_stat_user_tables, pg_stat_bgwriter and others to understand what’s going on with your databases. However, there are a host of other important metrics that need to be explicitly collected.
Comprehensive monitoring also means you also need to collect these metrics continuously, not just as a one off check. Database issues could occur at any time, and it is important that enough monitoring data is available for early detection and understanding what is triggering the issue and how best to resolve it.
Netdata is a fundamentally faster, easier way to monitor your infrastructure and anything that’s part of it. Including PostgreSQL!
So let’s first walk through how to set up Netdata monitoring (spoiler: it’s really easy) and then dive deeper into the world of PostgreSQL metrics – what they are, why they matter, and what to do with them. Or feel free to use the links below to jump to the section you need!
For the purpose of this guide, we will presume that you already have a working PostgreSQL cluster. The following prerequisites should be met in order for Netdata to monitor your PostgreSQL cluster:
If these requirements are met, installing Netdata is as easy as it gets; just follow these steps to sign up for a free Netdata account and install the open source agent.
Netdata automatically discovers the PostgreSQL service and starts monitoring it with the default PostgreSQL user. You should now see a PostgreSQL section on the Overview tab that’s already populated with charts about all the metrics you care about!
And as long as you are using the default Postgres user, that’s it!
If you’d like to configure a different user or modify any other configuration, the PostgreSQL collector reference documentation has got you covered.
The metrics that Netdata collects are organized into subsections within the PostgreSQL section for easier navigation. Each metric is represented by a composite chart that aggregates the data across multiple nodes/databases/tables/etc.
Netdata collects 100+ PostgreSQL metrics and visualizes these across 60+ different composite charts. See the reference documentation for the full list of metrics.
Clicking on the PostgreSQL section takes you to the summary dashboard which gives you a quick overview of your database cluster. Scrolling down further allows you to explore PostgreSQL metrics to your heart’s content - you can also click on a particular section to get to where you want quicker.
Each composite chart (see the example below) can be grouped by dimension, database, schema, table, or other group-by options. You can also filter the chart by a particular condition (a database or a table, for example). You also have the option to change the default aggregation and interpolation methods, change the chart type, add the chart to a custom dashboard or view it in full screen.
A connection is an established line of communication between a client and the PostgreSQL server. Each connection adds to the load on the PostgreSQL server.
Observing the number of connections to your database over time and connections currently in use, can help you optimize your application’s system architecture and determine the right settings to maximize your throughput while limiting overhead.
To guard against running out of memory or overloading the database the max_connections parameter (default = 100) defines the maximum number of concurrent connections to the database server. A separate parameter, superuser_reserved_connections (default = 3), defines the quota for superuser connections (so that superusers can connect even if all other connection slots are blocked).
Total connection utilization across all databases is measured as a percentage of (max_connections - superuser_reserved_connections). If the utilization is 100%, then no more new connections will be accepted (superuser connections will still be accepted if superuser quota is available).
Connections usage across all databases. The maximum number of concurrent connections to the database server is (max_connections - superuser_reserved_connections). As a general rule, if you need more than 200 connections it is advisable to use connection pooling.
Connection count visualized per current state of the connection. The state of a connection can be idle, active, disabled, idle in transaction, idle in transaction (aborted) and fastpath function call.
This information helps you better manage your connection count. The state of a connection can be:
If you see a high count of idle connections it is worth exploring a connection pooling solution such as PgBouncer. If you find that you have some stale transactions hanging around for days, hours, or even just a few minutes, you may want to set a default to end those transactions. To help with this, Postgres has a nice feature of a statement_timeout.
Connection utilization per database. Utilization is measured as a percentage of CONNECTION LIMIT per database (if set) or max_connections (if CONNECTION LIMIT is not set).
Represents the current number of connections per database.
This is an histogram representing the transaction duration. The bins are specified as consecutive, non-overlapping intervals. The value is the number of observed transactions that fall into each interval.
Commit ratio is the ratio between committed transactions (xact_commit) and rolled back transactions (xact_rollback). It indicates the amount of successful operations. Changes made by transactions may be aborted (rollback) or committed. Single queries that have failed outside the transactions are also accounted as rollbacks. So, in general, using the commit ratio we can estimate the amount of errors in a particular database.
Values closer to 100 mean that your database is healthy and has very few errors. If the commit ratio is below 90, it is a good idea to investigate further by enabling additional logs and analyzing them to build a list of the most common errors.
The actual count of transactions per second, per database.
This is an histogram representing the active query execution time. The bins are specified as consecutive, non-overlapping intervals. The value is the number of observed active queries that fall into each interval.
Use this chart to identify if you have a slow query problem - if you consistently see values in the bins representing longer duration it is worth enabling slow query logging to investigate further.
The rows fetched ratio is the percentage of rows that contain data needed to execute the query (rows fetched), out of the total number of rows scanned (rows returned). A high value indicates that the database is executing queries efficiently, while a low value indicates that the database is performing extra work by scanning a large number of rows that aren’t required to process the query.
If this value is consistently and significantly low then it is likely to be due to missing indexes or inefficient queries. Consider creating indexes on frequently accessed columns to improve this ratio.
Monitoring the blocks read from disk or found in cache helps you understand if a disk read was required or if the data was found in the buffer cache and a read was prevented.
Monitoring the number of rows inserted, updated or deleted helps you understand how write-heavy your current workload is and allows you to tune your database to optimize your application’s performance.
Number of temporary files created by queries. Complex queries may require more memory than is available (specified by work_mem). When this happens, Postgres reverts to using temporary files that are actually stored on disk, but only exist for the duration of the request. After the request returns, the temporary files are deleted. Usage of temp files also typically indicates long-running queries, so these are queries that are likely to be slow to return anyway, and may cause slowdowns on your application.
Amount of data written to temporary files by queries. The best way to reduce temp files and their impacts is to increase your query efficiency and/or decrease your query complexity. You can do this by:
Write queries throughput. If you see a large number of updated and deleted rows, keep an eye on the number of dead rows, as a high percentage of dead rows can slow down your queries.
Percentage of HOT (Heap Only Tuple) updated rows. HOT updates are much more efficient than ordinary updates: less write operations, less WAL writes, vacuum operation has less work to do, increased read efficiency (helps to limit table and index bloat).
Number of HOT (Heap Only Tuple) updated rows.
Number of scans initiated on this table. If you see that your database regularly performs more sequential scans over time, you can improve its performance by creating an index on data that is frequently accessed.
Number of live rows fetched by scans.
PostgreSQL uses a shared buffer cache to store frequently accessed data in memory, and avoid slower disk reads. This metric represents the miss ratio (percentage of time PostgreSQL had to do disk reads vs memory buffer reads) and the closer to zero it is the better.
If you are seeing performance issues, consider increasing the shared_buffers size or tuning effective_cache_size.
Amount of data read from shared buffer cache or from disk per database.
The cache miss ratio measured per table per database is useful to identify table inefficiency. Similar to database buffer cache miss ratio the lower the better.
Amount of data read from shared buffer cache or from disk per table.
Represents what percentage of index reads are from cache (memory) vs disk. The lower the better.
Amount of data read during index reads from shared buffer cache or from disk.
Actual on-disk usage of the database’s data directory and any associated tablespaces.
Actual on-disk size of table(s).
Actual on-disk size of index(es).
PostgreSQL uses Multi-Version Concurrency Control (MVCC) in order to maintain ACID compliance and is characterized by the generation of dead-tuples. Over time, the accumulation of these dead rows increases both the table and index sizes, and is known as bloat.
This chart represents the estimated percentage of bloat in the table. It is normal for tables that are updated frequently to have a small to moderate amount of bloat.
Disk space that was used by the table and is available for reuse by the database but has not been reclaimed.
Bloat lowers performance by increasing the amount of table space. Recovering the space is normally managed by the AUTOVACUUM daemon or by tuning the autovacuum table storage parameters. Alternatively a more aggressive approach is to “repack” the bloated relations using pg_repack.
This chart represents the estimated percentage of bloat in table indexes. It is normal for tables that are updated frequently to have a small to moderate amount of bloat.
Disk space that was used by the index and is available for reuse by the database but has not been reclaimed. Bloat slows down your database and eats up more storage than needed. To recover the space from indexes, recreate them using the REINDEX command.
Rate of deadlocks detected per second per database. When a transaction cannot acquire the requested lock within a certain amount of time (configured by deadlock_timeout) it triggers a deadlock.
Locks held per database, categorized by type of lock being held. Some of these lock modes are acquired by PostgreSQL automatically before statement execution, while others are provided to be used by applications. All lock modes acquired in a transaction are held for the duration of the transaction.
Locks awaited by database, categorized by type of lock being awaited.
It indicates that some transaction is currently waiting to acquire a lock, which implies that some other transaction is holding a conflicting lock mode on the same lockable object.
PostgreSQL databases require periodic maintenance known as vacuuming. For many installations, it is sufficient to let vacuuming be performed by the autovacuum daemon. For more information see The Autovacuum Daemon.
Count the vacuum and analyze workers by type.
Time elapsed since this table was last manually vacuumed (not counting VACUUM FULL).
Time elapsed since this table was last automatically vacuumed.
Time elapsed since this table was manually analyzed.
Time elapsed this table was analyzed by the autovacuum daemon.
Number of checkpoints that have been performed. Checkpoints are periodic maintenance operations the database performs to make sure that everything it’s been caching in memory has been synchronized with the disk. Ideally checkpoints should be time-driven (scheduled) as opposed to load-driven (requested).
Checkpoint timing information. An important indicator of how well checkpoint I/O is performing is the amount of time taken to sync files to disk.
The number of times the background writer stopped a cleaning scan because it had written too many buffers (exceeding the value of bgwriter_lru_maxpages).
Amount of data flushed from memory to disk.
Allocated and re-allocated buffers. If a backend process requests data, it is either found in a block in the shared buffer cache or the block has to be allocated (read from disk). This metric represents the rate of occurrences where the block had to be allocated.
Number of times a backend had to execute its own fsync call (normally the background writer handles those even when the backend does its own write). Any values above zero can indicate problems with storage when the fsync queue is completely filled.
Represents the value of the oldest transaction ID. If for some reason autovacuum fails to clear old XIDs from a table, the system will begin to emit warning messages when the database’s oldest XIDs reach eleven million transactions from the wraparound point.
To bring the highest XID age value back down one option is to force a vacuum on the entire database cluster. If this is not possible, vacuuming specific tables where the problem is concentrated is the other option.
This is a critical metric and indicates how close to the point of exhaustion and transaction ID wraparound the database is. If exhaustion is ever reached, the database could be forced to shut down. There is a small buffer since the upper boundary is 2 billion exactly, which is less than the actual max integer value that causes the exhaustion. Regardless, this metric hitting 100% should be acted upon immediately. For more information see Preventing Transaction ID Wraparound Failures.
This metric tells you how close to autovacuum_freeze_max_age the database’s highest transaction ID value has reached. autovacuum_freeze_max_age is a user configurable parameter (default value = 200 million), when any table’s highest transaction ID value reaches it, a higher priority autovacuum kicks in on that table.
It is especially important to monitor in systems where this has not been monitored previously.
The ratio of dead rows to live rows measured per database, per table.
An increase in dead rows indicates a problem with VACUUM processes, which can slow down your queries.
The total count of rows (live and dead) measured per database, per table
When you do an UPDATE or DELETE, the row is not actually physically deleted. For a DELETE, the database simply marks the row as unavailable for future transactions, and for UPDATE, under the hood it is a combined INSERT then DELETE, where the previous version of the row is marked unavailable.
Count of table columns which are always NULL - in other words this is a count of table columns that may be useless and unnecessary. This metric is measured per database and per table. Once you identify the table with always NULL columns you will have to manually query the table to identify which columns are always NULL.
Count of indexes by usage status. An index is considered unused if no scans have been initiated on that index. Use this metric to monitor if you consistently have unused indexes.
Replication lag measured in seconds per lag type. This metric is measured per standby instance.
Replication lag measured in bytes per lag type.This metric is measured per standby instance.
Replication slot files. For more information see Replication Slots.
Number of queries canceled due to conflict with recovery on standby servers. To minimize query cancels caused by cleanup records consider configuring hot_standby_feedback.
This view will only contain information on standby servers, since conflicts do not occur on master servers.
Statistics about queries canceled due to various types of conflicts on standby servers.
Write-Ahead Logging (WAL) ensures data integrity by ensuring that changes to data files (where tables and indexes reside) are written only after log records describing the changes have been flushed to permanent storage.
Data written to WAL per second.
Number of WAL logs stored in the directory pg_wal under the data directory.
Statistics about WAL file archiving.
The count of PostgreSQL databases being monitored by this Netdata job.
The count of different catalog relations.
The size of different catalog relations.
The time elapsed since the Postgres process was started.
Want a personalised demo of Netdata for your use case?