Monitoring & Alerting Checklist (2026)

1Tell us where to send it
Your name and work email — nothing more.
2Check your inbox
Your checklist arrives in seconds, not days.
3Use it with your team
Editable and ready to share — make it your own.

A peek inside

See exactly what you're getting

Free PDF

Spotsaas · 2026

Monitoring & Alerting Checklist

✓ What to monitor, by layer

✓ Alert thresholds and severity

✓ Make alerts actionable, not noisy

Get the checklist →

What Is Monitoring & Alerting?

The Monitoring & Alerting Checklist is the set of signals a production DBA actually watches — saturation, workload health, durability and replication, and the slow-burn problems that page you at 3am if you do not catch them at noon. It is organized by layer, with each metric meant to carry a threshold and an owner, so your first warning of trouble is a graph rather than an outage. The checklist's stance is that monitoring is only useful when every signal is tied to a meaningful threshold and a documented response.

The four monitoring layers map to how databases actually fail. Saturation covers the four golden signals — CPU and run-queue length, memory and buffer/cache hit ratio, disk free space and IOPS and queue depth, and connections active versus max — where sustained pressure means you are one spike from queueing. Workload health covers slow-query rate and p95/p99 latency, lock waits and deadlocks, transaction throughput and rollback ratio, and long-running or idle-in-transaction sessions that hold locks. Durability and replication watches replication lag against your RPO, last successful backup and restore-test times, WAL/redo generation and archive backlog, and replica failover readiness.

The fourth layer — maintenance and growth — catches the quiet killers: table and index bloat, transaction-ID wraparound age in PostgreSQL (an emergency if autovacuum-to-prevent-wraparound fires), unused indexes that slow writes, and database growth rate against the provisioned ceiling. The checklist pairs these with an alert-thresholds-and-severity table and a 'make alerts actionable, not noisy' Q&A that insists every alert have an owner and a runbook link, that you alert on user-felt symptoms rather than raw resource numbers, and that you alert on trends rather than only static thresholds.

What Monitoring & Alerting Is Used For

Teams use the monitoring checklist to build observability that warns them before users feel pain, and to keep their alerting signal-rich rather than noisy. The concrete jobs it does:

✓ Covering the four golden signals of saturation — CPU and run-queue, memory and cache hit ratio, disk free space, IOPS, and queue depth, and connection utilization against max_connections — each with a threshold.
✓ Tracking workload health — slow-query rate and p95/p99 latency, lock waits and deadlock count, transaction throughput and rollback ratio, and long-running or idle-in-transaction sessions that hold locks and block cleanup.
✓ Watching durability and replication — replication lag in seconds and bytes against your RPO, last successful backup and restore-test times, WAL/redo generation and archive backlog, and replica failover readiness.
✓ Catching slow-burn maintenance issues — table and index bloat, PostgreSQL transaction-ID wraparound age, unused indexes, and database growth rate versus the provisioned ceiling.
✓ Setting severity-appropriate alert thresholds from the included table, so paging events are reserved for things a human must act on now.
✓ Making alerts actionable — ensuring every alert has an owner and a runbook link, so an alert is a call to action rather than noise that trains on-call to ignore the channel.
✓ Alerting on symptoms and trends — paging on p99 latency and error rate rather than raw CPU, and on rate-of-change (disk gaining 10 points in an hour) rather than only static thresholds.

Who Uses Monitoring & Alerting

Database monitoring is owned by the people who keep production healthy and who get paged when it is not, so the checklist is written for operators first.

Database administrators (DBAs)They define the thresholds for saturation, replication lag, and bloat, and they own the maintenance signals — wraparound age, dead-tuple ratio — that only a DBA tends to watch.

Site reliability engineers (SREs)They build the alerting on p95/p99 latency and error rate, tune it for signal over noise, and ensure every alert maps to a runbook so on-call can act immediately.

On-call engineersThey are the ones the checklist protects — well-designed alerts with owners and runbooks turn a 3am page into a clear action rather than a panicked investigation from scratch.

Platform and DevOps engineersThey wire the database metrics into the observability stack and dashboards, and connect connection-pool and IOPS saturation signals to the autoscaling or pooling that responds to them.

Engineering managersThey review the rollback ratio, latency tails, and growth trends as leading indicators of reliability and capacity risk, using the checklist to confirm coverage is complete.

Monitoring & Alerting: Context & Good to Know

The core philosophy of this checklist is that good monitoring catches problems while you can still act calmly. Disk at 70% is fine; disk that gained ten points in an hour will be full by morning — so the checklist pushes rate-of-change alerts over static thresholds, because the goal is to be warned at noon about the thing that would page you at 3am. That distinction between a leading indicator and a lagging one runs through every layer, from saturation trends to backup-age tracking.

Alert fatigue is the failure mode the checklist works hardest to prevent. An alert with no documented response is just noise that trains the on-call to ignore the channel, so the 'actionable, not noisy' Q&A insists every alert have an owner and a runbook link, and that you page on symptoms users feel — p99 latency and error rate — rather than raw resource numbers. High CPU with fine latency is not an incident; treating it as one erodes trust in the whole alerting system.

Some of the most dangerous database problems are the quiet, slow-burn ones that this checklist deliberately surfaces. PostgreSQL transaction-ID wraparound, if it triggers autovacuum-to-prevent-wraparound, can force a database into a protective shutdown — an avoidable emergency if you watch the age metric. Table and index bloat silently wastes I/O and drives unnecessary VACUUM load. An untested backup is a hope, not a backup. These are precisely the signals that do not announce themselves until it is too late, which is why they belong on a checklist.

Spotsaas includes this checklist in its database-management resources because operational observability is a real differentiator between database engines and how teams run them. Whether a team operates PostgreSQL, MySQL, MongoDB, or a managed engine like Amazon Aurora that exposes many of these metrics out of the box, the discipline of tying every signal to a threshold, an owner, and a runbook is what separates a database that pages you with warnings from one that pages you with outages.

✓ Independent · vendors can't pay to rank

Built on verified data, not vendor spin

Every Spotsaas resource draws on the SpotScore — a blend of verified review ratings, review volume, and feature depth across 89 database management software tools. Refreshed regularly; data as of June 2026.

FAQ

Questions, answered

What are the four golden signals for database monitoring?

Adapted to databases, they are saturation across CPU and run-queue, memory and buffer/cache hit ratio, disk (free space, IOPS, queue depth), and connection utilization against max_connections. Sustained pressure on any of these — say CPU above 80% or connections above 80% of max — means you are one traffic spike away from queueing or hard refusals, so each should carry a threshold and an owner.

What is the most important database metric to monitor?

There is no single one, but the highest-value signals are p95/p99 query latency and error rate, because they reflect what users actually feel, and replication lag, because it directly threatens your RPO. The checklist's guidance is to page on user-felt symptoms and durability risks, and use resource metrics like CPU as supporting diagnosis rather than standalone alerts.

Why alert on trends instead of static thresholds?

Because a static threshold tells you when you have already hit a wall, while a rate-of-change alert warns you while you can still act calmly. Disk at 70% is fine; disk that gained ten points in an hour will be full by morning. Trend-based alerts catch slow-burn problems — disk growth, replication lag creep, connection ramp — early enough to prevent the 3am page.

What is replication lag and why monitor it?

Replication lag is how far behind a replica is from the primary, measured in seconds and bytes. It matters because it directly threatens your RPO — if the primary fails with high lag, the lagged data is lost — and because a replica too far behind cannot reliably serve reads. The checklist says to alert before lag exceeds your RPO, not after.

What is transaction-ID wraparound and why is it dangerous?

In PostgreSQL, transaction IDs are finite and must be periodically frozen by VACUUM; if the oldest unfrozen transaction's age grows too large, autovacuum-to-prevent-wraparound fires and, if unaddressed, the database can refuse writes to protect itself. Monitoring wraparound age turns a potential protective shutdown into a routine maintenance task done well in advance.

How do you avoid alert fatigue?

Give every alert an owner and a runbook link, and page only on things a human must act on now. Alert on symptoms users feel — p99 latency, error rate — rather than raw resource numbers, since high CPU with fine latency is not an incident. An alert with no documented response just trains on-call to ignore the channel, so prune anything that fires without a clear action.

What is an example of a database software with built-in monitoring?

PostgreSQL exposes pg_stat_* views and supports tools like pg_stat_statements; MySQL has the Performance Schema; Oracle Database offers AWR and Oracle Enterprise Manager; and managed services like Amazon Aurora surface saturation, replication, and workload metrics through their consoles. The checklist's layers — saturation, workload, durability, maintenance — apply across all of them regardless of the specific tooling.

What does a rising rollback ratio indicate?

A climbing rollback ratio — the share of transactions that roll back rather than commit — usually points to contention (transactions failing on lock conflicts or deadlocks) or application errors causing aborts. It is a workload-health signal worth watching alongside deadlock count and lock waits, because a rising deadlock rate often signals a transaction-ordering bug rather than simple load.

Why monitor backup and restore-test times as metrics?

Because an untested backup is a hope, not a backup. Tracking the last successful backup time catches a silently failing backup job, and tracking the last successful restore-test time ensures the backups are actually recoverable. The checklist treats both as durability signals, since the worst time to discover a broken backup is during a real recovery.

Keep exploring