What it is
The Monitoring & Alerting Checklist is the set of signals a production DBA actually watches — saturation, workload health, durability and replication, and the slow-burn problems that page you at 3am if you do not catch them at noon. It is organized by layer, with each metric meant to carry a threshold and an owner, so your first warning of trouble is a graph rather than an outage. The checklist's stance is that monitoring is only useful when every signal is tied to a meaningful threshold and a documented response.
The four monitoring layers map to how databases actually fail. Saturation covers the four golden signals — CPU and run-queue length, memory and buffer/cache hit ratio, disk free space and IOPS and queue depth, and connections active versus max — where sustained pressure means you are one spike from queueing. Workload health covers slow-query rate and p95/p99 latency, lock waits and deadlocks, transaction throughput and rollback ratio, and long-running or idle-in-transaction sessions that hold locks. Durability and replication watches replication lag against your RPO, last successful backup and restore-test times, WAL/redo generation and archive backlog, and replica failover readiness.
The fourth layer — maintenance and growth — catches the quiet killers: table and index bloat, transaction-ID wraparound age in PostgreSQL (an emergency if autovacuum-to-prevent-wraparound fires), unused indexes that slow writes, and database growth rate against the provisioned ceiling. The checklist pairs these with an alert-thresholds-and-severity table and a 'make alerts actionable, not noisy' Q&A that insists every alert have an owner and a runbook link, that you alert on user-felt symptoms rather than raw resource numbers, and that you alert on trends rather than only static thresholds.
What it's used for
Teams use the monitoring checklist to build observability that warns them before users feel pain, and to keep their alerting signal-rich rather than noisy. The concrete jobs it does:
- ✓ Covering the four golden signals of saturation — CPU and run-queue, memory and cache hit ratio, disk free space, IOPS, and queue depth, and connection utilization against max_connections — each with a threshold.
- ✓ Tracking workload health — slow-query rate and p95/p99 latency, lock waits and deadlock count, transaction throughput and rollback ratio, and long-running or idle-in-transaction sessions that hold locks and block cleanup.
- ✓ Watching durability and replication — replication lag in seconds and bytes against your RPO, last successful backup and restore-test times, WAL/redo generation and archive backlog, and replica failover readiness.
- ✓ Catching slow-burn maintenance issues — table and index bloat, PostgreSQL transaction-ID wraparound age, unused indexes, and database growth rate versus the provisioned ceiling.
- ✓ Setting severity-appropriate alert thresholds from the included table, so paging events are reserved for things a human must act on now.
- ✓ Making alerts actionable — ensuring every alert has an owner and a runbook link, so an alert is a call to action rather than noise that trains on-call to ignore the channel.
- ✓ Alerting on symptoms and trends — paging on p99 latency and error rate rather than raw CPU, and on rate-of-change (disk gaining 10 points in an hour) rather than only static thresholds.
Who uses it
Database monitoring is owned by the people who keep production healthy and who get paged when it is not, so the checklist is written for operators first.
Context & good to know
The core philosophy of this checklist is that good monitoring catches problems while you can still act calmly. Disk at 70% is fine; disk that gained ten points in an hour will be full by morning — so the checklist pushes rate-of-change alerts over static thresholds, because the goal is to be warned at noon about the thing that would page you at 3am. That distinction between a leading indicator and a lagging one runs through every layer, from saturation trends to backup-age tracking.
Alert fatigue is the failure mode the checklist works hardest to prevent. An alert with no documented response is just noise that trains the on-call to ignore the channel, so the 'actionable, not noisy' Q&A insists every alert have an owner and a runbook link, and that you page on symptoms users feel — p99 latency and error rate — rather than raw resource numbers. High CPU with fine latency is not an incident; treating it as one erodes trust in the whole alerting system.
Some of the most dangerous database problems are the quiet, slow-burn ones that this checklist deliberately surfaces. PostgreSQL transaction-ID wraparound, if it triggers autovacuum-to-prevent-wraparound, can force a database into a protective shutdown — an avoidable emergency if you watch the age metric. Table and index bloat silently wastes I/O and drives unnecessary VACUUM load. An untested backup is a hope, not a backup. These are precisely the signals that do not announce themselves until it is too late, which is why they belong on a checklist.
Spotsaas includes this checklist in its database-management resources because operational observability is a real differentiator between database engines and how teams run them. Whether a team operates PostgreSQL, MySQL, MongoDB, or a managed engine like Amazon Aurora that exposes many of these metrics out of the box, the discipline of tying every signal to a threshold, an owner, and a runbook is what separates a database that pages you with warnings from one that pages you with outages.