2022 Database Security

When Autovacuum Becomes a Performance Risk

PostgreSQL's autovacuum daemon is one of the database's most important subsystems, responsible for reclaiming storage occupied by dead tuples, updating planner statistics, and preventing transaction ID wraparound. Despite its importance, autovacuum configuration is one of the most frequently neglected areas we encounter in database security assessments. The default settings are designed for modest workloads and small tables. In high-transaction environments—systems processing thousands of writes per second across tables with tens of millions of rows—these defaults create a cascade of availability and performance risks that, left unaddressed, can culminate in forced database shutdown.

The Throttle That Starves Itself

The most common misconfiguration we observe involves autovacuum_vacuum_cost_delay and its companion autovacuum_vacuum_cost_limit. These parameters throttle vacuum I/O to prevent it from competing with production queries. The default delay of 2 milliseconds with a cost limit of 200 means the vacuum worker pauses after processing roughly 200 pages (about 1.6 MB) of data. On a table accumulating dead tuples faster than vacuum can process them at this throttled rate, the table bloats continuously. We have observed tables where dead tuple count exceeded live tuple count by a factor of ten, inflating the table from 2 GB to over 20 GB. Sequential scans that once completed in milliseconds now took seconds. Index scans degraded because the indexes themselves referenced dead tuples that had not been vacuumed. For write-heavy tables, we typically recommend reducing autovacuum_vacuum_cost_delay to 0 or 1 millisecond and increasing the cost limit to 1000 or higher, applied as per-table storage parameters rather than global settings to avoid over-vacuuming idle tables.

The second critical parameter is autovacuum_vacuum_scale_factor, which determines what fraction of a table must consist of dead tuples before vacuum is triggered. The default of 0.2 (20%) is reasonable for a table with 10,000 rows—vacuum triggers after 2,000 dead tuples. But on a table with 100 million rows, vacuum does not trigger until 20 million rows are dead. On tables of this scale, a fixed-threshold approach using autovacuum_vacuum_threshold with a near-zero scale factor is far more appropriate. We commonly recommend autovacuum_vacuum_scale_factor = 0.01 or even 0.005 combined with autovacuum_vacuum_threshold = 1000 for large tables, ensuring vacuum begins working before bloat becomes measurable.

Transaction ID Wraparound: The Non-Negotiable Threat

The most severe risk from inadequate vacuum configuration is transaction ID wraparound. PostgreSQL uses 32-bit transaction IDs with a modular arithmetic visibility scheme. Each transaction consumes one ID, and the system can only distinguish approximately two billion transactions into the past. If a table is not vacuumed before the oldest unfrozen transaction ID approaches the wraparound boundary, PostgreSQL will refuse to accept new transactions and shut down to prevent data corruption. This is not a graceful degradation—it is a hard stop that requires manual intervention to resolve. We monitor for this risk by querying pg_stat_user_tables for the last_autovacuum and n_dead_tup columns, and cross-referencing with the age(relfrozenxid) value from pg_class. Any table where the frozen transaction age exceeds 500 million should trigger an immediate alert. At one billion, the situation is urgent. The autovacuum_freeze_max_age parameter (default 200 million) controls when anti-wraparound vacuum is forced, but on systems where regular vacuum is chronically behind, even this safety net may fire too late if the vacuum workers are all busy with other tables.

Our standard recommendation for any PostgreSQL deployment handling more than a few hundred transactions per second is to establish continuous monitoring of three metrics: dead tuple ratio per table (from pg_stat_user_tables), frozen transaction age per table (from pg_class), and autovacuum worker utilisation (from pg_stat_activity where backend_type = 'autovacuum worker'). These should feed into the same alerting infrastructure as CPU and memory monitoring, because an autovacuum that cannot keep pace is as much an availability threat as a saturated disk. The security dimension is direct: table bloat causes query timeouts, which cause application-layer retries, which amplify the load, which accelerates bloat further. This positive feedback loop is indistinguishable from a denial-of-service attack in its effects, and it originates from a configuration that shipped as the default.