PostgreSQL backup notes

Some of my very rough notes when going nearly cover-to-cover of the PostgreSQL backup and restore documentation section. It’s one of the most detailed pieces of documentation I’ve ever read, so this acts as a very high-level summary. Most of the content is useful for general context surrounding data backups of any kind.

I’m using /data directory to signify the data storage directory. This depends on the chosen configuration, however.

Backups
Restore
Replication

Backups

Three main types of backup strategies:

SQL dump/load (stop-the-world)
Backup /data directory (stop-the-world)
Continuous Archiving

The first two typically need lots of extra space on the database server to store the backup before you can upload it to some off-site storage.

“stop-the-world” in this context is not an official nomenclature. In these strategies, most likely, the database needs to be shut down at some point.

Continous archive-based backups are used for a leader-follower setup, or even delta backups—where there’s a base backup, and subsequent data as deltas that can be used to restore the entire database. Application-supported remote backups are quite simple, and so this is the best strategy if the database servers are space-constrained.

SQL Dump/Load

This is the easiest strategy. pg_dump takes a backup, while pg_restore command consumes the output of that backup. This strategy is the simplest to cron-ify a backup, without external dependencies: take a backup, upload the files to remote storage, test the backup on a different machine, and do this every night.

pg_dump saves the database into a .sql statement. Requires large enough space to hold both the database and the backup script.
File sizes might be limited by kernel/OS, so that’s something to look ahead while deciding to use this.
Restore from the pg_dump output might also need extra configuration tweaking around connection times: too less, and the database might close the connection before the entire script has run.

Copy `/data` directory

PostgreSQL’s directory layout is straight-forward—once you get to know it. Most of the data is put in one directory, and this includes the two main components needed for any future restores: the data files, and the temporary append-only log files. If the database is shut down, you’re free to copy over the data directory to another machine, and start off from it. Configuration files typically aren’t placed in the data directory, so they might need to be copied as well.

Any strategies that have to rely on the file-system layout of PostgreSQL, or features provided by the file system itself.

Two routes here: frozen snapshots of the file system, or using tools like rsync, tar etc.

frozen snapshots

If the underlying file system supports atomic volume snapshots (btrfs, zfs, Apple’s APFS for example), one can snapshot the entire data directory. Lots of caveats around how good the snapshot mechanism is implemented exist.
The backup can be taken without stopping the server. During restore, this strategy would require replaying the logs as there might be some commits that weren’t turned to data files from the append only log.

`rsync`, `tar`, et al.

rsync, gzip, tar the data directory. These utilities don’t take consistent snapshots of the disk, so it’s best to shutdown the server. Shutting down the server forces a full flush of the data to disk.
An example two step process with rsync:

run rsync
shutdown the server
rsync --checksum

What’s interesting is that this two-step process is similar to the one used in online backups section. This is like a one-step delta backup process if we stretch it enough: the first backup is a base backup that contains the data committed till that point, then the second rsync takes the delta and copies that over.

Continuous Archiving

This system can be used to setup a replicated system, consisting of a leader and potentially multiple followers. The data from the leader is pushed, and each of the followers might pull the data. Where this data is stored is customisable. There are many ways to setup replication in PostgreSQL, and the documentation for it is exhaustive. The archival part deals with the first part: taking the backup and pushing it somewhere.

This strategy piggy-backs on the fact that a WAL log may be used to replay and restore a database. There are many caveats and configuration tweaks to how long the WAL log files are retained, the size of those log files and the naming of the files. It’s best to ship the log files as and when they are created to an external storage service. Rather than do this manually via rsync et. al., PostgreSQL provides a way: archive_command setting in the configuration, which takes a script.

WAL logs should be secured while transmission and remote storage, because these contain the actual data. (that goes for the main database too, fwiw)
archive_command should exit with 0 code. Otherwise, the command gets retried. The pg_wal directory may potentially get filled, and cause the server to crash!
archive_command should be designed to ensure it doesn’t override existing files on the remote system.
Missing WAL logs from the archive might hamper future restore, so regular base backups will help keep the error surface area a little small.
old base backup + too many WAL logs to restore increase the restore time. It’s important to determine the maths behind this to figure out how much downtime you might need and tweak the base backup frequency, and WAL file size accordingly.

General mechanism:

One base backup as a starting point
Continuous deltas in the form of the append-only log (WAL) files

The base backup marks the point where the backup would start (checkpoint).

base backup

Two ways to take a base backup:

Use the pg_basebackup command from an external machine (or the same machine, with a different data directory setting), providing the connection info to connect to the leader.
- Multiple commands can be run from multiple machines, but might depend on replication slots configuration on the leader.
- Might use one/two connections depending on the variant of backup used: copy WAL logs at the end (1) or stream WAL logs parallelly (2).
- Does not run if /data directory is not empty.
Two-step process via rsync. PostgreSQL provides two SQL statements for signalling the server that the user is taking a backup, and that a checkpoint has to be created: pg_start_backup, pg_stop_backup.

SELECT pg_start_backup('some_label')
rsync /data
SELECT * from pg_stop_backup();

Restore

stop-the-world restores

Data dumps taken with pg_dump or the file system strategy mentioned above can be restored by pg_restore or just starting the server. Needless to say, this strategy causes either data loss or needs downtime, depending on the operations chosen.

If a system has a simple pg_dump cron job that ships the archive to remote storage, when the leader crashes or dies, the time to detection, copying the archive to the follower, pg_restore completion times is the amount of downtime that’s required.
The cron job, if configured at a certain time in the day, differs from the time the crash happens, the delta in the data until that time on the leader is a potential loss in data.
When the leader crashes/dies, but you still have access to the physical data disks, recovery using file system snapshot is possible, and that may potentially recover all the data up till the point of the last commit. Because this recovery would also have the WAL files handy, the replay will make sure as much data as possible is recovered.

Continuous Archive restores

If the system is setup with continuous archiving, it may be possible to recover all the data. Restore times depend on how fast the base backup archive, WAL logs can be copied over to the new server, and the WAL log replay.

Replication

There are many ways to do this, too, depending on the underlying infra: shared disk (two machines accessing the same disk), file system replication (a write to one drive is mirrored to a different machine atomically), side-car middlewares that execute a given statement on multiple machines simultaneously, or even application-level middlewares that do this. Streaming/Point-in-time replication is one preferred approach that can piggy back on the continuous archive backup strategy.

Streaming/Point-in-time replication strategy uses wal logs shipped to a remote server using archive_command from the leader to be used to replay the logs on a follower continously.

Note that in streaming replication is possible without using archive_command, provided the data ingestion throughput never exceeds the rate of the follower streaming the logs directly from the leader, and applying them locally (also depends on the network latency).

If the follower is not able to keep up with the logs, the logs on leader might get recycled, and the follower will keep waiting for the now-non-existent WAL file. Force-starting the follower in case of failure will result in data loss.

← Caching, Parallelism ♥ Docker multi-stage builds Flattening and Filtering JSON for Cleaner Types in Go →