The argument for spending time setting up database backups is similar to that of paying for insurance. It seems like a waste at the outset, and it is easy to put off, however, when things go wrong, you will wish you had it.
This article describes a way to set up scheduled rotating encrypted backups of a PostgreSQL database to AWS S3, in two typical deployment environments:
- a virtual machine scheduled using Linux cron
- a Kubernetes cluster scheduled using CronJob
You can find a GitHub repository containing the script and template environment variables and Kubernetes config here: cpcwood/psql-backup-s3
Backup Strategy
A database backup and restore strategy should meet a few criteria:
- Separate remote storage - One of the main reasons for database backups is the mitigation of media failure, natural disaster, and user error. It tends to be challenging and costly to recover data in these circumstances, so backups must be kept off-site in a backup destination with reliable storage. A common cloud backup destination is Amazon's AWS S3 service.
- Automation - Backups should be made automatically and on regular intervals. Most systems have a built-in job scheduler such as Linux's
cron
and Kubernetes' CronJob. - Encryption - Database backups are usually dumps of SQL, used to rebuild the database structure and data. Since these dumps are typically plain text, to prevent sensitive data from being leaked, database backups should be encrypted while being stored in the backup destination. GnuPG (GPG) is a commonly used implementation of the PGP cryptographic software suite, which can encrypt database backups.
- Rotation - Off-site storage costs. Therefore, optimise resource requirements by keeping the number of stored backups to a cost-effective level. One way of doing this is backup rotation. While there are plenty of different backup rotation schemes, 'first in first out' scheme, where old backups are removed as new ones are added, is effective and easy to script for s3 storage.
Setup AWS
Create an AWS S3 Bucket
Create a private Amazon AWS S3 bucket to store your database backups: AWS 'create bucket' guide.
Create IAM User
Create an IAM user in your AWS account with access to the S3 bucket created above: AWS 'create user' guide
The script requires, list, put, and delete access on the s3 bucket. So, the S3 policy JSON attached to the IAM user might look like:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:ListBucket",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>/*",
"arn:aws:s3:::<bucket-name>"
]
}
]
}
Make sure to download or keep hold of the new user security credentials so you can add them to the backup script environment later.
Create PGP Keys
On a separate (ideally air-gapped) machine, install GPG so encryption keys can be generated:
apt install gnupg
Then create a pair of public and private encryption keys. Using public-key cryptography to encrypt the backup on the server will help prevent the database backup from being compromised if the environment variables are leaked.
- Generate a keypair using your email for ID:
gpg --gen-key
- Export the public key:
gpg --armor --export <your-email>
- Export the secret key and move to secure storage.
Create Backup Script
Script Requirements
The backup program will need to do the following:
- Check required config is present
- Create a temporary backup directory
- Create database backup
- Compress backup
- Encrypt backup
- Upload backup to off-site storage (AWS S3)
- Remove expired backups from off-site storage
- Cleanup
Write the Script
Since the program needs to interact with several other CLIs, such as psql
and aws-cli
, a shell script would be suitable. Make a shell script to meet the above requirements, for example:
vim ~/psql-backup-s3.sh
#! /bin/sh
# PSQL Database Backup to AWS S3
echo "Starting PSQL Database Backup..."
# Ensure all required environment variables are present
if [ -z "$GPG_KEY" ] || \
[ -z "$GPG_KEY_ID" ] || \
[ -z "$POSTGRES_PASSWORD" ] || \
[ -z "$POSTGRES_USER" ] || \
[ -z "$POSTGRES_HOST" ] || \
[ -z "$POSTGRES_DB" ] || \
[ -z "$AWS_ACCESS_KEY_ID" ] || \
[ -z "$AWS_SECRET_ACCESS_KEY" ] || \
[ -z "$AWS_DEFAULT_REGION" ] || \
[ -z "$S3_BUCKET" ]; then
>&2 echo 'Required variable unset, database backup failed'
exit 1
fi
# Make sure required binaries are in path (YMMV)
export PATH=/usr/local/bin:$PATH
# Import gpg public key from env
echo "$GPG_KEY" | gpg --batch --import
# Create backup params
backup_dir=$(mktemp -d)
backup_name=$POSTGRES_DB'--'$(date +%d'-'%m'-'%Y'--'%H'-'%M'-'%S).sql.bz2.gpg
backup_path="$backup_dir/$backup_name"
# Create, compress, and encrypt the backup
PGPASSWORD=$POSTGRES_PASSWORD pg_dump -d "$POSTGRES_DB" -U "$POSTGRES_USER" -h "$POSTGRES_HOST" | bzip2 | gpg --batch --recipient "$GPG_KEY_ID" --trust-model always --encrypt --output "$backup_path"
# Check backup created
if [ ! -e "$backup_path" ]; then
echo 'Backup file not found'
exit 1
fi
# Push backup to S3
aws s3 cp "$backup_path" "s3://$S3_BUCKET"
status=$?
# Remove tmp backup path
rm -rf "$backup_dir"
# Indicate if backup was successful
if [ $status -eq 0 ]; then
echo "PSQL database backup: '$backup_name' completed to '$S3_BUCKET'"
# Remove expired backups from S3
if [ "$ROTATION_PERIOD" != "" ]; then
aws s3 ls "$S3_BUCKET" --recursive | while read -r line; do
stringdate=$(echo "$line" | awk '{print $1" "$2}')
filedate=$(date -d"$stringdate" +%s)
olderthan=$(date -d"-$ROTATION_PERIOD days" +%s)
if [ "$filedate" -lt "$olderthan" ]; then
filetoremove=$(echo "$line" | awk '{$1=$2=$3=""; print $0}' | sed 's/^[ \t]*//')
if [ "$filetoremove" != "" ]; then
aws s3 rm "s3://$S3_BUCKET/$filetoremove"
fi
fi
done
fi
else
echo "PSQL database backup: '$backup_name' failed"
exit 1
fi
Deploy
Depending on your deployment setup, there will be different ways to deploy the script to run the backup job regularly.
Here are the setup methods for two typical deployment types:
- Traditional VM - Linux cron
- Kubernetes - CronJob
Traditional VM
Copy Script to Machine
Copy the script to the virtual machine (VM) into a suitable directory and make sure it is executable for the user who will be running it:
chmod 744 ./psql-backup-s3/psql-backup-s3.sh
Install Dependencies
Install the script dependencies on the VM:
- GPG - Install GPG to encrypt backup files:
apt install gnupg
- AWS-CLI - Install AWS CLI tool to transfer backup to AWS S3: see AWS guide
- date - Ensure date is GNU core utilities date, not included in alpine linux (busybox) by default:
apt install coreutils
Linux cron
cron is a time-based job scheduler built into Linux which runs processes on the system at scheduled times.
Config
The backup script gets its configuration from environment variables. The variables required are:
export ROTATION_PERIOD= # days to keep backups (exclude to stop backups from deleting)
export POSTGRES_USER=
export POSTGRES_PASSWORD=
export POSTGRES_HOST=
export POSTGRES_DB= # name of database to backup
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export AWS_DEFAULT_REGION=
export S3_BUCKET=
export GPG_KEY_ID= # email id used in gpg key generation
export GPG_KEY= # exported public gpg key
cron jobs do not inherit the same environment as a job run from the command line. Instead, their default environment is from /etc/environment
, read more about why in this IBM article. Therefore, load the environment variables required for the backup script in the job definition.
One way to do this source a shell script using the "dot" command in the crontab.
First, create a script exporting the above environment variables on the VM.
vim ~/psql-backup-s3/psql-backup-s3.env
Since it contains credentials, ensure the only the crontab user can only access it:
chmod 700 ~/psql-backup-s3/psql-backup-s3.env
It can then be sourced before the backup job in the crontab, as shown below.
Create the cron Job
Add a new cron job using crontab. The job should periodically load the environment variables and then run the backup script. For example, to run the backup daily at 3.30 am:
crontab -e
30 3 * * * . $HOME/psql-backup-s3/psql-backup-s3.env && $HOME/psql-backup-s3/psql-backup-s3.sh 2>&1 | logger -t psql-backup-s3
For more info on setting up a job using crontab, checkout ubuntu's guide here. crontab guru can be helpful for defining schedules.
Note: When setting up it can be useful to set the job schedule to a short interval such as */2 * * * *
so you can check for any misconfiguration or errors.
Kubernetes CronJob
Kubernetes CronJob is a built-in feature which allows jobs to be run in containers periodically.
Config
Create a Kubernetes Secret object to store the sensitive credentials for the backup script. For example:
---
apiVersion: v1
kind: Secret
metadata:
name: psql-backup-s3
data:
POSTGRES_USER:
POSTGRES_PASSWORD:
AWS_ACCESS_KEY_ID:
AWS_SECRET_ACCESS_KEY:
---
Create a Kubernetes ConfigMap object to store the non-sensitive configuration details for the script. For example:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: psql-backup-s3
data:
ROTATION_PERIOD: # days to keep backups (exclude to stop backups from deleting)
POSTGRES_HOST: # postgres k8s service hostname
POSTGRES_DB: # name of database to backup
AWS_DEFAULT_REGION:
S3_BUCKET:
GPG_KEY_ID: # email id used in gpg key generation
GPG_KEY: # exported public gpg key
---
Make sure to apply the newly created secret and config objects to your cluster in the correct namespace.
Create the CronJob
Create a Kubernetes CronJob object to run the backup job on a schedule. The container image used in the job must have the required dependencies and backup script included. An example Dockerfile can be found in the GitHub repository cpcwood/psql-backup-s3 and respective image can be pulled from dockerhub under the repository cpcwood/psql-backup-s3
. For example, to run the backup daily at 3.30 am:
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: psql-backup-s3
spec:
schedule: "30 3 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: psql-backup-s3-container
image: cpcwood/psql-backup-s3:latest
envFrom:
- configMapRef:
name: psql-backup-s3
env:
- name: POSTGRES_USER
valueFrom:
secretKeyRef:
name: psql-backup-s3
key: POSTGRES_USER
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: psql-backup-s3
key: POSTGRES_PASSWORD
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: psql-backup-s3
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: psql-backup-s3
key: AWS_SECRET_ACCESS_KEY
restartPolicy: Never
---
Note: When setting up it can be useful to set the job schedule to a short interval such as */3 * * * *
so you can check for any misconfiguration or errors.
Apply the CronJob object to your cluster.
The job should now run the backup script periodically as scheduled in the object definition.
Restore
To restore a backup:
- Download the encrypted database dump from aws S3
- Copy to the machine containing the private gpg key
- Decrypt downloaded file using gpg:
gpg --output <decrypted file name>.sql.bz2 --decrypt <downloaded file name>.sql.bz2.gpg
- Move to server hosting PostgreSQL database
- Unzip decrypted file using bzip:
bzip2 -d <decrypted file name>.sql.bz2
- Restore the database dump using the
psql
command, for details see the documentation on backup dumps for your version of PostgreSQL.
Conclusion
The backup script will now be scheduled to run on regular intervals as defined by your crontab or CronJob.
There are plenty of additional features which could be added to the script in the future such as success notifications using AWS SNS, or more complex backup rotation scheme such as Grandfather-father-son.
Footnote
Drop me a message through the contact section of this site or via LinkedIn if you have any comments, suggested changes, or see any bugs.