Wal-e and the gotcha, how I nearly lost 50% of my backups…

We all know, often from a bad experience, the classical “an untested backup is not a backup”. This small story will illustrate that situation, highlight a big wal-e gotcha and remind you to test your backups.

TL;DR : wal-e has a “problem” with multiple clusters.

Reminder : what is a PostgreSQL backup ?

This question seems silly, but it is indeed very important and worth repeating the answers.

A backup is a known good copy of a given dataset.

Streaming replication can save you from a server burning (if your replica is not in the same rack), a datacenter room burning (if your replica is not in the same room, lucky you) or a datacenter burning (if your replica is in a different datacenter, you’re richer and luckier than most). But it won’t save you from yourself or your dear human companions that can and will do mistakes.

You could also, every day, do a full pg_dump and archive it. I will allow you to restore to a good state, but you could end up losing a lot of good data added since your last backup. (But keep doing logical backups, they have many benefits)

Instead, backuping the WAL, with regular base backups, will give you the full power of PITR and give you a safe, “locked” backup of your whole system, that you should be able to rely on. Someone drops a table, you can go back in time before the table was dropped…

Since this can be complicated to maintain, especially with tricky storage systems like object storages (amazon S3 for instance), there are a lot of tools to automate this setup. Here, I will focus on wal-e.

It was a sunny Friday

In my current company, we manage quite a few PostgreSQL clusters. Most systems (Debian of course) have two clusters, with different PostgreSQL versions. Everything is backuped of course, using wal-e, in S3 (for reasons out of my jurisdiction), and logical backups.

Since I had a busy week, I decided to spend my Friday on trivial tasks. We recently had a migration to PostgreSQL 9.6 for one application, before migrating every application to it, and I did not check twice that every backup was set. So, before migrating more applications, a check was due. And indeed, a non-critical test environment was not backuped, not a critical mistake, but a mistake nevertheless.

So I went on the server, launched the wal-e setup (remember, if you have more than one machine, automate things) for the 9.6 cluster, checked that the WAL was being sent… All was fine. But WALs without a base backup are useless, so I decided to immediately push a basebackup.

Patience is not always a virtue

And I launched the wal-e backup-push command. This command requires a few ugly environment variables for S3 configuration, and the PGDATA folder for the cluster. wal-e has the huge problem of not being chatty. Not at all. So for at least 15 seconds, it said nothing, and I saw no new object on S3. So… Why not check what it is doing ?

I did a small ps faux on the server. And I saw that wal-e launched an SQL command using psql, a COPY, and was waiting for it… And I saw on my 9.4 cluster a COPY operation, and none on the 9.6 cluster.

“Silly me, I did a typo !”

Ctrl+C, and… wait… No, I did not ? What is wrong here ?

The first sweat drop happened.

I checked my PGDATA folder. /var/lib/postgresql/9.6/main could not be the path of my 9.4 cluster, no way.

So I looked at the psql command launched by wal-e, and I did not see any flag to specify the port of my 9.6 cluster.

Did I forget a configuration file ?

No. If you look at the whole wal-e documentation, it never specifies anywhere the port to use to connect to the cluster. Never. And we never specified it, since that information is right in PGDATA.

Wait, does this mean…

Now, the red panic button was armed and ready. I quickly opened the amazon S3 console, and went to our production backups. I went looking for the backups of our new 9.6 clusters, running on non-standard ports. And the extended_version.txt file scared me :

PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu...

And then I pushed on the panic button…

Our restore tests did not include yet these shiny new 9.6 clusters, so we got absolutely no warning about this. So remember: test your backups, just after you created the cluster everything should be up and running…

Fixing the issue

I don’t know how you setup your wal-e backups, but we use envdir and a per cluster folder containing the wal-e settings that are then pushed in the wal-e commands environments. The really easy fix was simply to write down the cluster port in a new PGPORT file in each cluster folder, drop the old useless backups (and wals), and push new backups. And as soon as possible, because this implied that production did not have all the backups it needed. (Don’t rely on only one backup system)

After a few hours, the panic button was disarmed, and it was time for a check of wal-e itself.

Wal-e, what are you doing buddy ?

I went to the wal-e documentation (its github page) and checked again. No reference to the PostgreSQL port anywhere.

I downloaded the latest source code, and grepped… No reference there either ! The psql launching code was just no able to specify a port to connect to… This is a really really big gotcha. While it could read the PostgreSQL port from postmaster.pid in PGDATA, it did not.

So I wrote a patch and sent it upstream.

https://github.com/wal-e/wal-e/pull/340

I hope it will be merged soon, or that the documentation will be modified to show this trap waiting for its next victim.

Until then, I wrote this blog post to warn as many people as possible. Test your backups for every cluster, and if you have several clusters on the same machine with wal-e running, make sure you configured it properly.