Wal-e and the gotcha, how I nearly lost 50% of my backups…

We all know, often from a bad experience, the classical “an untested backup is not a backup”. This small story will illustrate that situation, highlight a big wal-e gotcha and remind you to test your backups.

TL;DR : wal-e has a “problem” with multiple clusters.

Reminder : what is a PostgreSQL backup ?

This question seems silly, but it is indeed very important and worth repeating the answers.

A backup is a known good copy of a given dataset.

Streaming replication can save you from a server burning (if your replica is not in the same rack), a datacenter room burning (if your replica is not in the same room, lucky you) or a datacenter burning (if your replica is in a different datacenter, you’re richer and luckier than most). But it won’t save you from yourself or your dear human companions that can and will do mistakes.

You could also, every day, do a full pg_dump and archive it. I will allow you to restore to a good state, but you could end up losing a lot of good data added since your last backup. (But keep doing logical backups, they have many benefits)

Instead, backuping the WAL, with regular base backups, will give you the full power of PITR and give you a safe, “locked” backup of your whole system, that you should be able to rely on. Someone drops a table, you can go back in time before the table was dropped…

Since this can be complicated to maintain, especially with tricky storage systems like object storages (amazon S3 for instance), there are a lot of tools to automate this setup. Here, I will focus on wal-e.

It was a sunny Friday

In my current company, we manage quite a few PostgreSQL clusters. Most systems (Debian of course) have two clusters, with different PostgreSQL versions. Everything is backuped of course, using wal-e, in S3 (for reasons out of my jurisdiction), and logical backups.

Since I had a busy week, I decided to spend my Friday on trivial tasks. We recently had a migration to PostgreSQL 9.6 for one application, before migrating every application to it, and I did not check twice that every backup was set. So, before migrating more applications, a check was due. And indeed, a non-critical test environment was not backuped, not a critical mistake, but a mistake nevertheless.

So I went on the server, launched the wal-e setup (remember, if you have more than one machine, automate things) for the 9.6 cluster, checked that the WAL was being sent… All was fine. But WALs without a base backup are useless, so I decided to immediately push a basebackup.

Patience is not always a virtue

And I launched the wal-e backup-push command. This command requires a few ugly environment variables for S3 configuration, and the PGDATA folder for the cluster. wal-e has the huge problem of not being chatty. Not at all. So for at least 15 seconds, it said nothing, and I saw no new object on S3. So… Why not check what it is doing ?

I did a small ps faux on the server. And I saw that wal-e launched an SQL command using psql, a COPY, and was waiting for it… And I saw on my 9.4 cluster a COPY operation, and none on the 9.6 cluster.

“Silly me, I did a typo !”

Ctrl+C, and… wait… No, I did not ? What is wrong here ?

The first sweat drop happened.

I checked my PGDATA folder. /var/lib/postgresql/9.6/main could not be the path of my 9.4 cluster, no way.

So I looked at the psql command launched by wal-e, and I did not see any flag to specify the port of my 9.6 cluster.

Did I forget a configuration file ?

No. If you look at the whole wal-e documentation, it never specifies anywhere the port to use to connect to the cluster. Never. And we never specified it, since that information is right in PGDATA.

Wait, does this mean…

Now, the red panic button was armed and ready. I quickly opened the amazon S3 console, and went to our production backups. I went looking for the backups of our new 9.6 clusters, running on non-standard ports. And the extended_version.txt file scared me :

PostgreSQL 9.4.10 on x86_64-unknown-linux-gnu...

And then I pushed on the panic button…

Our restore tests did not include yet these shiny new 9.6 clusters, so we got absolutely no warning about this. So remember: test your backups, just after you created the cluster everything should be up and running…

Fixing the issue

I don’t know how you setup your wal-e backups, but we use envdir and a per cluster folder containing the wal-e settings that are then pushed in the wal-e commands environments. The really easy fix was simply to write down the cluster port in a new PGPORT file in each cluster folder, drop the old useless backups (and wals), and push new backups. And as soon as possible, because this implied that production did not have all the backups it needed. (Don’t rely on only one backup system)

After a few hours, the panic button was disarmed, and it was time for a check of wal-e itself.

Wal-e, what are you doing buddy ?

I went to the wal-e documentation (its github page) and checked again. No reference to the PostgreSQL port anywhere.

I downloaded the latest source code, and grepped… No reference there either ! The psql launching code was just no able to specify a port to connect to… This is a really really big gotcha. While it could read the PostgreSQL port from postmaster.pid in PGDATA, it did not.

So I wrote a patch and sent it upstream.


I hope it will be merged soon, or that the documentation will be modified to show this trap waiting for its next victim.

Until then, I wrote this blog post to warn as many people as possible. Test your backups for every cluster, and if you have several clusters on the same machine with wal-e running, make sure you configured it properly.

Using Row Level Security with a Django application


Row Level Security is one of the lesser known great addition to PostgreSQL 9.5. The documentation about it is of PostgreSQL documentation quality of course, but there is a lack of online examples and usages without relying on distinct PostgreSQL users for multi-tenant websites. I recently built a proof of concept using RLS to secure access to a specific table in a multi-site Django application, and I will explain here how to do it. It is a very simple trick, but it can improve your security a lot, especially in bigger applications when auditing the whole source code can become tedious.

What is Row Level Security ?

Row Level Security, aka RLS, is a great PostgreSQL feature that allows you, for each (user, table, action) combination, to specify additional rules that restrict access to the rows. The PostgreSQL documentation, and most online documentations, show that feature with SQL users. A table with an “owner” column would be automatically filtered on the criteria owner = current_user(). It’s very efficient (the criteria is not applied after the query but pushed in the query and thus can use more indexes if needed), and for a very low maintenance cost you can have security pushed down to the lowest level of your stack.

With RLS, suddendly customer A can no longer steal from customer B, even if there is a SQL injection in your server. It would also require a security issue or misconfiguration on PostgreSQL side. Of course, you should keep checks in your code, better safe than sorry, but it’s the extra security belt that could save you from the next “data leaks” headlines 🙂

Limitation of the common RLS demonstrations and workaround

Usually, most RLS demo will apply restrictions per SQL user. It’s, by far, the easiest and safest way to do it, because it will be very hard to alter the data used to apply the rules. But it’s not going to work very well for most, if not all, web applications. Web applications seldomly use different SQL users per authenticated web user. Sometimes one SQL user is used per “site” (we will see that a bit later), but even that is not the most common deployment strategy. Web applications, with their very short connection lives, rely on poolers like pgbouncer to keep connections to the SQL server alives and save a lot of time for each HTTP request. Doing that for tens or hundreds of different SQL users would be complicated and likely to be expensive.

But there is absolutely nothing in PostgreSQL preventing you from using any function. We could even have a policy applying restrictions based on the time or the day in the week (a “no work can be done outside work hours” policy). The real “complexity” will be the forwarding of the needed informations from your application to the policy. And PostgreSQL has a solution for that too (new motto of the day ?) : SET SESSION…

So let’s build a small django application and combine all that together to have our security belt against code mistakes.

Our first policy : business_hours

First, let’s see how to create a simple RLS policy, based on my previous business hours idea.

-- First step, create a table...
test=# CREATE TABLE work_queue(id serial primary key, task text);
test=# GRANT SELECT ON work_queue TO employee;
test=# INSERT INTO work_queue(task) VALUES ('long and complicated task');

We are superuser here (notice the # in prompt) and we put a new task in our work_queue. And our employee can read the queue. But he could do that even out of business hours, and we don’t want that. So let’s write a policy and enable it :

test=# CREATE POLICY business_hours ON work_queue FOR ALL TO employee USING (extract('hour' from now()) between 8 and 20);

Policies are quite readable : on table work_queue, for all operations (SELECT, INSERT, UPDATE, DELETE) to user employee, filter rows using the time of day. Note that policies, of course, don’t apply to super users…

And now, if our employee selects in the table:

test=> SELECT NOW(); SELECT * FROM work_queue;
 2017-03-21 23:15:28.560327+01
(1 row)

id | task 
(0 rows)

And if he comes back tomorrow :

test=> SELECT NOW(); SELECT * FROM work_queue;
 2017-03-22 10:20:53.382093+01
(1 row)

id | task 
 1 | long and complicated task
(1 row)

We wrote our first security policy, congrats! Now, let’s start working with Django.

Django and RLS

Understanding the need

In the Django world, a single server, a single process can handle several “sites”. Let’s say you, tinycorp, are selling a SaaS software to your customers megacorp1 and megacorp2. The both want their logo to appear on the login page. So you won’t be able to have a simple app.tinycorp.com domain, otherwise you would have no way to customize the look. You will instead use one vhost per customer, megacorp1.tinycorp.com and megacorp2.tinycorp.com. That’s what Django calls a site.

So here we will work on preventing a leaking page used by an attacker against megacorp2 to leak datas from megacorp1 too.

What to inject, and how ?

Django store in memory, for each request, the current site. It’s done in the first steps after a browser connects to the server. This is a simple integer, referencing the django_site table. So we have to do the following SQL call for each request:

SET SESSION django.site TO $id$;

How can we do that without altering every view in the application? It’s easy, most Python/web developers will know the answer: we must write a middleware.

In the Python world, a middleware is a function (or a class that defines what is needed to behave like a function) that is called for each request. The request and the “next” middleware are given as parameters, the response must be returned. For instance, you could have an AuthMiddleware, a CSRFProtectionMiddleware, a WafMiddleware and then your application. This is a very nice mechanism, efficient and simple.

Since this post starts being quite long, I will just post here a working Django Middleware. The code should be obvious to most Python developers.

class RlsMiddleware(object):
        def __init__ (self, get_response):
                self.get_response = get_response
        def __call__ (self, request):
                current_site = get_current_site(request)
                with connection.cursor() as cursor:
                        cursor.execute('SET SESSION "django.site" = %s;' % 
                response = self.get_response(request)
                with connection.cursor() as cursor:
                        cursor.execute('SET SESSION "django.site" = -1;')
                return response

And now, we only have to enable RLS on the tables and restrict the Django SQL user.

CREATE POLICY "check_django_site" ON my_table
                FOR ALL
                TO "django-user"
                USING ((site_id = (current_setting('django.site'))::integer));

And… that’s all folks! Nothing else is needed, you only have to apply such a security for each table.

Drawbacks and further improvements…

The main drawback with that solution against restrictions based on the SQL user is that you trust your SQL user to honestly request the proper site. It’s a far lower security than distinct SQL users, but like any security, you must do a threat analysis. The threat here is poorly coded views forgetting a check, and some partial SQL injections. Full SQL injections, allowing any SQL query, could still hurt you badly since they could change the django.site variable. But, considering the low cost of that extra-security… Why not?

Another drawback is that you must have a separate SQL user for your console Django commands, or celery tasks or whatever you do in your Django application with no user interaction. But that’s almost always better to do that, so it’s more like pushing a good practice down your throat 🙂

If you want, you can dig deeper. Much, much deeper. You could set a django.user variable that contains the current user id, and push down most access rules in the database. Again, you are protecting yourself against some programming or logic errors, a full SQL injection would work around that easily…Thank you all for reading that long explanation, and I hope it will be of use to someone… Especially if I have an account on your website, I don’t want to be in a pwned list please !

Modern C++ stored procedure wrapper

In a application following an intelligent database design, calls to stored procedures happen very often and thus must be done with as little boilerplate as possible.
Usually frameworks abstracting calls to the database are just ORMs, ignoring completely stored procedures, making the database stupid and moving all the logic in the application.

A year ago, I read on planet PostgreSQL (http://tech.zalando.com/posts/zalando-stored-procedure-wrapper-part-i.html) about a simple system built using Java and the Java annotation and reflection system.
A stored procedure can be called using a few lines of interface :

interface BasicExample {
    long computeProduct(@SProcParam int a, @SProcParam int b);

Recently, I started planning the development on my spare time of a C++/Qt5 application using a PostgreSQL database and I realized I had just no way to easily call stored procedures. Doing a proper database for the application would thus be a huge pain from a C++ point of view, messing database calls in the middle of the application… Since my C++ skills needed an update (C++11 and C++14 are out in the wild since a few years and I never had an opportunity to use the new features they bring), I figured this would be the best time to do it.

C++ does not have (yet… C++17, I have faith in you) the attributes and introspection used in the stored procedure wrapper of Zalando. Instead, C++ has a great compilation-time processing system through the templates. Templates are not just meant for implementing generics, they are a turing-complete meta-programming language. You can really do a lot of things using them. A lot. For instance, a tuple type working just like a Python tuple, to store a few values of different types side by side. Or implementing a compile-time mathematical function. C++11 and C++14 brought variadic templates, auto and a few other tools
that seemed very powerful and could yield great solutions for my problem.

After a few hours of hacking, I had my first interesting results :

SqlBindingMapper<QDateTime> get_now("now");
qDebug() << get_now();

With a few lines to implement the database connection (using QtSql so far, because I plan to write Qt5 applications with it), these two lines are enough to call NOW() in the database and map the result to a QDateTime, the Qt Date-Time representation object.

Of course, returning a single value from an argument-less function, that’s not really interesting. Let’s sum two digits.

SqlBindingMapper<int, int, int> summer("sum");
qDebug() << summer(1, 2);

And this will display 3.

So that’s for one returned record with one field. What about calling generate_series ?

SqlBindingMapper<QList<int>, int, int> generateSeries("generate_series");
for (auto i: generateSeries(1, 10))
    qDebug() << i;

Now, what about the following composite type :

CREATE TYPE card AS (value integer, suit text);
CREATE FUNCTION test_card (a card ) RETURNS integer LANGUAGE SQL AS 
$function$ SELECT $1.value; $function$;

Calling that function in C++ is only requiring you to use std::tuple :

SqlBindingMapper<int, std::tuple<int, QString>> testCard("test_card");
int value = testCard(std::make_tuple(1, "test"));
qDebug() << value;

Qt QObject introspection is also supported and during the FOSDEM I hacked support for arrays (ok, Qt vectors, but STL vectors are as easy to support) :

SqlBindingMapper<int, QVector<int>, int> array_length("array_length");
QVector<int> data;
data << 1 << 2;
qDebug() << "Our dims are :" << array_length(data, 1);


How does all this work behind the scene ? SqlBindingMapper is a template class that take a variadic number of parameters, the first one being the return type. It then implements the operator () returning the specified returning type and taking the specified parameters. A query is then built (at runtime so far, but this could evolve) with placeholders and the appropriate casting, still using templates. The placeholders are then filled, and after coming back from the database, a SqlQueryResultMapper<T> instance maps the rows to the required objects.

So far, the code is a crude hack, done for recreational purpose and for my own education. If there is any interest in a proper explanation of the code and people wanting to use it in real serious projects, I would be happy to help or write more documentation of course.

Right now, the code packaging suck. It’s only 3 headers you’d have to copy from the example on github : https://github.com/pinaraf/StoredProq/ (pg_types.h, queryresult.h and sqlmapper.h)

I hope you had fun reading this !