JIT: so you want to be faster than an interpreter on modern CPUs…

Hi

Since my previous blog entry about JIT compiler for PostgreSQL, sadly not much happened due to a lack of time, but still some things were done (biggest improvement was the port to ARM64, a few optimizations, implementing more opcodes…). But I am often asking myself how to really beat the interpreter… And on “modern” CPUs, with a well written interpreter, that’s far more complicated than many would imagine. So in order to explain all this and show how I am planning to improve performance (possibly of the interpreter itself too, thus making this endeavor self-defeating), let’s first talk about…

The magics of OoO execution and super-scalar CPUs

If you already know about all the topics mentioned in this title, feel free to jump to the next section. Note that the following section is over-simplified to make the concepts more accessible.

I am writing this blog post on a Zen 2+ CPU. If I upgraded to a Zen 3 CPU, same motherboard, same memory, I would get an advertised 25% performance jump in single thread benchmarks while the CPU frequency would be only 2% higher. Why such a discrepancy?

Since the 90s and the Pentium-class CPUs, x86 has followed RISC CPUs in the super-scalar era. Instead of running one instruction per cycle, when conditions are right, several instructions can be executed at the same time. Let’s consider the following pseudo-code:

f(a, b, c, d):
  X = a + b
  Y = c + d
  Z1 = 2 * X
  Z2 = 2 * Y
  Z = Z1 + Z2
  return Z

X and Y can be calculated at the same time. The CPU can execute these on two integer units, fetch the results and store them. The only issue is the computation of Z: everything must be done before this step, making it impossible for the CPU to go further without waiting for the previous results. But now, what if the code was written as follow:

f(a, b, c, d):
  X = a + b
  Z1 = 2 * X
  Y = c + d
  Z2 = 2 * Y
  Z = Z1 + Z2
  return Z

Every step would require waiting for the previous one, slowing down the CPU terribly. Hence the most important technique used to implement superscalar CPUs: out-of-order execution. The CPU will fetch the instructions, dispatch them in several instruction queues, and resolve dependencies to compute Y before computing Z1 in order to have it ready sooner. The CPU is spending less time idling, thus the whole thing is faster. But, alas, what would happen with the following function?

f(a, b, c, d):
  X = a + b
  Y = c + d
  if X > Y:
    Z = b + d
  else:
    Z = a + c
  return Z

Should the CPU wait for X and Y before deciding which Z to compute? Here is the biggest trick: it will try its luck and compute something anyway. This way, if its bet was right, a lot of time will be saved, otherwise the mistake result will be dropped and the proper computation will be done instead. This is called branch prediction, it has been the source of many fun security issues (hello meltdown), but the performance benefits are so huge that one would never consider disabling this.

How does this impact interpreters

Most interpreters will operate on an intermediate representation, using opcodes instead of directly executing from an AST or similar. So you could use the following main loop for an interpreter.

step = steps[0]
while 1:
  switch step.opcode:
    case A:
      // code for opcode A
    case B:
      // code for opcode B
    case C:
      // code for opcode C
    ...
  step++

This is how many, many interpreters were written. But this has a terrible drawback at least when compiled that way: it has branches all over the place from a single starting point (most if not all optimizing compilers will generate a jump table to optimize the dispatch, but this will still jump from the same point). The CPU will have a hard time predicting the right jump, and is thus losing a lot of performance. If this was the only way an interpreter could be written, generating a function by stitching the code together would save a lot of time, likely giving a more than 10% performance improvement. If one look at Python, removing this switch made the interpreter 15 to 20% faster. Many project, including PostgreSQL, use this same technique, called “computed gotos”. After a first pass to fill in “label” targets in each step, the execution would be

step = steps[0]
goto step->label;
OPCODE_A:
  // code for opcode A
  step++; goto step->label;
OPCODE_B:
  // code for opcode B
  step++; goto step->label;
OPCODE_C:
  // code for opcode C
  step++; goto step->label;
...

When running a short sequence of operations in a loop, the jumps will be far more predictable, making the branch predictor’s job easier, and thus improving the speed of the interpreter.

Now that we have a very basic understanding of modern CPUs and the insane level of optimization they reach, let’s talk about fighting the PostgreSQL interpreter on performance.

Optimizing a simple “SELECT a FROM table WHERE a = 42”

I will not discuss optimizing the tuple deforming part (aka. going from on-disc structure to the “C” structure used by the code), this will be a topic for a future blog post when I implement it in my compiler.

As you may know, PostgreSQL has a very complete type system with operators overloading. Even this simple query ends up being a call to int4eq, a strict function that will perform the comparison.

#define PG_GETARG_DATUM(n)    (fcinfo->args[n].value)
#define PG_GETARG_INT32(n)    DatumGetInt32(PG_GETARG_DATUM(n))

static inline int32
DatumGetInt32(Datum X)
{
        return (int32) X;..
}

Datum
int4eq(PG_FUNCTION_ARGS)
{
        int32           arg1 = PG_GETARG_INT32(0);
        int32           arg2 = PG_GETARG_INT32(1);

        PG_RETURN_BOOL(arg1 == arg2);
}

Since it is a strict function, PostgreSQL must check that the arguments are not null, otherwise the function is not called and the result will be null.

If you execute a very basic query like the one in the title, PostgreSQL will have the following opcodes:

EEOP_SCAN_FETCHSOME
EEOP_SCAN_VAR
EEOP_FUNCEXPR_STRICT_2
EEOP_QUAL
EEOP_DONE_RETURN

The EEOP_FUNCEXPR_STRICT_2 will perform the null check, and then call the function.

If we unroll all the opcodes in real C code, we end up with the following:

// SCAN_FETCHSOME
CheckOpSlotCompatibility(op, scanslot);

slot_getsomeattrs(scanslot, op->d.fetch.last_var);

op++; goto op->code;

// SCAN_VAR
int			attnum = op->d.var.attnum;

/* See EEOP_INNER_VAR comments */

Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
*op->resvalue = scanslot->tts_values[attnum];
*op->resnull = scanslot->tts_isnull[attnum];

op++; goto op->code;

// FUNCEXPR_STRICT_2
FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
NullableDatum *args = fcinfo->args;

Assert(op->d.func.nargs == 2);

/* strict function, so check for NULL args */
if (args[0].isnull || args[1].isnull)
    *op->resnull = true;
else
{
    Datum		d;

    fcinfo->isnull = false;
    d = op->d.func.fn_addr(fcinfo);
    *op->resvalue = d;
    *op->resnull = fcinfo->isnull;
}

op++; goto op->code;

// QUAL
/* simplified version of BOOL_AND_STEP for use by ExecQual() */

/* If argument (also result) is false or null ... */
if (*op->resnull ||
    !DatumGetBool(*op->resvalue))
{
    /* ... bail out early, returning FALSE */
    *op->resnull = false;
    *op->resvalue = BoolGetDatum(false);
    EEO_JUMP(op->d.qualexpr.jumpdone);
}

/*
 * Otherwise, leave the TRUE value in place, in case this is the
 * last qual.  Then, TRUE is the correct answer.
 */

op++; goto op->code;

// DONE_RETURN
*isnull = state->resnull;
return state->resvalue;

We can already spot one optimization: why do we check the two arguments, including our constant, against null? It will never change for the entire run of this query and thus each comparison is going to use an ALU, and branch depending on that comparison. But of course the CPU will notice the corresponding branch pattern, and will thus be able to remain active and feed its other units.

What is the real cost of such a pointless comparison? For this purpose, I’ve broken a PostgreSQL instance and replaced all FUNCEXPR_STRICT with a check on one argument only, and one with no STRICT check (do not try this at home!). Doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 100 million rows table, with no index, here are the two perf results:

// non-broken PostgreSQL, the query took about 124ms per run
 Performance counter stats for process id '1757113':

          1,175.93 msec task-clock:u                     #    0.133 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
                80      page-faults:u                    #   68.031 /sec                      
    13,377,719,062      instructions:u                   #    2.92  insn per cycle            
                                                  #    0.01  stalled cycles per insn   
     4,583,676,400      cycles:u                         #    3.898 GHz                       
        87,108,713      stalled-cycles-frontend:u        #    1.90% frontend cycles idle      
     2,322,262,677      branches:u                       #    1.975 G/sec                     
         1,577,035      branch-misses:u                  #    0.07% of all branches           

// only 1 argument checked, the query took about 117ms per run
 Performance counter stats for process id '1760731':

          1,137.86 msec task-clock:u                     #    0.137 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               480      page-faults:u                    #  421.845 /sec                      
    13,238,656,204      instructions:u                   #    2.99  insn per cycle            
                                                  #    0.01  stalled cycles per insn   
     4,429,234,344      cycles:u                         #    3.893 GHz                       
        80,398,623      stalled-cycles-frontend:u        #    1.82% frontend cycles idle      
     2,277,385,482      branches:u                       #    2.001 G/sec                     
         1,513,041      branch-misses:u                  #    0.07% of all branches           

// broken PostgreSQL, the query took about 115ms per run
 Performance counter stats for process id '1758298':

          1,134.46 msec task-clock:u                     #    0.132 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               484      page-faults:u                    #  426.634 /sec                      
    13,199,914,639      instructions:u                   #    2.99  insn per cycle            
                                                  #    0.01  stalled cycles per insn   
     4,416,014,805      cycles:u                         #    3.893 GHz                       
        80,840,330      stalled-cycles-frontend:u        #    1.83% frontend cycles idle      
     2,248,257,937      branches:u                       #    1.982 G/sec                     
         1,497,270      branch-misses:u                  #    0.07% of all branches           

So, even if this is not the optimization of the century, it’s not that expensive to make, so… why not do it? (Patch coming to pgsql-hackers soon)

But a better optimization is to go all-in on inlining. Indeed, instead of jumping through a pointer to the int4eq code (again, something that the CPU will optimize a lot), one could have a special opcode for this quite common operation.

// Code of the opcodes unrolled, merged back and hand-optimized

// SCAN_FETCHSOME
CheckOpSlotCompatibility(op, scanslot);

slot_getsomeattrs(scanslot, op->d.fetch.last_var);

op++;

// SCAN_VAR
int			attnum = op->d.var.attnum;

/* See EEOP_INNER_VAR comments */

Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
*op->resvalue = scanslot->tts_values[attnum];
*op->resnull = scanslot->tts_isnull[attnum];

op++;

// FUNCEXPR_STRICT_INT4EQ
FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
NullableDatum *args = fcinfo->args;


/* strict function, so check for NULL args */
if (args[0].isnull || args[1].isnull)
    *op->resnull = true;
else
{
    *op->resvalue = ((int32) args[0].value == (int32) args[1].value);
    *op->resnull = false;
}

op++;

// QUAL
/* simplified version of BOOL_AND_STEP for use by ExecQual() */

/* If argument (also result) is false or null ... */
if (*op->resnull ||
    !DatumGetBool(*op->resvalue))
{
    /* ... bail out early, returning FALSE */
    *op->resnull = false;
    *op->resvalue = BoolGetDatum(false);
    EEO_JUMP(op->d.qualexpr.jumpdone);
}

/*
 * Otherwise, leave the TRUE value in place, in case this is the
 * last qual.  Then, TRUE is the correct answer.
 */

op++;

// DONE_RETURN
*isnull = state->resnull;
return state->resvalue;

With this change alone (but keeping the two null checks, so there are still optimizations possible), we end up with the following perf results.

// broken PostgreSQL, the query took about 114ms per run

          1,125.33 msec task-clock:u                     #    0.143 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
               456      page-faults:u                    #  405.215 /sec                      
    12,941,745,741      instructions:u                   #    3.01  insn per cycle            
                                                  #    0.01  stalled cycles per insn     (71.33%)
     4,305,023,693      cycles:u                         #    3.826 GHz                         (71.38%)
        85,985,959      stalled-cycles-frontend:u        #    2.00% frontend cycles idle        (71.65%)
     2,211,080,124      branches:u                       #    1.965 G/sec                       (71.71%)
         1,596,208      branch-misses:u                  #    0.07% of all branches             (71.27%)

Let’s sum up these results.

PostgreSQLUnpatchedOnly 1 NULL checkNo NULL checkTwo NULL checks, inlined int4eq
Average time (ms)127117 (-7.9%)115 (-9.5%)114 (-10.3%)
Instructions13,377,719,06213,238,656,204 (-1.1%)13,199,914,639 (-1.4%)12,941,745,741 (-3.3%)
Cycles4,583,676,4004,429,234,344 (-3.4%)4,416,014,805 (-3.7%)4,305,023,693 (-6.1%)
Branches2,322,262,6772,277,385,482 (-2%)2,248,257,937 (-3.2%)2,211,080,124 (-4.8%)
Summary of performances along three changes

The biggest change comes, quite obviously, from inlining the int4eq call. Why is it that much better? Because it reduces by quite a lot the number of instructions to run, and it removes a call to an address stored in memory. And this is again an optimization I could do on my JIT compiler that can also be done on the interpreter with the same benefits. The biggest issue here is that you must keep the number of opcodes within (unspecified) limits: too many opcodes could make the compiler job far worse.

Fighting the wrong fight against the interpreter in my JIT compiler

Well. At first, I thought the elimination of null checks could not be implemented easily in the interpreter. The first draft in my compiler was certainly invalid, but gave me interesting numbers (around 5%, as seen above) and made me want to go ahead. And I realized that implementing it cleanly in the interpreter was far easier than implementing it in my JIT compiler …

Then I went with optimizing another common case, the call to int4eq, and, well… One could also add an opcode for that in the interpreter, and thus the performance gain of the JIT compiler are going to be minimal compared to the interpreter.

Modern CPUs don’t make my job easy here. Most of the cost of an interpreter is taken away by the branch predictor and the other optimizations implemented in silicon. So is all hope lost, am I to declare the interpreter the winner against the limitations of the copy-patch method I have available for my JIT?

PostgreSQLUnpatchedOnly 1 NULL checkNo NULL checkTwo NULL checks, inlined int4eqCopyjit, optimization in development
Average time (ms)127117 (-7.9%)115 (-9.5%)114 (-10.3%)98 (-23%)
Instructions13,377,719,06213,238,656,204 (-1,1%)13,199,914,639 (-1,4%)12,941,745,741 (-3,3%)12,013,081,667 (-11%)
Cycles4,583,676,4004,429,234,344 (-3,4%)4,416,014,805 (-3,7%)4,305,023,693 (-6,1%)3,701,112,703 (-20%)
Branches2,322,262,6772,277,385,482 (-2%)2,248,257,937 (-3,2%)2,211,080,124 (-4,8%)2,028,224,528 (-13%)
Look, some fun ahead!

Of course not, see you in the next post to discuss the biggest interpreter bottleneck!

PS: help welcome. Last year I managed to spend some time working on this during my work time. Since then I’ve changed job, and can hardly get some time on this. I also tried to get some sponsoring to work on this and present at future PostgreSQL conferences, to no luck :/

If you can help in any way on this project, feel free to reach me (code contribution, sponsoring, missions, job offers, nudge nudge wink wink). Since I’ve been alone on this, a lot of things are dibbles on scratch paper, I benchmark code and stuff in my head when life gives me some boring time but testing it for real is of course far better. I have some travel planned soon so I hope for next part to be released before next year, with interesting results since my experiences have been as successful as anticipated.

Leave a comment

Your email address will not be published. Required fields are marked *