Tag: JIT

  • JIT, episode III: warp speed ahead

    Previously…

    In our first JIT episode, we discussed how we could, using copy-patch, easily create a JIT compiler for PostgreSQL, with a slight improvement in performance compared to the PostgreSQL interpreter.

    In our second episode, I talked about the performance wall and how hard it was to have a real leap in performance compared to the interpreter. But it ended with a positive outlook, a nice performance jump that I was preparing at that moment…

    Beating the interpreter, let’s get back to the basics

    The interpreter will run each opcode for every record it has to process. Everything it has to do for each record that could be done only once is better done once, obviously. And this is where a JIT can beat it. The JIT compiler can choose optimizations that would require checks at each opcode for the interpreter, and thus self-defeating for the interpreter. For instance, I mentioned creating inlined opcodes for common function calls like int4eq : replacing the indirect call to int4eq with a comparison of the function pointer and then an inlined version would indeed be silly, since the comparison is going to waste a lot of time already.

    So, what can’t the interpreter do? It sure can’t easily remove indirect calls, but this is a 1% performance gain, 2% at most. You won’t get to the headlines with that, right? Well, when in doubt, look at the past…

    A short story about belief oriented programming…

    A decade ago, I worked at a small company where I heard the weirdest thing ever regarding system performance: “our application is slower when built in 64 bits mode because the bigger pointer size makes it slower”. I didn’t buy this, spent two days digging into the code, and found that it was the opposite: 64 bits brought such a performance improvement that the entire system collapsed on a mutex that held a core structure in the application… Removing the mutex made the application fly in both 32 and 64 bits, with 64 bits beating 32 bits obviously.

    But why is 64 bits faster? We are talking database here, so let’s have a look at a table, shall we?

    http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0870F10_K17_Matisse7_InstLatX64.txt (I know uBlock doesn’t like this domain, but this text document there is good, I promise)

    On my CPU, loading a 64 bits value in a register requires twice the time it takes to load a 32 bits value. So sure 64 bits must be slower than 32 bits!

    Except the switch from 32 to 64 bits also fixed one of the biggest issue with x86: the lack of registers. x86 never improved from its 16 bits roots and had 8 general purpose registers, little compared to PowerPC (32), Sparc (31), ARM (15). When AMD introduced 64 bits in the x86 world, they doubled the number of registers, from a ridiculous 8 to an acceptable 16. And from this came a huge performance boost.

    What you need to know about memory and registers

    Memory = slow

    Registers = fast

    Ok, more seriously, I will not start writing about this. Even if it is getting old and outdated, the “What Every Programmer Should Know About Memory” paper of Ulrich Drepper is still a great read if you’re interested into that topic. The only thing that matters for us is that, even with a lot of cache, writing to memory is slower than writing to a register. If I look at some measurements for my Zen2 CPU, a comparison between two registers takes less than a cycle (0.33c it seems), but if data has to be loaded from L1 cache you can add 4 cycles, 12 cycles from L2 and 38 from L3. Way, way slower. 12 to 115 times slower.

    Registers are used automatically by your compiler. When you write a function, it will automatically figure out what variable to move to a register, when, and if you don’t have enough registers for your entire function, spill registers on the stack as needed. If you are interested into this, there are many fun register allocation algorithms and many wikipedia pages covering this topic.

    How does this impact our previous example

    Let’s see one of the most basic opcode, EEOP_SCAN_VAR, taking a value from a scan slot in order to use it later.

    EEO_CASE(EEOP_SCAN_VAR)
    {
    	int			attnum = op->d.var.attnum;
    
    	*op->resvalue = scanslot->tts_values[attnum];
    	*op->resnull = scanslot->tts_isnull[attnum];
    
    	EEO_NEXT();
    }

    This is indeed a memory write. Could the interpreter get rid of this? Well, I think it could, but it would be a major undertaking. If we had a variable, stored in a register by the compiler, we could store there, sure, and a next step could fetch from that place, but what if the next step needs another value instead… Then we would have to spill the value back in memory, but checking for this at each step is going to kill performance. It may be possible to rewrite the entire interpreter to match a register based VM, but I can not be sure it would be worth it.

    And this is the path to beating the interpreter. We can check many things before running the opcodes, trace memory accesses and use registers as much as possible.

    Applying this to copyjit

    The great benefit of copy-patch is that you (almost) don’t write assembly code. Porting it to arm64 required me to learn about ARM64 specific relocations, how to encode immediate values in some ARM opcodes, but nothing more.

    But a big downside is that you don’t write assembly code 🙂

    And, well, if you don’t write the assembly code, you don’t control register allocation. But there is a simple way around this, let’s speak a bit about calling conventions.

    When function A is called, how do you pass parameters to it? If you learned some x86 assembly at school, you will answer “on the stack” and won a free ticket for an assembly refreshment course 🙂

    When AMD64 was introduced, the SysV Call Convention took over and completely changed the way functions are called. The first six integer or pointer parameters are passed through general purpose registers, and six floating point parameters are passed through FP registers.

    Each opcode is defined as a function with three parameters (matching the function signature expected by PostgreSQL). While respecting the SysV calling convention, it leaves us three registers that the compiler will keep across the opcode calls, and will spill automatically if needed. An alternative would have been to use the preserve_none calling convention, but for the first version I did not need it (and I still have many calls to PostgreSQL functions that will use the SysV calling convention)

    But three registers means… two values only. Sadly we transitioned from 32 to 64 bits, not to 65 bits… 65 bits would have given one bit to represent NULL/NOT NULL values, 0 would not have been NULL, 1 + NULL would be NULL! But we will not rewrite history here, and we are going to use one register as a set of null flags, one bit per value register (so we are wasting 62 bits here).

    Our opcode functions are thus going to have three new parameters, char nullFlags, intptr_t reg0, intptr_t reg1. Jumping to the next opcode will require passing these values around. Great, we keep registers around, now what about using these?

    As a reminder, here are the opcodes we are dealing with for our previous “SELECT * FROM demo WHERE a = 42”.

    // SCAN_FETCHSOME
    CheckOpSlotCompatibility(op, scanslot);
    
    slot_getsomeattrs(scanslot, op->d.fetch.last_var);
    
    op++;
    
    // SCAN_VAR
    int			attnum = op->d.var.attnum;
    
    /* See EEOP_INNER_VAR comments */
    
    Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
    *op->resvalue = scanslot->tts_values[attnum];
    *op->resnull = scanslot->tts_isnull[attnum];
    
    op++;
    
    // FUNCEXPR_STRICT_2
    FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
    NullableDatum *args = fcinfo->args;
    
    Assert(op->d.func.nargs == 2);
    
    /* strict function, so check for NULL args */
    if (args[0].isnull || args[1].isnull)
        *op->resnull = true;
    else
    {
        Datum		d;
    
        fcinfo->isnull = false;
        d = op->d.func.fn_addr(fcinfo);
        *op->resvalue = d;
        *op->resnull = fcinfo->isnull;
    }
    
    op++;
    
    // QUAL
    /* simplified version of BOOL_AND_STEP for use by ExecQual() */
    
    /* If argument (also result) is false or null ... */
    if (*op->resnull ||
        !DatumGetBool(*op->resvalue))
    {
        /* ... bail out early, returning FALSE */
        *op->resnull = false;
        *op->resvalue = BoolGetDatum(false);
        EEO_JUMP(op->d.qualexpr.jumpdone);
    }
    
    /*
     * Otherwise, leave the TRUE value in place, in case this is the
     * last qual.  Then, TRUE is the correct answer.
     */
    
    op++;
    
    // DONE_RETURN
    *isnull = state->resnull;
    return state->resvalue;

    This code doesn’t use our registers. I rewrote every opcode implementation to use the registers instead of writing in memory.

    // SCAN_FETCHSOME
    CheckOpSlotCompatibility(op, scanslot);
    
    slot_getsomeattrs(scanslot, op->d.fetch.last_var);
    
    op++;
    
    // SCAN_VAR
    int			attnum = op->d.var.attnum;
    
    /* See EEOP_INNER_VAR comments */
    
    Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
    reg0 = scanslot->tts_values[attnum];
    if (scanslot->tts_isnull[attnum])
      nullFlags |= 1;
    else
      nullFlags &= ~1;
    
    op++;
    
    // FUNCEXPR_STRICT_2
    /* strict function, so check for NULL args */
    if (nullFlags)
        nullFlags |= 1;
    else
    {
        reg0 = (reg0 == reg1);
    }
    
    op++;
    
    // QUAL
    /* simplified version of BOOL_AND_STEP for use by ExecQual() */
    
    /* If argument (also result) is false or null ... */
    if ((nullFlags & 1) || !DatumGetBool(reg0))
    {
        /* ... bail out early, returning FALSE */
        nullFlags &= ~1;
        reg0 = BoolGetDatum(false);
        EEO_JUMP(op->d.qualexpr.jumpdone);
    }
    
    /*
     * Otherwise, leave the TRUE value in place, in case this is the
     * last qual.  Then, TRUE is the correct answer.
     */
    
    op++;
    
    // DONE_RETURN
    *isnull = (nullFlags & 1);
    return reg0;

    In this version, all memory accesses have been replaced with register accesses instead, hurray!

    But this will work only for a simple query like this one. Once we start having more variables to store, we will need a spilling mechanism, a way to swap registers…

    Another issue appears when you call, for instance, a non-inlined function. The EEOP_FUNCEXPR is defined as:

    EEO_CASE(EEOP_FUNCEXPR)
    {
        FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
        Datum		d;
    
        fcinfo->isnull = false;
        d = op->d.func.fn_addr(fcinfo);
        *op->resvalue = d;
        *op->resnull = fcinfo->isnull;
    
        EEO_NEXT();
    }

    Parameters are fed through the fcinfo_data structure. The other opcodes are writing directly into this structure in the usual interpreter execution.

    It means that we must check all memory accesses from the opcodes and make sure that any expected memory access from an opcode implementation will not end up in a memory location we didn’t write to.

    Implementing this in copyjit

    I started with a small experiment, a “variabilizer”, that would look at each opcode and figure out through each memory access (read/write) all the variables used in a run, their lifetimes… It can even detect constants stored in memory (a memory that is only read from).

    I then refactored a lot of the compiler code in the past weeks. I started by moving the specialized opcodes definition and dispatch to the stencil library only, removing any special case I had in the compiler part. This required #defining a way for the C code in stencils.c to generate more C code in the built-stencils.h file through the stencil-builder.py script. Fun, but complicated and hairy stuff.

    After that, I started rewriting the stencil signature and several opcodes to use registers instead, and wrote a “contract” for each opcode, defining what was expected in each register, what will be written in each register, and what is going to be read/written in memory.

    With all these changes, here is what the FUNCEXPR_STRICT opcode optimized for int4eq looks like.

    /// Variant with int4eq inlined
    STENCIL(EEOP_FUNCEXPR_STRICT__1)
    {
    	if (nullFlags & 3) {
    		// Make sure reg0 is marked as null
    		nullFlags |= (1 << 0);
    	} else {
    		reg0 = (DatumGetInt32(reg0) == DatumGetInt32(reg1));
    		nullFlags &= ~(1 << 0);
    	}
    	goto_next;
    }
    SELECTOR(EEOP_FUNCEXPR_STRICT__1, op->d.func.fn_addr == &int4eq)
    BEGIN_REGISTER_CONTRACT(EEOP_FUNCEXPR_STRICT__1)
    EXPECT(0, op.d.func.fcinfo_data->args[0].isnull, 
              op.d.func.fcinfo_data->args[0].value)
    EXPECT(1, op.d.func.fcinfo_data->args[1].isnull, 
              op.d.func.fcinfo_data->args[1].value)
    WRITE(0, op.resnull, op.resvalue)
    END_REGISTER_CONTRACT

    More metadata than actual code… But if that’s what it takes to get a good performance boost, then here we go.

    After ingesting that, the compiler can fill in the registers with the proper values when needed. Another big issue that I’m not covering here is that doing this requires some minimal control flow analysis. For my simple benchmark, this is not a problem, and the code is getting ready for a wider range of queries, but I did not want to cover this and preferred focusing on the registers work…

    Results

    Well… This is the optimization I mentioned in the previous article. So, on our stupid benchmark, doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 10 million rows table…

    PostgreSQLNo JITLLVM JITCopyjit
    Average time (ms)120106 (-12%)101 (-15%)
    Compilation time (ms)0190.06
    Instructions13,350,766,20910,643,820,667 (-21%)12,769,013,536 (-5%)
    Cycles4,660,821,5964,005,881,863 (-14%)3,924,602,439 (-16%)
    Branches2,322,470,6591,798,221,785 (-23%)2,031,456,214 (-13%)

    As you can see, this is exactly what we expected: less instructions, sure, but this is not what gave us the performance boost here. What changed is the number of cycles, because the same instruction now uses a register instead of using a memory access, thus several cycles saved for the same instruction.

    The LLVM JIT can achieve about the same run time here, but it takes some time to generate the bitcode (less than 1ms), then several ms to analyze it, optimize it, and finally translate it to machine code. And this makes LLVM JIT slower here than copyjit, while copyjit still has some room for improvements (I’ve yet to look at tuple deforming)

    See you in the next one, I think we already know what the topic will be… Well, after I finish porting every opcode to these new metadata, test more stuff, and likely figure out some more optimizations on the way…

    PS: as said previously, help welcome, code FOSS as usual, on github, and I would gladly accept any sponsoring, mission, anything that could give me more time to work on this…

  • JIT: so you want to be faster than an interpreter on modern CPUs…

    Hi

    Since my previous blog entry about JIT compiler for PostgreSQL, sadly not much happened due to a lack of time, but still some things were done (biggest improvement was the port to ARM64, a few optimizations, implementing more opcodes…). But I am often asking myself how to really beat the interpreter… And on “modern” CPUs, with a well written interpreter, that’s far more complicated than many would imagine. So in order to explain all this and show how I am planning to improve performance (possibly of the interpreter itself too, thus making this endeavor self-defeating), let’s first talk about…

    The magics of OoO execution and super-scalar CPUs

    If you already know about all the topics mentioned in this title, feel free to jump to the next section. Note that the following section is over-simplified to make the concepts more accessible.

    I am writing this blog post on a Zen 2+ CPU. If I upgraded to a Zen 3 CPU, same motherboard, same memory, I would get an advertised 25% performance jump in single thread benchmarks while the CPU frequency would be only 2% higher. Why such a discrepancy?

    Since the 90s and the Pentium-class CPUs, x86 has followed RISC CPUs in the super-scalar era. Instead of running one instruction per cycle, when conditions are right, several instructions can be executed at the same time. Let’s consider the following pseudo-code:

    f(a, b, c, d):
      X = a + b
      Y = c + d
      Z1 = 2 * X
      Z2 = 2 * Y
      Z = Z1 + Z2
      return Z

    X and Y can be calculated at the same time. The CPU can execute these on two integer units, fetch the results and store them. The only issue is the computation of Z: everything must be done before this step, making it impossible for the CPU to go further without waiting for the previous results. But now, what if the code was written as follow:

    f(a, b, c, d):
      X = a + b
      Z1 = 2 * X
      Y = c + d
      Z2 = 2 * Y
      Z = Z1 + Z2
      return Z

    Every step would require waiting for the previous one, slowing down the CPU terribly. Hence the most important technique used to implement superscalar CPUs: out-of-order execution. The CPU will fetch the instructions, dispatch them in several instruction queues, and resolve dependencies to compute Y before computing Z1 in order to have it ready sooner. The CPU is spending less time idling, thus the whole thing is faster. But, alas, what would happen with the following function?

    f(a, b, c, d):
      X = a + b
      Y = c + d
      if X > Y:
        Z = b + d
      else:
        Z = a + c
      return Z

    Should the CPU wait for X and Y before deciding which Z to compute? Here is the biggest trick: it will try its luck and compute something anyway. This way, if its bet was right, a lot of time will be saved, otherwise the mistake result will be dropped and the proper computation will be done instead. This is called branch prediction, it has been the source of many fun security issues (hello meltdown), but the performance benefits are so huge that one would never consider disabling this.

    How does this impact interpreters

    Most interpreters will operate on an intermediate representation, using opcodes instead of directly executing from an AST or similar. So you could use the following main loop for an interpreter.

    step = steps[0]
    while 1:
      switch step.opcode:
        case A:
          // code for opcode A
        case B:
          // code for opcode B
        case C:
          // code for opcode C
        ...
      step++

    This is how many, many interpreters were written. But this has a terrible drawback at least when compiled that way: it has branches all over the place from a single starting point (most if not all optimizing compilers will generate a jump table to optimize the dispatch, but this will still jump from the same point). The CPU will have a hard time predicting the right jump, and is thus losing a lot of performance. If this was the only way an interpreter could be written, generating a function by stitching the code together would save a lot of time, likely giving a more than 10% performance improvement. If one look at Python, removing this switch made the interpreter 15 to 20% faster. Many project, including PostgreSQL, use this same technique, called “computed gotos”. After a first pass to fill in “label” targets in each step, the execution would be

    step = steps[0]
    goto step->label;
    OPCODE_A:
      // code for opcode A
      step++; goto step->label;
    OPCODE_B:
      // code for opcode B
      step++; goto step->label;
    OPCODE_C:
      // code for opcode C
      step++; goto step->label;
    ...
    

    When running a short sequence of operations in a loop, the jumps will be far more predictable, making the branch predictor’s job easier, and thus improving the speed of the interpreter.

    Now that we have a very basic understanding of modern CPUs and the insane level of optimization they reach, let’s talk about fighting the PostgreSQL interpreter on performance.

    Optimizing a simple “SELECT a FROM table WHERE a = 42”

    I will not discuss optimizing the tuple deforming part (aka. going from on-disc structure to the “C” structure used by the code), this will be a topic for a future blog post when I implement it in my compiler.

    As you may know, PostgreSQL has a very complete type system with operators overloading. Even this simple query ends up being a call to int4eq, a strict function that will perform the comparison.

    #define PG_GETARG_DATUM(n)    (fcinfo->args[n].value)
    #define PG_GETARG_INT32(n)    DatumGetInt32(PG_GETARG_DATUM(n))
    
    static inline int32
    DatumGetInt32(Datum X)
    {
            return (int32) X;..
    }
    
    Datum
    int4eq(PG_FUNCTION_ARGS)
    {
            int32           arg1 = PG_GETARG_INT32(0);
            int32           arg2 = PG_GETARG_INT32(1);
    
            PG_RETURN_BOOL(arg1 == arg2);
    }

    Since it is a strict function, PostgreSQL must check that the arguments are not null, otherwise the function is not called and the result will be null.

    If you execute a very basic query like the one in the title, PostgreSQL will have the following opcodes:

    EEOP_SCAN_FETCHSOME
    EEOP_SCAN_VAR
    EEOP_FUNCEXPR_STRICT_2
    EEOP_QUAL
    EEOP_DONE_RETURN

    The EEOP_FUNCEXPR_STRICT_2 will perform the null check, and then call the function.

    If we unroll all the opcodes in real C code, we end up with the following:

    // SCAN_FETCHSOME
    CheckOpSlotCompatibility(op, scanslot);
    
    slot_getsomeattrs(scanslot, op->d.fetch.last_var);
    
    op++; goto op->code;
    
    // SCAN_VAR
    int			attnum = op->d.var.attnum;
    
    /* See EEOP_INNER_VAR comments */
    
    Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
    *op->resvalue = scanslot->tts_values[attnum];
    *op->resnull = scanslot->tts_isnull[attnum];
    
    op++; goto op->code;
    
    // FUNCEXPR_STRICT_2
    FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
    NullableDatum *args = fcinfo->args;
    
    Assert(op->d.func.nargs == 2);
    
    /* strict function, so check for NULL args */
    if (args[0].isnull || args[1].isnull)
        *op->resnull = true;
    else
    {
        Datum		d;
    
        fcinfo->isnull = false;
        d = op->d.func.fn_addr(fcinfo);
        *op->resvalue = d;
        *op->resnull = fcinfo->isnull;
    }
    
    op++; goto op->code;
    
    // QUAL
    /* simplified version of BOOL_AND_STEP for use by ExecQual() */
    
    /* If argument (also result) is false or null ... */
    if (*op->resnull ||
        !DatumGetBool(*op->resvalue))
    {
        /* ... bail out early, returning FALSE */
        *op->resnull = false;
        *op->resvalue = BoolGetDatum(false);
        EEO_JUMP(op->d.qualexpr.jumpdone);
    }
    
    /*
     * Otherwise, leave the TRUE value in place, in case this is the
     * last qual.  Then, TRUE is the correct answer.
     */
    
    op++; goto op->code;
    
    // DONE_RETURN
    *isnull = state->resnull;
    return state->resvalue;

    We can already spot one optimization: why do we check the two arguments, including our constant, against null? It will never change for the entire run of this query and thus each comparison is going to use an ALU, and branch depending on that comparison. But of course the CPU will notice the corresponding branch pattern, and will thus be able to remain active and feed its other units.

    What is the real cost of such a pointless comparison? For this purpose, I’ve broken a PostgreSQL instance and replaced all FUNCEXPR_STRICT with a check on one argument only, and one with no STRICT check (do not try this at home!). Doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 10 million rows table, with no index, here are the two perf results:

    // non-broken PostgreSQL, the query took about 124ms per run
     Performance counter stats for process id '1757113':
    
              1,175.93 msec task-clock:u                     #    0.133 CPUs utilized             
                     0      context-switches:u               #    0.000 /sec                      
                     0      cpu-migrations:u                 #    0.000 /sec                      
                    80      page-faults:u                    #   68.031 /sec                      
        13,377,719,062      instructions:u                   #    2.92  insn per cycle            
                                                      #    0.01  stalled cycles per insn   
         4,583,676,400      cycles:u                         #    3.898 GHz                       
            87,108,713      stalled-cycles-frontend:u        #    1.90% frontend cycles idle      
         2,322,262,677      branches:u                       #    1.975 G/sec                     
             1,577,035      branch-misses:u                  #    0.07% of all branches           
    
    // only 1 argument checked, the query took about 117ms per run
     Performance counter stats for process id '1760731':
    
              1,137.86 msec task-clock:u                     #    0.137 CPUs utilized             
                     0      context-switches:u               #    0.000 /sec                      
                     0      cpu-migrations:u                 #    0.000 /sec                      
                   480      page-faults:u                    #  421.845 /sec                      
        13,238,656,204      instructions:u                   #    2.99  insn per cycle            
                                                      #    0.01  stalled cycles per insn   
         4,429,234,344      cycles:u                         #    3.893 GHz                       
            80,398,623      stalled-cycles-frontend:u        #    1.82% frontend cycles idle      
         2,277,385,482      branches:u                       #    2.001 G/sec                     
             1,513,041      branch-misses:u                  #    0.07% of all branches           
    
    // broken PostgreSQL, the query took about 115ms per run
     Performance counter stats for process id '1758298':
    
              1,134.46 msec task-clock:u                     #    0.132 CPUs utilized             
                     0      context-switches:u               #    0.000 /sec                      
                     0      cpu-migrations:u                 #    0.000 /sec                      
                   484      page-faults:u                    #  426.634 /sec                      
        13,199,914,639      instructions:u                   #    2.99  insn per cycle            
                                                      #    0.01  stalled cycles per insn   
         4,416,014,805      cycles:u                         #    3.893 GHz                       
            80,840,330      stalled-cycles-frontend:u        #    1.83% frontend cycles idle      
         2,248,257,937      branches:u                       #    1.982 G/sec                     
             1,497,270      branch-misses:u                  #    0.07% of all branches           
    

    So, even if this is not the optimization of the century, it’s not that expensive to make, so… why not do it? (Patch coming to pgsql-hackers soon)

    But a better optimization is to go all-in on inlining. Indeed, instead of jumping through a pointer to the int4eq code (again, something that the CPU will optimize a lot), one could have a special opcode for this quite common operation.

    // Code of the opcodes unrolled, merged back and hand-optimized
    
    // SCAN_FETCHSOME
    CheckOpSlotCompatibility(op, scanslot);
    
    slot_getsomeattrs(scanslot, op->d.fetch.last_var);
    
    op++;
    
    // SCAN_VAR
    int			attnum = op->d.var.attnum;
    
    /* See EEOP_INNER_VAR comments */
    
    Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
    *op->resvalue = scanslot->tts_values[attnum];
    *op->resnull = scanslot->tts_isnull[attnum];
    
    op++;
    
    // FUNCEXPR_STRICT_INT4EQ
    FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
    NullableDatum *args = fcinfo->args;
    
    
    /* strict function, so check for NULL args */
    if (args[0].isnull || args[1].isnull)
        *op->resnull = true;
    else
    {
        *op->resvalue = ((int32) args[0].value == (int32) args[1].value);
        *op->resnull = false;
    }
    
    op++;
    
    // QUAL
    /* simplified version of BOOL_AND_STEP for use by ExecQual() */
    
    /* If argument (also result) is false or null ... */
    if (*op->resnull ||
        !DatumGetBool(*op->resvalue))
    {
        /* ... bail out early, returning FALSE */
        *op->resnull = false;
        *op->resvalue = BoolGetDatum(false);
        EEO_JUMP(op->d.qualexpr.jumpdone);
    }
    
    /*
     * Otherwise, leave the TRUE value in place, in case this is the
     * last qual.  Then, TRUE is the correct answer.
     */
    
    op++;
    
    // DONE_RETURN
    *isnull = state->resnull;
    return state->resvalue;
    

    With this change alone (but keeping the two null checks, so there are still optimizations possible), we end up with the following perf results.

    // broken PostgreSQL, the query took about 114ms per run
    
              1,125.33 msec task-clock:u                     #    0.143 CPUs utilized             
                     0      context-switches:u               #    0.000 /sec                      
                     0      cpu-migrations:u                 #    0.000 /sec                      
                   456      page-faults:u                    #  405.215 /sec                      
        12,941,745,741      instructions:u                   #    3.01  insn per cycle            
                                                      #    0.01  stalled cycles per insn     (71.33%)
         4,305,023,693      cycles:u                         #    3.826 GHz                         (71.38%)
            85,985,959      stalled-cycles-frontend:u        #    2.00% frontend cycles idle        (71.65%)
         2,211,080,124      branches:u                       #    1.965 G/sec                       (71.71%)
             1,596,208      branch-misses:u                  #    0.07% of all branches             (71.27%)

    Let’s sum up these results.

    PostgreSQLUnpatchedOnly 1 NULL checkNo NULL checkTwo NULL checks, inlined int4eq
    Average time (ms)127117 (-7.9%)115 (-9.5%)114 (-10.3%)
    Instructions13,377,719,06213,238,656,204 (-1.1%)13,199,914,639 (-1.4%)12,941,745,741 (-3.3%)
    Cycles4,583,676,4004,429,234,344 (-3.4%)4,416,014,805 (-3.7%)4,305,023,693 (-6.1%)
    Branches2,322,262,6772,277,385,482 (-2%)2,248,257,937 (-3.2%)2,211,080,124 (-4.8%)
    Summary of performances along three changes

    The biggest change comes, quite obviously, from inlining the int4eq call. Why is it that much better? Because it reduces by quite a lot the number of instructions to run, and it removes a call to an address stored in memory. And this is again an optimization I could do on my JIT compiler that can also be done on the interpreter with the same benefits. The biggest issue here is that you must keep the number of opcodes within (unspecified) limits: too many opcodes could make the compiler job far worse.

    Fighting the wrong fight against the interpreter in my JIT compiler

    Well. At first, I thought the elimination of null checks could not be implemented easily in the interpreter. The first draft in my compiler was certainly invalid, but gave me interesting numbers (around 5%, as seen above) and made me want to go ahead. And I realized that implementing it cleanly in the interpreter was far easier than implementing it in my JIT compiler …

    Then I went with optimizing another common case, the call to int4eq, and, well… One could also add an opcode for that in the interpreter, and thus the performance gain of the JIT compiler are going to be minimal compared to the interpreter.

    Modern CPUs don’t make my job easy here. Most of the cost of an interpreter is taken away by the branch predictor and the other optimizations implemented in silicon. So is all hope lost, am I to declare the interpreter the winner against the limitations of the copy-patch method I have available for my JIT?

    PostgreSQLUnpatchedOnly 1 NULL checkNo NULL checkTwo NULL checks, inlined int4eqCopyjit, optimization in development
    Average time (ms)127117 (-7.9%)115 (-9.5%)114 (-10.3%)98 (-23%)
    Instructions13,377,719,06213,238,656,204 (-1,1%)13,199,914,639 (-1,4%)12,941,745,741 (-3,3%)12,013,081,667 (-11%)
    Cycles4,583,676,4004,429,234,344 (-3,4%)4,416,014,805 (-3,7%)4,305,023,693 (-6,1%)3,701,112,703 (-20%)
    Branches2,322,262,6772,277,385,482 (-2%)2,248,257,937 (-3,2%)2,211,080,124 (-4,8%)2,028,224,528 (-13%)
    Look, some fun ahead!

    Of course not, see you in the next post to discuss the biggest interpreter bottleneck!

    PS: help welcome. Last year I managed to spend some time working on this during my work time. Since then I’ve changed job, and can hardly get some time on this. I also tried to get some sponsoring to work on this and present at future PostgreSQL conferences, to no luck :/

    If you can help in any way on this project, feel free to reach me (code contribution, sponsoring, missions, job offers, nudge nudge wink wink). Since I’ve been alone on this, a lot of things are dibbles on scratch paper, I benchmark code and stuff in my head when life gives me some boring time but testing it for real is of course far better. I have some travel planned soon so I hope for next part to be released before next year, with interesting results since my experiences have been as successful as anticipated.

  • Look ma, I wrote a new JIT compiler for PostgreSQL

    Sometimes, I don’t know why I do things. It’s one of these times. A few months ago, Python 3.13 got its JIT engine, built with a new JIT compiler construction methodology (copy-patch, cf. research paper). After reading the paper, I was sold and I just had to try it with PostgreSQL. And what a fun ride it’s been so far. This blog post will not cover everything, and I prefer other communication methods, but I would like to introduce pg-copyjit, the latest and shiniest way to destroy and segfault speed up your PostgreSQL server.

    Before going any further, a mandatory warning: all code produced here is experimental. Please. I want to hear reports from you, like “ho it’s fun”, “ho I got this performance boost”, “hey maybe this could be done”, but not “hey, your extension cost me hours of downtime on my business critical application”. Anyway, its current state is for professional hackers, I hope you know better than trusting experimental code with a production server.

    In the beginning, there was no JIT, and then came the LLVM JIT compiler

    In a PostgreSQL release a long time ago, in a galaxy far far away, Andres Freund introduced the PostgreSQL world to the magics of JIT compilation, using LLVM. They married and there was much rejoicing. Alas, darkness there was in the bright castle, for LLVM is a very very demanding husband.

    LLVM is a great compilation framework. Its optimizer produces very good and efficient code, and Andres went further than what anybody else would have thought and tried in order to squeeze the last microsecond of performance in his JIT compiler. This is a wonderful work and I don’t know how to express my love for the madness this kind of dedication to performance is. But LLVM has a big downside : it’s not built for JIT compilation. At least not in the way PostgreSQL will use it: the LLVM optimizer is very expensive, but not using it may be worse than no compilation at all. And in order to compile only the good stuff, the queries that can enjoy the performance boost, the typical query cost estimation is used. And that’s the PostgreSQL downside making the whole thing almost impossible: costs in PostgreSQL are not designed to mean anything. They are meant to be compared to each other, but do not mean anything regarding the real execution time. A query costing 100 may run in 1 second, while another costing 1000 may run in 100 milliseconds. It’s not a bug, it’s a design decision. That’s why a lot of people (including me) end up turning off the JIT compiler: most if not all queries on my production system will not get enough from the performance boost to compensate the LLVM optimizer cost. If I can run the query 10ms faster but it needed 50ms to be optimized, it’s pure loss.

    There is one way to make the LLVM JIT compiler more usable, but I fear it’s going to take years to be implemented: being able to cache and reuse compiled queries. I will not dig further into that topic in this post, but trust me, it’s not going to be a small feat to achieve.

    And in 2021, copy-and-patch was described…

    So, what can we do? We need fast enough code generated the fastest way possible. Fast enough code mean at least a bit faster than the current interpreter… But writing a compiler is painful, writing several code generators (for different ISAs for instance) is even worse…

    This is where the innovation of copy-and-patch comes into play and saves the day.

    With copy-patch, you write stencils in C. These stencils are functions with holes, and they are compiled by your typical clang compiler (gcc support pending, too complicated to explain here). Then when you want to compile something, you stitch stencils together, fill in the gaps, and jump straight into your brand new “compiled” function.

    And this is it. This is the magic of copy-and-patch. You only copy the stencils in a new memory area, patch the holes, and voilà.

    Of course, you can go further. You can figure out what computation can be done at compilation time, you can split loops in several stencils to unroll them, you can merge several stencils together to optimize them in one go (creating kind of meta-stencils…)

    This paper caught the eyes of the Faster-CPython team, they implemented it in CPython 3.13, and this is when more people (including me) discovered it.

    Bringing copy-and-patch to PostgreSQL

    So, what does it take to build a new JIT engine in PostgreSQL? Hopefully, not that much, otherwise I would likely not be blogging about this.

    When JIT compilation was introduced, it was suggested on hackers to make LLVM a plugin, allowing future extensions to bring other JIT compilers. Back then, I was quite skeptical to this idea (but never expressed this opinion, I did not want to be wrong later), and it turned out I proved myself wrong… The interface is really simple, your .so only needs to provide a single _PG_jit_provider_init function, and in this function initialize three callbacks, named compile_expr, release_context and reset_after_error. The main one is obviously compile_expr. You get one ExprState* parameter, a pointer to an expression, made of opcodes. Then it’s “only” a matter of compiling the opcodes together in any way you want, mark this built code as executable, and changing the evalfunc to this code instead of the PostgreSQL interpreter. This is easy, and you have an automatic fallback to the PostgreSQL interpreter if you encounter any opcode you’ve not implemented yet.

    The copy and patch algorithm (implemented with only a few small optimizations so far) is so easy I can explain it here. For each opcode, the compiler will look into the stencil collection. If the opcode has a stencil, the stencil is appended to the “built” code. Otherwise, the compilation stops and the PostgreSQL interpreter will kick in. After appending the stencil, each of its holes are patched with the required value.

    For instance, let’s consider this basic unoptimized stencil, for the opcode CONST.

    Datum stencil_EEOP_CONST (struct ExprState *expression, struct ExprContext *econtext, bool *isNull)
    {
        *op.resnull = op.d.constval.isnull;
        *op.resvalue = op.d.constval.value;
    
        NEXT_OP();
    }

    op is declared as extern ExprEvalStep op; (and NEXT_OP is a bit harder to explain, I won’t dig into it here). When building this to a single .o file, the compiler will leave a hole in the assembly code, where the address for op will have to be inserted (using a relocation). When the stencil collection is built, this information is kept and used by the JIT compiler to use the current opcode structure address in order to get a working code.

    The build process for the stencils is quite fun, not complicated, but fun. The first step is to build the stencils to a single .o file, and then extract the assembly code and relocations from this .o file into C usable structures, that the JIT compiler will link to.

    And that’s about all there is.

    At first, I was extracting the assembly code manually. Using that way, I managed to get the three needed opcodes for SELECT 42; to work. And there was much joy. After this first proof of concept (and I guess some disturbed looks a few days ago at PgDay.Paris when people saw me happy with being able to run SELECT 42, that may have sound weird), I wrote a DirtyPython (unofficial variant) script to automate the assembly code extraction, and in a few hours I implemented function calls, single table queries, more complicated data types, introduced a few optimizations…

    Current state

    It works on my computer with PostgreSQL 16. It should be fine with older releases. It only supports AMD64 because that’s what I have and I can not target everything at once. Later I will add ARM64, and I would love to have some time to add support for some interesting targets like POWER64 or S390x (these may require some compiler patches, sadly, and access to such computers, nudge nudge wink wink)…

    Performance-wise, well, keeping in mind that I’ve spent almost no time optimizing it, the results are great. Code generation is done in a few hundreds microseconds, making it usable even for short queries, where LLVM is simply out of the game. On a simple SELECT 42; query, running with no JIT takes 0,3ms, with copyjit it requires 0,6ms, LLVM with no optimizations goes to 1,6ms and optimizing LLVM will require 6,6ms. Sure, LLVM can create really fast code, but the whole idea here is to quickly generate fast enough code, and thus comparing both tools won’t make much sense.

    But still, you are all waiting for a benchmark, so here we go, benchmarking two queries on a simple non-indexed 90k rows table. This benchmark is done on a laptop and my trust in such a benchmark setup is moderate at best, a proper benchmark will be done later on a desktop computer without any kind of thermal envelope shenanigans. And I have not optimized my compiler, it’s still quite stupid and there is a lot of things that can and must be done.

    QueryMin/max (ms)Median (ms) and stdev
    select * from b; — no JIT10.340/14.04610.652/0.515
    select * from b; — JIT10.326/14.61310.614/0.780
    select i, j from b where i < 10; — no JIT3.348/4.0703.7333/0.073
    select i, j from b where i < 10; — JIT3.210/4.7013.519/0.107
    Stupid benchmark on a laptop running non-optimized code, don’t trust these…

    As you can see, even in the current unfinished state, as soon as there is CPU work to do (here it’s the where clause), performance relative to the interpreter get better. It’s only logical, and what is important here is that even if the JIT is an extra, slightly time consuming step, it takes so little time even these queries can go a few percents faster.

    Note that even if I’ve implemented only a small handful of opcodes, I can run any query on my server, the JIT engine will only complain loudly about it and let the interpreter run the query…

    For the more curious, the code is dumped there on github. I said dumped because I focus only on the code and not on the clarity of my git history nor on wrapping it in a nice paper with flying colors and pretty flowers, that’s what you do when the code is done, this one isn’t yet… If you want to build it, the build-stencils.sh file must be run manually first. But again, I do not document it yet because I simply can not provide any support for the code in its current state.

    TODO…

    This is a proof of concept. I’ve not worked on making it easy to build, on making it possible to package it… The build scripts are Debian and PostgreSQL 16 specific. And, well, to be honest, at this point, I don’t care much and it will not trouble me, my focus is on implementing more opcodes, and searching for optimizations.

    I really hope I will reach a point where I can safely package this and deploy it on my production servers. This way, I’ll keep using the LLVM JIT on the server that can use it (a GIS server where queries are worth the optimization) and use this JIT on my web-application databases, where short query time is a must have, and the LLVM optimizations end up being counter-productive.

    I am also dead serious on porting this to other architectures. I love the old days of Alpha, Itanium, Sparc, M68k and other different architectures. I am not going to use this kind of system, but I miss the diversity, and I really don’t want to be a part of the monoculture issue here.

    Thanks

    First, huge thanks to my current day-job employer, Entr’ouvert. We are a small french SaaS company, free-software focused, and my colleagues simply let me toy on this between tickets and other DBA or sysadmin tasks.

    I would like to thank my DBA friends for supporting me and motivating me into doing this (won’t give their names, they know who they are). BTW: use PoWA, great tool, tell your friends…

    Also, quick question: they suggest I shall go to PGConf.dev to show this, but it’s too late for the schedule and since I live in France I did not intend to go there. If you think it’s important or worth it, please, please, say so (comments below, or my email is p@this.domain), otherwise see you in future european PG events 🙂