JIT, episode III: warp speed ahead

Written by

Previously…

In our first JIT episode, we discussed how we could, using copy-patch, easily create a JIT compiler for PostgreSQL, with a slight improvement in performance compared to the PostgreSQL interpreter.

In our second episode, I talked about the performance wall and how hard it was to have a real leap in performance compared to the interpreter. But it ended with a positive outlook, a nice performance jump that I was preparing at that moment…

Beating the interpreter, let’s get back to the basics

The interpreter will run each opcode for every record it has to process. Everything it has to do for each record that could be done only once is better done once, obviously. And this is where a JIT can beat it. The JIT compiler can choose optimizations that would require checks at each opcode for the interpreter, and thus self-defeating for the interpreter. For instance, I mentioned creating inlined opcodes for common function calls like int4eq : replacing the indirect call to int4eq with a comparison of the function pointer and then an inlined version would indeed be silly, since the comparison is going to waste a lot of time already.

So, what can’t the interpreter do? It sure can’t easily remove indirect calls, but this is a 1% performance gain, 2% at most. You won’t get to the headlines with that, right? Well, when in doubt, look at the past…

A short story about belief oriented programming…

A decade ago, I worked at a small company where I heard the weirdest thing ever regarding system performance: “our application is slower when built in 64 bits mode because the bigger pointer size makes it slower”. I didn’t buy this, spent two days digging into the code, and found that it was the opposite: 64 bits brought such a performance improvement that the entire system collapsed on a mutex that held a core structure in the application… Removing the mutex made the application fly in both 32 and 64 bits, with 64 bits beating 32 bits obviously.

But why is 64 bits faster? We are talking database here, so let’s have a look at a table, shall we?

http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0870F10_K17_Matisse7_InstLatX64.txt (I know uBlock doesn’t like this domain, but this text document there is good, I promise)

On my CPU, loading a 64 bits value in a register requires twice the time it takes to load a 32 bits value. So sure 64 bits must be slower than 32 bits!

Except the switch from 32 to 64 bits also fixed one of the biggest issue with x86: the lack of registers. x86 never improved from its 16 bits roots and had 8 general purpose registers, little compared to PowerPC (32), Sparc (31), ARM (15). When AMD introduced 64 bits in the x86 world, they doubled the number of registers, from a ridiculous 8 to an acceptable 16. And from this came a huge performance boost.

What you need to know about memory and registers

Memory = slow

Registers = fast

Ok, more seriously, I will not start writing about this. Even if it is getting old and outdated, the “What Every Programmer Should Know About Memory” paper of Ulrich Drepper is still a great read if you’re interested into that topic. The only thing that matters for us is that, even with a lot of cache, writing to memory is slower than writing to a register. If I look at some measurements for my Zen2 CPU, a comparison between two registers takes less than a cycle (0.33c it seems), but if data has to be loaded from L1 cache you can add 4 cycles, 12 cycles from L2 and 38 from L3. Way, way slower. 12 to 115 times slower.

Registers are used automatically by your compiler. When you write a function, it will automatically figure out what variable to move to a register, when, and if you don’t have enough registers for your entire function, spill registers on the stack as needed. If you are interested into this, there are many fun register allocation algorithms and many wikipedia pages covering this topic.

How does this impact our previous example

Let’s see one of the most basic opcode, EEOP_SCAN_VAR, taking a value from a scan slot in order to use it later.

EEO_CASE(EEOP_SCAN_VAR)
{
	int			attnum = op->d.var.attnum;

	*op->resvalue = scanslot->tts_values[attnum];
	*op->resnull = scanslot->tts_isnull[attnum];

	EEO_NEXT();
}

This is indeed a memory write. Could the interpreter get rid of this? Well, I think it could, but it would be a major undertaking. If we had a variable, stored in a register by the compiler, we could store there, sure, and a next step could fetch from that place, but what if the next step needs another value instead… Then we would have to spill the value back in memory, but checking for this at each step is going to kill performance. It may be possible to rewrite the entire interpreter to match a register based VM, but I can not be sure it would be worth it.

And this is the path to beating the interpreter. We can check many things before running the opcodes, trace memory accesses and use registers as much as possible.

Applying this to copyjit

The great benefit of copy-patch is that you (almost) don’t write assembly code. Porting it to arm64 required me to learn about ARM64 specific relocations, how to encode immediate values in some ARM opcodes, but nothing more.

But a big downside is that you don’t write assembly code 🙂

And, well, if you don’t write the assembly code, you don’t control register allocation. But there is a simple way around this, let’s speak a bit about calling conventions.

When function A is called, how do you pass parameters to it? If you learned some x86 assembly at school, you will answer “on the stack” and won a free ticket for an assembly refreshment course 🙂

When AMD64 was introduced, the SysV Call Convention took over and completely changed the way functions are called. The first six integer or pointer parameters are passed through general purpose registers, and six floating point parameters are passed through FP registers.

Each opcode is defined as a function with three parameters (matching the function signature expected by PostgreSQL). While respecting the SysV calling convention, it leaves us three registers that the compiler will keep across the opcode calls, and will spill automatically if needed. An alternative would have been to use the preserve_none calling convention, but for the first version I did not need it (and I still have many calls to PostgreSQL functions that will use the SysV calling convention)

But three registers means… two values only. Sadly we transitioned from 32 to 64 bits, not to 65 bits… 65 bits would have given one bit to represent NULL/NOT NULL values, 0 would not have been NULL, 1 + NULL would be NULL! But we will not rewrite history here, and we are going to use one register as a set of null flags, one bit per value register (so we are wasting 62 bits here).

Our opcode functions are thus going to have three new parameters, char nullFlags, intptr_t reg0, intptr_t reg1. Jumping to the next opcode will require passing these values around. Great, we keep registers around, now what about using these?

As a reminder, here are the opcodes we are dealing with for our previous “SELECT * FROM demo WHERE a = 42”.

// SCAN_FETCHSOME
CheckOpSlotCompatibility(op, scanslot);

slot_getsomeattrs(scanslot, op->d.fetch.last_var);

op++;

// SCAN_VAR
int			attnum = op->d.var.attnum;

/* See EEOP_INNER_VAR comments */

Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
*op->resvalue = scanslot->tts_values[attnum];
*op->resnull = scanslot->tts_isnull[attnum];

op++;

// FUNCEXPR_STRICT_2
FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
NullableDatum *args = fcinfo->args;

Assert(op->d.func.nargs == 2);

/* strict function, so check for NULL args */
if (args[0].isnull || args[1].isnull)
    *op->resnull = true;
else
{
    Datum		d;

    fcinfo->isnull = false;
    d = op->d.func.fn_addr(fcinfo);
    *op->resvalue = d;
    *op->resnull = fcinfo->isnull;
}

op++;

// QUAL
/* simplified version of BOOL_AND_STEP for use by ExecQual() */

/* If argument (also result) is false or null ... */
if (*op->resnull ||
    !DatumGetBool(*op->resvalue))
{
    /* ... bail out early, returning FALSE */
    *op->resnull = false;
    *op->resvalue = BoolGetDatum(false);
    EEO_JUMP(op->d.qualexpr.jumpdone);
}

/*
 * Otherwise, leave the TRUE value in place, in case this is the
 * last qual.  Then, TRUE is the correct answer.
 */

op++;

// DONE_RETURN
*isnull = state->resnull;
return state->resvalue;

This code doesn’t use our registers. I rewrote every opcode implementation to use the registers instead of writing in memory.

// SCAN_FETCHSOME
CheckOpSlotCompatibility(op, scanslot);

slot_getsomeattrs(scanslot, op->d.fetch.last_var);

op++;

// SCAN_VAR
int			attnum = op->d.var.attnum;

/* See EEOP_INNER_VAR comments */

Assert(attnum >= 0 && attnum < scanslot->tts_nvalid);
reg0 = scanslot->tts_values[attnum];
if (scanslot->tts_isnull[attnum])
  nullFlags |= 1;
else
  nullFlags &= ~1;

op++;

// FUNCEXPR_STRICT_2
/* strict function, so check for NULL args */
if (nullFlags)
    nullFlags |= 1;
else
{
    reg0 = (reg0 == reg1);
}

op++;

// QUAL
/* simplified version of BOOL_AND_STEP for use by ExecQual() */

/* If argument (also result) is false or null ... */
if ((nullFlags & 1) || !DatumGetBool(reg0))
{
    /* ... bail out early, returning FALSE */
    nullFlags &= ~1;
    reg0 = BoolGetDatum(false);
    EEO_JUMP(op->d.qualexpr.jumpdone);
}

/*
 * Otherwise, leave the TRUE value in place, in case this is the
 * last qual.  Then, TRUE is the correct answer.
 */

op++;

// DONE_RETURN
*isnull = (nullFlags & 1);
return reg0;

In this version, all memory accesses have been replaced with register accesses instead, hurray!

But this will work only for a simple query like this one. Once we start having more variables to store, we will need a spilling mechanism, a way to swap registers…

Another issue appears when you call, for instance, a non-inlined function. The EEOP_FUNCEXPR is defined as:

EEO_CASE(EEOP_FUNCEXPR)
{
    FunctionCallInfo fcinfo = op->d.func.fcinfo_data;
    Datum		d;

    fcinfo->isnull = false;
    d = op->d.func.fn_addr(fcinfo);
    *op->resvalue = d;
    *op->resnull = fcinfo->isnull;

    EEO_NEXT();
}

Parameters are fed through the fcinfo_data structure. The other opcodes are writing directly into this structure in the usual interpreter execution.

It means that we must check all memory accesses from the opcodes and make sure that any expected memory access from an opcode implementation will not end up in a memory location we didn’t write to.

Implementing this in copyjit

I started with a small experiment, a “variabilizer”, that would look at each opcode and figure out through each memory access (read/write) all the variables used in a run, their lifetimes… It can even detect constants stored in memory (a memory that is only read from).

I then refactored a lot of the compiler code in the past weeks. I started by moving the specialized opcodes definition and dispatch to the stencil library only, removing any special case I had in the compiler part. This required #defining a way for the C code in stencils.c to generate more C code in the built-stencils.h file through the stencil-builder.py script. Fun, but complicated and hairy stuff.

After that, I started rewriting the stencil signature and several opcodes to use registers instead, and wrote a “contract” for each opcode, defining what was expected in each register, what will be written in each register, and what is going to be read/written in memory.

With all these changes, here is what the FUNCEXPR_STRICT opcode optimized for int4eq looks like.

/// Variant with int4eq inlined
STENCIL(EEOP_FUNCEXPR_STRICT__1)
{
	if (nullFlags & 3) {
		// Make sure reg0 is marked as null
		nullFlags |= (1 << 0);
	} else {
		reg0 = (DatumGetInt32(reg0) == DatumGetInt32(reg1));
		nullFlags &= ~(1 << 0);
	}
	goto_next;
}
SELECTOR(EEOP_FUNCEXPR_STRICT__1, op->d.func.fn_addr == &int4eq)
BEGIN_REGISTER_CONTRACT(EEOP_FUNCEXPR_STRICT__1)
EXPECT(0, op.d.func.fcinfo_data->args[0].isnull, 
          op.d.func.fcinfo_data->args[0].value)
EXPECT(1, op.d.func.fcinfo_data->args[1].isnull, 
          op.d.func.fcinfo_data->args[1].value)
WRITE(0, op.resnull, op.resvalue)
END_REGISTER_CONTRACT

More metadata than actual code… But if that’s what it takes to get a good performance boost, then here we go.

After ingesting that, the compiler can fill in the registers with the proper values when needed. Another big issue that I’m not covering here is that doing this requires some minimal control flow analysis. For my simple benchmark, this is not a problem, and the code is getting ready for a wider range of queries, but I did not want to cover this and preferred focusing on the registers work…

Results

Well… This is the optimization I mentioned in the previous article. So, on our stupid benchmark, doing 10 times a simple SELECT * FROM demo WHERE a = 42 on a 10 million rows table…

PostgreSQL	No JIT	LLVM JIT	Copyjit
Average time (ms)	120	106 (-12%)	101 (-15%)
Compilation time (ms)	0	19	0.06
Instructions	13,350,766,209	10,643,820,667 (-21%)	12,769,013,536 (-5%)
Cycles	4,660,821,596	4,005,881,863 (-14%)	3,924,602,439 (-16%)
Branches	2,322,470,659	1,798,221,785 (-23%)	2,031,456,214 (-13%)

As you can see, this is exactly what we expected: less instructions, sure, but this is not what gave us the performance boost here. What changed is the number of cycles, because the same instruction now uses a register instead of using a memory access, thus several cycles saved for the same instruction.

The LLVM JIT can achieve about the same run time here, but it takes some time to generate the bitcode (less than 1ms), then several ms to analyze it, optimize it, and finally translate it to machine code. And this makes LLVM JIT slower here than copyjit, while copyjit still has some room for improvements (I’ve yet to look at tuple deforming)

See you in the next one, I think we already know what the topic will be… Well, after I finish porting every opcode to these new metadata, test more stuff, and likely figure out some more optimizations on the way…

PS: as said previously, help welcome, code FOSS as usual, on github, and I would gladly accept any sponsoring, mission, anything that could give me more time to work on this…

JIT PostgreSQL

JIT, episode III: warp speed ahead

Previously…

Beating the interpreter, let’s get back to the basics

A short story about belief oriented programming…

What you need to know about memory and registers

How does this impact our previous example

Applying this to copyjit

Implementing this in copyjit

Results

Comments

Leave a Reply Cancel reply

More posts

JIT, episode III: warp speed ahead

JIT: so you want to be faster than an interpreter on modern CPUs…

Look ma, I wrote a new JIT compiler for PostgreSQL

This “month” in Calligra #1