Patchpoints used for inline caches and pointless reloads

Hi All,

I am observing something i suspect is a misbehaviour of the register
allocator which impacts the performance of patchpoints. This occurs in
the context of an abstract machine which in some places uses inline
caches. The problematic code looks like this:

entry: ; Initialize the abstract machine
  %db = call create_big_seldom_used_database()
  ; do a lot of things which increases register pressure and spills %db
  br label %main_execution_loop;

main_execution_loop:
  ; We do instruction dispatch here
  ...

opcode_a:
  %name0 = ...
  ; Use the database to look up %name0 and then overwrite the patchpoint
  ; with a direct call
  tail call anyregcc void (i64, i32, i8*, i32, ...)*
    @llvm.experimental.patchpoint.void(i64 4711, i32 16, @lookup_and_patch, i32 0,
                                       some_type %name0, some_type %db)
  ...

  %name1 = ...
  ; Use the database to look up %name1 and then overwrite the patchpoint
  ; with a direct call
  tail call anyregcc void (i64, i32, i8*, i32, ...)*
    @llvm.experimental.patchpoint.void(i64 4711, i32 16, @lookup_and_patch, i32 0,
                                       some_type %name1, some_type %db)
  ...

  br label %main_execution_loop;

If I run this through llc (for x86_64) I will frequently see, especially
if I have two cache lookups in the same basic block or low register
pressure, that %db is loaded from the stack and into a register. The
generated code looks like this:

reload %db into reg0
; %name0 is in reg1
call lookup_and_patch(reg1, reg0)
; shadow
; %name1 is in reg2
call lookup_and_patch(reg1, reg0)
; shadow

This is a performance problem as, although the calls to
lookup_and_patch() are overwritten, we will always pay for the, now
useless, load of %db into reg0. If I wanted the arguments to
lookup_and_patch() in registers I would not have used the anyregcc
calling convention. In this toy example lookup_and_patch() only refers a
single variable, but in my real application it uses multiple values
(most of them spilled) and the slowdown is quite noticeable (overheads
of up to 600% for some opcodes).

To me this looks like the register allocator is too eager to load values
which are only used by anyregcc patchpoints into registers, or is this
the intended behavior of anyregcc patchpoints?

I would be grateful for suggestions of how I could modify the register
allocator (RAGreedy) to avoid reloading values when they are only used
by instructions which are anyregcc patchpoints. During the last two
weeks I have made a couple unsuccessful attempts at that and could
really use some pointers from someone who understands it.

Attached is a the smallest example I have managed to find which shows
the problem.

Regards,

--Frej

bug.ll (808 Bytes)

Just to make sure we’re on the same page, let me try to rephrase your problem:
“After the initial call is patched out of existence, the register spilling which was reasonable for the initial call becomes a performance liability for the patched in assembly.”

Is that a correct restatement?

If so, I’d suggest you look into not using the anyregcc convention. My understand is that that convention is intended for calls where values are desired in registers over the call. This is not what you want.

If you used the standard C convention, the calling convention would place the first X (where X is 6 I think) arguments in registers and the rest on the stack. If you pad your actual arguments with undef such that the arguments that are used by your patch mechanism are after that threshold, you might see what you’re looking for. However, I suspect you’d still see a move from the stack to a register, and then back to a different slot due the calling convention adjustments.

What I think you really want is a parameter attribute that we don’t have. You want to be able to tag a particular parameter as being “anylocation” (analogous to the anyreg of anyregcc). In concept, a “anystack” notion is similar to what we do for gc.statepoints (and, I think, excess arguments on patchpoints?). That’s still not quite as strong as what you want though…

If you do decide to pursue the “anylocation” parameter attribute, definitely keep me informed. This is analogous to something we want for gc.statepoints as well.

p.s. Without knowing what DB actually is, it really looks like you might be able to accomplish the same thing using the “ID” field of the patchpoint. Have you looked into that?

Philip

Philip Reames <listmail@philipreames.com> writes:

Just to make sure we're on the same page, let me try to rephrase your
problem:
"After the initial call is patched out of existence, the register
spilling which was reasonable for the initial call becomes a
performance liability for the patched in assembly."

Is that a correct restatement?

That is correct.

If so, I'd suggest you look into not using the anyregcc convention. My
understand is that that convention is intended for calls where values
are desired in registers over the call. This is not what you want.

If you used the standard C convention, the calling convention would
place the first X (where X is 6 I think) arguments in registers and
the rest on the stack. If you pad your actual arguments with undef
such that the arguments that are used by your patch mechanism are
after that threshold, you might see what you're looking for. However,
I suspect you'd still see a move from the stack to a register, and
then back to a different slot due the calling convention adjustments.

Or even worse, if among the input to the patch mechanism are arguments
to the function containing the patch-point, I will probably get a copy
from the caller's stack frame to the callee's. Also as the patch
mechanism looks at values which are "hot" I still want to access them
where they are (hopefully in registers) and not force them into
stack-slots, or into registers dictated by the calling convention.

What I think you really want is a parameter attribute that we don't
have. You want to be able to tag a particular parameter as being
"anylocation" (analogous to the anyreg of anyregcc). In concept, a
"anystack" notion is similar to what we do for gc.statepoints (and, I
think, excess arguments on patchpoints?). That's still not quite as
strong as what you want though...

Since I sent the original email I have been working on a pass inspired
by LiveDebugVariables. Before register allocation it notes the virtual
register for each non-lowered patchpoint operand and then replaces the
operand with an immediate. During register allocation the pass tracks
the original virtual register and when register allocation is complete,
the operands are replaced with a physical register or a stack slot. That
way the arguments to the patchpoint does not influence register
allocation at all, i.e. when the patchpoint call is patched out it will
be as if it never existed.

Instead of treating all non-lowered operands like this, I can simply
limit it to parameters marked with an "anylocation" parameter
attribute. That is a good suggestion for extending the patchpoint
behavior.

If you do decide to pursue the "anylocation" parameter attribute,
definitely keep me informed. This is analogous to something we want
for gc.statepoints as well.

I'll be happy to send you my current patch off-list if you are
interested. I'll definitely add you as a reviewer when I try to push it
upstream. Can you tell me more about what it is you want for
gc.statepoints?

p.s. Without knowing what DB actually is, it really looks like you
might be able to accomplish the same thing using the "ID" field of the
patchpoint. Have you looked into that?

Sure, an extra indirection solves all problems in CS :slight_smile: But it is not
that simple if, for example, values the patch mechanism needs are passed
as arguments to function containing the patchpoint.

Cheers,

--Frej

Philip Reames <listmail@philipreames.com> writes:

Just to make sure we're on the same page, let me try to rephrase your
problem:
"After the initial call is patched out of existence, the register
spilling which was reasonable for the initial call becomes a
performance liability for the patched in assembly."

Is that a correct restatement?

That is correct.

*trim for length*

What I think you really want is a parameter attribute that we don't
have. You want to be able to tag a particular parameter as being
"anylocation" (analogous to the anyreg of anyregcc). In concept, a
"anystack" notion is similar to what we do for gc.statepoints (and, I
think, excess arguments on patchpoints?). That's still not quite as
strong as what you want though...

Since I sent the original email I have been working on a pass inspired
by LiveDebugVariables. Before register allocation it notes the virtual
register for each non-lowered patchpoint operand and then replaces the
operand with an immediate. During register allocation the pass tracks
the original virtual register and when register allocation is complete,
the operands are replaced with a physical register or a stack slot. That
way the arguments to the patchpoint does not influence register
allocation at all, i.e. when the patchpoint call is patched out it will
be as if it never existed.

Oh, cool. Do you have code you could share? I'd really like to see how this approach works out. This is *exactly* the behavior we'd like to have by default for gc.statepoint. I hadn't thought of this approach before.

Instead of treating all non-lowered operands like this, I can simply
limit it to parameters marked with an "anylocation" parameter
attribute. That is a good suggestion for extending the patchpoint
behavior.

Just to be clear, I'm not set on having the parameter attribute. If you can make everything work and get Andy to sign off on the behavior change for patchpoint, that's also fine by me.

If you do decide to pursue the "anylocation" parameter attribute,
definitely keep me informed. This is analogous to something we want
for gc.statepoints as well.

I'll be happy to send you my current patch off-list if you are
interested. I'll definitely add you as a reviewer when I try to push it
upstream. Can you tell me more about what it is you want for
gc.statepoints?

Please do.

Essentially, we'd like exactly what you're describing. Register allocation should ignore uses in the gc.statepoint. The gc.statepoint needs to be updated to know where a particular virtual register got placed, but that's it. We're perfectly happy with values being placed in registers or stack slots.

(Well, mostly. Putting gc references in registers needs to be optional since many runtimes won't support it.)

p.s. Without knowing what DB actually is, it really looks like you
might be able to accomplish the same thing using the "ID" field of the
patchpoint. Have you looked into that?

Sure, an extra indirection solves all problems in CS :slight_smile: But it is not
that simple if, for example, values the patch mechanism needs are passed
as arguments to function containing the patchpoint.

Figured there's be a complication somewhere. Personally, I'm glad the simple solution doesn't work for you; it sounds like you're solving a problem I'd really like to see solved. :slight_smile:

Hi All,

I am observing something i suspect is a misbehaviour of the register
allocator which impacts the performance of patchpoints. This occurs in
the context of an abstract machine which in some places uses inline
caches. The problematic code looks like this:

entry: ; Initialize the abstract machine
%db = call create_big_seldom_used_database()
; do a lot of things which increases register pressure and spills %db
br label %main_execution_loop;

main_execution_loop:
; We do instruction dispatch here

opcode_a:
%name0 = …
; Use the database to look up %name0 and then overwrite the patchpoint
; with a direct call
tail call anyregcc void (i64, i32, i8*, i32, …)*
@llvm.experimental.patchpoint.void(i64 4711, i32 16, @lookup_and_patch, i32 0,
some_type %name0, some_type %db)

%name1 = …
; Use the database to look up %name1 and then overwrite the patchpoint
; with a direct call
tail call anyregcc void (i64, i32, i8*, i32, …)*
@llvm.experimental.patchpoint.void(i64 4711, i32 16, @lookup_and_patch, i32 0,
some_type %name1, some_type %db)

br label %main_execution_loop;

If I run this through llc (for x86_64) I will frequently see, especially
if I have two cache lookups in the same basic block or low register
pressure, that %db is loaded from the stack and into a register. The
generated code looks like this:

reload %db into reg0
; %name0 is in reg1
call lookup_and_patch(reg1, reg0)
; shadow
; %name1 is in reg2
call lookup_and_patch(reg1, reg0)
; shadow

This is a performance problem as, although the calls to
lookup_and_patch() are overwritten, we will always pay for the, now
useless, load of %db into reg0. If I wanted the arguments to
lookup_and_patch() in registers I would not have used the anyregcc
calling convention. In this toy example lookup_and_patch() only refers a
single variable, but in my real application it uses multiple values
(most of them spilled) and the slowdown is quite noticeable (overheads
of up to 600% for some opcodes).

To me this looks like the register allocator is too eager to load values
which are only used by anyregcc patchpoints into registers, or is this
the intended behavior of anyregcc patchpoints?

The calling convention affects the patchpoint call arguments only.

Patchpoint operands that can be in any location should be in the set of “live values”, not call arguments. You’ve done this correctly.

For other cases where those live values should be forced into registers, they should be promoted to call arguments and anyregcc should be used.

But since you’re not using the calling convention, you don’t need to specify anyregcc. The only affect it will have is to clobber %rax, which you probably don’t want. And the purpose of that convention is really the opposite of what you want anyway.

You’re right, this is a register allocator performance bug. It is too aggressive at optimizing local live ranges, and ignores the fact that these uses can be spilled for free.

I would be grateful for suggestions of how I could modify the register
allocator (RAGreedy) to avoid reloading values when they are only used
by instructions which are anyregcc patchpoints. During the last two
weeks I have made a couple unsuccessful attempts at that and could
really use some pointers from someone who understands it.

See http://llvm.org/PR22832 for more information.

-Andy