Hash Table Virtual Calls with Conflict Resolution Stubs

Hi everyone,

I'm the developer of the Loci programming language (
http://loci-lang.org ), for which the compiler is a front-end for
LLVM. I would like to say that LLVM has been extremely useful for the
development of the compiler and so thank you everyone for building
this amazing system.

---- Virtual Method Calls ----

While most aspects of the language map well onto LLVM IR, it seems
non-trivial to generate the virtual method calls in LLVM. Loci uses a
concept called 'structural typing' [1], which basically means classes
don't have to explicitly implement interfaces; if a class has the
methods required by the interface then the class -> interface cast is
valid. A hash table is therefore used to represent the virtual method
table of an object, in which a virtual method call indexes the hash
table with a hash of the method name and then calls the function
pointer it finds in the corresponding slot.

Clearly there can be clashes in the hash table where two (or more)
method names map to the same hash table slot. There are various
approaches for conflict resolution but the cheapest seems to be
setting a hidden value (such as a register) on the caller side to a
larger integer hash value (a large enough space in which collisions
are assumed not to occur) and then generating a 'conflict resolution
stub', which does something like:

    conflict_resolution_stub:
           cmp eax, <METHOD 1 HASH>
           jne .method_2
    .method_1:
           jmp method_1
    .method_2:
           jmp method_2

So in this case 'eax' is being used for conflict resolution and a
simple comparison can distinguish the two methods. The nice aspect of
this approach is that collisions aren't expected to occur in most
cases so the method function can be directly placed in the vtable and
the hidden value will simply be ignored in those cases. Hence in
general the only overhead versus a C++ virtual call is setting the
hidden value, which is essentially a negligible cost. For more detail,
the basic elements of the virtual call mechanism are described in
'Invokeinterface Considered Harmless' [2] (albeit for Java).

---- Encoding calls in LLVM IR ----

Encoding this information in LLVM IR seems to be very difficult,
albeit I've noticed some recent developments (such as musttail) that
could be useful for this. There was a previous discussion of this
issue in 2011 [3] and the conclusion seemed to be that the mechanism
above couldn't be encoded in LLVM IR.

It is possible to generate the necessary inline assembly, but then
LLVM has no knowledge of the function calls and hence no inlining
occurs and all the optimisations hit this boundary. Ideally it would
be possible to generate safe LLVM IR that could be understood by
existing optimisation passes.

So I'm wondering:

* Is there anything in LLVM IR that could help?
* If this currently can't be encoded in LLVM IR, what changes do you
think would be necessary to support this?

I understand that LLVM may not have complete support for this use case
right now; I'd be very happy to do the necessary work and submit any
relevant patches for review. I am very keen *not* to fork LLVM but
instead to try to make upstream changes that are as broadly useful as
possible. I've added some notes below about what approach might be
taken and I'd be interested to know what you think.

Kind regards,
Stephen Cross

[1] Structural Typing — Loci Programming Language
[2] https://www.research.ibm.com/people/d/dgrove/papers/oopsla01.pdf
[3] '[LLVMdev] Compile function with limited set of registers? Jump to' - MARC

---- Notes about potential approaches ----

In terms of the approach, there are at least two parts to this problem:

* Setting/retrieving the hidden value
* Forwarding function arguments

For the former, two vague ideas I had were adding a new calling
convention (it might be useful if external projects could use hooks to
add in their own calling conventions) or modifying inreg to allow
specifying a particular register (albeit encoding target-specific
information like this isn't ideal). Another approach could be adding a
parameter attribute (e.g. 'hidden' or 'outofband') that puts the
parameter in some target-specific location. The 'llvm.read_register'
and 'llvm.write_register' intrinsics would only be useful if they
could guarantee the relevant register wasn't going to be overwritten
in the meantime (and they'd also need to support more registers than
currently).

As for forwarding arguments, it seems like the combination of
'musttail' and varargs come close to satisfying this. The only issue
is that it's not possible to forward a return value.

For the former, two vague ideas I had were adding a new calling
convention (it might be useful if external projects could use hooks to
add in their own calling conventions) or modifying inreg to allow
specifying a particular register (albeit encoding target-specific
information like this isn't ideal).

Assuming you have full control over your environment, I'd not start by
writing a new calling convention. I'd just have:

  ...
  %dest = load_from_itable()
  call i32 %dest(i32 METHOD_ID, rest of the arguments ...)
  ...

and have the conflict resolution stub be:

  define i32 @conflict_redo(i32 %method_id, arg0, arg1, arg2) {
    dispatch based on %method_id
  }

with the normal x86 calling conventions. Later, as an *optimization*
I'd consider a custom calling convention for the %method_id argument.
Not using %rax for %method_id frees it up for use as %dest, for instance,
so using %rdi for %method_id may not actually be a bad idea in
practice.

If you have to inter-operate with legacy code and are not flexible on
how %method_id is propagated (i.e. it _has_ to be in %eax) then a
custom calling convention that puts the first argument in %eax is
probably the best way to go.

As for forwarding arguments, it seems like the combination of
'musttail' and varargs come close to satisfying this. The only issue
is that it's not possible to forward a return value.

I'm not sure I understand what you mean by "forward a return value",
but running the following:

declare i32 @f(i32, i64)

define i32 @conflict_resolution_stub(i32 %t, i64 %arg0) {
  %v = musttail call i32 @f(i32 %t, i64 %arg0)
  ret i32 %v
}

through `llc -O3` gives (trimmed output):

_conflict_resolution_stub: ## @conflict_resolution_stub
pushq %rax
popq %rax
jmp _f ## TAILCALL

I'm not sure what the pushq and popq %rax is here, looks like a
codegen glitch, but you do get the jmp _f as expected.

-- Sanjoy

Hi Sanjoy,

Thanks for your response.

Assuming you have full control over your environment [...]

Just to answer this, I do have control over the calling convention, so comments
about the implementation choices are definitely in scope. Having said that, I'm
keen not to design the calling convention in order to satisfy current compiler
implementation limitations.

Not using %rax for %method_id frees it up for use as %dest, for instance,
so using %rdi for %method_id may not actually be a bad idea in
practice.

Yes, using eax was just an example. The problem with using rdi is that the code
then has to shift all the values up one register, as well as inducing
an unnecessary
stack allocation. On an individual basis this is a minor cost, but
this is a core
language feature which may be invoked a huge number of times in a program.

After your mention of eax, I investigated other registers and discovered that a
viable solution already appears to exist in LLVM in the form of the
'nest' parameter
attribute. On x86_64 this passes a pointer in r10, which satisfies the use case
extremely well.

I've therefore realised that a very similar use case to this is
already well known
to LLVM in the form of nested functions and the static chain pointer.
As far as I
can tell, the registers used are:

* x86 : ecx
* x86_64 : r10
* ARM (32-bit and 64-bit) : r0/x0 (seems to treat it as a normal argument)

(I haven't investigated the other architectures yet.)

So x86 and x86_64 are excellent. I'm curious about the implementation of nested
functions on ARM, since they seem to be treating the static chain pointer as a
normal argument (or is it just that 'nest' is ignored on ARM?). As mentioned
above this is very costly because it means all the arguments have to be shifted
by one position; it would be more efficient if a callee-save register was used.

In any case, I expect to make a proposal at some point soon for either extending
'nest' or adding a new attribute (I'm thinking it could be 'outofband') that
allows passing an arbitrary-typed argument via an out-of-band
mechanism. Following
the established norms for nested functions, this would use ecx on x86,
r10 on x86_64
and a callee-save register on ARM. What do you think?

I'm not sure I understand what you mean by "forward a return value",

My thinking was that the i32 return value might affect the correctness
of the call,
since we may in fact be returning significantly different types (e.g. a large
struct). Having said that, given that we're calling with the correct prototype
and returning the value in a function with the correct prototype it
seems hard to
imagine a code generator choking on this.

Presumably optimisation passes might still have trouble with this, but
it looks like
the new 'thunk' function attribute is designed for this case. Could
anyone explain
in detail the purpose of the 'thunk' attribute and what led to its introduction?
(The IR documentation is slightly vague.)

but running the following:

Yes, this is right and thanks for pointing this out. In an actual conflict
resolution stub we need to pass arbitrary arguments of unknown type,
whereas I've
found casting non-varargs functions sometimes leads LLVM to zero all
the arguments.
Fortunately it looks like LLVM 3.6+ allows forwarding varargs which means this
works smoothly.

Kind regards,
Stephen

Hi Stephen,

Have you looked at how Objective-C implements this? There are at least four Objective-C runtimes that are open source, which all solve this problem in slightly different ways. The GCC, GNUstep, Apple Legacy, Apple Modern, and ObjFW runtimes are all supported by clang, so you can take a look at the IR generation easily as well.

I’d be happy to answer questions about the GNUstep runtime off list,

David

Hello,

So I've done a bit of digging in GCC and for the static chain pointer
it seems to be
using [1]:

* x86: ecx
* x86_64: r10
* ARM: ip (r12)
* AArch64: x18
* PowerPC: r11

The choices here are notable: on ARM, GCC uses ip (r12), the
'intra-procedure-call
scratch register', which can technically be used by veneers generated
by the linker,
but the nature of nested functions means this doesn't occur; I believe it also
works for my use case.

On AArch64, GCC uses x18, the 'platform register', which appears to be unused by
Linux and therefore available as a scratch register. However it *is*
used by iOS [2]
and GCC appears to not have chosen a static chain register for iOS
AArch64 yet [3].
I'm raising this issue on the GCC mailing list.

Currently LLVM supports x86 (ecx) and x86_64 (r10), but as far as I can tell the
other backends don't support 'nest' and hence treat the parameter as
if it didn't
have the attribute; it looks like the PowerPC backend supports the trampoline
intrinsics but its nested parameter isn't passed in the same way as
GCC (though I
haven't yet investigated this target in detail).

I'm putting together patches for ARM and AArch64 that implement 'nest' in their
respective backends to match the choices made by GCC and writing tests
to verify the
registers used are correct (btw LLVM has a great test system). On the
backends that
don't support 'nest' I'd suggest that they should report an error (or
at least some
form of notice) since they're not using an out-of-band mechanism, and
I'd be happy
to write tests to check this. I would also be interested in making the
documentation
for 'nest' slightly clearer. Does this sound sensible?

As for renaming/generalising 'nest', I'm going to leave this for now
but probably
raise it a later time. I do think it'd be cleaner to have something
like 'outofband'
that supports arbitrary parameter types, so e.g. the Loci frontend
doesn't have to
generate different IR for each target (32-bit targets will pass a
pointer to a i64
value, whereas 64-bit targets pass the value directly). This would
replace 'nest'
but have identical behaviour (and register selection) for the case where the
parameter is a pointer.

Have you looked at how Objective-C implements this? There are at least four Objective-C runtimes that are open source, which all solve this problem in slightly different ways. The GCC, GNUstep, Apple Legacy, Apple Modern, and ObjFW runtimes are all supported by clang, so you can take a look at the IR generation easily as well.

I haven't looked at Objective-C. As far as I can tell, Objective-C
doesn't face the
same problem as shown here because blocks essentially pass their scope
as a normal
pointer argument (I ran an Objective-C program through Clang to check
this). Calls
to protocol methods appear to use objc_msg_lookup and then just
directly call the
returned function pointer.

Does that sound correct? (Or did I miss the point you were making?)

Kind regards,
Stephen

[1] https://github.com/gcc-mirror/gcc/blob/7c62dfbbcd3699efcbbadc9fb3aa14f23a123add/gcc/testsuite/gcc.dg/cwsc1.c
[2] Apple Developer Documentation
[3] https://github.com/gcc-mirror/gcc/blob/7c62dfbbcd3699efcbbadc9fb3aa14f23a123add/libffi/src/aarch64/ffitarget.h#L66

(GCC links are to specific commits so they don't go out-of-date.)