Custom calling convention & ARM target

Hello.

For our project needs we implemented a custom calling convention. The
main goals are to pass function arguments in registers and always use
tailcall optimization for calls to functions with our CC when
applicable.
Function arguments are always pointers and the maximum number of
arguments is 5. No frame pointer register is in use for this CC. No varargs.
Finally, there are not any callee-saving registers.
This approach worked successfully for x86 arch.

For ARM we are having troubles with the LR register.
The problem is that when there is a return from a function using our CC
the existing LLVM machinery emits 'mov pc, lr' instruction which looks fine.
The expectation is that we would return to a function which called our
CC function for the first time (no tailcall for the first time call).
But at the moment the return LR register contains an invalid value
because is wasn't preserved.
So the question is how to preserve LR register in the best way? My
current idea is to write a MachineFunctionPass which would add LR
register spill instruction to stack or some other memory and add LR
reload instruction on return.

Does this seem like a reasonable approach? Thank you for your
time and consideration.

Kind Regards,

Alexander Mitin

For ARM we are having troubles with the LR register.
The problem is that when there is a return from a function using our CC
the existing LLVM machinery emits 'mov pc, lr' instruction which looks fine.

It's actually pretty suspicious. You'd only realistically use that
sequence on a very, very old CPU which puts you deep into barely
tested territory on LLVM. I'd look into setting your triple and target
to something more recent.

Triple should probably be "arm-linux-gnueabi" at least, or maybe
"arm-none-eabi" if you're targeting bare metal; the CPU would probably
be OK at default for either of those, but otherwise would normally be
something implementing at least ARMv6 (arm1176jzf-s in RPi), probably
ARMv7 (something starting with "cortex").

So the question is how to preserve LR register in the best way? My
current idea is to write a MachineFunctionPass which would add LR
register spill instruction to stack or some other memory and add LR
reload instruction on return.

The backend should preserve LR through to the return instruction
automatically since it's a fundamental part of any calling convention
on ARM. I had a quick look and couldn't even see a way to break it
while tweaking purely calling convention knobs, so I suspect your CPU
issue above is to blame.

If fixing that doesn't resolve the issue, could you tell us which bits
of the calling convention you have customized for ARM? It'd hopefully
help to narrow down where things might be going wrong, because I'm
pretty perplexed.

Cheers.

Tim.

Hi Tim,

Thank you for your reply.
Actually, I already played with various target triples including what
sys::getProcessTriple() returns when I tried to compile it on
a Raspberry Pi 3 device.
Yes, changing the triple to armv7-unknown-linux-gnueabi changes the
emitted return instruction to
'bx lr'. But this is not the issue.
Let me describe it based on an example I prepared to demonstrate the problem.
Currently, LLVM contains GHC calling convention (aka cc10) and this CC
is very similar to what we are trying to implement. The difference
is that our CC has a simpler argument specification (only pointers)
and could have
a prologue/epilogue.
I wrote a simple example which is a sort of interpreter implemented as
a threaded code.
The C version of it is here:
https://gist.github.com/amitin/7df4fbb806c0b48eb5bcaf614e5d93cd#file-test-c
There are three handler functions which invoke each other sequentially
(the order doesn't matter actually). A starter function initializes
and runs handler functions. It runs until a terminator function is
encountered which returns the execution flow to the starter function.
Note that the code is really simplified for you to get the idea.
However it works and could be compiled using cmake script which I
included in the same gist.
I compiled it into LLVM IR using clang -S -O3 -emit-llvm
--target=armv7-unknown-linux-gnueabihf test.c
Then I modified the resulting IR in order to use cc10 for handler
functions and simplified it a bit, see
https://gist.github.com/amitin/7df4fbb806c0b48eb5bcaf614e5d93cd#file-test-ll
Next, I compiled it into asm file using llc test.ll command
https://gist.github.com/amitin/7df4fbb806c0b48eb5bcaf614e5d93cd#file-test-s
LLVM is really good at tail call elimination so it replaced all calls
between handlers to just branches.
Now getting back to the problem, note that the handlers call to 'puts' so LR
register gets changed.
See https://gist.github.com/amitin/7df4fbb806c0b48eb5bcaf614e5d93cd#file-test-s-L31
Thus, when the execution flow reaches 'terminatorFunc' it will branch
to an unknown location.
See https://gist.github.com/amitin/7df4fbb806c0b48eb5bcaf614e5d93cd#file-test-s-L73

The same IR code works fine for x86_64. You can verify it by changing
the triple to x86_64 (uncomment the test.ll:5 line and comment
test.ll:4 then compile it with llc).
I think I have to mention that I tried all above using LLVM v.8.0.

Thank you kindly for any insights you can provide.

Hi,

Now getting back to the problem, note that the handlers call to 'puts' so LR
register gets changed.
See https://gist.github.com/amitin/7df4fbb806c0b48eb5bcaf614e5d93cd#file-test-s-L31

Sure, but this is completely normal, ARM's BL instruction will
*always* change LR. If LLVM wasn't used to dealing with this issue
then no code would ever work on ARM.

What normally happens is that the call gets marked as clobbering LR
which triggers ARMFrameLowering.cpp to save it in the prologue and
restore it in the epilogue.

You mention above that you've patterned your changes on the GHC
convention, which suppresses prologue and epilogue. That's probably
where I'd start to look for the problem. But don't trust what's there
already: I'm not sure how functional the GHC convention is on ARM, it
seems like it could only work if it guaranteed *every* call was a tail
call.

The same IR code works fine for x86_64. You can verify it by changing
the triple to x86_64 (uncomment the test.ll:5 line and comment
test.ll:4 then compile it with llc).

x86_64 has a completely different call/return sequence that
automatically involves the stack so that's not surprising.

Cheers.

Tim.

Hi.

What normally happens is that the call gets marked as clobbering LR
which triggers ARMFrameLowering.cpp to save it in the prologue and
restore it in the epilogue.

Do you mean the callee should save/restore LR?
If so, then this is not as good as it could be.
Let me clarify things a bit more.
We don't want it to be like that:
handlerFunc0: ; our CC
   push lr
   ; do smth
   pop lr
   b handlerFuncX ; branch to the next handler

We need it like this:
starterFunc: ; cdecl or whatever standard CC
  mov lr, offset of label1
  push lr ; or store lr somewhere else
  bl handlerFuncX ; or even just b handlerFuncX
label1:
; do something on return
; ..
terminatorFunc: ; our CC
  pop lr ; or reload lr from some other memory
  bx lr

Also this code would look good:
handlerFunc0: ; our CC
   ; do smth
   push lr
   bl someFuncWithStandardCC
   pop lr
   ; do smth else
   b handlerFuncX ; branch to the next handler

I wasn't expecting that it would work out-of-the-box and I don't think
it's a bug in LLVM :slight_smile:
I'm just looking for guidance on the best way to implement it.

You mention above that you've patterned your changes on the GHC
convention, which suppresses prologue and epilogue. That's probably
where I'd start to look for the problem. But don't trust what's there
already: I'm not sure how functional the GHC convention is on ARM, it
seems like it could only work if it guaranteed *every* call was a tail
call.

For our implementation we don't suppress prologue/epilogue. However,
no callee-saved registers allowed to reduce mem-reg operations.
This CC is very specific and it was never intended to be used widely.
And you are correct that it would only work if it guaranteed every
call was a tail call - and that's the idea how the interpreter of
our VA Smalltalk virtual machine works. It's much faster than any
implementation in C language.

Kind Regards,
Alexander Mitin