__attribute__((apple_abi)): targeting Apple/ARM64 ABI from Linux (and others)

Hello everyone,

I made a quick patch to clang/llvm to introduce an "apple_abi" function attribute
(https://github.com/aguinet/llvm-project/commit/c4905ded3afb3182435df30e527955031cb0d098),
to be able to compile functions for the Apple ARM64 ABI when targeting other ARM64 OSes
(e.g. Linux). This can be seen as the Apple version of the already existing "ms_abi"
attribute.

In this mail, I will describe why we would want to do such a thing, the current
implementation and some remaining questions I have about this (like "isn't this a terrible
idea").

Motivation

Hi,

For the record, I've spent a nontrivial amount of time on the ARM64 version of Wine, and back in the day started out by implementing the ms_abi attribute for aarch64 just to get the handling of printf like functions right - dealing with (to some extent) most of the same issues you're dealing with here.

(Also, as a side comment; the existing names "win64cc", CC_Win64 or "IsWin64" used in a number of places, are a bit misnamed in the current scope. For the original, x86-only context (with 32 and 64 bit code generation is mostly shared), where the C calling convention is similar on x86_32, but differences only arose on x86_64, naming it "Win64" probably is quite neat, but within AArch64 it's a bit redundant - and if a similar distinction would be needed on ARM (e.g. if an explicit windows calling convention would be needed), reusing the existing "win64cc" is even more out of place...)

In one of other attempts to make all this mess easier to handle, we adapted the https://github.com/shinh/maloader project (that will be open source if all of this works) to load ARM64 MachO under Linux and run the final binary using qemu-user. This can be seen as a very light version of wine [1] for iOS.

[3] What I say here isn't entirely true, as darlinghq moved away from this "wine" model (which can be seen very basically as make a loader for the targeted architecture, create wrappers for system libraries and run all of this in userland). For those interested in more information, I recommend reading the article in http://blog.darlinghq.org/2017/02/the-mach-o-transition-darling-in-past-5.html

I would say this isn't entirely accurate regarding how wine works - maybe it was the case for other thinner win32 binary loaders that have existed though.

Wine never (at least not in the last 20 years afaik) just translated calls between the windows and host environment. Wine consists of a mostly full reimplementation of all the supported Windows APIs, and these only occasionally call down to the host libc and host's native APIs. It's true that Wine used to build its modules as native ELF (or MachO) binaries - but they weren't just plain ELF .so's; internally they contain most of the PE DLL data structures as well, so that run and interact with other modules using the normal DLL import/export mechanisms.

But lately this has been taken even further, and now most modules can be built as real DLLs as well - linking against wine's msvcrt/ucrt instead of the host libc, etc. For higher level components that only interact with other DLLs, this is mostly straightforward, but for lower level components that actually do need to call the native host environment, they have been split into a native ELF/MachO component (which links against whatever system libraries it needs to use), and the bulk of the code as either a real DLL or as a DLL wrapped in ELF/MachO. This requires having a suitable cross compiler available (but with clang being multi-targeting, that should be trivially available).

So that sounds very much like the same approach that Darling is taking, except that Darling doesn't maintain support for building the emulated components as ELF, only as native MachO. And Darling has the benefit of being able to build Apple's open sourced code, instead of having to reimplement it all based on the public interfaces.

In any case - even if the bulk of the code is built as the emulated platform's native binaries (DLL or MachO), I guess there's a need for interaction at some layer (even if the interface might be quite thin), so having support for something like this sounds sensible to me.

And being able to interact with code built for a different ABI on a per-function level also sounds very sensible to me. So I don't think this is a bad idea.

BTW, for running Windows code on Linux, one constant stumbling block has been the use of the x18 register. On Linux, this register is normally free to use by any function, but on Windows, it is supposed to remain constant (pointing at a thread specific data structure), with various workarounds being used to retain it.

For the Darwin case, x18 is reserved (so compiler generated code doesn't use it, similar to windows), but AFAIK nothing really uses it. Earlier, the Darwin kernel used to overwrite the x18 register to 1 on context switch, just to make sure that no code kept relying on it retaining its value, but this doesn't seem to be the case any longer. As no code actually uses it, it shouldn't be any problem for your usecase.

The current implementation & questions

The current implementation introduces the CC_AArch64_Apple calling convention, to enforce the usage of Apple's CC when necessary. This has mainly been inspired by how CC_Win64 works.

There are I think at least these limitations:

* this supposes that the original targeted CC is Apple ARM64 AAPCS. In its current form,
there is no way to support for instance vector calls (see for instance
https://github.com/aguinet/llvm-project/commit/c4905ded3afb3182435df30e527955031cb0d098#diff-f124368bac3e5d7be20450aa83b166daR218)

I'm not familiar with the vector calling convention here - but if that's used, the function (on the C level) already has a suitable attribute specifying the non-standard calling convention? Wouldn't that end up lowered into the right thing here as well?

Or is it a case where there's a generic "vector" calling convention which turns into different things depending on whether targetin linux or darwin? In that case, you'd probably need add a separate attribute and calling conventions, like apple_vector and sysv_vector (or whatever to call the default), to allow specifying the intent more exactly.

For windows on i386, there's actually at least 4 different calling conventions being used; cdecl (the default for C code), stdcall, fastcall and vectorcall. As those names aren't associated with anything else on other platforms, you can use e.g. __attribute__((fastcall)) on any platform.

My questions would be:
* the fact that we can't target Apple's vector calls ABI shows that having one
CC_AArch64Apple (as CC_Win64 exists) calling convention might not be the right
implementation of this "apple_abi" attribute. Has someone better suggestions?

It doesn't sound too bad to me, but as naming things is one of the hardest things, one could also think of other, less generic names (as the attribute "apple_abi" or whatever it is, doesn't per se imply any specific ABI, but just is the apple default C calling convention) - but "apple_c_default" also is ugly.

* For variadic functions (which are among the functions that have different ABIs), GCC and Clang have __builtin_ms_va_list. My understanding is that we should have the Apple equivalent, but I'm not sure to completely understand what's at stake here. Said differently, is this builtin used to make sure we use the va_list type of the Apple ABI, should the need arise to forward it to another function that uses the Apple ABI?

Exactly. In your example, you're implementing printf, so you're receiving variadic arguments on the stack, boiling them down to a (linux native) va_list and passing them to a linux native vprintf. If you'd be implementing and wrapping the darwin vprintf on the other hand, you'd need to declare it to be receiving a __builtin_apple_va_list.

Example with printf

For now, we manage to compile this simple example for iOS/arm64:

#include <stdio.h>

int main(int argc, char** argv)
{
printf("number of args: %d, argv: %s, %s, %s\n", argc, argv[0], argv[1], argv[2]);
return 0;
}

and run it under the combo maloader/qemu-user under Linux/x64, using this wrapper for printf:

__attribute__((apple_abi)) int darwin_aarch64_printf(const char* format, ...)
{
va_list args;
va_start(args, format);
const int ret = vprintf(format, args);
va_end(args);
return ret;
}

The fact that va_start/va_end works by using the Linux ABI from a function whose arguments use the Apple ABI seems completely magical to me, so if someone knows why this work I would also be interested!

I think this might be a borderline case that I wasn't entirely sure would work right, but apparently does. (Or maybe the code really is flexible enough to systematically handle such mixed cases?)

The calling convention attribute indicates how and where the variadic arguments are laid out on the stack, but these are then collected into a linux native va_list, which is passed to the linux native vprintf function that interprets them accordingly.

FWIW, if you want to experiment with how variadic functions and va_list behaves on different platforms, you can try e.g. this test snippet:

void vararg(int a, ...);
void call_vararg(void) {
         vararg(7, 8, 9, 10.0, 11, 12.0, 13);
}

void other(__builtin_va_list ap);
void receive_vararg(int a, ...) {
         __builtin_va_list ap;
         __builtin_va_start(ap, a);
         other(ap);
         __builtin_va_end(ap);
}

int use_vararg(__builtin_va_list *ap) {
         return __builtin_va_arg(*ap, int);
}

Compiling this with e.g. "clang -target {aarch64-windows,aarch64-linux-gnu,arm64-apple-darwin} -S -O2 -o - test.c" lets you have a look at what they end up like. E.g. use_vararg is identical between darwin and windows, while call_vararg is kind of similar between linux and windows (except windows passes all variadic args in GPRs), and receive_vararg is pretty different between all of them.

Is this a terrible idea?

Building these "ABI wrappers" using an "apple_abi" attribute seemed a good idea at the beginning, but this already raises some concerns (see above), and I'd be willing to hear any arguments that show that this is actually a bad idea.

It's certainly more sustainable and durable to provide full, proper implementations of the target, like Darling and Wine do, but even then, being able to build a function taking arguments with a foreign calling convention does sound sensible and useful to me.

Depending on exactly where you draw the line between "emulated"/foreign executables and native host system, you might not have any variadic functions in the border interface layer, and then you might get away without such support in the compiler, but to me, it sounds like a useful thing to have in any case.

// Martin

Hi Adrien,

* this supposes that the original targeted CC is Apple ARM64 AAPCS. In its current form,
there is no way to support for instance vector calls (see for instance
https://github.com/aguinet/llvm-project/commit/c4905ded3afb3182435df30e527955031cb0d098#diff-f124368bac3e5d7be20450aa83b166daR218)

I'm afraid I don't understand this point.

* the fact that we can't target Apple's vector calls ABI shows that having one
CC_AArch64Apple (as CC_Win64 exists) calling convention might not be the right
implementation of this "apple_abi" attribute. Has someone better suggestions?

Needing two calling conventions seems really odd to me, unless it's
for genuinely different ABI slices (arm64 vs arm64e or arm64_32 for
example), and even there I'm not sure.

The fact that va_start/va_end works by using the Linux ABI from a function whose arguments
use the Apple ABI seems completely magical to me, so if someone knows why this work I
would also be interested!

It's a series of coincidences conspiring together, I think. Linux's
varargs ABI doesn't change from the normal one, so functions have to
store all GPRs and vector registers that might contain arguments (as
well as where stack args start), and va_list describes where they were
stored:

typedef struct {
  void *stack;
  void *gr_top;
  void *vr_top;
  int gr_offs;
  int vr_offs;
} va_list;

This is what you're getting with your "va_list" declaration. While the
Darwin one is just a double pointer, but conceptually

typedef struct {
  void *stack;
} va_list;

because all anonymous args go on the stack there on Darwin.

That means when you call (Darwin's) va_start in your vprintf function
it "correctly" initializes the first field of that struct, leaving the
rest garbage. The gr_offs and vr_offs fields decide whether to use
gr_top/vr_top or stack to actually get the argument, and in this case
if gr_offs happens to be >= 0 it'll "correctly" use the stack to
retrieve everything. I'm guessing that happens to be the case for
simple programs (quite possibly the stack is still zero-initialized if
this is a trivial test-case).

You're also getting very lucky in that a Darwin varargs function
changes how much of the stack each argument uses, bringing it in line
with the normal AAPCS (otherwise the entire forwarding enterprise
would be doomed and you'd have to implement significant chunks of
vprintf to repack the arguments).

So, at a high level what you'll *want* to do to correctly forward from
Darwin to Linux is make sure that always happens: initialize gr_offs
and vr_offs to 0 to begin with so only the stack is available (I'd
also set the *_top fields to NULL for good measure). Take the time to
be grateful you're not trying to go the other way, too!

Now, back to your previous question...

* For variadic functions (which are among the functions that have different ABIs), GCC and
Clang have __builtin_ms_va_list. My understanding is that we should have the Apple
equivalent, but I'm not sure to completely understand what's at stake here. Said
differently, is this builtin used to make sure we use the va_list type of the Apple ABI,
should the need arise to forward it to another function that uses the Apple ABI?

That, together with __builtin_ms_va_arg and __builtin_ms_va_start, are
for if you have a Linux-side function that wants to make use of a
va_list or anonymous args coming from Darwin code in a relatively
agnostic way. I think what you're doing (here at least) is so
intimately tied to bridging the two ABIs that using it would just be a
fig-leaf.

Cheers.

Tim.

Hello Martin,

Thanks for your very detailed answer. Comments below.

Hi,

For the record, I've spent a nontrivial amount of time on the ARM64
version of Wine, and back in the day started out by implementing the
ms_abi attribute for aarch64 just to get the handling of printf like
functions right - dealing with (to some extent) most of the same issues
you're dealing with here.

Interesting, and thanks for you work on the Wine/ARM64 port!

[3] What I say here isn't entirely true, as darlinghq moved away from
this "wine" model (which can be seen very basically as make a loader
for the targeted architecture, create wrappers for system libraries
and run all of this in userland). For those interested in more
information, I recommend reading the article in
http://blog.darlinghq.org/2017/02/the-mach-o-transition-darling-in-past-5.html

I would say this isn't entirely accurate regarding how wine works -
maybe it was the case for other thinner win32 binary loaders that have
existed though.

Wine never (at least not in the last 20 years afaik) just translated
calls between the windows and host environment. Wine consists of a
mostly full reimplementation of all the supported Windows APIs, and
these only occasionally call down to the host libc and host's native
APIs. It's true that Wine used to build its modules as native ELF (or
MachO) binaries - but they weren't just plain ELF .so's; internally they
contain most of the PE DLL data structures as well, so that run and
interact with other modules using the normal DLL import/export mechanisms.

I do agree on this, and my comment has been a (failed) attempt at trying
to summarize wine in one sentence...

So that sounds very much like the same approach that Darling is taking,
except that Darling doesn't maintain support for building the emulated
components as ELF, only as native MachO. And Darling has the benefit of
being able to build Apple's open sourced code, instead of having to
reimplement it all based on the public interfaces.

In any case - even if the bulk of the code is built as the emulated
platform's native binaries (DLL or MachO), I guess there's a need for
interaction at some layer (even if the interface might be quite thin),
so having support for something like this sounds sensible to me.

And being able to interact with code built for a different ABI on a
per-function level also sounds very sensible to me. So I don't think
this is a bad idea.

Okay, so I guess I will continue on this rabbit hole a little bit more :slight_smile:

BTW, for running Windows code on Linux, one constant stumbling block has
been the use of the x18 register. On Linux, this register is normally
free to use by any function, but on Windows, it is supposed to remain
constant (pointing at a thread specific data structure), with various
workarounds being used to retain it.

For the Darwin case, x18 is reserved (so compiler generated code doesn't
use it, similar to windows), but AFAIK nothing really uses it. Earlier,
the Darwin kernel used to overwrite the x18 register to 1 on context
switch, just to make sure that no code kept relying on it retaining its
value, but this doesn't seem to be the case any longer. As no code
actually uses it, it shouldn't be any problem for your usecase.

Interesting to know indeed. And TBH I'm glad I don't have to deal with
that problem in this usecase...

The current implementation & questions

The current implementation introduces the CC_AArch64_Apple calling
convention, to enforce the usage of Apple's CC when necessary. This
has mainly been inspired by how CC_Win64 works.

There are I think at least these limitations:

* this supposes that the original targeted CC is Apple ARM64 AAPCS. In
its current form,
there is no way to support for instance vector calls (see for instance
https://github.com/aguinet/llvm-project/commit/c4905ded3afb3182435df30e527955031cb0d098#diff-f124368bac3e5d7be20450aa83b166daR218)

I'm not familiar with the vector calling convention here - but if that's
used, the function (on the C level) already has a suitable attribute
specifying the non-standard calling convention? Wouldn't that end up
lowered into the right thing here as well?

Let's say a user wants to target the Apple "aapcs-vfp" calling
convention from a Linux/ARM64 binary. He would for instance want to use
that combination:

__attribute__((apple_abi)) __attribute__((pcs("aapcs-vfp"))) void foo(...)

In our current implementation, that would not work because we would try
to setup two different LLVM calling conventions on the same function.

Or is it a case where there's a generic "vector" calling convention
which turns into different things depending on whether targetin linux or
darwin?

That's my understanding reading for instance
https://llvm.org/doxygen/AArch64RegisterInfo_8cpp_source.html#l00149

In that case, you'd probably need add a separate attribute and
calling conventions, like apple_vector and sysv_vector (or whatever to
call the default), to allow specifying the intent more exactly.

On the LLVM level I guess yes, but maybe we might keep this simple on
the clang level by allowing the combination above?

For windows on i386, there's actually at least 4 different calling
conventions being used; cdecl (the default for C code), stdcall,
fastcall and vectorcall. As those names aren't associated with anything
else on other platforms, you can use e.g. __attribute__((fastcall)) on
any platform.

Okay, that works in this case indeed.

My questions would be:
* the fact that we can't target Apple's vector calls ABI shows that
having one
CC_AArch64Apple (as CC_Win64 exists) calling convention might not be
the right
implementation of this "apple_abi" attribute. Has someone better
suggestions?

It doesn't sound too bad to me, but as naming things is one of the
hardest things, one could also think of other, less generic names (as
the attribute "apple_abi" or whatever it is, doesn't per se imply any
specific ABI, but just is the apple default C calling convention) - but
"apple_c_default" also is ugly.

Cf. above, allowing the combination of attributes might be a viable
solution.

* For variadic functions (which are among the functions that have
different ABIs), GCC and Clang have __builtin_ms_va_list. My
understanding is that we should have the Apple equivalent, but I'm not
sure to completely understand what's at stake here. Said differently,
is this builtin used to make sure we use the va_list type of the Apple
ABI, should the need arise to forward it to another function that uses
the Apple ABI?

Exactly. In your example, you're implementing printf, so you're
receiving variadic arguments on the stack, boiling them down to a (linux
native) va_list and passing them to a linux native vprintf. If you'd be
implementing and wrapping the darwin vprintf on the other hand, you'd
need to declare it to be receiving a __builtin_apple_va_list.

Okay thanks! So I'll add this to the todo list.

The fact that va_start/va_end works by using the Linux ABI from a
function whose arguments use the Apple ABI seems completely magical to
me, so if someone knows why this work I would also be interested!

I think this might be a borderline case that I wasn't entirely sure
would work right, but apparently does. (Or maybe the code really is
flexible enough to systematically handle such mixed cases?)

Tim Northover described what seems to happen in another answer, and so
it looks like to be mostly out of luck that it works.

The calling convention attribute indicates how and where the variadic
arguments are laid out on the stack, but these are then collected into a
linux native va_list, which is passed to the linux native vprintf
function that interprets them accordingly.

FWIW, if you want to experiment with how variadic functions and va_list
behaves on different platforms, you can try e.g. this test snippet:

void vararg(int a, ...);
void call_vararg(void) {
vararg(7, 8, 9, 10.0, 11, 12.0, 13);
}

void other(__builtin_va_list ap);
void receive_vararg(int a, ...) {
__builtin_va_list ap;
__builtin_va_start(ap, a);
other(ap);
__builtin_va_end(ap);
}

int use_vararg(__builtin_va_list *ap) {
return __builtin_va_arg(*ap, int);
}

Compiling this with e.g. "clang -target
{aarch64-windows,aarch64-linux-gnu,arm64-apple-darwin} -S -O2 -o -
test.c" lets you have a look at what they end up like. E.g. use_vararg
is identical between darwin and windows, while call_vararg is kind of
similar between linux and windows (except windows passes all variadic
args in GPRs), and receive_vararg is pretty different between all of them.

Thanks a lot for this tip. I will have a closer look at it.

Is this a terrible idea?

Building these "ABI wrappers" using an "apple_abi" attribute seemed a
good idea at the beginning, but this already raises some concerns (see
above), and I'd be willing to hear any arguments that show that this
is actually a bad idea.

It's certainly more sustainable and durable to provide full, proper
implementations of the target, like Darling and Wine do, but even then,
being able to build a function taking arguments with a foreign calling
convention does sound sensible and useful to me.

Okay, fair enough :slight_smile:

Depending on exactly where you draw the line between "emulated"/foreign
executables and native host system, you might not have any variadic
functions in the border interface layer, and then you might get away
without such support in the compiler, but to me, it sounds like a useful
thing to have in any case.

Our test cases use very few libc/libSystem functions, but some of them
are indeed from the "printf"-family to output interesting informations,
so I think it's worth the efforts to support them. The goal indeed isn't
to go through a full implementation of that targeted system.

Hello Tim,

Thanks for the details you provided! Answers & comments below.

* this supposes that the original targeted CC is Apple ARM64 AAPCS. In its current form,
there is no way to support for instance vector calls (see for instance
https://github.com/aguinet/llvm-project/commit/c4905ded3afb3182435df30e527955031cb0d098#diff-f124368bac3e5d7be20450aa83b166daR218)

I'm afraid I don't understand this point.

ARM64 defines two calling conventions: aapcs and aapcs-vfp
(https://developer.arm.com/documentation/dui0491/i/Compiler-specific-Features/--attribute----pcs--calling-convention-----function-attribute).

Using __attribute__((apple_abi)) in its current form would only allow to
target aapcs from a foreign OS, not aapcs-vfp.

* the fact that we can't target Apple's vector calls ABI shows that having one
CC_AArch64Apple (as CC_Win64 exists) calling convention might not be the right
implementation of this "apple_abi" attribute. Has someone better suggestions?

Needing two calling conventions seems really odd to me, unless it's
for genuinely different ABI slices (arm64 vs arm64e or arm64_32 for
example), and even there I'm not sure.

See above. The idea would be to also have, for instance,
CC_AArch64Apple_VFP (even if I'm really not found of this). It could
also end up as a non-supported case.

The fact that va_start/va_end works by using the Linux ABI from a function whose arguments
use the Apple ABI seems completely magical to me, so if someone knows why this work I
would also be interested!

It's a series of coincidences conspiring together, I think. Linux's
varargs ABI doesn't change from the normal one, so functions have to
store all GPRs and vector registers that might contain arguments (as
well as where stack args start), and va_list describes where they were
stored:

typedef struct {
  void *stack;
  void *gr_top;
  void *vr_top;
  int gr_offs;
  int vr_offs;
} va_list;

This is what you're getting with your "va_list" declaration. While the
Darwin one is just a double pointer, but conceptually

typedef struct {
  void *stack;
} va_list;

because all anonymous args go on the stack there on Darwin.

That means when you call (Darwin's) va_start in your vprintf function

It's Linux's va_start no?

it "correctly" initializes the first field of that struct, leaving the
rest garbage. The gr_offs and vr_offs fields decide whether to use
gr_top/vr_top or stack to actually get the argument, and in this case
if gr_offs happens to be >= 0 it'll "correctly" use the stack to
retrieve everything. I'm guessing that happens to be the case for
simple programs (quite possibly the stack is still zero-initialized if
this is a trivial test-case).

Okay got it. So the good way to lower this va_start would be to correctly set the rest of
the structure to zero if va_start is called from a function which has an Apple ABI (while
targetting Linux)? (actually answered below)

You're also getting very lucky in that a Darwin varargs function
changes how much of the stack each argument uses, bringing it in line
with the normal AAPCS (otherwise the entire forwarding enterprise
would be doomed and you'd have to implement significant chunks of
vprintf to repack the arguments).

Indeed!

So, at a high level what you'll *want* to do to correctly forward from
Darwin to Linux is make sure that always happens: initialize gr_offs
and vr_offs to 0 to begin with so only the stack is available (I'd
also set the *_top fields to NULL for good measure).

Okay that seems to answer my question just above.

Take the time to be grateful you're not trying to go the other way, too!

Yes :slight_smile:

* For variadic functions (which are among the functions that have different ABIs), GCC and
Clang have __builtin_ms_va_list. My understanding is that we should have the Apple
equivalent, but I'm not sure to completely understand what's at stake here. Said
differently, is this builtin used to make sure we use the va_list type of the Apple ABI,
should the need arise to forward it to another function that uses the Apple ABI?

That, together with __builtin_ms_va_arg and __builtin_ms_va_start, are
for if you have a Linux-side function that wants to make use of a
va_list or anonymous args coming from Darwin code in a relatively
agnostic way. I think what you're doing (here at least) is so
intimately tied to bridging the two ABIs that using it would just be a
fig-leaf.

Okay, thanks for the confirmation.

Regards