RFC: Implementing the Swift calling convention in LLVM and Clang

Hi, all.

Swift uses a non-standard calling convention on its supported platforms. Implementing this calling convention requires support from LLVM and (to a lesser degree) Clang. If necessary, we’re willing to keep that support in “private” branches of LLVM and Clang, but we feel it would be better to introduce it in trunk, both to (1) minimize the differences between our branches and trunk and (2) allow other language implementations to take advantage of that support.

We don’t expect this to be particularly controversial, at least in the abstract, since LLVM already includes support for a number of variant, language-specific calling conventions. Some of Swift's variations are more invasive than those existing conventions, however, so we want to make sure the community is on board before we start landing patches or sending them out for review.

Here’s a brief technical summary of the convention:

In general, the calling convention lowers onto an existing C calling convention; let’s call this the “intermediary convention”. The intermediary convention is not necessarily the target platform’s standard C convention; for example, we intend to use a VFP convention on iOS ARM targets. Aggregate arguments and results are translated to sequences of scalar types (possibly just an indirect argument/sret pointer) and, for the most part, passed and returned using the intermediary convention’s rules for a function with that signature. For example, if struct A expands to the sequence [i32,float,i32], a function type like (A,Int64) -> Bool) would be lowered basically like the C function type bool(*)(int32_t, float, int32_t, int64_t).

There are four general points of deviation from the intermediary convention:

  - We sometimes want to return more values in registers than the convention normally does, and we want to be able to use both integer and floating-point registers. For example, we want to return a value of struct A, above, purely in registers. For the most part, I don’t think this is a problem to layer on to an existing IR convention: C frontends will generally use explicit sret arguments when the convention requires them, and so the Swift lowering will produce result types that don’t have legal interpretations as direct results under the C convention. But we can use a different IR convention if it’s necessary to disambiguate Swift’s desired treatment from the target's normal attempts to retroactively match the C convention.

  - We sometimes have both direct results and indirect results. It would be nice to take advantage of the sret convention even in the presence of direct results on targets that do use a different (profitable) ABI treatment for it. I don’t know how well-supported this is in LLVM.

  - We want a special “context” treatment for a certain argument. A pointer-sized value is passed in an integer register; the same value should be present in that register after the call. In some cases, the caller may pass a context argument to a function that doesn’t expect one, and this should not trigger undefined behavior. Both of these rules suggest that the context argument be passed in a register which is normally callee-save.

  - We want a special “error” treatment for a certain argument/result. A pointer-sized value is passed in an integer register; a different value may be present in that register after the call. Much like the context treatment, the caller may use the error treatment with a function that doesn’t expect it; this should not trigger undefined behavior, and the existing value should be left in place. Like the context treatment, this suggests that the error value be passed and returned in a register which is normally callee-save.

Here’s a brief summary of the expected code impact for this.

The Clang impact is relatively minor; it is focused on allowing the Swift runtime to define functions that use the convention. It adds a new calling convention attribute, a few new parameter attributes constrained to that calling convention, and some relatively un-invasive call lowering code in IR generation.

The LLVM impact is somewhat larger.

Three things in the convention require a possible change to IR:

  - Using sret together with a direct result may or may not “just work". I certainly don’t see a reason why it shouldn’t work in the middle-end. Obviously, some targets can’t support it, but we can avoid doing this on those targets.

  - Opting in to the two argument treatments requires new parameter attributes. We discussed using separate calling conventions; unfortunately, error and context arguments can appear either separately or together, so we’d really need several new conventions for all the valid combinations. Furthermore, calling a context-free function with an ignored context argument could turn into a call to a function using a mismatched calling convention, which LLVM IR generally treats as undefined behavior. Also, it wasn’t obvious that just a calling convention would be sufficient for the error treatment; see the next bullet.

  - The “error” treatment requires some way to (1) pass and receive the value in the caller and (2) receive and change the value in the callee. The best way we could think of to represent this was to pretend that the argument is actually passed indirectly; the value is “passed” by storing to the pointer and “received” by loading from it. To simplify backend lowering, we require the argument to be a special kind of swifterror alloca that can only be loaded, stored, and passed as a swifterror argument; in the callee, swifterror arguments have similar restrictions. This ends up being fairly invasive in the backend, unfortunately.

The convention also requires a few changes to the targets that support the convention, to deal with the context and error treatments and to return more values in registers.

Anyway, I would appreciate your thoughts.

John.

It's probably worth noting that the likely diffs involved can be
inferred from the "upstream-with-swift" branch at
"git@github.com:apple/swift-llvm.git", which is a fairly regularly
merged copy of trunk with the swift changes.

I'm steering well clear of the policy decision.

Tim.

Hi, all.
  - We sometimes want to return more values in registers than the convention normally does, and we want to be able to use both integer and floating-point registers. For example, we want to return a value of struct A, above, purely in registers. For the most part, I don’t think this is a problem to layer on to an existing IR convention: C frontends will generally use explicit sret arguments when the convention requires them, and so the Swift lowering will produce result types that don’t have legal interpretations as direct results under the C convention. But we can use a different IR convention if it’s necessary to disambiguate Swift’s desired treatment from the target's normal attempts to retroactively match the C convention.

Is this a back-end decision, or do you expect the front-end to tell
the back-end (via annotation) which parameters will be in regs? Unless
you also have back-end patches, I don't think the latter is going to
work well. For example, the ARM back-end has a huge section related to
passing structures in registers, which conforms to the ARM EABI, not
necessarily your Swift ABI.

Not to mention that this creates the versioning problem, where two
different LLVM releases can produce slightly different PCS register
usage (due to new features or bugs), and thus require re-compilation
of all libraries. This, however, is not a problem for your current
request, just a comment.

  - We sometimes have both direct results and indirect results. It would be nice to take advantage of the sret convention even in the presence of direct results on targets that do use a different (profitable) ABI treatment for it. I don’t know how well-supported this is in LLVM.

I'm not sure what you mean by direct or indirect results here. But if
this is a language feature, as long as the IR semantics is correct, I
don't see any problem.

  - We want a special “context” treatment for a certain argument. A pointer-sized value is passed in an integer register; the same value should be present in that register after the call. In some cases, the caller may pass a context argument to a function that doesn’t expect one, and this should not trigger undefined behavior. Both of these rules suggest that the context argument be passed in a register which is normally callee-save.

I think it's going to be harder to get all opts to behave in the way
you want them to. And may also require back-end changes to make sure
those registers are saved in the right frame, or reserved from
register allocation, or popped back after the call, etc.

The Clang impact is relatively minor; it is focused on allowing the Swift runtime to define functions that use the convention. It adds a new calling convention attribute, a few new parameter attributes constrained to that calling convention, and some relatively un-invasive call lowering code in IR generation.

This sounds like a normal change to support language perks, no big
deal. But I'm not a Clang expert, nor I've seen the code.

  - Using sret together with a direct result may or may not “just work". I certainly don’t see a reason why it shouldn’t work in the middle-end. Obviously, some targets can’t support it, but we can avoid doing this on those targets.

All sret problems I've seen were back-end related (ABI conformance).
But I wasn't paying attention to the middle-end.

  - Opting in to the two argument treatments requires new parameter attributes. We discussed using separate calling conventions; unfortunately, error and context arguments can appear either separately or together, so we’d really need several new conventions for all the valid combinations. Furthermore, calling a context-free function with an ignored context argument could turn into a call to a function using a mismatched calling convention, which LLVM IR generally treats as undefined behavior. Also, it wasn’t obvious that just a calling convention would be sufficient for the error treatment; see the next bullet.

Why not treat context and error like C's default arguments? Or like
named arguments in Python?

Surely the front-end can easily re-order the arguments (according to
some ABI) and make sure every function that may be called with
context/error has it as the last arguments, and default them to null.
You can then later do an inter-procedural pass to clean it up for all
static functions that are never called with those arguments, etc.

  - The “error” treatment requires some way to (1) pass and receive the value in the caller and (2) receive and change the value in the callee. The best way we could think of to represent this was to pretend that the argument is actually passed indirectly; the value is “passed” by storing to the pointer and “received” by loading from it. To simplify backend lowering, we require the argument to be a special kind of swifterror alloca that can only be loaded, stored, and passed as a swifterror argument; in the callee, swifterror arguments have similar restrictions. This ends up being fairly invasive in the backend, unfortunately.

I think this logic is too high-level for the back-end to deal with.
This looks like a simple run of the mill pointer argument that can be
null (and is by default), but if it's not, the callee can change the
object pointed by but not the pointer itself, ie, "void foo(exception
* const Error = null)". I don't understand why you need this argument
to be of a special kind of SDNode.

cheers,
--renato

I’ve done the “environment passed in a callee-save register” thing before, just by using the C compiler’s ability to reserve a register and map a particular C global variable to it.

As you say, there is then no problem when you call code (let’s call it “library code”) which doesn’t expect it. The library code just automatically saves and restores that register if it needs it.

However – and probably you’ve thought of this – there is a problem with callbacks from library code that doesn’t know about the environment argument. The library might save the environment register, put something else there, and then call back to your code that expects the environment to be set up. Boom!

This callback might be a function that you passed explicitly as an argument, a function pointed to by a global hook, or a virtual function of an object you passed (derived from a base class that the library knows about).

Any such callbacks need to either 1) not use the environment register, or 2) set up the environment register from somewhere else before using it or calling other code that uses it, or 3) be a wrapper/thunk that sets up the environment register before calling the real function.

I’ve done the “environment passed in a callee-save register” thing before, just by using the C compiler’s ability to reserve a register and map a particular C global variable to it.

As you say, there is then no problem when you call code (let’s call it “library code”) which doesn’t expect it. The library code just automatically saves and restores that register if it needs it.

However – and probably you’ve thought of this – there is a problem with callbacks from library code that doesn’t know about the environment argument. The library might save the environment register, put something else there, and then call back to your code that expects the environment to be set up. Boom!

Yes, that’s a well-known problem with trying to reserve a register for a ubiquitous environment. That’s not what we’re doing here, though. The error result is more like a special argument / result to the function, in as much it’s only actually required that the value be in that register at the call boundary.

Swift uses a non-zero-cost exceptions scheme for its primary error handling; this result is used to indicate whether (and what) a function throws. The basic idea is that the callee sets the register to either null, meaning it didn’t throw, or an error value, meaning it did; except actually it’s better for code size if the caller sets the register to null on entry.

The considerations on the choice of register are as follows:

  1. We want to be able to freely convert an opaque function value that’s known not to throw to a function that can. The idea here is that the caller initializes the register to null. If the callee is dynamically potentially-throwing, it expects the register to be null on entry and may or may not set it to be some other value. If the callee is not dynamically potentially-throwing, it leaves the register alone because it considers it to be callee-save. So there’s a hard requirement that whatever register we choose be considered callee-save by the Swift convention.

  2. Swift code frequently calls C code. It’s good for performance and code size if a function doesn’t have to save and restore the error-result register just because it’s calling a C function. So we really want it to be callee-save in the C convention, too. This also means that we don’t have to worry about the dynamic linker messing us up.

  3. We don’t want to penalize other Swift functions by claiming an argument/result register that they would otherwise use. Of course, they wouldn’t normally use a callee-save register.

John.

Hi, all.
- We sometimes want to return more values in registers than the convention normally does, and we want to be able to use both integer and floating-point registers. For example, we want to return a value of struct A, above, purely in registers. For the most part, I don’t think this is a problem to layer on to an existing IR convention: C frontends will generally use explicit sret arguments when the convention requires them, and so the Swift lowering will produce result types that don’t have legal interpretations as direct results under the C convention. But we can use a different IR convention if it’s necessary to disambiguate Swift’s desired treatment from the target's normal attempts to retroactively match the C convention.

Is this a back-end decision, or do you expect the front-end to tell
the back-end (via annotation) which parameters will be in regs? Unless
you also have back-end patches, I don't think the latter is going to
work well. For example, the ARM back-end has a huge section related to
passing structures in registers, which conforms to the ARM EABI, not
necessarily your Swift ABI.

Not to mention that this creates the versioning problem, where two
different LLVM releases can produce slightly different PCS register
usage (due to new features or bugs), and thus require re-compilation
of all libraries. This, however, is not a problem for your current
request, just a comment.

The frontend will not tell the backend explicitly which parameters will be
in registers; it will just pass a bunch of independent scalar values, and
the backend will assign them to registers or the stack as appropriate.

Our intent is to completely bypass all of the passing-structures-in-registers
code in the backend by simply not exposing the backend to any parameters
of aggregate type. The frontend will turn a struct into (say) an i32, a float,
and an i8; if the first two get passed in registers and the last gets passed
on the stack, so be it.

The only difficulty with this plan is that, when we have multiple results, we
don’t have a choice but to return a struct type. To the extent that backends
try to infer that the function actually needs to be sret, instead of just trying
to find a way to return all the components of the struct type in appropriate
registers, that will be sub-optimal for us. If that’s a pervasive problem, then
we probably just need to introduce a swift calling convention in LLVM.

- We sometimes have both direct results and indirect results. It would be nice to take advantage of the sret convention even in the presence of direct results on targets that do use a different (profitable) ABI treatment for it. I don’t know how well-supported this is in LLVM.

I'm not sure what you mean by direct or indirect results here. But if
this is a language feature, as long as the IR semantics is correct, I
don't see any problem.

A direct result is something that’s returned in registers. An indirect
result is something that’s returned by storing it in an implicit out-parameter.
I would like to be able to form calls like this:

  %temp = alloca %my_big_struct_type
  call i32 @my_swift_function(sret %my_big_struct_type* %temp)

This doesn’t normally happen today in LLVM IR because when C frontends
use an sret result, they set the direct IR result to void.

Like I said, I don’t think this is a serious problem, but I wanted to float the idea
before assuming that.

- We want a special “context” treatment for a certain argument. A pointer-sized value is passed in an integer register; the same value should be present in that register after the call. In some cases, the caller may pass a context argument to a function that doesn’t expect one, and this should not trigger undefined behavior. Both of these rules suggest that the context argument be passed in a register which is normally callee-save.

I think it's going to be harder to get all opts to behave in the way
you want them to. And may also require back-end changes to make sure
those registers are saved in the right frame, or reserved from
register allocation, or popped back after the call, etc.

I don’t expect the optimizer to be a problem, but I just realized that the main
reason is something I didn’t talk about in my first post. See below.

That this will require some support from the backend is a given.

The Clang impact is relatively minor; it is focused on allowing the Swift runtime to define functions that use the convention. It adds a new calling convention attribute, a few new parameter attributes constrained to that calling convention, and some relatively un-invasive call lowering code in IR generation.

This sounds like a normal change to support language perks, no big
deal. But I'm not a Clang expert, nor I've seen the code.

- Using sret together with a direct result may or may not “just work". I certainly don’t see a reason why it shouldn’t work in the middle-end. Obviously, some targets can’t support it, but we can avoid doing this on those targets.

All sret problems I've seen were back-end related (ABI conformance).
But I wasn't paying attention to the middle-end.

- Opting in to the two argument treatments requires new parameter attributes. We discussed using separate calling conventions; unfortunately, error and context arguments can appear either separately or together, so we’d really need several new conventions for all the valid combinations. Furthermore, calling a context-free function with an ignored context argument could turn into a call to a function using a mismatched calling convention, which LLVM IR generally treats as undefined behavior. Also, it wasn’t obvious that just a calling convention would be sufficient for the error treatment; see the next bullet.

Why not treat context and error like C's default arguments? Or like
named arguments in Python?

Surely the front-end can easily re-order the arguments (according to
some ABI) and make sure every function that may be called with
context/error has it as the last arguments, and default them to null.
You can then later do an inter-procedural pass to clean it up for all
static functions that are never called with those arguments, etc.

Oh, sorry, I forgot to talk about that. Yes, the frontend already rearranges
these arguments to the end, which means the optimizer’s default behavior
of silently dropping extra call arguments ends up doing the right thing.

I’m reluctant to say that the convention always requires these arguments.
If we have to do that, we can, but I’d rather not; it would involve generating
a lot of unnecessary IR and would probably create unnecessary
code-generation differences, and I don’t think it would be sufficient for
error results anyway.

- The “error” treatment requires some way to (1) pass and receive the value in the caller and (2) receive and change the value in the callee. The best way we could think of to represent this was to pretend that the argument is actually passed indirectly; the value is “passed” by storing to the pointer and “received” by loading from it. To simplify backend lowering, we require the argument to be a special kind of swifterror alloca that can only be loaded, stored, and passed as a swifterror argument; in the callee, swifterror arguments have similar restrictions. This ends up being fairly invasive in the backend, unfortunately.

I think this logic is too high-level for the back-end to deal with.
This looks like a simple run of the mill pointer argument that can be
null (and is by default), but if it's not, the callee can change the
object pointed by but not the pointer itself, ie, "void foo(exception
* const Error = null)". I don't understand why you need this argument
to be of a special kind of SDNode.

We don’t want checking or setting the error result to actually involve memory
access.

An alternative to the pseudo-indirect-result approach would be to model
the result as an explicit result. That would really mess up the IR, though.
The ability to call a non-throwing function as a throwing function means
we’d have to provide this extra explicit result on every single function with
the Swift convention, because the optimizer is definitely not going to
gracefully handle result-type mismatches; so even a function as simple as
  func foo() -> Int32
would have to be lowered into IR as
  define { i32, i8* } @foo(i8*)

John.

Also, just a quick question. I’m happy to continue to talk about the actual
design and implementation of LLVM IR on this point, and I’d be happy to
put out the actual patch we’re initially proposing. Obviously, all of this code
needs to go through the normal LLVM/Clang code review processes. But
before we continue with that, I just want to clarify one important point: assuming
that the actual implementation ends up satisfying your technical requirements,
do you have any objections to the general idea of supporting the Swift CC
in mainline LLVM?

John.

I personally don't. I think we should treat Swift as any other
language that we support, and if we can't use existing mechanisms in
the back-end to lower Swift, then we need to expand the back-end to
support that.

That being said, if the Swift support starts to bit-rot (if, for
instance, Apple stops supporting it in the future), it will be harder
to clean up the back-end from its CC. But that, IMHO, is a very
far-fetched future and a small price to pay.

cheers,
--renato

Okay, thank you. Back to technical discussion. :slight_smile:

John.

The Swift calling model also seems to be quite generally useful. I can imagine that VMKit would have used it, if it had been available then.

My only concern is that the implicit contract between the front and back ends, with regard to calling convention, is already complex, already largely undocumented, and already difficult to infer even if you have the relevant platform ABI document in front of you. It is badly in need of some cleanup and I wonder if the desire to minimise diffs for Swift might provide Apple with some incentive to spend a little bit of engineering effort on it?

David

That's a very good point. We don't have that many chances of
refactoring largely forgotten and undocumented code.

--renato

The frontend will not tell the backend explicitly which parameters will be
in registers; it will just pass a bunch of independent scalar values, and
the backend will assign them to registers or the stack as appropriate.

I'm assuming you already have code in the back-end that does that in
the way you want, as you said earlier you may want to use variable
number of registers for PCS.

Our intent is to completely bypass all of the passing-structures-in-registers
code in the backend by simply not exposing the backend to any parameters
of aggregate type. The frontend will turn a struct into (say) an i32, a float,
and an i8; if the first two get passed in registers and the last gets passed
on the stack, so be it.

How do you differentiate the @foo's below?

struct A { i32, float };
struct B { float, i32 };

define @foo (A, i32) -> @foo(i32, float, i32);

and

define @foo (i32, B) -> @foo(i32, float, i32);

The only difficulty with this plan is that, when we have multiple results, we
don’t have a choice but to return a struct type. To the extent that backends
try to infer that the function actually needs to be sret, instead of just trying
to find a way to return all the components of the struct type in appropriate
registers, that will be sub-optimal for us. If that’s a pervasive problem, then
we probably just need to introduce a swift calling convention in LLVM.

Oh, yeah, some back-ends will fiddle with struct return. Not all
languages have single-value-return restrictions, but I think that ship
has sailed already for IR.

That's another reason to try and pass all by pointer at the end of the
parameter list, instead of receive as an argument and return.

A direct result is something that’s returned in registers. An indirect
result is something that’s returned by storing it in an implicit out-parameter.

Oh, I see. In that case, any assumption on the variable would have to
be invalidated, maybe use global volatile variables, or special
built-ins, so that no optimisation tries to get away with it. But that
would mess up your optimal code, especially if they have to get passed
in registers.

Oh, sorry, I forgot to talk about that. Yes, the frontend already rearranges
these arguments to the end, which means the optimizer’s default behavior
of silently dropping extra call arguments ends up doing the right thing.

Excellent!

I’m reluctant to say that the convention always requires these arguments.
If we have to do that, we can, but I’d rather not; it would involve generating
a lot of unnecessary IR and would probably create unnecessary
code-generation differences, and I don’t think it would be sufficient for
error results anyway.

This should be ok for internal functions, but maybe not for global /
public interfaces. The ARM ABI has specific behaviour guarantees for
public interfaces (like large alignment) that would be prohibitively
bad for all functions, but ok for public ones.

If hells break loose, you could enforce that for public interfaces only.

We don’t want checking or setting the error result to actually involve memory
access.

And even though most of those access could be optimised away, there's
no guarantee.

Another option would be to have a special built-in to recognise
context/error variables, and plug in a late IR pass to clean up
everything. But I'd only recommend that if we can't find another way
around.

The ability to call a non-throwing function as a throwing function means
we’d have to provide this extra explicit result on every single function with
the Swift convention, because the optimizer is definitely not going to
gracefully handle result-type mismatches; so even a function as simple as
  func foo() -> Int32
would have to be lowered into IR as
  define { i32, i8* } @foo(i8*)

Indeed, very messy.

I'm going on a tangent, here, may be all rubbish, but...

C++ handles exception handling with the exception being thrown
allocated in library code, not the program. If, like C++, Swift can
only handle one exception at a time, why can't the error variable be a
global?

The ARM back-end accepts the -rreserve-r9 option, and others seem to
have similar options, so you could use that to force your global
variable to live on the platform register.

That way, all your error handling built-ins deal with that global
variable, which the back-end knows is on registers. You will need a
special DAG node, but I'm assuming you already have/want one. You also
drop any problem with arguments and PCS, at least for the error part.

cheers,
--renato

I have to say that, while I completely agree with you, I also deliberately made an effort in the design of our lowering to avoid as many of those existing complexities as I could. :slight_smile: So I’m not sure we’d be an ideal vehicle for cleaning up the C lowering model. I’m also wary about turning this project — already somewhat complex — into a massive undertaking, which I’m afraid that changing general CC lowering rules would be. Furthermore, I’m concerned that anything we did here would just turn into an *extra* dimension of complexity for the backend, rather than replacing the current complexity, because it’s not obvious that targets would be able to simply drop their existing ad-hoc interpretation rules. But if you have concrete ideas about this, maybe we can find a way to work them in.

The basic tension in CC lowering is between wanting simple cases to just work without further annotations and the need to cover the full gamut of special-case ABI rules. If we didn’t care about the former, we could just require every call and the function to completely describe the ABI to use — "argument 1 is in R0, argument 2 is in R12, argument 3 is at offset 48 on the stack, and we need 64 bytes on the stack and it has to be 16-byte-aligned at the call”. But dealing with that level of generality at every single call boundary would be a huge pain for backends, and we’d still need special code for things like varargs. So instead we’ve evolved all these informal protocols between frontends and backends. The informal protocols are… annoying, but I think the bigger problem is that they’re undocumented, and it’s unclear to everybody involved what’s supposed to happen when you go outside them. So the first step, I think, would just be to document as many of those informal, target-specific protocols as we can, and then from there maybe we can find commonalities that can be usefully generalized.

John.

Proposal for function vectorization and loop vectorization with function calls

There are four general points of deviation from the intermediary
convention:

  - We sometimes want to return more values in registers than the
convention normally does, and we want to be able to use both integer and
floating-point registers. For example, we want to return a value of struct
A, above, purely in registers. For the most part, I don’t think this is a
problem to layer on to an existing IR convention: C frontends will
generally use explicit sret arguments when the convention requires them,
and so the Swift lowering will produce result types that don’t have legal
interpretations as direct results under the C convention. But we can use a
different IR convention if it’s necessary to disambiguate Swift’s desired
treatment from the target's normal attempts to retroactively match the C
convention.

You're suggesting that backends shouldn't try to turn returns of {i32,
float, i32} into sret automatically if the C ABI would require that struct
to be returned indirectly. I know there are many users of LLVM out there
that wish that LLVM would just follow the C ABI for them in "simple" cases
like this, even though in general it's a lost cause. I think if you hide
this new behavior under your own swiftcc then we can keep those people
happy, ish.

  - We sometimes have both direct results and indirect results. It would

be nice to take advantage of the sret convention even in the presence of
direct results on targets that do use a different (profitable) ABI
treatment for it. I don’t know how well-supported this is in LLVM.

LLVM insists that sret functions be void because the C convention requires
the sret pointer to be returned in the normal return register. X86 Sys V
requires this, though LLVM does not leverage it, and was non-conforming for
most of its life. I don't see why Swift would need to use the 'sret'
attribute for indirect results, though, if it doesn't need to conform to
that part of the x86 convention. Am I missing something profitable about
reusing our sret support?

  - We want a special “context” treatment for a certain argument. A
pointer-sized value is passed in an integer register; the same value should
be present in that register after the call. In some cases, the caller may
pass a context argument to a function that doesn’t expect one, and this
should not trigger undefined behavior. Both of these rules suggest that
the context argument be passed in a register which is normally callee-save.

As discussed later, these arguments would come last. I thought it was
already legal to call a C function with too many arguments without invoking
UB, so I think we have to keep this working in LLVM anyway.

  - The “error” treatment requires some way to (1) pass and receive the
value in the caller and (2) receive and change the value in the callee.
The best way we could think of to represent this was to pretend that the
argument is actually passed indirectly; the value is “passed” by storing to
the pointer and “received” by loading from it. To simplify backend
lowering, we require the argument to be a special kind of swifterror alloca
that can only be loaded, stored, and passed as a swifterror argument; in
the callee, swifterror arguments have similar restrictions. This ends up
being fairly invasive in the backend, unfortunately.

This seems unfortunate. I guess you've already rejected returning an FCA. I
wonder if we should ever go back to the world of "only calls can produce
multiple values" as a special case, since that's what really happens at the
MI level. I wonder if operand bundles or tokens could help solve this
problem.

The frontend will not tell the backend explicitly which parameters will be
in registers; it will just pass a bunch of independent scalar values, and
the backend will assign them to registers or the stack as appropriate.

I'm assuming you already have code in the back-end that does that in
the way you want, as you said earlier you may want to use variable
number of registers for PCS.

Our intent is to completely bypass all of the passing-structures-in-registers
code in the backend by simply not exposing the backend to any parameters
of aggregate type. The frontend will turn a struct into (say) an i32, a float,
and an i8; if the first two get passed in registers and the last gets passed
on the stack, so be it.

How do you differentiate the @foo's below?

struct A { i32, float };
struct B { float, i32 };

define @foo (A, i32) -> @foo(i32, float, i32);

and

define @foo (i32, B) -> @foo(i32, float, i32);

We don’t need to. We don't use the intermediary convention’s rules for aggregates.
The Swift rule for aggregate arguments is literally “if it’s too complex according to
<foo>, pass it indirectly; otherwise, expand it into a sequence of scalar values and
pass them separately”. If that means it’s partially passed in registers and partially
on the stack, that’s okay; we might need to re-assemble it in the callee, but the
first part of the rule limits how expensive that can ever get.

The only difficulty with this plan is that, when we have multiple results, we
don’t have a choice but to return a struct type. To the extent that backends
try to infer that the function actually needs to be sret, instead of just trying
to find a way to return all the components of the struct type in appropriate
registers, that will be sub-optimal for us. If that’s a pervasive problem, then
we probably just need to introduce a swift calling convention in LLVM.

Oh, yeah, some back-ends will fiddle with struct return. Not all
languages have single-value-return restrictions, but I think that ship
has sailed already for IR.

That's another reason to try and pass all by pointer at the end of the
parameter list, instead of receive as an argument and return.

That’s pretty sub-optimal compared to just returning in registers. Also, most
backends do have the ability to return small structs in multiple registers already.

A direct result is something that’s returned in registers. An indirect
result is something that’s returned by storing it in an implicit out-parameter.

Oh, I see. In that case, any assumption on the variable would have to
be invalidated, maybe use global volatile variables, or special
built-ins, so that no optimisation tries to get away with it. But that
would mess up your optimal code, especially if they have to get passed
in registers.

I don’t understand what you mean here. The out-parameter is still explicit in
LLVM IR. Nothing about this is novel, except that C frontends generally won’t
combine indirect results with direct results. Worst case, if pervasive LLVM
assumptions prevent us from combining the sret attribute with a direct result,
we just won’t use the sret attribute.

Oh, sorry, I forgot to talk about that. Yes, the frontend already rearranges
these arguments to the end, which means the optimizer’s default behavior
of silently dropping extra call arguments ends up doing the right thing.

Excellent!

I’m reluctant to say that the convention always requires these arguments.
If we have to do that, we can, but I’d rather not; it would involve generating
a lot of unnecessary IR and would probably create unnecessary
code-generation differences, and I don’t think it would be sufficient for
error results anyway.

This should be ok for internal functions, but maybe not for global /
public interfaces. The ARM ABI has specific behaviour guarantees for
public interfaces (like large alignment) that would be prohibitively
bad for all functions, but ok for public ones.

If hells break loose, you could enforce that for public interfaces only.

We don’t want checking or setting the error result to actually involve memory
access.

And even though most of those access could be optimised away, there's
no guarantee.

Right. The backend isn’t great about removing memory operations that survive to it.

Another option would be to have a special built-in to recognise
context/error variables, and plug in a late IR pass to clean up
everything. But I'd only recommend that if we can't find another way
around.

The ability to call a non-throwing function as a throwing function means
we’d have to provide this extra explicit result on every single function with
the Swift convention, because the optimizer is definitely not going to
gracefully handle result-type mismatches; so even a function as simple as
func foo() -> Int32
would have to be lowered into IR as
define { i32, i8* } @foo(i8*)

Indeed, very messy.

I'm going on a tangent, here, may be all rubbish, but...

C++ handles exception handling with the exception being thrown
allocated in library code, not the program. If, like C++, Swift can
only handle one exception at a time, why can't the error variable be a
global?

The ARM back-end accepts the -rreserve-r9 option, and others seem to
have similar options, so you could use that to force your global
variable to live on the platform register.

That way, all your error handling built-ins deal with that global
variable, which the back-end knows is on registers. You will need a
special DAG node, but I'm assuming you already have/want one. You also
drop any problem with arguments and PCS, at least for the error part.

Swift does not run in an independent environment; it has to interact with
existing C code. That existing code does not reserve any registers globally
for this use. Even if that were feasible, we don’t actually want to steal a
register globally from all the C code on the system that probably never
interacts with Swift.

John.

Yes, that may be best.

For most platforms, it’s not profitable. On some platforms, there’s a register reserved for the sret argument; it would be nice to take advantage of that for several reasons, including just being nicer to existing tools (debuggers, etc.) in common cases.

Formally, no, it’s UB to call (non-variadic, of course) C functions with extra arguments. But you’re right, it does generally work at runtime, and LLVM doesn’t get in the way.

See one of my recent responses to Renato. It’s really hard to make that work cleanly with the ability to call functions that lack the result.

I agree that the error-result stuff is the most intrinsically awkward part of our current patch, and I would love to find alternatives.

John.

I consider it completely reasonable for Clang to support this calling
convention and the associated attributes, especially given the minor
impact you describe above.

Hi Tian,

Thanks for the writeup, it sounds very interesting! Please find some questions/comments inline:

Proposal for function vectorization and loop vectorization with function calls

Intel Corporation (3/2/2016)

This is a proposal for an initial work towards Clang and LLVM implementation of
vectorizing a function annotated with OpenMP 4.5's "#pragma omp declare simd"
(named SIMD-enabled function) and its associated clauses based on the VectorABI
[2]. On the caller side, we propose to improve LLVM loopVectorizer such that
the code that calls the SIMD-enabled function can be vectorized. On the callee
side, we propose to add Clang FE support for "#pragma omp declare simd" syntax
and a new pass to transform the SIMD-enabled function body into a SIMD loop.
This newly created loop can then be fed to LLVM loopVectorizer (or its future
enhancement) for vectorization. This work does leverage LLVM's existing
LoopVectorizer.

Problem Statement

Currently, if a loop calls a user-defined function or a 3rd party library
function, the loop can't be vectorized unless the function is inlined. In the
example below the LoopVectorizer fails to vectorize the k loop due to its
function call to "dowork" because "dowork" is an external function. Note that
inlining the "dowork" function may result in vectorization for some of the
cases, but that is not a generally applicable solution. Also, there may be
reasons why compiler may not (or can't) inline the "dowork" function call.
Therefore, there is value in being able to vectorize the loop with a call to
"dowork" function in it.

#include<stdio.h>
extern float dowork(float *a, int k);

float a[4096];
int main()
{ int k;
#pragma clang loop vectorize(enable)
for (k = 0; k < 4096; k++) {
   a[k] = k * 0.5;
   a[k] = dowork(a, k);
}
printf("passed %f\n", a[1024]);
}

I think it should be possible to vectorize such loop even without openmp clauses. We just need to gather a vector value from several scalar calls, and vectorizer already knows how to do that, we just need not to bail out early. Dealing with calls is tricky, but in this case we have the pragma, so we can assume it should be fine. What do you think, will it make sense to start from here?

sh-4.1$ clang -c -O2 -Rpass=loop-vectorize -Rpass-missed=loop-vectorize
                    -Rpass-analysis=loop-vectorize loopvec.c
loopvec.c:15:12: remark: loop not vectorized: call instruction cannot be
     vectorized [-Rpass-analysis]
   a[k] = dowork(a, k);
          ^
loopvec.c:13:3: remark: loop not vectorized: use -Rpass-analysis=loop-vectorize
     for more info (Force=true) [-Rpass-missed=loop-vectorize]
for (k = 0; k < 4096; k++) {
^
loopvec.c:13:3: warning: loop not vectorized: failed explicitly specified
               loop vectorization [-Wpass-failed]
1 warning generated.

New functionality of Vectorization

New functionalities and enhancements are proposed to address the issues
stated above which include: a) Vectorize a function annotated by the
programmer using OpenMP* SIMD extensions; b) Enhance LLVM's LoopVectorizer
to vectorize a loop containing a call to SIMD-enabled function.

For example, when writing:

#include<stdio.h>

#pragma omp declare simd uniform(a) linear(k)
extern float dowork(float *a, int k);

float a[4096];
int main()
{ int k;
#pragma clang loop vectorize(enable)
for (k = 0; k < 4096; k++) {
   a[k] = k * 0.5;
   a[k] = dowork(a, k);
}
printf("passed %f\n", a[1024]);
}

the programmer asserts that
a) there will be a vector version of "dowork" available for the compiler to
    use (link with, with appropriate signature, explained below) when
    vectorizing the k loop; and that
b) no loop-carried backward dependencies are introduced by the "dowork"
    call that prevent the vectorization of the k loop.

The expected vector loop (shown as pseudo code, ignoring leftover iterations)
resulting from LLVM's LoopVectorizer is

... ...
vectorized_for (k = 0; k < 4096; k += VL) {
   a[k:VL] = {k, k+1, k+2, k+VL-1} * 0.5;
   a[k:VL] = _ZGVb4Nul_dowork(a, k);
}
... ...

In this example "_ZGVb4Nul_dowork" is a special name mangling where:
_ZGV is a prefix based on C/C++ name mangling rule suggested by GCC community,
'b' indicates "xmm" (assume we vectorize here to 128bit xmm vector registers),
'4' is VL (assume we vectorize here for length 4),
'N' indicates that the function is vectorized without a mask, M indicates that
    the function is vecrized with a mask.
'u' indicates that the first parameter has the "uniform" property,
'l' indicates that the second argement has the "linear" property.

More details (including name mangling scheme) can be found in the following
references [2].

References

1. OpenMP SIMD language extensions: http://www.openmp.org/mp-documents/openmp-4.
5.pdf

2. VectorABI Documentation:
https://www.cilkplus.org/sites/default/files/open_specifications/Intel-ABI-Vecto
r-Function-2012-v0.9.5.pdf
https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=Vecto
rABI.txt

[[Note: VectorABI was reviewed at X86-64 System V Application Binary Interface
       mailing list. The discussion was recorded at
       https://groups.google.com/forum/#!topic/x86-64-abi/LmppCfN1rZ4 ]]

3. The first paper on SIMD extensions and implementations:
"Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on
Multicore-SIMD Processors" by Xinmin Tian, Hideki Saito, Milind Girkar,
Serguei Preis, Sergey Kozhukhov, et al., IPDPS Workshops 2012, pages 2349--2358
[[Note: the first implementation and the paper were done before VectorABI was
       finalized with the GCC community and Redhat. The latest VectorABI
       version for OpenMP 4.5 is ready to be published]]

Proposed Implementation

1. Clang FE parses "#pragma omp declare simd [clauses]" and generates mangled
  name including these prefixes as vector signatures. These mangled name
  prefixes are recorded as function attributes in LLVM function attribute
  group. Note that it may be possible to have several mangled names associated
  with the same function, which correspond to several desired vectorized
  versions. Clang FE generates all function attributes for expected vector
  variants to be generated by the back-end. E.g.,

  #pragma omp delcare simd uniform(a) linear(k)
  float dowork(float *a, int k)
  {
     a[k] = sinf(a[k]) + 9.8f;
  }

  define __stdcall f32 @_dowork(f32* %a, i32 %k) #0
  ... ...
  attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}

2. A new vector function generation pass is introduced to generate vector
  variants of the original scalar function based on VectorABI (see [2, 3]).
  For example, one vector variant is generated for "_ZGVbN4ul_" attribute
  as follows (pseudo code):

  define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
  {
    #pragma clang loop vectorize(enable)
    for (int %t = k; %t < %k + 4; %t++) {
      %a[t] = sinf(%a[t]) + 9.8f;
    }
    vec_load xmm0, %a[k:VL]
    return xmm0;
  }

Am I getting it right, that you're going to emit declarations for all possible vector types, and then implement only used ones? If not, how does frontend know which vector-width to use? If the dowork function and its caller are in different modules, how does compiler communicate what vector width are needed?

  The body of the function is wrapped inside a loop having VL iterations,
  which correspond to the vector lanes.

  The LLVM LoopVectorizer will vectorize the generated %t loop, expected
  to produce the following vectorized code eliminating the loop (pseudo code):

  define __stdcall <4 x f32> @_ZGVbN4ul_dowork(f32* %a, i32 %k) #0
  {
    vec_load xmm1, %a[k: VL]
    xmm2 = call __svml_sinf(xmm1)
    xmm0 = vec_add xmm2, [9,8f, 9.8f, 9.8f, 9.8f]
    store %a[k:VL], xmm0
    return xmm0;
  }

  [[Note: Vectorizer support for the Short Vector Math Library (SVML)
          functions will be a seperate proposal. ]]

Loop Vectorizer already supports math functions and math functions libraries. You might need just to expand this support to SVML (i.e. add tables of correspondence between scalar and vector function variants).

3. The LLVM LoopVectorizer is enhanced to
  a) identify loops with calls that have been annotated with
     "#pragma omp declare simd" by checking function attribute groups;
  b) analyze each call instruction and its parameters in the loop, to
     determine if each parameter has the following properties:
       * uniform
       * linear + stride
       * vector
       * aligned
       * called inside a conditional branch or not
         ... ...
     Based on these properties, the signature of the vectorized call is
     generated; and
  c) performs signature matching to obtain the suitable vector variant
     among the signatures available for the called function. If no such
     signature is found, the call cannot be vectorized.

  Note that a similar enhancement can and should be made also to LLVM's
  SLP vectorizer.

  For example:

  #pragma omp declare simd uniform(a) linear(k)
  extern float dowork(float *a, int k);

  ... ...
  #pragma clang loop vectorize(enable)
  for (k = 0; k < 4096; k++) {
    a[k] = k * 0.5;
    a[k] = dowork(a, k);
  }
  ... ...

  Step a: "dowork" function is marked as SIMD-enabled function
          attributes #0 = { nounwind uwtable "_ZGVbM4ul_" "_ZGVbN4ul_" ...}

  Step b: 1) 'a' is uniform, as it is the base address of array 'a'
          2) 'k' is linear, as 'k' is the induction variable with stride=1
          3) SIMD "dowork" is called unconditionally in the candidate k loop.
          4) it is compiled for SSE4.1 with the Vector Length VL=4.
             based on these properties, the signature is "_ZGVbN4ul_"

  [[Notes: For conditional call in the loop, it needs masking support,
           the implementation details seen in reference [1][2][3] ]]

  Step c: Check if the signature "_ZGVbN4ul_" exists in function attribute #0;
          if yes the suitable vectorized version is found and will be linked
          with.

  The below loop is expected to be produced by the LoopVectorizer:
  ... ...
  vectorized_for (k = 0; k < 4096; k += 4) {
    a[k:4] = {k, k+1, k+2, k+3} * 0.5;
    a[k:4] = _ZGVb4Nul_dowork(a, k);
  }
  ... ...

[[Note: Vectorizer support for the Short Vector Math Library (SVML) functions
       will be a seperate proposal. ]]

GCC and ICC Compatibility

With this proposal the callee function and the loop containing a call to it
can each be compiled and vectorized by a different compiler, including
Clang+LLVM with its LoopVectorizer as outlined above, GCC and ICC. The
vectorized loop will then be linked with the vectorized callee function.
Of-course each of these compilers can also be used to compile both loop and
callee function.

Current Implementation Status and Plan

1. Clang FE is done by Intel Clang FE team according to #1. Note: Clang FE
  syntax process patch is implemented and under community review
  (http://reviews.llvm.org/D10599). In general, the review feedback is
  very positive from the Clang community.

2. A new pass for function vectorization is implemented to support #2 and
  to be prepared for LLVM community review.

3. Work is in progress to teach LLVM's LoopVectorizer to vectorize a loop
  with user-defined function calls according to #3.

Call for Action

1. Please review this proposal and provide constructive feedback on its
  direction and key ideas.

2. Feel free to ask any technical questions related to this proposal and
  to read the associated references.

3. Help is also highly welcome and appreciated in the development and
  upstreaming process.

Again, thanks for writing it up. I think this would be a valuable improvement of the vectorizer and I'm looking forward to further discussion and/or patches!

Best regards,
Michael

Hi Michael. Thank for your feedback and questions/comments. See below.

I think it should be possible to vectorize such loop even without openmp clauses. We just need to gather a vector value from several scalar calls, and vectorizer already knows how to do that, we just need not to bail out early. Dealing with calls is tricky, but in this case we have the pragma, so we can assume it should be fine. What do you think, will it make sense to start from here?

Yes, we can vectorize this loop by calling scalar code VL times as an emulation of SIMD execution if the SIMD version does not exist to start with. See below example, we need this functionality anyway as a fall back for vecotizing this loop when there is no SIMD version of dowork exist. E.g.

#pragma clang loop vectorize(enable)
for (k = 0; k < 4096; k++) {
    a[k] = k * 0.5;
    a[k] = dowork(a, k);
}

==>

Vectorized_for (k = 0; k < 4096; k+=VL) { // assume VL = 4. No vector version of dowork function exist.
    a[k:VL] = {k, k+1, K+2, k+3) * 0.5.; // Broadcast 0.5 to SIMD register, vector mul, with {k, k+1, k+2, k+3}, vector store to a[k:VL]
    t0 = dowork(a, k) // emulate SIMD execution with scalar calls.
    t1 = dowork(a, k+1)
    t2 = dowork(a, k+2)
    t3 = dowork(a, k+3)
    a[k:VL] = {t0, t1, t2, t3}; // SIMD store
}

Am I getting it right, that you're going to emit declarations for all possible vector types, and then implement only used ones? If not, how does frontend know which vector-width to use? If the dowork function and its caller are in different modules, how does compiler communicate what vector width are needed?

Yes, you are right in general, that is defined by VectorABI used by GCC and ICC. E.g. GCC generation 7 versions by default for x86 (scalar, SSE(mask, nomask), AVX(mask, nomask), AVX2 (mask, nomask). There are several options we can optimize to reduce the # of version we need to generate w.r.t compile-time and code-size. We can provide detailed info.

Loop Vectorizer already supports math functions and math functions libraries. You might need just to expand this support to SVML (i.e. add tables of correspondence between scalar and vector function variants).

Correct, that is the Step 3 in the doc we are working on.

Again, thanks for writing it up. I think this would be a valuable improvement of the vectorizer and I'm looking forward to further discussion and/or patches!

Thanks for the positive feedback! We are also looking forward to further discussion and sending patches with help from you and other LLVM community members.

Thanks,
Xinmin