RFC: Implementing the Swift calling convention in LLVM and Clang

Hi Tian,

Hi Michael. Thank for your feedback and questions/comments. See below.

I think it should be possible to vectorize such loop even without openmp clauses. We just need to gather a vector value from several scalar calls, and vectorizer already knows how to do that, we just need not to bail out early. Dealing with calls is tricky, but in this case we have the pragma, so we can assume it should be fine. What do you think, will it make sense to start from here?

Yes, we can vectorize this loop by calling scalar code VL times as an emulation of SIMD execution if the SIMD version does not exist to start with. See below example, we need this functionality anyway as a fall back for vecotizing this loop when there is no SIMD version of dowork exist. E.g.

#pragma clang loop vectorize(enable)
for (k = 0; k < 4096; k++) {
   a[k] = k * 0.5;
   a[k] = dowork(a, k);
}

==>

Vectorized_for (k = 0; k < 4096; k+=VL) { // assume VL = 4. No vector version of dowork function exist.
   a[k:VL] = {k, k+1, K+2, k+3) * 0.5.; // Broadcast 0.5 to SIMD register, vector mul, with {k, k+1, k+2, k+3}, vector store to a[k:VL]
   t0 = dowork(a, k) // emulate SIMD execution with scalar calls.
   t1 = dowork(a, k+1)
   t2 = dowork(a, k+2)
   t3 = dowork(a, k+3)
   a[k:VL] = {t0, t1, t2, t3}; // SIMD store
}

Yes, that’s what I meant.

Am I getting it right, that you're going to emit declarations for all possible vector types, and then implement only used ones? If not, how does frontend know which vector-width to use? If the dowork function and its caller are in different modules, how does compiler communicate what vector width are needed?

Yes, you are right in general, that is defined by VectorABI used by GCC and ICC. E.g. GCC generation 7 versions by default for x86 (scalar, SSE(mask, nomask), AVX(mask, nomask), AVX2 (mask, nomask).

How does it play with other architectures? Should it be described in more general terms, like vector/element width? I realize that you might be mostly concerned about x86, but this feature looks pretty generic, so I think it should be kept target-independent.

There are several options we can optimize to reduce the # of version we need to generate w.r.t compile-time and code-size. We can provide detailed info.

I’ll be interested in looking into this, as I find this part the most challenging in this changeset (other parts look to me like clear improvements of what we have now).

Thanks,
Michael

I’ll be interested in looking into this, as I find this part the most challenging in this changeset (other parts look to me like clear improvements of what we have now).

Right, this part is a challenging part. We will work closely with you to get a good implementation. Thanks.

How does it play with other architectures? Should it be described in more general terms, like vector/element width? I realize that you might be mostly concerned about x86, but this feature looks pretty generic, so I think it should be kept target-independent.

Yes, our experiences are mainly x86, we will work with you for the generic support.

Xinmin

Great, thanks.

Since the response so far has been very positive on the idea, I think it’s probably time to start sending out patches for review. Manman will be leading that on the LLVM side, since she did most of the work there. On the Clang side, I’ll land what I have and then progressively work on it in trunk.

John.

We don’t need to. We don't use the intermediary convention’s rules for aggregates.
The Swift rule for aggregate arguments is literally “if it’s too complex according to
<foo>, pass it indirectly; otherwise, expand it into a sequence of scalar values and
pass them separately”. If that means it’s partially passed in registers and partially
on the stack, that’s okay; we might need to re-assemble it in the callee, but the
first part of the rule limits how expensive that can ever get.

Right. My worry is, then, how this plays out with ARM's AAPCS.

As you said below, you *have* to interoperate with C code, so you will
*have* to interoperate with AAPCS on ARM.

AAPCS's rules on aggregates are not simple, but they also allow part
of it in registers, part on the stack. I'm guessing you won't have the
same exact rules, but similar ones, which may prove harder to
implement than the former.

That’s pretty sub-optimal compared to just returning in registers. Also, most
backends do have the ability to return small structs in multiple registers already.

Yes, but not all of them can return more than two, which may constrain
you if you have both error and context values in a function call, in
addition to the return value.

I don’t understand what you mean here. The out-parameter is still explicit in
LLVM IR. Nothing about this is novel, except that C frontends generally won’t
combine indirect results with direct results.

Sorry, I had understood this, but your reply (for some reason) made me
think it was a hidden contract, not an explicit argument. Ignore me,
then. :slight_smile:

Right. The backend isn’t great about removing memory operations that survive to it.

Precisely!

Swift does not run in an independent environment; it has to interact with
existing C code. That existing code does not reserve any registers globally
for this use. Even if that were feasible, we don’t actually want to steal a
register globally from all the C code on the system that probably never
interacts with Swift.

So, as Reid said, usage of built-ins might help you here.

Relying on LLVM's ability to not mess up your fiddling with variable
arguments seems unstable. Adding specific attributes to functions or
arguments seem too invasive. So a solution would be to add a built-in
in the beginning of the function to mark those arguments as special.

Instead of alloca %a + load -> store + return, you could have
llvm.swift.error.load(%a) -> llvm.swift.error.return(%a), which
survives most of middle-end passes intact, and a late pass then change
the function to return a composite type, either a structure or a
larger type, that will be lowered in more than one register.

This makes sure error propagation won't be optimised away, and that
you can receive the error in any register (or even stack), but will
always return it in the same registers (ex. on ARM, R1 for i32, R2+R3
for i64, etc).

I understand this might be far off what you guys did, and I'm not
trying to re-write history, just brainstorming a bit.

IMO, both David and Richard are right. This is likely not a huge deal
for the CC code, but we'd be silly not to take this opportunity to
make it less fragile overall.

cheers,
--renato

Hi David [Majnemer], Richard [Smith],

Front-end wise, the biggest change in this proposal is introduction of
new mangling for vector functions.

May I ask you to look at the mangling part (sections 1 and 2 in the
"Proposed Implementation" chapter) and review it?

(Obviously, others who are concerned with how mangling is done in
Clang are welcome to chime in as well!)

Yours,
Andrey

We don’t need to. We don't use the intermediary convention’s rules for aggregates.
The Swift rule for aggregate arguments is literally “if it’s too complex according to
<foo>, pass it indirectly; otherwise, expand it into a sequence of scalar values and
pass them separately”. If that means it’s partially passed in registers and partially
on the stack, that’s okay; we might need to re-assemble it in the callee, but the
first part of the rule limits how expensive that can ever get.

Right. My worry is, then, how this plays out with ARM's AAPCS.

As you said below, you *have* to interoperate with C code, so you will
*have* to interoperate with AAPCS on ARM.

I’m not sure of your point here. We don’t use the Swift CC to call C functions.
It does not matter, at all, whether the frontend lowering of an aggregate under
the Swift CC resembles the frontend lowering of the same aggregate under AAPCS.

I brought up interoperation with C code as a counterpoint to the idea of globally
reserving a register.

AAPCS's rules on aggregates are not simple, but they also allow part
of it in registers, part on the stack. I'm guessing you won't have the
same exact rules, but similar ones, which may prove harder to
implement than the former.

That’s pretty sub-optimal compared to just returning in registers. Also, most
backends do have the ability to return small structs in multiple registers already.

Yes, but not all of them can return more than two, which may constrain
you if you have both error and context values in a function call, in
addition to the return value.

We do actually use a different swiftcc calling convention in IR. I don’t see any
serious interop problems here. The “intermediary” convention is just the original
basis of swiftcc on the target.

I don’t understand what you mean here. The out-parameter is still explicit in
LLVM IR. Nothing about this is novel, except that C frontends generally won’t
combine indirect results with direct results.

Sorry, I had understood this, but your reply (for some reason) made me
think it was a hidden contract, not an explicit argument. Ignore me,
then. :slight_smile:

Right. The backend isn’t great about removing memory operations that survive to it.

Precisely!

Swift does not run in an independent environment; it has to interact with
existing C code. That existing code does not reserve any registers globally
for this use. Even if that were feasible, we don’t actually want to steal a
register globally from all the C code on the system that probably never
interacts with Swift.

So, as Reid said, usage of built-ins might help you here.

Relying on LLVM's ability to not mess up your fiddling with variable
arguments seems unstable. Adding specific attributes to functions or
arguments seem too invasive.

I’m not sure why you say that. We already do have parameter ABI override
attributes with target-specific behavior in LLVM IR: sret and inreg.

I can understand being uneasy with adding new swiftcc-specific attributes, though.
It would be reasonable to make this more general. Attributes can be parameterized;
maybe we could just say something like abi(“context”), and leave it to the CC to
interpret that?

Having that sort of ability might make some special cases easier for C lowering,
too, come to think of it. Imagine an x86 ABI that — based on type information
otherwise erased by the conversion to LLVM IR — sometimes returns a float in
an SSE register and sometimes on the x86 stack. It would be very awkward to
express that today, but some sort of abi(“x87”) attribute would make it easy.

So a solution would be to add a built-in
in the beginning of the function to mark those arguments as special.

Instead of alloca %a + load -> store + return, you could have
llvm.swift.error.load(%a) -> llvm.swift.error.return(%a), which
survives most of middle-end passes intact, and a late pass then change
the function to return a composite type, either a structure or a
larger type, that will be lowered in more than one register.

This makes sure error propagation won't be optimised away, and that
you can receive the error in any register (or even stack), but will
always return it in the same registers (ex. on ARM, R1 for i32, R2+R3
for i64, etc).

I understand this might be far off what you guys did, and I'm not
trying to re-write history, just brainstorming a bit.

IMO, both David and Richard are right. This is likely not a huge deal
for the CC code, but we'd be silly not to take this opportunity to
make it less fragile overall.

The lowering required for this would be very similar to the lowering that Manman’s
patch does for swift-error: the backend basically does special value
propagation. The main difference is that it’s completely opaque to the middle-end
by default instead of looking like a load or store that ordinary memory optimizations
can handle. That seems like a loss, since those optimizations would actually do
the right thing.

John.

I’m not sure of your point here. We don’t use the Swift CC to call C functions.
It does not matter, at all, whether the frontend lowering of an aggregate under
the Swift CC resembles the frontend lowering of the same aggregate under AAPCS.

Right, ignore me, then.

I’m not sure why you say that. We already do have parameter ABI override
attributes with target-specific behavior in LLVM IR: sret and inreg.

Their meaning is somewhat confused and hard-coded in the back-end. I
once wanted to use inreg for lowering register-based divmod in
SelectionDAG, but ended up implementing custom lowering in the ARM
back-end because inreg wasn't used correctly. It's possible that now
it's better, but you'll always be at the mercy of what the back-end
does with the attributes, especially in custom lowering.

Also, for different back-ends, "inreg" means different things. If the
PCS allows multiple argument/return registers, then sret inreg is
possible for a structure with up to X/Y words, where X and Y are
different for different targets and could very well be zero.

Example, in a machine with *two* PCS registers:

i64 @foo (i32)

returning in registers becomes: sret { i32, i32 } @foo (inreg i32)

then you add your error: sret { i32, i32, i8* } @foo (inreg i32, inreg i8*)

You can fit the two arguments in registers, but you can't fit the
result + error in your sret.

Targets will have to deal with that in the DAG, if you don't do that
in IR. The ARM target would put the error pointer in the stack, which
is not where you want it to go.

You'd probably need a way to mark portions of your sret as *must be
inreg* and others to be "nice to be inreg", so that you can spill the
result and not the error, if that's what you want.

Having that sort of ability might make some special cases easier for C lowering,
too, come to think of it. Imagine an x86 ABI that — based on type information
otherwise erased by the conversion to LLVM IR — sometimes returns a float in
an SSE register and sometimes on the x86 stack. It would be very awkward to
express that today, but some sort of abi(“x87”) attribute would make it easy.

If this is kept in Swift PCS only, and if the compiler always agree on
which registers you're using, that's ok.

But if you call a C function, or a new version of LLVM decides to use
a different register, you'll have run-time problems.

That's why ARM has different standards for hard and soft float, which
cannot mix.

cheers,
--renato

I think it should be possible to vectorize such loop even without openmp clauses.

Note that we may still need/use some #pragma to guarantee that

b) no loop-carried backward dependencies are introduced by the "dowork"
   call that prevent the vectorization of the k loop.

in order to vectorize the loop even if the calls themselves are to remain scalar and in order, unless we can prove that no such dependencies exist between the call and other instructions in the loop. But that holds independent of this proposal.

Ayal.

I’m not sure of your point here. We don’t use the Swift CC to call C functions.
It does not matter, at all, whether the frontend lowering of an aggregate under
the Swift CC resembles the frontend lowering of the same aggregate under AAPCS.

Right, ignore me, then.

I’m not sure why you say that. We already do have parameter ABI override
attributes with target-specific behavior in LLVM IR: sret and inreg.

Their meaning is somewhat confused and hard-coded in the back-end. I
once wanted to use inreg for lowering register-based divmod in
SelectionDAG, but ended up implementing custom lowering in the ARM
back-end because inreg wasn't used correctly. It's possible that now
it's better, but you'll always be at the mercy of what the back-end
does with the attributes, especially in custom lowering.

Also, for different back-ends, "inreg" means different things. If the
PCS allows multiple argument/return registers, then sret inreg is
possible for a structure with up to X/Y words, where X and Y are
different for different targets and could very well be zero.

Example, in a machine with *two* PCS registers:

i64 @foo (i32)

returning in registers becomes: sret { i32, i32 } @foo (inreg i32)

then you add your error: sret { i32, i32, i8* } @foo (inreg i32, inreg i8*)

You can fit the two arguments in registers, but you can't fit the
result + error in your sret.

Targets will have to deal with that in the DAG, if you don't do that
in IR. The ARM target would put the error pointer in the stack, which
is not where you want it to go.

You'd probably need a way to mark portions of your sret as *must be
inreg* and others to be "nice to be inreg", so that you can spill the
result and not the error, if that's what you want.

Right, this is one very good reason I would prefer to keep the error-result
modelled as a parameter rather than mixing it in with the return value.

Also, recall that the error-result is supposed to be assigned to a register
that isn’t normally used for return values (or arguments, for that matter).

Having that sort of ability might make some special cases easier for C lowering,
too, come to think of it. Imagine an x86 ABI that — based on type information
otherwise erased by the conversion to LLVM IR — sometimes returns a float in
an SSE register and sometimes on the x86 stack. It would be very awkward to
express that today, but some sort of abi(“x87”) attribute would make it easy.

If this is kept in Swift PCS only, and if the compiler always agree on
which registers you're using, that's ok.

But if you call a C function, or a new version of LLVM decides to use
a different register, you'll have run-time problems.

A new version of LLVM really can’t just decide to use a different register
once there’s an agreed interpretation. It sounds like the problem you were
running into with “inreg” was that the ARM backend didn’t have a stable meaning
for it, probably because the ARM target doesn’t allow the frontend features
(regparm/sseregparm) that inreg is designed for. But there are targets — i386,
chiefly — where inreg has a defined, stable meaning precisely because regparm
has a defined, stable meaning. It seems to me that an abi(“context”) attribute
would be more like the latter than the former: any target that supports swiftcc
would also have to assign a stable meaning for abi(“context”).

John.

To be absolutely clear: I’m not suggesting that merging the Swift CC should be conditional on Apple fixing all of the associated ugliness in all of the calling convention logic. I support merging the Swift CC, but I also think that it is going to add some complexity to an already complex part of LLVM and think that it would be good if it could come along with a plan for reducing that complexity.

For the current logic, there are two interrelated issues:

- The C ABI defines how to map things to registers / stack slots.

- Other language ABI documents (including C++) are typically defined in terms of lowering to the platform’s C calling convention. Even when the core language is not, the C FFI usually is.

There are a few smaller issues, such as the complexity required for each pass to work out what the return value of a call / invoke instruction is (is it the return value, is it a load of some alloca that is passed via an sret argument?).

There are two separable parts of this problem:

- What does the representation of a call with a known set of C types look like in LLVM?

- What are the APIs that we use for constructing a function that has these calls?

Clang already has APIs to abstract a lot of this. Given a C type and a set of LLVM values that represent these C values, it can deconstruct the values into the relevant LLVM types and, on the callee side, reassemble LLVM values that correspond to the C types. It’s been proposed a few times before to have some kind of ABIBuilder class that would encapsulate this behaviour, probably pulling some code out of clang. It would then be the responsibility of backend maintainers to ensure that the ABIBuilder is kept in sync with any changes to how they represent their ABI in IR. It would probably also help to have some introspection APIs of the same form (e.g. for getting the return value).

David

Right, this is one very good reason I would prefer to keep the error-result
modelled as a parameter rather than mixing it in with the return value.

Ok, so we're on the same page here.

Also, recall that the error-result is supposed to be assigned to a register
that isn’t normally used for return values (or arguments, for that matter).

This looks more complicated, though.

The back-end knows how to lower standard PCS, so you'll have to teach
that this particular argument violates that agreement in a very
predictable fashion, ie. another ABI.

If the non-PCS register you use is always the same (say the platform
register), then you'll have to save/restore whenever you cross the
boundaries between using/not-using (ex. between C and Swift
functions). This sounds hard to get right.

One way to know would be to identify all calls in IR that have
different number of parameters, and do the save/restore there.
Example:

define @foo() {
...
call @bar(i32, i8*)
...
}

define @bar(i32)

You'd need to change the frame lowering code to identify the
difference and, instead of bailing out, create the additional spills.

A new version of LLVM really can’t just decide to use a different register
once there’s an agreed interpretation.

Good, so there will be a defined ABI.

It originally sounded like you could choose "any register", but it
seems you're actually going to define the exact behaviour in all
supported platforms.

It seems to me that an abi(“context”) attribute
would be more like the latter than the former: any target that supports swiftcc
would also have to assign a stable meaning for abi(“context”).

Makes sense.

cheers,
--renato

Right, this is one very good reason I would prefer to keep the error-result
modelled as a parameter rather than mixing it in with the return value.

Ok, so we're on the same page here.

Also, recall that the error-result is supposed to be assigned to a register
that isn’t normally used for return values (or arguments, for that matter).

This looks more complicated, though.

The back-end knows how to lower standard PCS, so you'll have to teach
that this particular argument violates that agreement in a very
predictable fashion, ie. another ABI.

We are using a different swiftcc convention in IR already, and we are fine with
locking the error-result treatment to that CC.

If the non-PCS register you use is always the same (say the platform
register), then you'll have to save/restore whenever you cross the
boundaries between using/not-using (ex. between C and Swift
functions). This sounds hard to get right.

One way to know would be to identify all calls in IR that have
different number of parameters, and do the save/restore there.
Example:

define @foo() {
...
call @bar(i32, i8*)
...
}

define @bar(i32)

You'd need to change the frame lowering code to identify the
difference and, instead of bailing out, create the additional spills.

I don’t think we can make this depend on statically recognizing when we’re
passing extra arguments. That’s why, in our current implementation, whether
or not the register is treated as an ordinary callee-save register or the magic
error result is based on whether there’s an argument to the call (or function
on the callee side) with that specific parameter attribute.

A new version of LLVM really can’t just decide to use a different register
once there’s an agreed interpretation.

Good, so there will be a defined ABI.

It originally sounded like you could choose "any register", but it
seems you're actually going to define the exact behaviour in all
supported platforms.

Right.

John.

We are using a different swiftcc convention in IR already, and we are fine with
locking the error-result treatment to that CC.

Makes sense.

I don’t think we can make this depend on statically recognizing when we’re
passing extra arguments. That’s why, in our current implementation, whether
or not the register is treated as an ordinary callee-save register or the magic
error result is based on whether there’s an argument to the call (or function
on the callee side) with that specific parameter attribute.

Right, and you set it up even if the caller doesn't use the error
argument, which is expected.

I think all my questions were answered, and I'm happy with it. Thanks
for the time! :slight_smile:

I'll look into Manman's patch soon, but seems quite straightforward.
No changes on the ARM side at all so far.

Thanks!
--renato

PS: Nice to see you're using X86 like ARM, not the other way around... :slight_smile:

Pinging David and Richard!

Yours,
Andrey