[RFC] arm64_32: upstreaming ILP32 support for AArch64

As you may have noticed, we released a 64b S4 chip that runs an ILP32
variant of the AArch64 ABI, and now we'd like to upstream that work.
I've pushed preliminary patches to
https://github.com/TNorthover/llvm-project/pull/1/commits (arm64_32
branch in that repo) to accompany this RFC. The changes divide fairly
neatly into three categories.

First, there's AArch64 ILP32 support, which should be fairly easy to
adapt to the ELF (or COFF) world and be generally useful. This
involved changing some generic code in ways I'll discuss below.

Then there's the specific ABI we chose, which isn't quite the same as
AAPCS since it was designed in conjunction with armv7k so that IR
could be compiled to be compatible with arm64_32. Since people do use
third-party compilers based on LLVM having it upstream is expected to
be a good thing.

Finally we have a few passes that translate the necessarily
platform-specific parts of armv7k IR to arm64_32. Things like NEON
intrinsic calls and workarounds for certain assumptions the Swift
compiler made about C++ parameter passing. These aren't quite so
obviously useful to everyone, but could serve as examples in future
and (since they're self-contained IR passes) are likely to be low
maintenance. However, we'd understand if the community doesn't want
this
burden anyway.

Most of the target-specific changes are fairly straightforward, but I
think I should explain the changes made to generic CodeGen.

AArch64 ILP32 Addressing Modes

Comments inline

From: llvm-dev <llvm-dev-bounces@lists.llvm.org> On Behalf Of Tim Northover
via llvm-dev
Sent: Thursday, January 31, 2019 7:06 AM
To: LLVM Developers Mailing List <llvm-dev@lists.llvm.org>
Subject: [EXT] [llvm-dev] [RFC] arm64_32: upstreaming ILP32 support for
AArch64

CodeGenPrepare:
---------------

We teach CodeGenPrepare to sink GEPs as GEPs, and preserve the
inbounds marker. This is the only way they can possibly be exposed to
SDAG at the basic block level.

Isn't addr-sink-using-gep already a thing?

Pointers are still 64-bits, tricked ya!
---------------------------------------

The next question was how to expose these GEPs to the SDAG.

I first considered adding an ISD::GEP or an "inbounds" flag to
ISD::ADD. These would solve the first issue above, but not the second.

So the proposed solution is to allow pointers to have different
in-memory and in-DAG types, and specifically keep an i64 pointer in
the DAG on arm64_32. This immediately guarantees that (valid) pointers
will have their high bits zeroed, and just by creating the DAG we make
explicit the sign-extensions described by GEP semantics.

Addressing-modes can then be used with no change to the actual C++
code in AArch64 that selects them.

There are two possible disadvantages though. First, since pointers are
64-bits, they will consume 64-bit spill slots and potentially bloat
the stack. It's unclear how much of an issue that is in practice.

Second is the intrusiveness. On the plus side it's less intrusive than
ISD::GEP would be, but it still involves changes in some fairly
obscure bits of DAG -- often found when things broke rather than by
careful planning.

Did you consider modeling this with address spaces? LLVM already has robust support for address spaces with different pointer sizes, and you probably want to expose support for 64-bit pointers anyway.

Arrays
------

We're translating armv7k bitcode to arm64_32, and the result has to be
compatible with code that is compiled directly to arm64_32.

The biggest barrier here was small structs. They generally get passed
in registers, possibly with alignment requirements.

    struct { int arr[2] }; goes in [rN,rN+1] or in xN.
    struct { uint64_t val; } goes in [rN,rN+1] (starting even), or xN

So we need a way to signal in IR that two values should be combined
into a single x-register when compiled for arm64_32. We chose LLVM
arrays for the job. So, unlike all other targets, the following two
functions will behave differently in arm64_32:

    void @foo([2 x i32] %x0) ; Two i32s combined into 64-bit x0 register
    void @foo(i32 %w0, i32 %w1) ; First i32 in w0, second in w1

I'm not sure I follow the difference between [2 x i32] and i64: if they both go into a single register, why do you need both? Or is this necessary to support your automatic translation pass?

-Eli

Hi Eli,

Thanks for the comments.

> We teach CodeGenPrepare to sink GEPs as GEPs, and preserve the
> inbounds marker. This is the only way they can possibly be exposed to
> SDAG at the basic block level.

Isn't addr-sink-using-gep already a thing?

Yes, I'm not sure why I wrote that (maybe I saw the new
addrSinkUsingGEPs in a patch and misremembered). It looks like what I
actually did was attempt to decouple the logic. It's currently based
on useAA, which seems to be an orthogonal question to me, so I added a
new virtual function hook. I'm now suspicious of the logic there too,
though. I'll inspect it further before uploading anything for review.

> Second is the intrusiveness. On the plus side it's less intrusive than
> ISD::GEP would be, but it still involves changes in some fairly
> obscure bits of DAG -- often found when things broke rather than by
> careful planning.

Did you consider modeling this with address spaces? LLVM already has robust support for address spaces with different pointer sizes,

I have to say I didn't, but I don't think it would solve the problem.
Alternate address-spaces still have just one pointer size per space as
far as I'm aware. If that's 64-bits we get efficient CodeGen but
loading or storing a pointer clobbers more data than it should, if
that's 32-bits then we get poor CodeGen.

and you probably want to expose support for 64-bit pointers anyway.

It's a possibility, though no-one has asked for it yet. The biggest
request we've actually had is for signed 32-bit pointers so that both
TTBR0 and TTBR1 regions can be used. I could see a pretty strong
argument for exposing unsigned pointers via a different address-space
in that regime (for use in user_addr_t in kernel code), though you'd
have to be pretty disciplined to make it work I think.

I'm not sure I follow the difference between [2 x i32] and i64: if they both go into a single register, why do you need both? Or is this necessary to support your automatic translation pass?

Yep, it's entirely because we need to support code generated for
armv7k. On that platform [2 x i32] and i64 have different alignment
requirements on the stack; [2 x i32] would be used for struct {
int32_t val[2]; }, i64 would be used for struct { int64_t val; }. But
because AArch64 AAPCS puts more data in registers, some of these args
generated for the stack go in registers on arm64_32.

Cheers.

Tim.

I was thinking of a model something like this: 32-bit pointers are addrspace 0, 64-bit pointers are addrspace 1. ISD::LOAD/STORE in addrspace 0 are not legal: they're custom-lowered to operations in addrspace 1. (An addrspacecast from 0 to 1 is just zero-extension.) At that point, since the cast from 32 bits to 64 bits is explicitly represented, we can optimize it in the DAG or IR. For example, we can transform a load of an inbounds gep in addrspace 0 into to a load of an inbounds gep in addrspace 1.

I don't know that this ends up being easier to implement overall, but the model is closer to what the hardware actually supports, and it involves fewer changes to target-independent code.

-Eli

+1

This is basically what we do for one address space on AMDGPU

-Matt

That would have to be an IR-level pass I think; otherwise the default
MVT for any J. Random Pointer Value is still i32, leading to the same
efficiency issues when you eventually use that on a load/store.

With a pass, within a function you ought to be able to promote all
uses of addrspace(0) to addrspace(1), leaving (as you say)
addrspacecasts at opaque sources and sinks (loads, stores, args,
return, ...). Structs containing pointers would be (very?) messy. And
you'd probably want it earlyish to recombine things.

I do like LLVM passes as a solution for most problems, and it ought
to give a big head start to GlobalISel implementation too. I'll
definitely give it a go as an alternative next week.

Cheers.

Tim.

Don't suppose you could tell me which it is? No worries if you don't
remember, but it might save me a few minutes in unfamiliar code if you
do.

Cheers.

Tim.

The current implementation is not what it should be, since it’s currently done in the selector (see Expand32BitAddress) for CONSTANT_ADDRESS_32BIT, which just needs a zext to CONSTANT_ADDRESS. I’ve been meaning to move this into the lowering for the load

-Matt

Alternate address-spaces still have just one pointer size per space as
far as I'm aware. If that's 64-bits we get efficient CodeGen but
loading or storing a pointer clobbers more data than it should, if
that's 32-bits then we get poor CodeGen.

I was thinking of a model something like this: 32-bit pointers are addrspace 0, 64-bit pointers are addrspace 1. ISD::LOAD/STORE in addrspace 0 are not legal: they're custom-lowered to operations in addrspace 1. (An addrspacecast from 0 to 1 is just zero-extension.) At that point, since the cast from 32 bits to 64 bits is explicitly represented, we can optimize it in the DAG or IR. For example, we can transform a load of an inbounds gep in addrspace 0 into to a load of an inbounds gep in addrspace 1.

That would have to be an IR-level pass I think; otherwise the default
MVT for any J. Random Pointer Value is still i32, leading to the same
efficiency issues when you eventually use that on a load/store.

I don’t see why this would need to be an IR pass. There aren’t all that many places left using the default argument to the various pointer function that can mostly be fixed. iPTR is hopelessly broken on the tablegen side, but you wouldn’t get to that point with this.

With a pass, within a function you ought to be able to promote all
uses of addrspace(0) to addrspace(1), leaving (as you say)
addrspacecasts at opaque sources and sinks (loads, stores, args,
return, ...). Structs containing pointers would be (very?) messy. And
you'd probably want it earlyish to recombine things.

You can specify the ABI alignment to 8-bytes in the data layout for the 32-bit pointer for struct layout

-Matt

I don’t see why this would need to be an IR pass. There aren’t all that many places left using the default argument to the various pointer function that can mostly be fixed. iPTR is hopelessly broken on the tablegen side, but you wouldn’t get to that point with this.

The difficulty I'm seeing is that we need GEP to be lowered to i64
arithmetic, but that happens in SelectionDAGBuilder before the target
has any real opportunity to override anything. Once the GEP has been
converted to DAG, the critical information is already gone and we just
have i32 ADD/MUL trees.

The two options I see for making that happen favourably are an IR pass
or deep surgery on Clang, which seems even less appealing.

> With a pass, within a function you ought to be able to promote all
> uses of addrspace(0) to addrspace(1), leaving (as you say)
> addrspacecasts at opaque sources and sinks (loads, stores, args,
> return, ...). Structs containing pointers would be (very?) messy. And
> you'd probably want it earlyish to recombine things.

You can specify the ABI alignment to 8-bytes in the data layout for the 32-bit pointer for struct layout

I was more thinking in terms of the pass converting all value
representations of pointers to addrspace(1). That means that when a
struct gets loaded or stored directly it needs to be repacked.
Completely tractable, but not pretty.

Also, we couldn't do that anyway because the ABI is now very much set
in stone (actually has been in that regard since the very first watch
came out -- we translate bitcode for armv7k to arm64_32 which is
hopelessly doomed if the DataLayouts don't match).

And thanks for the pointers on AMD; I'll take a look at those properly
and see what we can learn.

Cheers.

Tim.

Oh right, you don’t have the addrspace in the input.

I have long wanted a way for targets to take over the GEP expansion which may help you? We’ll need that for non-integral pointer support anyway.

-Matt

Oh right, you don’t have the addrspace in the input.

Input to what? Even if it's available it's wrong without a fixup pass.
Still, custom override for GEP as you talk about later could overcome
the problem...

I have long wanted a way for targets to take over the GEP expansion which may help you? We’ll need that for non-integral pointer support anyway.

It has potential. If I could override GEP generation to use i64
arithmetic followed by a trunc (and use custom load/store lowering)
then it'd be a battle between DAGCombiner trying to eliminate
redundant casts (good!), and narrowing arithmetic surrounded by casts
(bad!).

I don't know which would win right now or how we could influence that,
but I am a bit worried that the semantics of the DAG would no longer
actually represent our requirements. In a pure i64-value-pointer world
there is no question: truncating to i32 is not legitimate.

The main non-integer pointer project I'm aware of is CHERI, and for
personal reasons I'm all in favour of making its job easier. Do you
have others in mind, or any other opinions in how a target should
override GEP generation? If some particular variety of arm64_32
upstreaming could provide a stepping stone for other uses that would
be a definite argument in its favour.

Cheers.

Tim.

AMDGPU basically has 128-bit fat pointers for lots of intrinsics. We currently hack around this by using <4 x i32>, which isn’t really valid since LLVM assumes memory access is through a pointer type. We would like to have 128-bit fat pointers, but only the low 48-bits really behave as an integer. We want to do 64-bit adds on the low bits, and preserve the high metadata bits.

I haven’t looked too closely at how to actually implement this, but some form of target control of GEP lowering is definitely needed for this. I imagine this looks something like a ptr_add instruction, and most of the GEP complexity is turned into a simple byte offset from this.

-Matt

Unfortunately, having thought some more about this approach I don't
think it'll work for arm64_32.

The barrier is that each instruction is expanded in isolation so it
has to conform to a type interface; in this case it would be that
pointer SDValues are i32. So no matter how cunning our expansion is
there would have to be a trunc/zext pair between (say) a GEP expansion
and its use in a store instruction. That's a real masking operation
and prevents the addressing-modes being used. Using trunc/anyext
instead would be foldable, but it's unsound.

If instead we went for an i64 interface, I think the result would be
pretty much equivalent to the implementation I already have (hooks on
load/store/GEP/icmp in SelectionDAGBuilder, i.e. the places where it
matters that pointers are really 32-bit).

I'm having some success with the IR-level pass though. Not finished
yet, and I have no idea how the backend will cope with the new IR, but
it's mostly transforming things as I'd expect.

Cheers.

Tim.

Hi again,

I don't know that this ends up being easier to implement overall, but the model is closer to what the hardware actually supports, and it involves fewer changes to target-independent code.

I've now got something about largely working via an IR-level lowering
pass (pushed to GitHub as
https://github.com/TNorthover/llvm-project/tree/arm64_32-arch-pass,
please excuse any artefacts of incompleteness). I feel like it's
rapidly approaching an unpalatability horizon though. Most issues stem
from the fact that not all pointers are visible or controllable in the
IR:

  + FrameIndices: you can't change an alloca's address-space since
it's fixed by the DataLayout. So they get through to the DAG as i32s,
significantly complicating the Addressing-mode logic.
  + ConstantPool accesses are automatically put into addrspace(0)
  + BlockAddress is similar.
  + Some intrinsics are not polymorphic on pointer type, and adapting
those that are is messy.
  + Returns demoted to x8-indirect are always implemented by stores in
addrspace(0).

I don't think any of these are truly insurmountable, but they do mean
that the backend would have to cope with both i32 and i64 pointers in
fairly ad-hoc ways, and add a lot of complexity to the approach. I
think it's reached the point where the added complexity in AArch64 has
outweighed the benefits to SelectionDAG so I'm inclined to stick with
the original approach for now.

Cheers.

Tim.

Like you say, I'm pretty sure the problems you mentioned are solvable. And you don't actually have to solve every possible inefficiency to get a usable result; it's not the end of the world if we emit an unnecessary zero extension somewhere.

But maybe modifying the DAG to allow pointers in 64-bit registers which correspond to 32-bit values in memory isn't too horrible. It's probably not even that much code; we don't synthesize very many pointer load/store operations in SelectionDAG. I'm mostly worried that you'll continue to discover new issues forever because nobody else has a target that differs in that particular dimension. Maybe we'll get ILP32 ABIs for more targets that will use this functionality in the future, though.

-Eli

Like you say, I'm pretty sure the problems you mentioned are solvable. And you don't actually have to solve every possible inefficiency to get a usable result; it's not the end of the world if we emit an unnecessary zero extension somewhere.

True, though they'd have to be pretty small edge cases. And the
biggest job (intrinsics) might well be the most important since both
NEON and prefetch tend to be used in performance-critical code.

I'm mostly worried that you'll continue to discover new issues forever because nobody else has a target that differs in that particular dimension.

I'd actually be more worried with the pass-based version. From what I
remember only the icmp and not-inbounds GEP were really surprising in
the original attempt. This time around I was hitting a lot more
edge-cases that need special handling, a trend that I suspect would
continue.

Cheers.

Tim.