[global-isel] Type-independence of load/store

Hi Jakob,

Sounds like a really exciting topic; I'd love to be involved in
implementation. I've not really had time to think about the
implications of the larger picture, but one detail did strike me on
the first read-through:

On the other hand, when types are not used to select register banks, it
becomes really difficult to explain the difference between load i32 and load
f32. The hardware doesn't care either, it simply knows how to load 32 bits
into a given register.

We can use a three-level hierarchical type system to
better describe this:

That may be something we want to be flexible about. I know we don't
support big-endian ARM at the moment, but its NEON load/store
instructions do take an interest in more than just the total number of
bits I think.

        vst1.16 {d0}, [r0]
        vst1.64 {d0}, [r0]
would give byte-wise layouts of
        1 0 3 2 5 4 7 6
        7 6 5 4 3 2 1 0

Other big-endian targets may have similar issues, but I know virtually
nothing about them.

Of course, if those three categories are just helpful mental models
(perhaps with convenience functions) then there's likely no issue. We
probably shouldn't go around discarding the "irrelevant" information
though.

Cheers.

Tim.

Sounds like a really exciting topic; I'd love to be involved in
implementation.

We need all the volunteers we can get. :wink:

On the other hand, when types are not used to select register banks, it
becomes really difficult to explain the difference between load i32 and load
f32. The hardware doesn't care either, it simply knows how to load 32 bits
into a given register.

We can use a three-level hierarchical type system to
better describe this:

That may be something we want to be flexible about. I know we don't
support big-endian ARM at the moment, but its NEON load/store
instructions do take an interest in more than just the total number of
bits I think.

       vst1.16 {d0}, [r0]
       vst1.64 {d0}, [r0]
would give byte-wise layouts of
       1 0 3 2 5 4 7 6
       7 6 5 4 3 2 1 0

Other big-endian targets may have similar issues, but I know virtually
nothing about them.

ARM’s is an interesting implementation of big-endian vectors. AFAIK, other architectures go all in and use both big-endian lanes and elements. That makes the problem go away, and you only need one load instruction.

Note that LLVM IR requires a bitcast to be equivalent to storing one type and loading the other, and it seems that this would turn a bitcast into a kind of shuffle.

I think Dan has opinions on this particular topic.

Thanks,
/jakob

ARM’s is an interesting implementation of big-endian vectors.
AFAIK, other architectures go all in and use both big-endian
lanes and elements. That makes the problem go away, and you
only need one load instruction.

Hmm, I suppose the "cost" is that any instruction referring to lanes
has to behave differently under big and little endian conditions. Not
an issue if you only support one, of course.

Note that LLVM IR requires a bitcast to be equivalent to storing one
type and loading the other, and it seems that this would turn a
bitcast into a kind of shuffle.

Interesting. We'll have some fun if we ever try that, I think!

Tim.

> Other big-endian targets may have similar issues, but I know virtually
> nothing about them.

ARM's is an interesting implementation of big-endian vectors. AFAIK, other
architectures go all in and use both big-endian lanes and elements. That
makes the problem go away, and you only need one load instruction.

The recently published MIPS SIMD Architecture (MSA) has the same issue for big-endian vectors. There's a small non-functional benefit to accounting for this in little-endian too. For little-endian mode, the emitted code is a bit easier to understand if the 'correct' loads and stores are used.

Daniel Sanders
Leading Software Design Engineer, MIPS Processor IP
Imagination Technologies Limited
www.imgtec.com

AltiVec is an implementation of big-endian vectors that doesn’t require multiple load instructions or shuffling bitcasts. See section 4.2 of http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf

I can’t tell if MIPS and ARM are doing the same thing, or if they need different models. I don’t think either has ever been attempted in LLVM. I suspect that some tinkering is required at the IR level as well to make it work.

But it seems like we’ll probably need to allow the vector shape to influence load/store instruction selection.

Thanks,
/jakob

I believe we are doing the same thing for normal loads and stores. In big-endian mode, our st.h stores using the byte-order 10325476 and our st.d stores 76543210. This is the same as the vst1.16 and vst1.64 that Tim described.

I'm actually working on MSA at the moment. I've just started upstreaming my work now that the spec has been published.

I've had two main problems with my implementation of MSA so far. The first is that the type system is rather awkward and doesn't cope very well with multiple types in the same register class. I ended up splitting up my register classes according to the number of bits in the element.
The second is that on MIPS32 with MSA the v2i64 type is legal but i64 is not. I'm finding that the legalizer and dag combiner sometimes generates SelectionDAG nodes with illegal types (e.g. a build_vector containing i64's).
I believe both of these issues would be solved using the proposed global-isel.