proposal to add MVT::vAny type

The ARM Neon load, store and shuffle operations that I've been implementing recently with LLVM intrinsics do not care about the distinction between vectors with i32 and f32 elements -- only the size matters. But, because we have only MVT::fAny and MVT::iAny types, I've been having to define separate intrinsics for the operations with floating-point vector elements. It didn't bother me when there were only a few intrinsics like this, but now there are more, and I realized this weekend that I still need to add more for the load/store lane operations.

I had been thinking about trying to bitcast my way out of this, but it struck me that it would make a lot more sense to have a new MVT::vAny type that TableGen would match to any vector type. That would more accurately reflect the type constraints on these intrinsics.

It seems like since these "*Any" types are confined to TableGen, it should be pretty easy to add another one. I looked at the places using iAny and fAny and they seem pretty easy to extend to handle a new vAny type. Does this seem like a good idea? Any objections?

I'd like to get the Neon intrinsics finalized before the 2.6 release, since it may be harder to change them later.

Hi Bob,

An alternative would be to model the operations as regular shuffle, load, and store operators, combined to describe the actual instructions. This would make them easier for target-independent code to understand.

Dan

The ARM Neon load, store and shuffle operations that I've been
implementing recently with LLVM intrinsics do not care about the
distinction between vectors with i32 and f32 elements -- only the size
matters. But, because we have only MVT::fAny and MVT::iAny types,
I've been having to define separate intrinsics for the operations with
floating-point vector elements. It didn't bother me when there were
only a few intrinsics like this, but now there are more, and I
realized this weekend that I still need to add more for the load/store
lane operations.

Hi Bob,

I really do think that bitcast is the right way to go here. I ran into a couple of similar problems when bringing up the altivec port. For example, at one time we'd get "all zero vectors" of different MVTs, which would not be CSEd.

The fix for this was to be really disciplined about what types to make things in, and use bitcasts to convert the types when appropriate. For example, the all zeros vector is now only created as a <4 x i32> (IIRC) and bitcasted to the desired type. If this is impacting intrinsics, it seems that the front-end could do a similar thing to canonicalize the intrinsics early. As you know, we do prefer to have as few intrinsics as possible.

Can you describe a bit more about what fAny would do for you, maybe with an example? I'm sorry that I don't know much at all about neon...

-Chris

Hi Bob,

An alternative would be to model the operations as regular shuffle,
load, and store operators, combined to describe the actual
instructions. This would make them easier for target-independent code
to understand.

Yes, I have tried to do that as much as possible. There are still a number of operations where we've ended up using intrinsics, for varying reasons.

For example, I had been planning to have the front-end translate the VTRN, VZIP, and VUZP builtins to vector shuffles, since that is exactly what they are. But, after discussing it with Evan, I changed these to intrinsics because we couldn't figure out a good way to handle them as shuffles. They take two vector operands and shuffle them in place, producing two vector results. I had been translating these to shuffles that produced double-wide vectors, e.g., shuffle two <8 x i8> vectors producing one <16 x i8> vector. That made it hard for the optimizer to deal with the results, since they are really two separate vectors, and some simple experiments made me think we won't get very good code from that approach.

The load/store multiple with element (de)interleaving operations also worked out best as intrinsics.

Maybe we can talk about these in person if you want the gory details.

I really do think that bitcast is the right way to go here. I ran
into a couple of similar problems when bringing up the altivec port.
For example, at one time we'd get "all zero vectors" of different
MVTs, which would not be CSEd.

The fix for this was to be really disciplined about what types to make
things in, and use bitcasts to convert the types when appropriate.
For example, the all zeros vector is now only created as a <4 x i32>
(IIRC) and bitcasted to the desired type.

Yes, I used that approach, at least to some extent, for Neon. There may be more to do to make sure things are getting CSEd the way we want.

If this is impacting
intrinsics, it seems that the front-end could do a similar thing to
canonicalize the intrinsics early. As you know, we do prefer to have
as few intrinsics as possible.

That is exactly what I'm trying to accomplish here (fewer intrinsics). I think I can do it with bitcasts, though.

Can you describe a bit more about what fAny would do for you, maybe
with an example? I'm sorry that I don't know much at all about neon...

It doesn't do anything fundamental. It just seems like a better fit. Neon has vectors of both integers and floats. Currently my choices for describing the type constraints for a Neon intrinsic are iAny or fAny, but those also allow scalars. vAny would more accurately indicate that only vector types are allowed, and it would also avoid the need for bitcasting.

It sounds like it is not a popular idea, so I'll let it rest.

SSE is suffering from the same issue. We end up with a lot of def : Pat patterns. I am not sure whether having vAny will solve it though. We don't want to match to any vector type rather any vector type of a given size. It would be nice if we can use PatFrag to specify type matching code for dagisel.

Evan

Yes, these sound very reasonable to keep as intrinsics. We'd just want one intrinsic for each semantic operation, not one for each type. Using bitcasts to canonicalize to one (non-type-parametric) intrinsic seems like the right approach.

-Chris

So I tried adding bitcasts today and it turned out to be harder than I expected. The intrinsics are returning first-class aggregates containing values with overloaded types. If I bitcast the source operands and use the v*i32 intrinsic instead of a v*f32 intrinsic, the type of the result is completely different than the expected aggregate type. I would have to extract the individual elements, bitcast them back to floating-point values, and then insert them into a new aggregate with a type that matches what the front-end is expecting. It is messy.

I talked about this with Chris and he agreed that adding vAny would not be so bad. I discovered this afternoon that there are a number of additional intrinsics I can remove with this change (beyond what I was originally thinking of). These are all operations where the behavior differs between i32 and f32 element types so bitcasting is not an option. That is good!

Even better, Chris showed me a way to implement VZIP, VUZP, and VTRN as shuffles instead of intrinsics....