Vectors in structures

Second question:

I was checking NEON instructions this week and the vector types seem
to be inside structures. If vector types are considered proper types
in LLVM, why pack them inside structures?

That results in a lot of boilerplate code for converting and copying
the values (about 20 lines of IR) just to call a NEON instruction
that, in the end, will be converted into three instructions:

VLDR + {whatever} + VSTR

If the load and store are normally performed by one operation (I
assume it's the same on Intel and others), why bother with the
structure passing instead of just using load/store for vector types?

Also, the extra struct { [i8 x 8] } for memcopy seems also redundant.
If you're explicitly telling you want NEON (or whatever vector
instructions), why bother with compatibility?

Second question:

I was checking NEON instructions this week and the vector types seem
to be inside structures. If vector types are considered proper types
in LLVM, why pack them inside structures?

Because that is what ARM has specified? They define the vector types that are used with their NEON intrinsics as "containerized vectors". Perhaps someone on the list from ARM can explain why they did it that way.

The extra structures are irrelevant at the llvm IR level and below. The NEON intrinsics in llvm use plain old vector types.

If you're using llvm-gcc, you can define the ARM_NEON_GCC_COMPATIBILITY preprocessor macro, and it will switch to a version of the NEON intrinsics that use plain vector types instead of the containerized vectors. For clang, we are planning to do something similar (without requiring the macro) by overloading the intrinsic functions to take either type of arguments, but that is not yet implemented.

That results in a lot of boilerplate code for converting and copying
the values (about 20 lines of IR) just to call a NEON instruction
that, in the end, will be converted into three instructions:

VLDR + {whatever} + VSTR

If the load and store are normally performed by one operation (I
assume it's the same on Intel and others), why bother with the
structure passing instead of just using load/store for vector types?

As you noted, the struct wrappers produce a lot of extra code but it should all be optimized away. If you see a case where that is not happening, please file a bug report.

Also, the extra struct { [i8 x 8] } for memcopy seems also redundant.
If you're explicitly telling you want NEON (or whatever vector
instructions), why bother with compatibility?

I don't know what you're referring to here. Can you give an example?

Because that is what ARM has specified? They define the vector types that are used with their NEON intrinsics as "containerized vectors". Perhaps someone on the list from ARM can explain why they did it that way.

That's ok, but why do you need to do that in the IR? I mean, in the
end, the boilerplate will be optimized away and all that's left will
be the vector instruction, either compiled or JITed.

As you noted, the struct wrappers produce a lot of extra code but it should all be optimized away. If you see a case where that is not happening, please file a bug report.

So far so good, all operations I've tried with Clang are being
correctly generated to a load+op+store triple.

The intrinsics are defined as ordinary C functions in <arm_neon.h>. They use the containerized vector types. So, you've got C code using structures, and at some point we want to remove those structures and expose the underlying vector types. We rely on llvm's SROA optimizations to do that. If you're suggesting that the front-end should optimize away the structures before even generating the llvm IR, that is definitely possible. It would require more code in the front-end. As long as SROA succeeds in optimizing away the cruft, why does it matter? I suppose there might be some effect on compile-time, but I'd be surprised if it is significant.

I see your point, and I'm not concerned with compilation time. All
that code is reused by casting structures to vectors or something like
that and gets optimized away automatically.

However, Clang is already doing a lot of work in the front-end, since
the operations are correct (adds, intinsics) where in arm_neon.h the
function calls are transformed into a series of similar functions with
slightly different parameters.

It means that Clang is, at least, recognizing the correct functions
and transforming into the appropriate instructions. Why not go a step
further and minimize what needs optimizing in the back-end?

But as you said, in GCC compatibility mode it's pure vector, so it's
good enough. And I agree that it's not necessary, I was just
curious... Thanks! :wink:

Bob Wilson writes:

> I was checking NEON instructions this week and the vector types seem
> to be inside structures. If vector types are considered proper types
> in LLVM, why pack them inside structures?

Because that is what ARM has specified? They define the vector types
that are used with their NEON intrinsics as "containerized vectors".
Perhaps someone on the list from ARM can explain why they did it that
way.

"Containerized Vector" in the ARM AAPCS refers to fundamental data
types (machine types), it's the class of machine types comprising
the 64-bit and 128-bit NEON machine types.

The AAPCS defines how to pass containerized vectors and it defines
how the NEON user types map on to them. Also it defines how to
mangle the NEON user types. So it defines how to use NEON user
types in a public binary interface.

It also says that arm_neon.h "defines a set of internal structures
that describe the short vector types" which I guess could be read
as saying they are packed inside structures - but I don't think this
is the intention and it doesn't match implementations.
The arm_neon.h implementation in the ARM compiler defines the user
types in terms of C structs called __simd64_int8_t etc. and the
mangling originated as an artifact of this. But the C structs aren't
wrapped vectors; they wrap double or a pair of doubles, to get the
size and alignment. Their only purpose is to be recognized by name
by the front end and turned into a native register type.
In gcc's arm_neon.h the user types aren't structs at all, they're
defined using vector_size and the mangling is done as a special case.

So I think there's no need to wrap these types in LLVM.

Al

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

They are defined as structures. The table in A.2 defines the exact
structure names. There is a requirement to mangle them as those
structures in A.2.1. The fields of the structure may be different in
this implementation, but the net effect here is that llvm-gcc and
clang avoid having to magically recognize NEON types and substitute
the proper mangling for them the way GCC does.

deep

Right. The contents of the struct don't matter -- the spec is pretty clear about that -- so llvm uses vector types instead of doubles, but your spec definitely shows them being defined as structs.

Beyond that, if you want any sort of cross-compiler portability, you don't want to write code for GCC's implementation. GCC lets you freely intermix vector types, or at least integer vector types, as long as they have the same total size. Since ARM's definition says they are structs, if you want portable NEON code, you have to assume that your intrinsic arguments are compatible based on struct type compatibility, i.e., they have to match exactly, even down to signed vs. unsigned element types. This is a huge hassle. If you take code written for GCC, you typically end up inserting vreinterpret calls all over the place. This was such a problem for llvm-gcc that we had to implement an optional GCC-compatibility mode, and we're planning to do something similar for clang using overloaded intrinsics.

They are defined as structures. The table in A.2 defines the exact
structure names. There is a requirement to mangle them as those
structures in A.2.1.

The mangling requirement doesn't require you to meet it in any
particular way as long as you end up with the right strings.
I.e. the mangling requirement places no requirements at all on
the implementation, outside of mangled names.

The controversial statement is where A.2 requires the user types to
map on to the __simd64 structures. But that still isn't an argument
for wrapping. This sentence might be significant:

"The structures have 64-bit alignment and map directly onto the
containerized vector fundamental data types."

So the structures map _directly_ on to the vector types - not on to
wrappers around the vector types.

The fields of the structure may be different in
this implementation, but the net effect here is that llvm-gcc and
clang avoid having to magically recognize NEON types and substitute
the proper mangling for them the way GCC does.

But mangling routines are self-contained and have to deal with this
sort of target issue anyway, e.g. gcc for ARM deals with 16-bit
floats as well as NEON types. Saving 20 or so lines in a mangling
routine makes no sense as an argument for a particular implementation
strategy for arm_neon.h and a front end, let alone for inflating
bitcode files with lots of wrapping around vector operations.

Al

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Bob Wilson writes:

Right. The contents of the struct don't matter -- the spec is pretty
clear about that -- so llvm uses vector types instead of doubles, but
your spec definitely shows them being defined as structs.

It _says_ they are defined as structs - but it doesn't show them
in use, i.e. it doesn't show how a user of the NEON intrinsics is
supposed to be able to make any use of the fact that the types
are defined as structs rather than some completely opaque type.
All it does is explain why the mangling rule looks the way it does.

And whatever the effect of them being structs on the NEON intrinsics
programmer might be, it surely wouldn't prevent them being lowered
to native types when the struct-ness no longer mattered. And I'd
have thought that would be in the front end.

Beyond that, if you want any sort of cross-compiler portability, you
don't want to write code for GCC's implementation. GCC lets you freely
intermix vector types, or at least integer vector types, as long as
they have the same total size.

Yes, other problem cases might be

  int16x4_t x = { 1, 2, 3, 4 }; // gcc only?

  struct float4: float32x4_t { ... }; // armcc only?

We ought to be more specific about the portable subset, and give
more guidance on potential portability issues. Probably that would
start with a common specification for the NEON intrinsics, independent
of any given ARM or GNU compiler release.

Al

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Bob Wilson writes:

Right. The contents of the struct don't matter -- the spec is pretty
clear about that -- so llvm uses vector types instead of doubles, but
your spec definitely shows them being defined as structs.

It _says_ they are defined as structs - but it doesn't show them
in use, i.e. it doesn't show how a user of the NEON intrinsics is
supposed to be able to make any use of the fact that the types
are defined as structs rather than some completely opaque type.
All it does is explain why the mangling rule looks the way it does.

We can deal with the mangling rule in other ways if necessary. I don't read the mangling rule as necessarily implying anything about the actual data types.

The fact remains that ARM's documentation defines these types as structs. A user should NOT be able to take advantage of that -- the types are intentionally opaque.

And whatever the effect of them being structs on the NEON intrinsics
programmer might be, it surely wouldn't prevent them being lowered
to native types when the struct-ness no longer mattered. And I'd
have thought that would be in the front end.

Where that lowering is done is an implementation detail of the compiler, isn't it?

The big question is in regard to "whatever the effect of them being structs". As I pointed out, if you follow ARM's approach of making them structs, that has big implications for what types you can use for intrinsic arguments. Most of the vreinterpret intrinsics are only needed if you define the NEON types as structs. If they are GCC-style vectors, you can omit most of the vreinterpret calls.

Beyond that, if you want any sort of cross-compiler portability, you
don't want to write code for GCC's implementation. GCC lets you freely
intermix vector types, or at least integer vector types, as long as
they have the same total size.

Yes, other problem cases might be

int16x4_t x = { 1, 2, 3, 4 }; // gcc only?

struct float4: float32x4_t { ... }; // armcc only?

Yes, definitely. These are the sorts of things that caused us to define a GCC backward-compatibility option.

We ought to be more specific about the portable subset, and give
more guidance on potential portability issues. Probably that would
start with a common specification for the NEON intrinsics, independent
of any given ARM or GNU compiler release.

That would be great. My experience has been that the using the struct types as defined by ARM is a big nuisance, so if you can find a way to relax your specification to allow other implementations, that would be most welcome.

They are defined as structures. The table in A.2 defines the exact
structure names. There is a requirement to mangle them as those
structures in A.2.1.

The mangling requirement doesn't require you to meet it in any
particular way as long as you end up with the right strings.
I.e. the mangling requirement places no requirements at all on
the implementation, outside of mangled names.

The controversial statement is where A.2 requires the user types to
map on to the __simd64 structures. But that still isn't an argument
for wrapping. This sentence might be significant:

"The structures have 64-bit alignment and map directly onto the
containerized vector fundamental data types."

So the structures map _directly_ on to the vector types - not on to
wrappers around the vector types.

But regardless they are still structures, right? What does it mean for them to map onto other types? Is the parser supposed to treat them as if they _were_ those other types? If so, I think you need to define a type system for those fundamental vector types. I had read those statements to say something about the data types used in the generated code.

The fields of the structure may be different in
this implementation, but the net effect here is that llvm-gcc and
clang avoid having to magically recognize NEON types and substitute
the proper mangling for them the way GCC does.

But mangling routines are self-contained and have to deal with this
sort of target issue anyway, e.g. gcc for ARM deals with 16-bit
floats as well as NEON types. Saving 20 or so lines in a mangling
routine makes no sense as an argument for a particular implementation
strategy for arm_neon.h and a front end, let alone for inflating
bitcode files with lots of wrapping around vector operations.

Agreed. The mangling is a side issue.

Hi Bob,

Just tested with plain vectors and LLVM's back-end seems to get them
all right. Plain instructions (add, mul, sub) and intrinsics (adds,
subs, muls). Haven't seen any compatibility issues so far, not to
mention that the IR is a fifth of what Clang generates... :wink:

I'm not sure what you mean by this. The llvm intrinsics and built-in vector operations use plain vectors regardless of the front-end. The structures are only relevant for things like argument passing and copying -- you can't do anything else with them. Can you post an example of the 5X IR code size that you're seeing with clang? I'd like to understand the issue that you're seeing.

I mean that I could remove all structure boilerplate and it still
works, plus you don't have to define any type (as LLVM uses the vector
types), as per the discussion we're having about needing to use
structures. Make that 2x smaller, I had a special case that was not a
fair comparison.

But I recently found out that the polyNxN_t vector type can destroy
everything, as it appears to LLVM as <8 x i8>, and is identical to a
intNxN_t for base instructions, so an "icmp eq <8 x i8>" always become
VCEQ.I8 and never a VCEQ.P8, even though that's what Clang generates.

Putting them into structures doesn't help because of the type names
being irrelevant, both names become %struct.__simd64_int8_t

%struct.__simd64_int8_t = type { <8 x i8> }
%struct.__simd64_poly8_t = type { <8 x i8> }
%struct.__simd64_uint8_t = type { <8 x i8> }

@u8d = common global %struct.__simd64_int8_t zeroinitializer, align 8
@i8d = common global %struct.__simd64_int8_t zeroinitializer, align 8
@p8d = common global %struct.__simd64_int8_t zeroinitializer, align 8

The difference between uint8x8 and int8x8 is done via 'nsw' (which,
unless it's really generating a trap value, it's a misleading tag),
but there's nothing that will flag this type as poly8x8.

When I try to compile this with Clang:

=== comp.c ===
#define __ARM_NEON__
#include <arm_neon.h>

int8x8_t i8d;
uint8x8_t u8d;
poly8x8_t p8d;

void vceq() {
  u8d = vceq_s8(i8d, i8d);
  u8d = vceq_p8(p8d, p8d);
}
=== end ===

It generates exactly the same instruction for both calls:

$ clang -ccc-host-triple armv7a-none-eabi -ccc-gcc-name
arm-none-eabi-gcc -mfloat-abi=hard -w -S comp.c -o - | grep vceq
  .globl vceq
  .type vceq,%function
vceq:
  vceq.i8 d0, d1, d0
  vceq.i8 d0, d1, d0
  .size vceq, .Ltmp0-vceq

Isn't that a call to use intrinsics?

Support for NEON intrinsics in clang is not complete. Poly types in general are known to be an issue, and the vceq_p8 in your example definitely needs an intrinisic. It should work with llvm-gcc.

Can you clarify ARM's position on those structure types? It sounds like you are advocating that we get rid of them. The only reason we've been using them in llvm-gcc and clang is for compatibility for ARM's specifications and with ARM's RVCT compiler. If ARM does not care about those things, I'd love to remove the struct wrappers from llvm.

As Al said earlier, you definitely don't need the structures for
compatibility with armcc.

As far as the LLVM back-end is concerned, with or without structures,
the instruction selection works a treat and generates correct NEON
instructions. If the final object has the correct instructions and
follows ARM ABIs, there is no point in keeping IR compatibility.

I also noticed that Clang's arm_neon.h is completely different from
armcc's, another non-compatible choice that has no impact in the final
object code generated.

As far as I can see, there is no gain in adding the wrapping
structures to the vector types.

I'll add the intrinsic to the VCEQ.P8 locally and test. If that works,
I'll be sending patches to NEON.td for all ambiguities I find...

Can you clarify ARM's position on those structure types? It sounds like you are advocating that we get rid of them. The only reason we've been using them in llvm-gcc and clang is for compatibility for ARM's specifications and with ARM's RVCT compiler. If ARM does not care about those things, I'd love to remove the struct wrappers from llvm.

As Al said earlier, you definitely don't need the structures for
compatibility with armcc.

An implementation, such as in GCC, that does not use structures is compatible with ARM's specification in only one direction. GCC will accept any code written for RVCT, but not the other way around. And, as Al pointed out, there are also compatibility issues with how you can initialize vectors. (In fact, if you stick to the documented interfaces, the only way you can initialize a vector to an arbitrary value is by loading from memory.)

Can we get an official position from ARM on this?

As far as the LLVM back-end is concerned, with or without structures,
the instruction selection works a treat and generates correct NEON
instructions. If the final object has the correct instructions and
follows ARM ABIs, there is no point in keeping IR compatibility.

We care about llvm IR compatibility across releases, but we can auto-upgrade old bitcode files to work with newer releases of llvm. I don't think this is relevant to the question of structs vs. no-structs.

I also noticed that Clang's arm_neon.h is completely different from
armcc's, another non-compatible choice that has no impact in the final
object code generated.

Each compiler can have its own version of arm_neon.h. llvm-gcc's is quite different from clang. That is an internal implementation issue.

As far as I can see, there is no gain in adding the wrapping
structures to the vector types.

I'll add the intrinsic to the VCEQ.P8 locally and test. If that works,
I'll be sending patches to NEON.td for all ambiguities I find...

Wait a minute.... VCEQ does not have a special polynomial version. There is only VCEQ.I8. What I said about support for polynomial types in Clang is still true, but for this particular case, there is no difference between vceq_s8, vceq_u8, and vceq_p8 (aside from the types of the intrinsic arguments).

An implementation, such as in GCC, that does not use structures is compatible with ARM's specification in only one direction. GCC will accept any code written for RVCT, but not the other way around. And, as Al pointed out, there are also compatibility issues with how you can initialize vectors. (In fact, if you stick to the documented interfaces, the only way you can initialize a vector to an arbitrary value is by loading from memory.)

Hi Bob,

Can you clarify what compatibility problems you had with GCC? And that
by using structures in Clang you made it work with armcc?

Is it just a source code compatibility issue?

Can we get an official position from ARM on this?

I really don't know what you want here. I can't tell you that it will
be safe to remove the structures from Clang, since I don't know enough
about the vector types (and all other back-ends that use it) and what
the problems you had with gcc/armcc compatibility. Maybe, because of
the way vectors are implemented in LLVM, there is no other way...
maybe not.

Wait a minute.... VCEQ does not have a special polynomial version. There is only VCEQ.I8. What I said about support for polynomial types in Clang is still true, but for this particular case, there is no difference between vceq_s8, vceq_u8, and vceq_p8 (aside from the types of the intrinsic arguments).

Sorry, bad example... (and wrong copy&past test generation) :wink:

An implementation, such as in GCC, that does not use structures is compatible with ARM's specification in only one direction. GCC will accept any code written for RVCT, but not the other way around. And, as Al pointed out, there are also compatibility issues with how you can initialize vectors. (In fact, if you stick to the documented interfaces, the only way you can initialize a vector to an arbitrary value is by loading from memory.)

Hi Bob,

Can you clarify what compatibility problems you had with GCC? And that
by using structures in Clang you made it work with armcc?

Is it just a source code compatibility issue?

Yes, there are multiple issues but they all involve source compatibility.
Here is an example:

#include <arm_neon.h>
uint32x2_t test(int32x2_t x) { return vadd_u32(x, x); }

This works fine with GCC because int32x2_t and uint32x2_t are built-in vector types and can be implicitly converted. It is not valid if those types are defined as structs, because C/C++ do not allow distinct struct types to be implicitly converted just because they happen to have the same size. To get this to compile when the NEON types are structs, you need to add vreinterpret intrinsics:

#include <arm_neon.h>
uint32x2_t test(int32x2_t x) { uint32x2_t ux = vreinterpret_u32_s32(x); return vadd_u32(ux, ux); }

I do not have access to ARM's compiler(s) but I'm assuming that the first example will not compile because vadd_u32 expects arguments of type uint32x2_t. Using structs in llvm does not "fix" a compatibility problem, but it helps our users write NEON code that will work with ARM's compiler.

Can we get an official position from ARM on this?

I really don't know what you want here. I can't tell you that it will
be safe to remove the structures from Clang, since I don't know enough
about the vector types (and all other back-ends that use it) and what
the problems you had with gcc/armcc compatibility. Maybe, because of
the way vectors are implemented in LLVM, there is no other way...
maybe not.

We have gone to some lengths to make llvm match ARM's specifications and to help our users write code that will be portable to work with your compiler. If we don't have to worry about compatibility with ARM's specifications and ARM's compilers, we can drop the struct wrappers and make life easier for ourselves. I am getting requests that we do that regardless of ARM's opinion, but I've resisted based on the notion that portability and compatibility are worth fighting for. It is pretty ironic and frustrating to me to hear that even people at ARM think these wrapper structs are a bad idea. I would still prefer to have ARM publish specifications and guidelines that make it possible to write portable NEON code. Do you guys even care about that?