NEON intrinsics preventing redundant load optimization?

Hi all,

I’m not sure if this is the right list, so apologies if not.

Doing some profiling I noticed some of my hand-tuned matrix multiply code with NEON intrinsics was much slower through a C++ template wrapper vs calling the intrinsics function directly. It turned out clang/LLVM was unable to eliminate a temporary even though the case seemed quite straightforward. Unfortunately any loads directly after NEON stores seem to be bad news on many arm cores (the wrapped version that stores to a temporary, then loads and stores back to the final location was almost 4x slower than the direct version without the temporary).

I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)

Here's a simplified test case:

struct vec4
{
  float data[4];
};

vec4 operator* (vec4& a, vec4& b)
{
  vec4 result;
  for(int i = 0; i < 4; ++i)
    result.data[i] = a.data[i] * b.data[i];

  return result;
}

void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
{
  result = a * b;
}

With -O3 the loop gets vectorized and the code generated looks optimal:

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
  vld1.32 {d16, d17}, [r1]
  vld1.32 {d18, d19}, [r0]
  vmul.f32 q8, q9, q8
  vst1.32 {d16, d17}, [r2]
  bx lr

However if I replace the operator* with a NEON intrinsic implementation (I know the vectorizer figured out optimal code in this case anyway, but that wasn't true for my real situation) then the temporary "result" seems to be kept in the generated code for the test function, and triggers the bad penalty of a load after a NEON store.

vec4 operator* (vec4& a, vec4& b)
{
  vec4 result;
  
  float32x4_t result_data = vmulq_f32(vld1q_f32(a.data), vld1q_f32(b.data));
  vst1q_f32(result.data, result_data);

  return result;
}

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
  sub sp, #16
  vld1.32 {d16, d17}, [r1]
  vld1.32 {d18, d19}, [r0]
  mov r0, sp
  vmul.f32 q8, q9, q8
  vst1.32 {d16, d17}, [r0]
  vld1.32 {d16, d17}, [r0]
  vst1.32 {d16, d17}, [r2]
  add sp, #16
  bx lr

Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code?

Thanks,

Simon

From: "Simon Taylor" <simontaylor1@ntlworld.com>
To: llvmdev@cs.uiuc.edu
Sent: Sunday, December 7, 2014 1:15:51 PM
Subject: [LLVMdev] NEON intrinsics preventing redundant load optimization?

Hi all,

I’m not sure if this is the right list, so apologies if not.

This is not a bad place :wink:

Doing some profiling I noticed some of my hand-tuned matrix multiply
code with NEON intrinsics was much slower through a C++ template
wrapper vs calling the intrinsics function directly. It turned out
clang/LLVM was unable to eliminate a temporary even though the case
seemed quite straightforward. Unfortunately any loads directly after
NEON stores seem to be bad news on many arm cores (the wrapped
version that stores to a temporary, then loads and stores back to
the final location was almost 4x slower than the direct version
without the temporary).

I'm using the clang in the latest XCode + iOS SDK: Apple LLVM version
6.0 (clang-600.0.56) (based on LLVM 3.5svn)

Here's a simplified test case:

struct vec4
{
  float data[4];
};

vec4 operator* (vec4& a, vec4& b)
{
  vec4 result;
  for(int i = 0; i < 4; ++i)
    result.data[i] = a.data[i] * b.data[i];

  return result;
}

void TestVec4Multiply(vec4& a, vec4& b, vec4& result)
{
  result = a * b;
}

With -O3 the loop gets vectorized and the code generated looks
optimal:

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
  vld1.32 {d16, d17}, [r1]
  vld1.32 {d18, d19}, [r0]
  vmul.f32 q8, q9, q8
  vst1.32 {d16, d17}, [r2]
  bx lr

However if I replace the operator* with a NEON intrinsic
implementation (I know the vectorizer figured out optimal code in
this case anyway, but that wasn't true for my real situation) then
the temporary "result" seems to be kept in the generated code for
the test function, and triggers the bad penalty of a load after a
NEON store.

vec4 operator* (vec4& a, vec4& b)
{
  vec4 result;
  
  float32x4_t result_data = vmulq_f32(vld1q_f32(a.data),
  vld1q_f32(b.data));
  vst1q_f32(result.data, result_data);

  return result;
}

__Z16TestVec4MultiplyR4vec4S0_S0_:
@ BB#0:
  sub sp, #16
  vld1.32 {d16, d17}, [r1]
  vld1.32 {d18, d19}, [r0]
  mov r0, sp
  vmul.f32 q8, q9, q8
  vst1.32 {d16, d17}, [r0]
  vld1.32 {d16, d17}, [r0]
  vst1.32 {d16, d17}, [r2]
  add sp, #16
  bx lr

Is there something about the use of intrinsics that prevents the
compiler optimizing out the redundant store on the stack?

I recommend filing a bug report so that someone can look at this in detail. You can do this at llvm.org/bugs -- select "libraries" as the product, and then "Scalar Optimizations" as the component (that's probably right, and we can always change it if it turns out the problem lies elsewhere).

In the mean time, I recommend trying to pass by value, instead of by reference, in your multiplication operator. It is hard to say without looking at the code in detail, but it is easier for the compiler to analyze:
  vec4 operator* (vec4 a, vec4 b)
than to analyze:
  vec4 operator* (vec4& a, vec4& b)

-Hal

If I had to guess, I'd say the intrinsic got in the way of recognising
the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but
vld1q_f32 is still kept as an intrinsic, so register allocators and
schedulers get confused and, when lowering to assembly, you're left
with garbage around it.

Creating a bug for this is probably the best thing to do, since this
is a common pattern that needs looking into to produce optimal code.

cheers,
--renato

Thanks for the responses. I’ve filed bug #21778 for this:
http://llvm.org/bugs/show_bug.cgi?id=21778

I’ve also tried replacing the vst1.32 with setting the data[i] elements individually with vgetq_lane, which gets at least the single multiply case back to optimal code. There’s still an unneeded temporary when doing res = a * b * c though. Anyway, let’s continue this on the bug tracker :slight_smile:

Simon

Is there something about the use of intrinsics that prevents the compiler optimizing out the redundant store on the stack? Is there any hope for this improving in the future, or anything I can do now to improve the generated code?

If I had to guess, I'd say the intrinsic got in the way of recognising
the pattern. vmulq_f32 got correctly lowered to IR as "fmul", but
vld1q_f32 is still kept as an intrinsic, so register allocators and
schedulers get confused and, when lowering to assembly, you're left
with garbage around it.

Creating a bug for this is probably the best thing to do, since this
is a common pattern that needs looking into to produce optimal code.

Thanks for the responses. I’ve filed bug #21778 for this:
http://llvm.org/bugs/show_bug.cgi?id=21778

I’ve also tried replacing the vst1.32 with setting the data[i] elements individually with vgetq_lane, which gets at least the single multiply case back to optimal code. There’s still an unneeded temporary when doing res = a * b * c though. Anyway, let’s continue this on the bug tracker :slight_smile:

FWIW, with top of tree clang, I get the same (good) code for both of the implementations of operator* in the original email. That appears to be a fairly recent improvement, though I haven’t bisected it down or anything.

-Jim

Thanks for the note. I’m building a recent checkout now to see if I see the same behaviour.

Looking at the -emit-llvm output from the XCode build shows "load <4 x float>* %1, align 4, !tbaa !3” for the C implementation where redundant temporaries are successfully eliminated. With the intrinsics the IR still contains "tail call <4 x float> @llvm.arm.neon.vld1.v4f32(i8* %1, i32 4)”.

Perhaps this is due to NEON supporting interleaved loads for loading arrays of structs into vector registers of each element. I suspect that isn’t very common across architectures, so it wouldn’t surprise me if there was no IR instruction for the interleaved cases (vld[234].*). It seems the vld1.* and vst1.* do have those direct IR representations though.

It’s great news if this is fixed in the current tip, but in the short term (for app store builds using the official toolchain) are there any LLVM-specific extensions to initialise a float32x4_t that will get lowered to the "load <4 x float>* %1” form? Or is that more a question for the clang folks?

Simon

FWIW, with top of tree clang, I get the same (good) code for both of the implementations of operator* in the original email. That appears to be a fairly recent improvement, though I haven’t bisected it down or anything.

Thanks for the note. I’m building a recent checkout now to see if I see the same behaviour.

Things have definitely improved with the top of tree, and as Jim reported the examples I started with now generate the expected code. The NEON load/stores still appear in the IR rather than the load <4 x float> from the auto-vectorized C.

There’s an additional example in the bug report [http://llvm.org/bugs/show_bug.cgi?id=21778] that tests chained multiply (ie res = a * b * c). In this case the current top of tree clang still has one redundant temporary.

It’s great news if this is fixed in the current tip, but in the short term (for app store builds using the official toolchain) are there any LLVM-specific extensions to initialise a float32x4_t that will get lowered to the "load <4 x float>* %1” form? Or is that more a question for the clang folks?

I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing this?

typedef float32x4_t __attribute((aligned(4))) f32x4_align4_t;

vec4 operator* (const vec4& a, const vec4& b)
{
  vec4 result;

  float32x4_t a_data = *((f32x4_align4_t*)a.data);
  float32x4_t b_data = *((f32x4_align4_t*)b.data);

  float32x4_t result_data = vmulq_f32(a_data, b_data);

  *((f32x4_align4_t*)result.data) = result_data;

  return result;
}

My view is that you should only use intrinsics where the language has
no semantics for it. Since this is not the case, using pointers is
probably the best way, anyway.

There is still the "bug" where the load/store intrinsics don't map to
simple pointer references, but since you found a better work-around,
that has lower priority now.

I changed the bug to reflect that.

cheers,
--renato

I’ve managed to replace the load/store intrinsics with pointer dereferences (along with a typedef to get the alignment correct). This generates 100% the same IR + asm as the auto-vectorized C version (both using -O3), and works with the toolchain in the latest XCode. Are there any concerns around doing this?

My view is that you should only use intrinsics where the language has
no semantics for it. Since this is not the case, using pointers is
probably the best way, anyway.

I think dereferencing pointers is explicitly discouraged in the
documentation for portability reasons. It may well have issues on
wrong-endian targets.

Tim.

The ARM ACLE docs recommend against the GCC extension that allows an initializer list because of potential endianness issues:
float32x4_t values = {1, 2, 3, 4};

I don’t recall seeing anything about pointer dereferencing, but it may have the same issues. I’m a bit hazy on endianness issues with NEON anyway (in terms of element numbering, casts between types, etc) but it seems like all the smartphone platform ABIs are defined to be little-endian so I haven’t spent too much time worrying about it.

Simon

Tim is right, this can be a potential danger, but not more than other
endian or type size issues. If you're writing portable code, I assume
you'll already be mindful of those issues.

This is why I said it's still a problem, but not a critical one. Maybe
adding a comment to your code explaining the issue will help you in
the future to move it back to NEON loads/stores once this is fixed.

cheers,
--renato

Hi all,

Sorry for arriving late to the party. First, some context:

vld1 is not the same as a pointer dereference. The alignment requirements are different (which I saw you hacked around in your testcase using attribute((aligned(4))) ), and in big endian environments they do totally different things (VLD1 does element-wise byteswapping and pointer dereferences byteswaps the entire 128-bit number).

While pointer dereference does work just as well (and better, given this defect) as VLD1 it is explicitly not supported. The ACLE mandates that there are only certain ways to legitimately “create” a vector object - vcreate, vcombine, vreinterpret and vload. NEON intrinsic types don’t exist in memory (memory is modelled as a sequence of scalars, as in the C model). For this reason Renato I don’t think we should advise people to work around the API, as who knows what problems that will cause later.

The reason above is why we map a vloadq_f32() into a NEON intrinsic instead of a generic IR load. Looking at your testcase, even with tip-of-trunk clang we generate redundant loads and stores:

vld1.32 {d16, d17}, [r1]
vld1.32 {d18, d19}, [r0]
mov r0, sp
vmul.f32 q8, q9, q8
vst1.32 {d16, d17}, [r0]
vld1.64 {d16, d17}, [r0:128]
vst1.32 {d16, d17}, [r2]

Whereas for AArch64, we don’t (and neither do we for the chained multiply case):

ldr q0, [x0]
ldr q1, [x1]
fmul v0.4s, v0.4s, v1.4s
str q0, [x2]
ret

So this is handled, and I think there’s something wrong/missing in the optimizer for AArch32. This is a legitimate bug and should be fixed (even if a workaround is required in the interim!)

Cheers,

James

For this reason Renato I don't think we should advise people to work around
the API, as who knows what problems that will cause later.

I stand corrected (twice). But we changed the subject a bit, so things
got different midway through.

There were only two codes: full C, with pointers, vectorized some
times; and full NEON, with extra loads. The mix between the two came
later, as a quick fix. My comment to that, I agree, was misplaced, and
both you and Tim are correct is asking for it not to be used as a
final solution. But as an interim solution, with the care around
alignment, endian and all, it is "fine".

So this is handled, and I think there's something wrong/missing in the
optimizer for AArch32. This is a legitimate bug and should be fixed (even if
a workaround is required in the interim!)

It is legitimate, and that's why I created the bug in the first place.
The only difference is that it's now lower priority since it:

1. is a performance issue, not a conformance one
2. has a work around that is not too ugly

I now agree that the second part is not as strong as I judged (while
drinking Savoy wine and eating Gruyere cheese). Feel free to revamp
the priority.

cheers,
--renato

After a bit more testing applying the pointer dereferencing workaround in my real code (which has some layers of templates and inline functions) I’ve decided against using it in practice.

GCC produced incorrect code unless -fno-strict-aliasing was specified. It’s probably entirely entitled to do that, so it just seems too flaky to recommend in any case.

Simon

Ok, I raised the priority back to Normal in the bug, since the work around wasn’t good enough.

Cheers,
Renato

Ok, I raised the priority back to Normal in the bug, since the work around wasn’t good enough.

Sorry, looks like a false alarm; I was missing a set of brackets in my macro definitions for vld/vst and they weren’t rebuilt properly after being fixed. This does then generate correct code with both GCC and LLVM, and in LLVM all the redundant load/stores are correctly removed.

I’ve still sufficiently scared myself about aliasing, and GCC still suffers from some additional temporaries, so I’m going to use a lazy-evaluation template approach instead. That will lose the natural support for chained multiply that the simple return by value offers but I have more confidence in the correctness of the approach than the workaround I previously suggested.

Simon

While pointer dereference does work just as well (and better, given this
defect) as VLD1 it is explicitly *not supported*.

I'm actually less convinced of this now. I'd argue that you shouldn't
*create* vectors like that (e.g. by casting a pointer to another type
and loadin), but the ACLE can't really dictate what you do with those
vectors after they've been created, and passing around pointers to
them (or storing them in an array, as in int8x8xN_t and so on) is
entirely valid. I do hope we haven't broken that assumption with our
big-endian model...

Cheers.

Tim.

Hi Tim,

I do hope we haven’t broken that assumption with our
big-endian model…

I should rephrase. I’m concerned solely with creating vectors by dereferencing casted pointers to scalars (and vice-versa):

uint32_t *g;
uint32x4_t f = *(uint32x4_t)g;

What you referring to works perfectly fine in big endian mode and should be supported - they’re a type like any other. The important difference is the implicit reinterpret_cast, which can make things go wrong fast in big endian mode (I haven’t wrapped my head yet quite around whether there is an easy scenario in which things could break…)

Cheers,

James

To correct my previous correction, this does indeed cause all the important instructions to get stripped with the android NDK’s gcc 4.8, when optimisation is set to -O2. With -O3 for whatever reason this doesn’t happen. So I’d definitely agree with the general consensus to only create vectors with the official intrinsics and steer well clear of my nasty hacky attempt at a workaround.

Simon