[arm, aarch64] Alignment checking in interleaved access pass

Hi,

As a follow up to Patch D23646, I’m trying to figure out if there should be an alignment check and what the correct approach is.

Some background:
For stores, the pass turns:

%i.vec = shuffle <8 x i32> %v0, <8 x i32> %v1,
<0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>
store <12 x i32> %i.vec, <12 x i32>* %ptr
Into:

%sub.v0 = shuffle <8 x i32> %v0, <8 x i32> v1, <0, 1, 2, 3>
%sub.v1 = shuffle <8 x i32> %v0, <8 x i32> v1, <4, 5, 6, 7>
%sub.v2 = shuffle <8 x i32> %v0, <8 x i32> v1, <8, 9, 10, 11>
call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)

The purpose of the above patch is to enable more general patterns such as turning:

%i.vec = shuffle <32 x i32> %v0, <32 x i32> %v1,
<4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
store <12 x i32> %i.vec, <12 x i32>* %ptr
Into:

%sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>
%sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34, 35>
%sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18, 19>
call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)

The question I’m trying to get answered if there should have been an alignment check for the original pass, and, similarly, if there should be an expanded one for the more general pattern.
In the example above, I was looking to check if the data at positions 4, 16, 32 is aligned, but I cannot get a clear picture on the impact on performance (hence the side question below).
Also, some preliminary alignment checks I added break some ARM tests (and not their AArch64 counterparts). The cause is getting “not fast” from allowsMisalignedMemoryAccesses, from checking hasV7Ops.
I’d appreciate getting some guidance one how to best address and analyze this.

Side question for Tim and other ARM folks, could I get a recommendation on reading material for performance tuning for the different ARM archs?

Thank you,
Alina

All,

Gentle reminder in the hopes of getting some answers to the questions in the original email.

Thank you,
Alina

The question I'm trying to get answered if there should have been an
alignment check for the original pass, and, similarly, if there should be an
expanded one for the more general pattern.

Hi Alina,

IIRC, the initial implementation was very simple and straightforward
to make use of VLDn instructions on ARM/AArch64 NEON.

Your patterns allow simple vector instructions in the trivial case,
but not in the cases where VLDn would make a difference.

The examples were:

for (i..N)
  out[i] = in[i] * Factor; // R
  out[i+1] = in[i+1] * Factor; // G
  out[i+2] = in[i+2] * Factor; // B

This pattern is easily vectorised on most platforms, since loads, muls
and stores are the exact same operation. which can be combined.

for (i..N)
  out[i] = in[i] * FactorR; // R
  out[i+1] = in[i+1] * FactorG; // G
  out[i+2] = in[i+2] * FactorB; // B

This still can be vectorised easily, since the Factor vector can be
easily constructed.

for (i..N)
  out[i] = in[i] + FactorR; // R
  out[i+1] = in[i+1] - FactorG; // G
  out[i+2] = in[i+2] * FactorB; // B

Now it gets complicated, because the operations are not the same. In
this case, VLDn helps, because you shuffle [0, 1, 2, 3, 4, 5] -> VADD
[0, 3] + VSUB [1, 4] + VMUL [2, 5].

Your case seems to be more like:

for (i..N)
  out[i] = in[i] * FactorR; // R
  out[i+4] = in[i+4] * FactorG; // G
  out[i+8] = in[i+8] * FactorB; // B

In which VLDn won't help, but re-shuffling the vectors like the second
case above will.

Even this case:

for (i..N)
  out[i] = in[i] + FactorR; // R
  out[i+4] = in[i+4] - FactorG; // G
  out[i+8] = in[i+8] * FactorB; // B

can work, if the ranges are not overlapping. So, [0, 4, 8] would work
on a 4-way vector, but [0, 2, 4] would only work on a 2-way vector.

In the example above, I was looking to check if the data at positions 4, 16,
32 is aligned, but I cannot get a clear picture on the impact on performance

On modern ARM and AArch64, misaligned loads are not a problem. This is
true at least from A15 onwards, possibly A9 (James may know better).

If your ranges overlap, you may be forced to reduce the vectorisation
factor, thus reducing performance, but the vectoriser should be able
to pick that up from the cost analysis pass (2-way vs 4-way).

Also, some preliminary alignment checks I added break some ARM tests (and
not their AArch64 counterparts). The cause is getting "not fast" from
allowsMisalignedMemoryAccesses, from checking hasV7Ops.

What do you mean by "break"? Bad codegen? Slower code?

Side question for Tim and other ARM folks, could I get a recommendation on
reading material for performance tuning for the different ARM archs?

ARM has a list of manuals on each core, including optimisation guides:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.cortexa/index.html

cheers,
--renato

Hi Renato,

Thank you for the answers!

First, let me clarify a couple of things and give some context.

The patch it looking at VSTn, rather than VLDn (stores seem to be somewhat harder to get the “right” patterns, the pass is doing a good job for loads already)

The examples you gave come mostly from loop vectorization, which, as I understand it, was the reason for adding the interleaved access pass. I’m looking at a different usecase. The code in question is generated by a DSL (Halide), and it’s directly generating LLVM bitcode. The computations do come originally from loops, but they are pre-processed by Halide, followed by vector code generation as bitcode. The patters I’m targeting are not generated AFAIK from the loop vectorization or SLP passes.

Now, for ARM archs Halide is currently generating explicit VSTn intrinsics, with some of the patterns I described, and I found no reason why Halide shouldn’t generate a single shuffle, followed by a generic vector store and rely on the interleaved access pass to generate the right intrinsic.
Performance-wise, it is worth using the VSTns in the scenarios they encounter, it’s mostly a question of where they get generated.

The alignment question is orthogonal to the patch up for review. There was no alignment check before, and I didn’t have enough background of the architectures to conclude if this was needed or not. I added a simple check to test if this would make any difference and some of the ARM tests started failing. The break just meant that the interleaved access pass was not replacing the “shuffle+store” with the “vstn” so the checks were failing.

If the alignment is not an issue, it simplifies things.

Also, thank you for the reference!

Best,
Alina

Now, for ARM archs Halide is currently generating explicit VSTn intrinsics,
with some of the patterns I described, and I found no reason why Halide
shouldn't generate a single shuffle, followed by a generic vector store and
rely on the interleaved access pass to generate the right intrinsic.

IIRC, the shuffle that unscrambles the interleaved pattern <0, 4, 8, 1
...> -> <0, 1, 2 ...> is how the strided algorithm works, so that the
back-end can match the patterns and emit a VSTn / STn because the
access is "sequential" in a n-hop way.

If the shuffle doesn't present itself in that way, the back-end
doesn't match and you end up with a long list of VMOVs.

Also, the vectorisers today make sure that the sequence is continuous
and contiguous, which doesn't seem to be a hard requirement for
Halide. I don't think there's a better reason than "we haven't thought
about the general case yet".

One way to test the back-end pattern matching is to emit textual IR
and manually change it, removing the intrinsics, or changing the
shuffles and see what happens after `opt`.

Performance-wise, it is worth using the VSTns in the scenarios they
encounter, it's mostly a question of where they get generated.

I'm confused. AFAIK, VSTn is AArch32 while STn is AArch64, and for the
one lane variant, the latency and throughput seem to be identical.

If the alignment is not an issue, it simplifies things.

Except on rare cases (which are not pertinent to the case at hand),
ARMv8 handles unaligned loads and stores without penalties.

cheers,
--renato

> Now, for ARM archs Halide is currently generating explicit VSTn
intrinsics,
> with some of the patterns I described, and I found no reason why Halide
> shouldn't generate a single shuffle, followed by a generic vector store
and
> rely on the interleaved access pass to generate the right intrinsic.

IIRC, the shuffle that unscrambles the interleaved pattern <0, 4, 8, 1
...> -> <0, 1, 2 ...> is how the strided algorithm works, so that the
back-end can match the patterns and emit a VSTn / STn because the
access is "sequential" in a n-hop way.

If the shuffle doesn't present itself in that way, the back-end
doesn't match and you end up with a long list of VMOVs.

That's what the patch addresses. It generalizes the algorithm to accept a
more generic interleaved pattern and emit a VSTn/STn.
The access no longer needs to be sequential in an n-hop way, but a pattern
amenable to the use of the store intrinsic:
[x, y, ... z, x+1, y+1, ...z+1, x+2, y+2, ...z+2, ...]

Also, the vectorisers today make sure that the sequence is continuous
and contiguous, which doesn't seem to be a hard requirement for
Halide. I don't think there's a better reason than "we haven't thought
about the general case yet".

That's right, perhaps because Halide is not a regular vectorizer, which
opens up new cases.
To give a bit more insight, here's a simple example of where the data is
still continuous: [0 .. 32) , but it needs to be split to use multiple
VSTns/STns. This is what Halide generates for aarch64:

  %uglygep242243 = bitcast i8* %uglygep242 to <16 x i32>*

  %114 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 0,
i32 1, i32 2, i32 3>
  %115 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 8,
i32 9, i32 10, i32 11>
  %116 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 16,
i32 17, i32 18, i32 19>
  %117 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 24,
i32 25, i32 26, i32 27>
  %118 = bitcast <16 x i32>* %uglygep242243 to <4 x i32>*
  call void @llvm.aarch64.neon.st4.v4i32.p0v4i32(<4 x i32> %114, <4 x i32>
%115, <4 x i32> %116, <4 x i32> %117, <4 x i32>* %118)
  %scevgep241 = getelementptr <16 x i32>, <16 x i32>* %uglygep242243, i64 1

  %119 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 4,
i32 5, i32 6, i32 7>
  %120 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 12,
i32 13, i32 14, i32 15>
  %121 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 20,
i32 21, i32 22, i32 23>
  %122 = shufflevector <16 x i32> %112, <16 x i32> %113, <4 x i32> <i32 28,
i32 29, i32 30, i32 31>
  %123 = bitcast <16 x i32>* %scevgep241 to <4 x i32>*
  call void @llvm.aarch64.neon.st4.v4i32.p0v4i32(<4 x i32> %119, <4 x i32>
%120, <4 x i32> %121, <4 x i32> %122, <4 x i32>* %123)

IMO, it makes sense to have Halide generate this instead:
%114 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x i32> <i32 0,
i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18,
i32 26, i32 3, i32 11, i32 19, i32 27>
store <16 x i32> %114, <16 x i32>* %sunkaddr262
%115 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x i32> <i32 4,
i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32
22, i32 30, i32 7, i32 15, i32 23, i32 31>
store <16 x i32> %115, <16 x i32>* %scevgep241
With the changes from the patch, this translates to the code above, and it
is arch independent.

AFAIK there isn't an LLVM pass going one step further to do such splits if
given a single store.

But, you're right that in general, the data is not necessarily continuous
and contiguous in Halide, and I think the pass should address those cases
as well.

One way to test the back-end pattern matching is to emit textual IR
and manually change it, removing the intrinsics, or changing the
shuffles and see what happens after `opt`.

Yes, I did that with some of the codes generated by Halide, it's what led
to patch D23646 to extend the patterns. The new code being generated is the
"expected" one.
Also, benchmarking some of their apps showed that llvm's pass (after the
patch) does the job as well as the custom code generation they were using
before. (Note, that Halide's code generation was written before the
interleaved access pass was added, so it made sense at the time.)

> Performance-wise, it is worth using the VSTns in the scenarios they
> encounter, it's mostly a question of where they get generated.

I'm confused. AFAIK, VSTn is AArch32 while STn is AArch64, and for the
one lane variant, the latency and throughput seem to be identical.

That's right, the aim is to have llvm generate VSTn/STn depending on the
arch.

Best,
Alina

IMO, it makes sense to have Halide generate this instead:
%114 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x i32> <i32 0,
i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18,
i32 26, i32 3, i32 11, i32 19, i32 27>
store <16 x i32> %114, <16 x i32>* %sunkaddr262
%115 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x i32> <i32 4,
i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32
22, i32 30, i32 7, i32 15, i32 23, i32 31>
store <16 x i32> %115, <16 x i32>* %scevgep241
With the changes from the patch, this translates to the code above, and it
is arch independent.

Right, this makes sense.

This should generate 2 VST4/ST4, which together will be contiguous,
but not individually.

Yes, I did that with some of the codes generated by Halide, it's what led to
patch D23646 to extend the patterns. The new code being generated is the
"expected" one.

I have added some comments on the review, but I think overall, it
makes sense and it's a much simpler patch than I was expecting to find
working all the way to the end. :slight_smile:

Also, benchmarking some of their apps showed that llvm's pass (after the
patch) does the job as well as the custom code generation they were using
before. (Note, that Halide's code generation was written before the
interleaved access pass was added, so it made sense at the time.)

Nice!

cheers,
--renato

> IMO, it makes sense to have Halide generate this instead:
> %114 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x i32> <i32 0,
> i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32
18,
> i32 26, i32 3, i32 11, i32 19, i32 27>
> store <16 x i32> %114, <16 x i32>* %sunkaddr262
> %115 = shufflevector <16 x i32> %112, <16 x i32> %113, <16 x i32> <i32
4,
> i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32
> 22, i32 30, i32 7, i32 15, i32 23, i32 31>
> store <16 x i32> %115, <16 x i32>* %scevgep241
> With the changes from the patch, this translates to the code above, and
it
> is arch independent.

Right, this makes sense.

This should generate 2 VST4/ST4, which together will be contiguous,
but not individually.

> Yes, I did that with some of the codes generated by Halide, it's what
led to
> patch D23646 to extend the patterns. The new code being generated is the
> "expected" one.

I have added some comments on the review, but I think overall, it
makes sense and it's a much simpler patch than I was expecting to find
working all the way to the end. :slight_smile:

Thanks a lot for looking at the patch, much appreciated! :slight_smile:
Updated and followed up on it.

Alina