[AArch64] Target-specific loop idiom recognition

Hi all,

I’d like to ask if there are any thoughts on creating an AArch64 specific pass similar to LoopIdiomRecognize, specifically for loops with early exits which currently limits SLP and loop vectorisation from handling them.

We have identified some loops that we know at least a few workloads would benefit from if we were able to use SVE’s unique predication features. One example is with the following loop:

while (i != max_len)
    if (a[i] != b[i])
        break;

This is similar to a memcmp, but is slightly different because instead of returning the difference between the values of the first pair of bytes which did not match, it returns the index of the first mismatch. We observe a significant performance improvement by replacing this with a specialised predicated SVE loop. We believe this would generally be beneficial; this pattern appears several times in the xz benchmark and in the LLVM test suite (7zip).

The LoopIdiomRecognise pass seems like a good candidate for matching the loop, however as the transform depends on the target having the right features to optimise this efficiently I don’t think it would be an acceptable place to implement this. We also would not be able to lower to a memcmp call as mentioned above. As there is a HexagonLoopIdiomRecognition pass, we are considering adding an AArch64 specific pass which we could also extend to work for similar loops such as std::find which is a more common idiom and we know is used heavily by the xalancbmk benchmark.

I’d like to check first if there are any concerns with this approach and any comments would be appreciated.

Thanks,
Kerry

5 Likes

RISC-V “V” extension also has fault-only-first loads; we probably want to share code if possible. (And on many targets, you can do related transforms by ensuring vector loads never cross a page boundary.)

Do we need to integrate this into the vectorizer? In the simplest cases, like the one you’ve shown, it’s not that hard to directly pattern-match. But in general there are other operations in the loop which need to be vectorized.

Hi Eli,

For the loops we’ve identified, the trip counts are low and this guided us towards an efficient model where we only enter the vector loop when memory accesses don’t cross page boundaries, otherwise falling back on a scalar loop. I don’t think we would be able to use transforms such as peeling iterations as there are multiple loads in the loops we’re looking at which would need to be aligned to a vector multiple.

The loop that we would be emitting from this new pass includes some small page checks to ensure it won’t fault, but these are target-specific and so LoopVectorize is probably not the right place for this? We’ve done some investigations into instead using first-faulting loads for loops such as the one described above and while it did result in a slight performance improvement on Neoverse V1, it was not as significant and there was little improvement observed on SVE2 hardware.

We could integrate this into LoopVectorize with first-faulting loads but we’d still need to do further work to make it optimal, as current SVE hardware shows that the performance gain would be poor.

Thanks,
Kerry

There’s a similar proposal [RFC] Vectorization of loops with Data Dependent Exits.
I do see such things should belong to vectorizer, so unifying both RFCs are quite important.

efficient model where we only enter the vector loop when memory accesses don’t cross page boundaries

Oh, I thought “unique predication features” was referring to first-faulting loads; if you don’t need those, it’s not clear to me what’s “unique”. What exactly is “unique” isn’t that important to the overall discussion, though.

I don’t think we would be able to use transforms such as peeling iterations as there are multiple loads in the loops we’re looking at which would need to be aligned to a vector multiple.

You can probably handle loops with two loads without too much overhead? Align one pointer, then split the iteration where the second pointer would cross a page boundary into two iterations. You can make this work with either masked loads, or some shuffling. Unless I’m misunderstanding something…

The loop that we would be emitting from this new pass includes some small page checks to ensure it won’t fault, but these are target-specific and so LoopVectorize is probably not the right place for this?

The question here is really figuring out what we can generalize, and what we can’t.

If the transform is essentially vectorization, leveraging LoopVectorize seems okay. It’s heavily target-driven anyway, based on the functionality available on each target. I don’t see a problem adding some target-specific bits to the runtime check, if that’s what it takes.

If you can’t treat it as vectorization, adding a separate pass might be okay. But we should start with the assumption that other targets can take advantage of this code somehow.

I guess if you’re not sure what specific bits other targets can take advantage of, you could start with a target-specific pass, with the expectation it will be made target-independent by someone else who needs similar functionality on another target.

I agree LoopIdiomRecognise is a bad place for the proposed transform, but not really for the reasons you’re raising. The issue is that LoopIdiomRecognise is an early “canonicalizing” pass, but this transform should be done in a late “lowering” pass, around the same time we do vectorization. The fact that target-specific checks are required isn’t really that relevant; “lowering” passes frequently leverage TargetTransformInfo to figure out what’s available/profitable on a target.

I second @nikolaypanchenko and @efriedma-quic that “correct” long term solution would be to support mentioned pattern by the vectorizer. I believe there shouldn’t be any technically unsolvable problems emitting runtime boundary checks (even target specific) as part of vectoirzation.
Saying the above, it still may make sense to add a separate “pattern matching” pass as a short term solution because the “correct” solution will take ?significant? time to implement. On the other hand, if we align our efforts and work together we can finish implementation much faster. @kmclaughlin-arm do you or anybody else have interest\opportunity to join “[RFC] Vectorization of loops with Data Dependent Exits” initiative?

Thank you @efriedma-quic & @ebrevnov for your feedback on my question, I’m glad there would be some support for the idea of creating a separate pass for this in the short term until the vectoriser is able to handle this type of loop pattern.

After some conversations based on your replies, we believe there is a way that I can change the pass we’ve worked on downstream to make it more generic, so that other targets can also benefit from it more easily. I will try these before sharing any patches.

Thanks,
Kerry

It indeed looks like a vectorisation issue, but maybe starting with an AArch64 specific pass and generalising it later isn’t such a bad idea because it sounds like some heuristics are involved, like small trip counts, and there are also differences between SVE and SVE2 hardware; getting more experience with this in a separate pass first might be easier.

After some conversations based on your replies, we believe there is a way that I can change the pass we’ve worked on downstream to make it more generic, so that other targets can also benefit from it more easily. I will try these before sharing any patches.

If you already have a prototype ready, you could also share it before making it more more target independent? Maybe it helps with discussing this further.

Hi @sjoerdmeijer,

The prototype isn’t quite ready, it will need a bit of reworking to combine everything we’ve tried into one pass. I’m happy to share it once I have a patch ready though if it will help!

Sure, sounds good.

And to support the idea of target specific (SVE2) loop idiom recognition a little bit more, I just though of some more examples:

Not sure how generic this all is, and a target-specific loop idiom pass sounds like a good playground for this.

Hi all,

I have shared a work in progress patch for reference to help with discussing this:
⚙ D158291 [PoC][WIP] Add an AArch64 specific pass for loop idiom recognition.

The patch adds an intrinsic (@llvm.experimental.cttz.elts) which counts the trailing zero elements in a vector and has generic lowering. The pass introduced is still called AArch64LoopIdiomRecognize as this is how I began implementing it, but there is nothing specific that prevents it being used by other targets. The checks to ensure memory accesses don’t cross page boundaries will however require knowing the page size for the particular target.

The way that’s implemented seems extremely targeted towards the exact loop in question… but I guess that’s fine as a starting point.

cttz.elts seems useful. The generic lowering is probably not what you want on most targets (on x86 the most efficient lowering is probably bitcast+cttz), but the implementation seems like a reasonable starting point.