Is there any llvm neon intrinsic that maps to vmla.f32 instruction ?

Hi all,

Everything is in the tile, I would like to enforce generation of vmla.f32 instruction for scalar operations on cortex-a9, so is there a LLMV neon intrinsic available for that ?

Thanks for your answers

Best Regards

Seb

Hi Sebastien,

LLVM doesn't use intrinsics when there is a clear way of representing the
same thing on standard IR. In the case of VMLA, it is generated from a
pattern:

%mul = mul <N x type> %a, %b
%sum = add <N x type> %mul, %c

So, if you generate FAdd(FMull(a, b), c), you'll probably get an FMLA.

It's not common, but also not impossible that the two instructions will be
reordered, or even removed, so you need to make sure the intermediate
result is not used (or it'll probably use VMUL/VADD) and the final result
is used (or it'll be removed) and keep the body of the function/basic block
small (to avoid reordering).

cheers,
--renato

Hi Renato,

Thanks for the answer, it confirms what I was suspecting. My problem is that this behavior is controlled by vmlx forwarding on cortex-a9 for which despite asking on this list, I couldn’t get a clear understanding what this option is meant for.

So here are my new questions:

Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to guarantee correctness or for performance purpose ? I’ve made some experiments and DISABLING vmlx-forwarding for cortex-a9 leads to generation of more vmla/vmls .f32 and significantly improve some benchmarks. I’ve not enter into a case where it significantly degrades performance or give incorrect answers.

Thus my goal is to use my front-end to generate llvm neon intrinsics that maps to LLVM vmla/vmls f32 when I think it is appropriate and not to rely on disabling/enabling vmlx-forwarding.

Best Regards

Seb

Why for cortex-a9 vmlx-forwarding is enabled by default ? Is it to
guarantee correctness or for performance purpose ? I’ve made some
experiments and DISABLING vmlx-forwarding for cortex-a9 leads to generation
of more vmla/vmls .f32 and significantly improve some benchmarks. I’ve not
enter into a case where it significantly degrades performance or give
incorrect answers.

I believe this is what you're looking for:

http://article.gmane.org/gmane.comp.compilers.llvm.cvs/90709

Performance only, but if you're seeing regressions, I'm interested to know
what benchmarks and how much are they regressing/improving.

****

Thus my goal is to use my front-end to generate llvm neon intrinsics that
maps to LLVM vmla/vmls f32 when I think it is appropriate and not to rely
on disabling/enabling vmlx-forwarding.

In that case, you must disable the pass when you call the back-end.

cheers,
--renato

Hi Renato,

Indeed problem is with generation of vmla.f64. Affected benchmark is MILC from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on complete benchmark execution ! So it is worth a try. Now going back to vmla generation through LLMV intrinsic usage. I’ve looked at .td file and it seems to me that when there is a “pattern” to generate instruction, no intrinsic is defined to generate it, correct ?

Is it possible for an instruction that is generated through a “pattern” to add also an LLVM intrinsic. My goal here is to not rely on LLVM to generate VMLA but rather having my front-end to generate call to a VLMA intrinsic I would have defined when it thinks it’s appropriate to generate one.

Hope that’s clear.

Thanks for your answer

Seb

Indeed problem is with generation of vmla.f64. Affected benchmark is MILC
from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on
complete benchmark execution ! So it is worth a try.

Hi Sebastien,

Ineed, worth having a look. Including Bob Wilson (who introduced the code
in the first place, and is a connoisseur of NEON in LLVM) to see if he has
a better idea of the problem.

Now going back to vmla generation through LLMV intrinsic usage. I’ve looked

at .td file and it seems to me that when there is a “pattern” to generate
instruction, no intrinsic is defined to generate it, correct ?

Correct.

Is it possible for an instruction that is generated through a “pattern” to

add also an LLVM intrinsic. My goal here is to not rely on LLVM to generate
VMLA but rather having my front-end to generate call to a VLMA intrinsic I
would have defined when it thinks it’s appropriate to generate one.

No, and I'm not sure we should have one.

I understand why you want one, but that's too much back-end knowledge to a
front-end, and any pass that can transform a pair of VMLAs into an
intrinsic call, can also transform into VMLA+VMUL+VADD. In this case,
disabling the optimization is probably the best course of action.

In your compiler, you may prefer to leave it always disabled, then you
should set it when creating the Target.

If we find that this optimization produces worse code in more cases than
not, than we should leave it disable by default and let the user enable
when necessary. I'll let Bob follow up on that, since I don't know what
benchmarks he used.

cheers,
--renato

Hi,

If we find that this optimization produces worse code in more cases than not, than we should leave it disable by default and let the user enable when necessary. I’ll let Bob follow up on that, since I don’t know what benchmarks he used.

Note that it may well be the case that the most “generally performant” default may vary between different ARM cores as well as various types of code. It would certainly be as well to try benchmarking on different cores. as theoretical discussion of which code sequence is better is often add odds with empirically observed results.

Cheers,

Dave

Hi David,

Your point is correct, I guess then vmlx-forwarding attribute needs to be revisited, since it is on by default for Cortex-A9. So far on codes I’m looking at (not only MILC), it always a win to disable it for Cortex-A9. In any case, I’m speaking about scalar (not vector) fp (not integer) operations, that’s why I would like a way to selectively enable it for such kind of operations only.

Best Regards

Seb

In theory, the backend should choose the best instructions for the selected target processor. VMLA is not always the best choice. Lang Hames did some measurements a while back to come up with the current behavior, but I don’t remember exactly what he found. CC’ing Lang.

Hi Bob, Seb, Renalto,

My VMLA performance work was on Swift, rather than Cortex-A9.

Sebastian - is vmlx-forwarding really the only variable you changed between your tests?

As far as I can see the VMLx forwarding attribute only exists to restrict the application of one DAG combine optimization: PerformVMULCombine in ARMISelLowering.cpp, which turns (A + B) * C into (A * C) + (B * C). This combine only ever triggers when vmlx-forwarding is on. I’d usually expect this to increase vmla formation, rather than decrease it, but under some circumstances (e.g. when the (A * C) and (B * C) expressions have existing uses) it might block their formation.

If you want to narrow the conditions for when PerformVMULCombine applies, please feel free. Please don’t remove the dependence of this optimization on vmlx-forwarding though - we don’t want it applying to targets that don’t have that feature.

Regards,
Lang.

Hi all,

Sorry for my naïve question but what is Swift ?

Yes vmlx-forwarding is the only variable I changed in my tests.

I did the experiment on another popular FP benchmark and observe a 14% speed-up only by disabling vmlx-forwarding.

Best Regards

Seb

My VMLA performance work was on Swift, rather than Cortex-A9.

Sebastian - is vmlx-forwarding really the only variable you changed between your tests?

As far as I can see the VMLx forwarding attribute only exists to restrict the application of one DAG combine optimization: PerformVMULCombine in ARMISelLowering.cpp, which turns (A + B) * C into (A * C) + (B * C). This combine only ever triggers when vmlx-forwarding is on. I’d usually expect this to increase vmla formation, rather than decrease it, but under some circumstances (e.g. when the (A * C) and (B * C) expressions have existing uses) it might block their formation.

If you want to narrow the conditions for when PerformVMULCombine applies, please feel free. Please don’t remove the dependence of this optimization on vmlx-forwarding though - we don’t want it applying to targets that don’t have that feature.

Regards,

Lang.

In theory, the backend should choose the best instructions for the selected target processor. VMLA is not always the best choice. Lang Hames did some measurements a while back to come up with the current behavior, but I don’t remember exactly what he found. CC’ing Lang.

Indeed problem is with generation of vmla.f64. Affected benchmark is MILC from SPEC 2006 suite and disabling vmlx forwarding gives a 10% speed-up on complete benchmark execution ! So it is worth a try.

Hi Sebastien,

Ineed, worth having a look. Including Bob Wilson (who introduced the code in the first place, and is a connoisseur of NEON in LLVM) to see if he has a better idea of the problem.

Now going back to vmla generation through LLMV intrinsic usage. I’ve looked at .td file and it seems to me that when there is a “pattern” to generate instruction, no intrinsic is defined to generate it, correct ?

Correct.

Is it possible for an instruction that is generated through a “pattern” to add also an LLVM intrinsic. My goal here is to not rely on LLVM to generate VMLA but rather having my front-end to generate call to a VLMA intrinsic I would have defined when it thinks it’s appropriate to generate one.

No, and I’m not sure we should have one.

I understand why you want one, but that’s too much back-end knowledge to a front-end, and any pass that can transform a pair of VMLAs into an intrinsic call, can also transform into VMLA+VMUL+VADD. In this case, disabling the optimization is probably the best course of action.

In your compiler, you may prefer to leave it always disabled, then you should set it when creating the Target.

If we find that this optimization produces worse code in more cases than not, than we should leave it disable by default and let the user enable when necessary. I’ll let Bob follow up on that, since I don’t know what benchmarks he used.

cheers,

–renato

Sorry for my naïve question but what is Swift ?

It’s a complicated area. There’s the standard Cortex-a9 design from ARM, Swift is the CPU that Apple that’s used in their latest products that is significantly modified from a basic ARM design and then there’s the next generation Cortex-a15 design from ARM. Each of them handles the same instruction set, but the implementation detaiis of each mean that different instruction sequences may perform better on each.

Cheers,

Dave

Understood,

Same architecture, different micro-arch (implementation). Could this be the case that vmlx-forwarding make senses for SWIFT and not for ARM Cortex-A9 implementation ? It is enabled by default when –mcpu=cortex-a9 is used but test have made show significant improvements when disabled for cortex-A9 (STEricsson Nova platform).

Best Regards

Seb

Hi Sebastien,

The optimization does make sense for cortex-a9, I remember to have reviewed
the patch myself and the A9 document clearly states the delays involved
between VMLAs and that this was a solution.

However, due to micro-architecture differences (as David explained), it may
interfere with other non-Swift steps (or the lack of Swift steps) and
produce worse code. It's not uncommon to see "is (isSwift())" around the
code generation or optimization passes.

I haven't done any benchmarking on that particular issue, but if you can
show that the performance regression occur on more than one cortex-A9 core
(ST, TI), than I'd be inclined to suggest only enable VMLx-forward by
default on Swift.

cheers,
--renato

I did the initial work on vmla formation. The default settings for cortex-a8 / a9 due to micro-architecture difference (i believe a8 TRM talks about vmla hazards) and extensive testing. That said, given the limitation of the current pre-RA scheduling pass, it’s likely the use of vmla can caused regressions.

Im not opposed to changing the setting for a9. However, it’s not a good idea to base the decision on one benchmark. I’d like to see minimally performance data of the entire llvm test suite.

Evan

Im not opposed to changing the setting for a9.

At least until we identify what is the problem and how to fix it, if it's
another pass messing up the patterns.

However, it's not a good idea to base the decision on one benchmark. I'd

like to see minimally performance data of the entire llvm test suite.

Absolutely.

--renato

If this helps taking your decision, there are at least two benchmarks for which disabling vmlx-forwarding makes a significant difference.
If I get lucky I may be able to run on a panda board by next week and have more info to share
Best Regards
Seb

If this helps taking your decision, there are at least two benchmarks for
which disabling vmlx-forwarding makes a significant difference.

I think Evan's worry was to base this decision on visible and
comprehensible benchmarks, such as the test-suite.

If I get lucky I may be able to run on a panda board by next week and have

more info to share

That'd be great, thanks!

--renato

Hi Sebastien,

How many extra vmlas did you see in 433.milc due to disabling -vmlx-forwarding?
As I mentioned earlier, I saw only two additional integer vmlx instructions when I tested.

Could you send me your 433.milc compile setup? (os, flags, compiler version, etc.). I’d like to try to reproduce your results.

Cheers,
Lang.

Hi Lang,

I'm speaking about 64-fp vmla. Find attached to this e-mail, a .ll file that exhibits problem encountered in MILC.
I've built LLVM (trunk & 3.2) on a x86-64 Ubuntu 10.04 LTS system.
Try
llc -march=arm -mcpu=cortex-a9 vmlx_ex.ll -o vmlx_ex.s
and
llc -march=arm -mcpu=cortex-a9 -mattr=-vmlx-forwarding vmlx_ex.ll -o vmlx_ex.s

You should see difference and trust me it make significant difference in performance - at least on my platform - on MILC and other FP intensive code.

Best Regards
Seb

vmlx_ex.ll (12.1 KB)