NEON vector instructions and the fast math IR flags

Hi,

I was recently looking into the translation of LLVM-IR vector instructions to ARM NEON assembly. Specifically, when this is legal to do and when we need to be careful.

I attached a very simple test case:

define <4 x float> @fooP(<4 x float> %A, <4 x float> %B)
{
%C = fmul <4 x float> %A, %B
ret <4 x float> %C

}

If fooP is compiled with “llc -march=arm -mattr=+vfp3,+neon” LLVM happily uses ARM NEON instructions to implement the vector multiply. This is obviously the fastest code that we can generate, but on the other hand we loose precision compared to non-NEON code (NEON flushes denormals to zero).

As LLVM has now support for IR level fast-math flags, I am wondering if it now would make sense to only create NEON instructions if the relevant fast math flags are set on the IR level?

The reason behind my question is that at the moment the only way to get IEEE 754 floating point operations on ARM is to fully disable NEON. However, NEON can be safely used for integer computations as well as for LLVM-IR instructions with the appropriate fast math flags. The attached test case contains an example of a floating point operation that requires IEEE 754 compliance, a floating point operation that does not require IEEE 754 as well as an integer computation. It is a perfect mixed use case, where we really do not want to globally disable NEON.

I understand that some users do not require 754 compliant floating point behavior (clang on darwin?), which means they would probably not need this change. However, it should also not hurt them performance-wise as such users would probably set the relevant global fast-math flags to reduce the precision requirements, such that NEON instructions would be chosen anyway.

I am very interested in opinions on the general topic as well as how to actually implement this in the ARM target.

All the best,

Tobias

[1] http://llvm.org/docs/LangRef.html#fast-math-flags

neon-floating-point-precision.ll (1.16 KB)

Darwin uses NEON for floating point, but does *not* (and should not). globally enable fast math flags. Use of NEON for FP needs to remain achievable without globally setting the fast math flags. Fast math may imply reasonably imply NEON, but the opposite direction is not accurate.

That said, I don't think anyone would object to making VFP codegen available under non-Darwin triples. It's just a matter of making it happen.

-Owen

Hi Owen,

ARMSubtarget::resetSubtargetFeatures(StringRef CPU, StringRef FS) has a
check to see if the target is Darwin or if UnsafeMath is enabled to set
the UseNEONForSinglePrecisionFP, but only for A5 and A8, where this was a
problem. Maybe I was too conservative on my fix.

Tobi,

The march=arm option would default to ARMv4, while mattr=+neon would force
NEON, but I'm not sure it would default to A8, which would be a weird
combination of ARM7TDMI+NEON.

There are two things to know at this point:

1. When the execution gets to resetSubtargetFeatures, what CPU has it
detected for your arguments. You may also have to look at ARM.td to see if
the CPU that it got detected has in its description the feature
"FeatureNEONForFP".

2. If the CPU is correct (Cortex-A*), and it's neither A5 nor A8, do we
still want to generate single-precision float on NEON when non-Darwin and
safe math? I don't think so. Possibly, that condition should be extended to
ignore the CPU you're using and *only* emit NEON SP-FP when either Darwin
or UnsafeMath are on.

cheers,
--renato

Hi Owen, hi Renato,

thanks for your replies.

Darwin uses NEON for floating point, but does *not* (and should not).
globally enable fast math flags. Use of NEON for FP needs to remain
achievable without globally setting the fast math flags. Fast math may
imply reasonably imply NEON, but the opposite direction is not accurate.

Good point. Fast math is probably a too tough requirement. I need to look into what are the ways NEON does not comply with IEEE 754. For now the only difference I see is that it may round denormals to zero.

That said, I don't think anyone would object to making VFP codegen
available under non-Darwin triples. It's just a matter of making it happen.

I see.

Tobi,

The march=arm option would default to ARMv4, while mattr=+neon would force
NEON, but I'm not sure it would default to A8, which would be a weird
combination of ARM7TDMI+NEON.

There are two things to know at this point:

1. When the execution gets to resetSubtargetFeatures, what CPU has it
detected for your arguments. You may also have to look at ARM.td to see if
the CPU that it got detected has in its description the feature
"FeatureNEONForFP".

2. If the CPU is correct (Cortex-A*), and it's neither A5 nor A8, do we
still want to generate single-precision float on NEON when non-Darwin and
safe math? I don't think so. Possibly, that condition should be extended to
ignore the CPU you're using and *only* emit NEON SP-FP when either Darwin
or UnsafeMath are on.

Renato:

When to set which subtarget feature is a policy decision, where I honestly don't have any opinion on for clang. The best is probably to mirror the gcc behavior on linux targets. My current goal is to understand the implications of certain features and to make sure a tool using the LLVM back-ends can actually implement any policy it likes.

I just looked again at the +neonfp flag. Compiling with and without +neonfp flag seems to only affect scalar types in the attached test case. If e.g. the LLVM vectorizer introduces vector instructions on LLVM-IR level floating point vectors still yield NEON assembly even if compiled with "-mattr=+neon,-neonfp". Is this expected?

Cheers,
Tobias

neon-floating-point-precision-2.ll (1.3 KB)

Darwin uses NEON for floating point, but does *not* (and should not).
globally enable fast math flags. Use of NEON for FP needs to remain
achievable without globally setting the fast math flags. Fast math may
imply reasonably imply NEON, but the opposite direction is not accurate.

Good point. Fast math is probably a too tough requirement. I need to
look into what are the ways NEON does not comply with IEEE 754. For now
the only difference I see is that it may round denormals to zero.

Yes, I've gone on record before as saying that fast-math enables far too
many
different things for it to be "the canonical switch" for just about any
transformation.
Rather, it should be what I think it is in gcc which is an effectively a
short-cut
for invoking of several individual math-option flags.

[snip]

I just looked again at the +neonfp flag. Compiling with and without
+neonfp flag seems to only affect scalar types in the attached test
case. If e.g. the LLVM vectorizer introduces vector instructions on
LLVM-IR level floating point vectors still yield NEON assembly even if
compiled with "-mattr=+neon,-neonfp". Is this expected?

I'm virtually certain that's a problem since there are codebases out there
which use that to effectively specify "integer neon but use VFP for floats".
If the vectorizer is producing neon floating point from scalar code
in the presence of that flag then it's a (minor) issue waiting to happen.

Cheers,
Dave

When to set which subtarget feature is a policy decision, where I honestly
don't have any opinion on for clang. The best is probably to mirror the gcc
behavior on linux targets.

Not really, since GCC has no special behaviour for Darwin, AFAIK.

My change will only generate SP-FP on NEON for A5 and A8 and only if it's
Darwin or UnsafeMath is on, which seems not to be the case for you, so I
don't think the problem is in that area. It's possible that some passes are
not consulting that flag when generating NEON SP-FP. If that's true, this
is definitely a bug.

When I changed that, for VMUL.f32, it worked (ie. generated VFP
instruction), but it might not be taking the same path your code is.

I just looked again at the +neonfp flag. Compiling with and without +neonfp

flag seems to only affect scalar types in the attached test case. If e.g.
the LLVM vectorizer introduces vector instructions on LLVM-IR level
floating point vectors still yield NEON assembly even if compiled with
"-mattr=+neon,-neonfp". Is this expected?

No, vectorizers should honour FP contracts. This is probably a bug, too.

Please, fill both bugs on bugzilla, attaching the relevant IR to each one
and a way to reproduce, and I'll have a look at them.

cheers,
--renato

>I just looked again at the +neonfp flag. Compiling with and without
>+neonfp flag seems to only affect scalar types in the attached test
>case. If e.g. the LLVM vectorizer introduces vector instructions on
>LLVM-IR level floating point vectors still yield NEON assembly even if
>compiled with "-mattr=+neon,-neonfp". Is this expected?

I'm virtually certain that's a problem since there are codebases out there
which use that to effectively specify "integer neon but use VFP for floats".
If the vectorizer is producing neon floating point from scalar code
in the presence of that flag then it's a (minor) issue waiting to happen.

That flag doesn't really do what it is described as doing. It specifies that NEON instructions should be used for *scalar* arithmetic. It tries to avoid using VFP instructions and will promote scalar ops to vector ops. This is to try and gain performance when switching between VFP and NEON pipelines is punished by a core.

Also, -neonfp does nothing. It is not a ternary flag (do nothing, force-on, force-off) - it is either active or inactive. +neonfp forces some transformation, -neonfp disables forcing that transformation. -neonfp doesn't imply any transformations itself.

Also, -neonfp does nothing. It is not a ternary flag (do nothing, force-on,

force-off) - it is either active or inactive. +neonfp forces some
transformation, -neonfp disables forcing that

transformation. -neonfp doesn't imply any transformations itself.

Ah, my misremembering.

When to set which subtarget feature is a policy decision, where I honestly don't have any opinion on for clang. The best is probably to mirror the gcc behavior on linux targets.

Not really, since GCC has no special behaviour for Darwin, AFAIK.

My change will only generate SP-FP on NEON for A5 and A8 and only if it's Darwin or UnsafeMath is on, which seems not to be the case for you, so I don't think the problem is in that area. It's possible that some passes are not consulting that flag when generating NEON SP-FP. If that's true, this is definitely a bug.

When I changed that, for VMUL.f32, it worked (ie. generated VFP instruction), but it might not be taking the same path your code is.

I just looked again at the +neonfp flag. Compiling with and without +neonfp flag seems to only affect scalar types in the attached test case. If e.g. the LLVM vectorizer introduces vector instructions on LLVM-IR level floating point vectors still yield NEON assembly even if compiled with "-mattr=+neon,-neonfp". Is this expected?

No, vectorizers should honour FP contracts. This is probably a bug, too.

Please, fill both bugs on bugzilla, attaching the relevant IR to each one and a way to reproduce, and I'll have a look at them.

It is not the vectorizer that is the issue, it is the ARM backend that currently translates vectorized floating point IR to NEON instructions (it should scalarize it if desired to do so - i.e. if people care about denormals). To fix this issue one would have to fix the backend: i.e not declare v4f32 et al as legal (under a flag). As to making this predicated on fast math flags on operations (something like no-denormals - i don’t think we have that in the IR yet - we only have no nan, no infinite, no signed zeros, etc) I believe this would be a lot harder because I suspect you would have to custom lower all the operations.

It is not the vectorizer that is the issue, it is the ARM backend that
currently translates vectorized floating point IR to NEON instructions (it
should scalarize it if desired to do so - i.e. if people care about
denormals).

Hi Arnold,

Can't the vectorizer not generate the v4f32 vectors in the first place,
with that flag disabled?

To fix this issue one would have to fix the backend: i.e not declare v4f32

et al as legal (under a flag). As to making this predicated on fast math
flags on operations (something like no-denormals - i don’t think we have
that in the IR yet - we only have no nan, no infinite, no signed zeros,
etc) I believe this would be a lot harder because I suspect you would have
to custom lower all the operations.

This is one way of solving it, and maybe we will have to implement it
anyway (for hand-coded IR or external front-ends).

However, that still doesn't solve the original issue. When the vectorizer
analysis the cost of the new loop, it takes into account that now you have
four operations (v4f32) instead of one, which is clearly profitable, but if
we know that the back-end will serialize, than it's no longer profitable,
and can quite possibly hurt performance.

I think we need both solutions.

cheers,
--renato

It is not the vectorizer that is the issue, it is the ARM backend that currently translates vectorized floating point IR to NEON instructions (it should scalarize it if desired to do so - i.e. if people care about denormals).

Hi Arnold,

Can't the vectorizer not generate the v4f32 vectors in the first place, with that flag disabled?

No, vectorized floating point IR and non-vectorized floating point IR are semantically the same wrt to the end result - it is the backend that has to make sure that this is the case (scalarize if desired). The vectorizer is not the only one who could produce vectorize IR.

The vectorizer has two parts: legality and cost. It is legal to generate LLVM IR with vectors because they are semantically the same. The cost model should inform the vectorizer that it is a bad idea on ARM (after the backend has been fixed) because it will be scalarized (dependent on flags).

(I took the liberty to call vectorized IR, and scalar IR semantically the same, of course this only applies if you look at the execution not the individual instruction).

We don’t want to encode backend knowledge into the vectorizer (i.e. don’t vectorize type X because the backend does not support it). The only way to get this result is indirectly via the cost model but the backend must still support vectorized IR (it is part of the language) via scalarization.

(You can of course assign UMAX cost for all floating point vector types in the cost model for ARM and get the desired result - this won’t solve the problem if somebody else writes the vectorize LLVM IR though)

We don’t want to encode backend knowledge into the vectorizer (i.e. don’t
vectorize type X because the backend does not support it).

We already do, via the cost table. This case is no different. It might not
be the best choice, but it is how the cost table is being built over the
last months.

The only way to get this result is indirectly via the cost model but the

backend must still support vectorized IR (it is part of the language) via
scalarization.

Absolutely! There are two problems to solve: increase the cost for SPFP
when UseNEONForSinglePrecisionFP is false, so that vectorizers don't
generate such code, and legalize correctly in the backend, for vector code
that does not respect that flag.

(You can of course assign UMAX cost for all floating point vector types in

the cost model for ARM and get the desired result - this won’t solve the
problem if somebody else writes the vectorize LLVM IR though)

I wouldn't use UMAX, since the idea is not to forbid, but to tell how
expensive it is. But it would be a big number, yes. :wink:

cheers,
--renato

Renato, I think we agree.

We don’t want to encode backend knowledge into the vectorizer (i.e. don’t vectorize type X because the backend does not support it).

We already do, via the cost table. This case is no different. It might not be the best choice, but it is how the cost table is being built over the last months.

Using the cost model to communicate that the backend will generate wrong code is an abuse (in my opinion, this is not what the cost model is for). This is what I meant by encoding backend knowledge. Of course, we use the cost model to tell us how expensive an operation might be but we should not use it as an indicator how wrong it will be ;). (Which is what we would do if we give a v4f32 operation a high cost because the backend generates instructions that flush denormals to zero).

What I wanted to say is that even if you give v4f32 a high cost you still have to solve the real problem in the ARM backend.

The only way to get this result is indirectly via the cost model but the backend must still support vectorized IR (it is part of the language) via scalarization.

Absolutely! There are two problems to solve: increase the cost for SPFP when UseNEONForSinglePrecisionFP is false, so that vectorizers don't generate such code, and legalize correctly in the backend, for vector code that does not respect that flag.

(You can of course assign UMAX cost for all floating point vector types in the cost model for ARM and get the desired result - this won’t solve the problem if somebody else writes the vectorize LLVM IR though)

I wouldn't use UMAX, since the idea is not to forbid, but to tell how expensive it is. But it would be a big number, yes. :wink:

I was referring to the case when you are abusing the cost model to forbid a vectorized v4f32 IR (which I thought you were proposing).

What I am suggesting is that (if you care about denormals):

* the arm backend has to be fixed to scalarize floating point vector operations (behind a flag)
* the arm target transform model has to correctly reflect that

What one could also do (but I don’t think is a good idea) is to just give floating point vector operations a max cost. You might run into unforeseen problems, including that other clients are generating vectorized LLVM IR.

(This makes we wonder whether we clamp the cost computation at TYPE_MAX :slight_smile:

Yup. What I had in mind, too. This is why I asked Tobi to create two bugs,
and we would fix them accordingly. :wink:

cheers,
--renato

Thanks for that explanation. I think it illustrates the situation well.

For programs that have mixed precision requirements for floating point operations we probably need to do this according to the fast math flags.
Until we get there, a good first step would probably be to provide a global option similar to -enable-no-infs-fp-math that specifies if denormals should be allowed or not. This would allow the user to specify the precision requirements, without the need to alter with the feature flags of a specific piece of hardware.

Tobi

Done.

1) Fix the ARM target to only introduce NEON if valid

llvm.org/PR16274

2) Fix the vectorizer cost function

llvm.org/PR16275

Thanks for your insights.

Cheers,
Tobi

For programs that have mixed precision requirements for floating point
operations we probably need to do this according to the fast math flags.
Until we get there, a good first step would probably be to provide a
global option similar to -enable-no-infs-fp-math that specifies if
denormals should be allowed or not. This would allow the user to specify
the precision requirements, without the need to alter with the feature
flags of a specific piece of hardware.

Hi, sorry for coming in late on this. Firstly, I think what you mean is "if denormals should be required to be preserved or not". (Apart from anything else it's possible to move data between standard CPU, SIMD CPU and GPU so that even if one part of the system flushes them to zero when they occur can show up in other parts.) Clearly this implies that you can't use NEON instructions since they are specified not to preserve denormals.

Secondly, I think it would be helpful to at least try to map out which "optimizations" are going to be viewed by a per-instruction IR flag just in order to get a clearer idea if the global stuff is the right model. (Amongst other things, I'm interested in DSLs where the likelihood of knowing something about the "ideal requirements" for operations that will be transformed into LLVM IR is higher than for manually written C/Fortran.)

Cheers,
Dave

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

> For programs that have mixed precision requirements for floating point
> operations we probably need to do this according to the fast math flags.
> Until we get there, a good first step would probably be to provide a
> global option similar to -enable-no-infs-fp-math that specifies if
> denormals should be allowed or not. This would allow the user to specify
> the precision requirements, without the need to alter with the feature
> flags of a specific piece of hardware.

Hi, sorry for coming in late on this. Firstly, I think what you mean is "if denormals should be required to be preserved or not". (Apart from anything else it's possible to move data between standard CPU, SIMD CPU and GPU so that even if one part of the system flushes them to zero when they occur can show up in other parts.) Clearly this implies that you can't use NEON instructions since they are specified not to preserve denormals.

True.

Secondly, I think it would be helpful to at least try to map out which "optimizations" are going to be viewed by a per-instruction IR flag just in order to get a clearer idea if the global stuff is the right model. (Amongst other things, I'm interested in DSLs where the likelihood of knowing something about the "ideal requirements" for operations that will be transformed into LLVM IR is higher than for manually written C/Fortran.)

Sorry, I did not get this sentence. Would you mind rephrasing it?

At the moment I am mainly concerned of the code generation aspect. Optimizations on LLVM-IR can already reason per-instruction about
several floating point precision flags. Doing this during code generation is apparently difficult as we would have to decide per
instruction if we can legally lower it to NEON or not.

Tobi

Secondly, I think it would be helpful to at least try to map out which "optimizations" are going to be viewed by a per-instruction IR flag just in order to get a clearer idea if the global stuff is the right model. (Amongst other things, I'm interested in DSLs where the likelihood of knowing something about the "ideal requirements" for operations that will be transformed into LLVM IR is higher than for manually written C/Fortran.)

Sorry, I did not get this sentence. Would you mind rephrasing it?

At the moment I am mainly concerned of the code generation aspect.
Optimizations on LLVM-IR can already reason per-instruction about
several floating point precision flags. Doing this during code
generation is apparently difficult as we would have to decide per
instruction if we can legally lower it to NEON or not.

I was being a bit unclear. I was thinking that in general it's often the case that one doesn't have one opinion about a given aspect of floating point, but that it matters in some cases (ie, denormals are very useful in code which will be doing a division by a subtraction, whereas in many areas denormals are not a concernt.) So I was thinking it would be helpful to figure out how this would work on a per-IR-instruction basis first, even if from manpower concerns dealing with a simple global flag was going to be implemented first. However, talking to someone who does some numerical coding it seems quite rare to have code which is actually understood and annotated to that level of detail, so I think as a practical matter this isn't worth pursuing since it would be unlikely to be used.

Cheers,
Dave

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.