FMA canonicalization in IR

Finkel_Hal_J · November 19, 2016, 4:58pm

Sent from my Verizon Wireless 4G LTE DROID
On Nov 19, 2016 10:26 AM, Sanjay Patel <spatel@rotateright.com> wrote:
>
> If I have my FMA intrinsics story straight now (thanks for the explanation, Hal!), I think it raises another question about IR canonicalization (and may affect the proposed revision to IR FMF):

No, I think that we specifically don’t want to canonicalize to fmuladd at the IR level at all. If the backend has the freedom to form FMAs as it sees fit, then we should delay the decision until whenever the backend finds most appropriate. Some backends, for example, form FMAs using the MachineCombiner pass which considers critical path, latency, throughputs, etc. in order to find the best fusion opportunities. We only use fmuladd when required to restrict the backend to certain choices due to source-language semantics.

Thanks again,
Hal

>
> define float @foo(float %a, float %b, float %c) {
> %mul = fmul fast float %a, %b ; using ‘fast’ because there is no ‘fma’ flag
> %add = fadd fast float %mul, %c
> ret float %add
> }
>
> Should this be:
>
> define float @goo(float %a, float %b, float %c) {
> %maybe.fma = call fast float @llvm.fmuladd.f32(float %a, float %b, float %c)
> ret float %maybe.fma
> }
> declare float @llvm.fmuladd.f32(float %a, float %b, float %c)
>
>

Finkel_Hal_J · November 20, 2016, 2:29am

From: "Hal J. via llvm-dev Finkel" <llvm-dev@lists.llvm.org>
To: "Sanjay Patel" <spatel@rotateright.com>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Saturday, November 19, 2016 10:58:27 AM
Subject: Re: [llvm-dev] FMA canonicalization in IR

Sent from my Verizon Wireless 4G LTE DROID
>
> If I have my FMA intrinsics story straight now (thanks for the
> explanation, Hal!), I think it raises another question about IR
> canonicalization (and may affect the proposed revision to IR FMF):

No, I think that we specifically don't want to canonicalize to
fmuladd at the IR level at all. If the backend has the freedom to
form FMAs as it sees fit, then we should delay the decision until
whenever the backend finds most appropriate. Some backends, for
example, form FMAs using the MachineCombiner pass which considers
critical path, latency, throughputs, etc. in order to find the best
fusion opportunities. We only use fmuladd when required to restrict
the backend to certain choices due to source-language semantics.

I'll also add that, in general, we canonicalize in order to enable other transformations (and reduce the number of input forms those transformations need to match in order to be effective). Forming @llvm.fmulall at the IR level does not seem to further this goal. Did you have something in mind that this canonicalization would help?

Thanks again,
Hal

rotateright · November 20, 2016, 4:40am

The potential advantage I was considering would be more accurate cost modeling in the vectorizer, inliner, etc. Like min/max, this is another case where the sum of the IR parts is greater than the actual cost.

Beyond that, it seems odd to me that we’d choose the longer IR expression of something that could be represented in a minimal form. I know we make practical concessions in IR based on backend deficiencies, but in this case I think the fix would be easy - if we’re in contract=fast mode, just split all of these intrinsics at DAG creation time and let the DAG or other passes behave exactly like they do today to fuse them back together again?

mehdi_amini · November 20, 2016, 7:21am

Hi Sanjay,

Except for memcpy, are there other examples where going from first class sequence of instructions to intrinsics is considered an OK canonicalization?

Finkel_Hal_J · November 20, 2016, 7:38am

From: "Sanjay Patel" <spatel@rotateright.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Saturday, November 19, 2016 10:40:27 PM
Subject: Re: [llvm-dev] FMA canonicalization in IR

The potential advantage I was considering would be more accurate cost
modeling in the vectorizer, inliner, etc. Like min/max, this is
another case where the sum of the IR parts is greater than the
actual cost.

This is indeed a problem, but is a much larger problem than just FMAs (as you note). Our cost-modeling interfaces should be extended to handle instruction patterns -- I don't see any other way of solving this in general.

Beyond that, it seems odd to me that we'd choose the longer IR
expression of something that could be represented in a minimal form.

My fear is that, by forming the FMAs earlier than necessary, you'll just end up limiting opportunities for CSE, reassociation, etc. without any corresponding benefit.

I know we make practical concessions in IR based on backend
deficiencies, but in this case I think the fix would be easy - if
we're in contract=fast mode, just split all of these intrinsics at
DAG creation time and let the DAG or other passes behave exactly
like they do today to fuse them back together again?

This is a good point; we could do this in fp-contract=fast mode.

-Hal

rotateright · November 20, 2016, 3:01pm

Hi Mehdi,
I can’t think of any (and I’m away from my dev machine, so I can’t check). If you’re concerned about inhibiting transforms by introducing intrinsics (as Hal also mentioned), I agree.

However, I see fmuladd as a special case - we already use these intrinsics in contract=on mode, so we should already be required to handle these as “first class” ops in the cost model and other passes. If we’re not, I think that would be a bug.

Finkel_Hal_J · November 20, 2016, 9:04pm

From: "Sanjay Patel" <spatel@rotateright.com>
To: "Mehdi Amini" <mehdi.amini@apple.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Sunday, November 20, 2016 9:01:36 AM
Subject: Re: [llvm-dev] FMA canonicalization in IR

Hi Mehdi,
I can't think of any (and I'm away from my dev machine, so I can't
check). If you're concerned about inhibiting transforms by
introducing intrinsics (as Hal also mentioned), I agree.

However, I see fmuladd as a special case - we already use these
intrinsics in contract=on mode, so we should already be required to
handle these as "first class" ops in the cost model and other
passes. If we're not, I think that would be a bug.

I don't think it is a matter of handling them as "first class" or otherwise. The intrinsics specifically represent a tradeoff: Specific add/multiply pairs we're permitted to fuse by source-language rules. Should I perform a CSE, reassociation, etc. that would require splitting apart a fmuladd intrinsic? I suspect not. Doing so would lose information. On the other hand, if we're free to fuse later as we see fit, then we probably should perform the CSE and then fuse, if possible, later. So the question is really: do we want to teach IR-level optimizations to split apart the fmuladd intrinsics when they'd otherwise block transformations? I suspect the answer is no, and I think that's the assumption embedded in the use of an intrinsic in the design (although I'm certainly open to being convinced otherwise).

-Hal

nhaehnle · November 21, 2016, 9:01am

I think there's a good reason to do this at the IR level already when the appropriate flags are set, see the example that I also sent in another mail:

((X * Y) * X) + Z

is transformed to

((X * X) * Y) + Z

when associative transforms are allowed, but when the original is built as

fmuladd(X * Y, X, Z)

this optimization may be missed (I didn't actually check what happens today).

Nicolai

preames · November 25, 2016, 4:58am

From: "Sanjay Patel" <spatel@rotateright.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Saturday, November 19, 2016 10:40:27 PM
Subject: Re: [llvm-dev] FMA canonicalization in IR

The potential advantage I was considering would be more accurate cost
modeling in the vectorizer, inliner, etc. Like min/max, this is
another case where the sum of the IR parts is greater than the
actual cost.

This is indeed a problem, but is a much larger problem than just FMAs (as you note). Our cost-modeling interfaces should be extended to handle instruction patterns -- I don't see any other way of solving this in general.

This proposal - cost model instruction patterns, not just instructions - keeps coming up in a number of contexts. We've seen a number of proposals recently to add intrinsics at various places in the pipeline to get around this limitation. Investing in infrastructure to solve this problem via the cost model seems like a generally useful path forward which would benefit far more than FMA.

Topic		Replies	Views
FMA canonicalization in IR LLVM Dev List Archives	1	89	November 21, 2016
Documentation of fmuladd intrinsic LLVM Dev List Archives	10	189	January 11, 2013
Question about FMA formation LLVM Dev List Archives	12	141	December 13, 2012
Vector dialect fma lowering to llvm.intr.fmuladd rather than llvm.intr.fma MLIR x86 , mlir	10	283	June 14, 2025
Documentation of fmuladd intrinsic LLVM Dev List Archives	2	134	January 17, 2013

FMA canonicalization in IR

Related topics