Is such a transformation best done as a (target-specific) DAG combine?

A similar instcombine already exists for the X/C->X*1/C case (see the CvtFDivConstToReciprocal function in InstCombineMlDivRem.cpp), but I don’t believe the above can be done as an instcombine as it creates a new instruction (in addition to replacing the original). Also, I only want to perform the transformation if there are multiple uses of 1/Y (like in my test case). Otherwise, the transformation replaces a fdiv with a fdiv+fmul pair, which I doubt would be profitable.

FWIW, I’m also pretty sure this combine requires -fast-math.

Is such a transformation best done as a (target-specific) DAG
combine?

Perhaps only partially relevant, but I have a fast-math-only target-specific DAG combine in the PowerPC backend which replaces X/Y -> X*1/Y because we can use the fast reciprocal estimate + newton iteration for 1/Y instead of a full divide.

This is a great transform to do, but you’re right that it’s only safe under fast-math. This is particularly interesting when the original divisor is a constant so you can materialize the reciprocal at compile-time. You’re right that in either case, this optimization should only kick in when there is more than one divide instruction that will be changed to a mul.

I don’t have a strong preference for instcombine vs. dagcombine, though I lean slightly towards later when we’ll have more target information available if we want to apply a more complicated cost function for some targets.

Is such a transformation best done as a (target-specific) DAG combine?

A similar instcombine already exists for the X/C->X*1/C case (see the CvtFDivConstToReciprocal function in InstCombineMlDivRem.cpp), but I don't believe the above can be done as an instcombine as it creates a new instruction (in addition to replacing the original). Also, I only want to perform the transformation if there are multiple uses of 1/Y (like in my test case). Otherwise, the transformation replaces a fdiv with a fdiv+fmul pair, which I doubt would be profitable.

FWIW, I'm also pretty sure this combine requires -fast-math.

I did few transformation in Instruction *InstCombiner::visitFDiv() in an attempt to remove some divs.
I may miss this case. If you need to implement this rule, it is better done in Instcombine than in DAG combine.
Doing such xform early expose the redundancy of 1/y, which have positive impact to neighboring code,
while DAG combine is bit blind.

You should be very careful, reciprocal is very very very imprecise transformation. Sometimes you will see big different
with and without this xform.

I remember why I didn’t implement this rule in Instcombine. It add one instruction. So,
this xform should be driven by a redundancy eliminator if you care code size.

I believe the advice that Owen gave me was that if it’s simple enough, we should do it in both places. (Feel free to chime in if I’m misquoting you, Owen.)

Performing the transformation early as an InstCombine can expose additional opportunities to optimize code. The purpose of the redundant DAGCombine would be to catch the case where the optimization is exposed by another DAG combine. And as Jim also pointed out, the DAG combine has the benefit of target-specific knowledge.

I agree, this will require quite a bit of testing.

Yes, I believe I commented on that concern in my original email (Perhaps not for this reason, however). Ben pointed out that GVN should be able to clean things up if there are multiple instances of 1/Y.

Thanks for the input, Jim. I’m kinda torn because I have the same preference as you (i.e., a DAG combine), but Ben seems to have already proposed a good solution as an InstCombine with a little cleanup. Perhaps, I’ll implement both and then do some analysis.

Yes it is precise, if Y is constant and 1/Y is exact. In this case, this transformation is conducted without "-fast-math".
What I meant is if 1/Y is not exact. I measured the results before. Among all fast-math transformation,
reciprocal is most annoying one because it is very imprecise. I was arguing at that time that reciprocal should be
ranked as the most relax xfrom. It seems the numerical guru Steve kinda agree with me.

I vaguely remember we are not quite happy with the arcp relax rank because the it is not very relax, while the the reciprocal is very imprecise. I guess we are better off using unsafe-math to guard such xform?

This is a great transform to do, but you’re right that it’s only safe under fast-math. This is particularly interesting when the original divisor is a constant so you can materialize the reciprocal at compile-time. You’re right that in either case, this optimization should only kick in when there is more than one divide instruction that will be changed to a mul.

It can be worthwhile to do this even in the case where there is only a single divide since 1/Y might be loop invariant, and could then be hoisted out later by LICM. You just need to be able to fold it back together when there is only a single use, and that use is not inside a more deeply nested loop.

This is a great transform to do, but you’re right that it’s only safe under fast-math. This is particularly interesting when the original divisor is a constant so you can materialize the reciprocal at compile-time. You’re right that in either case, this optimization should only kick in when there is more than one divide instruction that will be changed to a mul.

It can be worthwhile to do this even in the case where there is only a single divide since 1/Y might be loop invariant, and could then be hoisted out later by LICM. You just need to be able to fold it back together when there is only a single use, and that use is not inside a more deeply nested loop.

Ben’s patch does exactly this, so perhaps that is the right approach.

Just to be clear of what is being proposed (which I rather like):

1) Canonical form is to use the reciprocal when allowed (by the fast math
flags, whichever we decide are appropriate).
2) The backend folds a single-use reciprocal into a direct divide.

Did I get it right? If so, I think this is a really nice way to capture all
of the potential benefits of forming reciprocals without pessimizing code
where it isn't helpful.

For point #2, wouldn’t that be somewhat orthogonal to the discussion, as it has/needs no knowledge that an IR-level transformation happened? Also, reciprocal-multiply will be the preferred option for many (most) backends if the IR says to do that. But, I suppose some backend might want to be allowed to do the reverse transformation if allowed by fast-math flags in IR, or fast-math mode in selection DAG.