RFC: Moving DAG heuristic-based transforms to MI passes

All llvm-devs,

We're going to introduce the new possible implementation for such optimizations as reciprocal estimation instead of fdiv. In short it's a replacement of fdiv instruction (which is very expensive in most of CPUs) with alternative sequence of instructions which is usually cheaper but has appropriate precision (see genReciprocalDiv in lib/Target/X86/X86InstrInfo.cpp for details). There are other similar optimizations like usage of rsqrt, etc. but at the moment we're dealing with recip estimation only - see https://reviews.llvm.org/D26855 for details.

The current version of optimization is done at DAG Combiner level when we don't know the exact target instructions which will be used by CodeGen. As result we don't know the real cost of the alternative sequence and can't compare that cost with the cost of the single fdiv. As result the decision to select an alternative sequence (made on compiler options only) could be wrong because modern CPUs introduce very cheap fdiv and we should use it directly.

We suggest to move the implementation from DAG heuristics to MI-scheduler-based transformations (Machine Combiner). At that time we know exact target instructions and are able to use scheduler-based cost model. This knowledge allows as to select proper code sequence for final target code generation.

A possible disadvantage of the new implementation is compile time increasing (as discussed in D26855), but we expect to make improvements in that area. For the initial change (reciprocal transform), any difference is limited to fast-math compilations.

Any objections, suggestion, comments?

All llvm-devs,

We're going to introduce the new possible implementation for such optimizations as reciprocal estimation instead of fdiv. In short it's a replacement of fdiv instruction (which is very expensive in most of CPUs) with alternative sequence of instructions which is usually cheaper but has appropriate precision (see genReciprocalDiv in lib/Target/X86/X86InstrInfo.cpp for details). There are other similar optimizations like usage of rsqrt, etc. but at the moment we're dealing with recip estimation only - see https://reviews.llvm.org/D26855 for details.

The current version of optimization is done at DAG Combiner level when we don't know the exact target instructions which will be used by CodeGen. As result we don't know the real cost of the alternative sequence and can't compare that cost with the cost of the single fdiv. As result the decision to select an alternative sequence (made on compiler options only) could be wrong because modern CPUs introduce very cheap fdiv and we should use it directly.

We suggest to move the implementation from DAG heuristics to MI-scheduler-based transformations (Machine Combiner). At that time we know exact target instructions and are able to use scheduler-based cost model. This knowledge allows as to select proper code sequence for final target code generation.

A possible disadvantage of the new implementation is compile time increasing (as discussed in D26855), but we expect to make improvements in that area. For the initial change (reciprocal transform), any difference is limited to fast-math compilations.

Any objections, suggestion, comments?

Are you asking whether is okay to commit the change first and then look at the MachineCombiner's worst-case performance in followup? In general, I think that moving to using the MachineCombiner for these kinds of transformations, where there are complex tradeoffs between latency, throughput, etc., is the right direction.

  -Hal

In fact to commit the change before dealing with worst-case performance is a good idea because here we have 2 different issues. But the main idea of this RFC is an attempt to show the better approach to to these kinds of transformations and to suggest to use this approach in the future.

At the same time, I'm trying to explain that this patch is not the performance one because the generated code is almost identical to what we have just now. It is a suggestion to change the strategy in such transformations elaborating. If the community accept this new strategy we're ready to introduce new similar transformations, automate the framework, etc. But of course it will be themes for new RFCs and discussions.

In fact to commit the change before dealing with worst-case performance is a good idea because here we have 2 different issues. But the main idea of this RFC is an attempt to show the better approach to to these kinds of transformations and to suggest to use this approach in the future.

At the same time, I'm trying to explain that this patch is not the performance one because the generated code is almost identical to what we have just now. It is a suggestion to change the strategy in such transformations elaborating. If the community accept this new strategy we're ready to introduce new similar transformations, automate the framework, etc. But of course it will be themes for new RFCs and discussions.

I'm not sure that your RFC is detailed enough to really get feedback from those not already attuned to the situation. As someone who is, I'm in favor of moving forward with this. We're already using the MachineCombiner in many targets (AArch64, PowerPC, and already X86) to perform these kinds of transformations, and while we need work on the infrastructure in the near term (fixing MachineCombiner worst-case performance issues, TableGen integration, etc.), this is important for increasing code quality in cases where there are non-trivial latency/throughput tradeoffs.

  -Hal

Does it mean I have LGTM for https://reviews.llvm.org/D2685?

And does it mean I should open and fix a new PR related to "worst-case performance issues"?