RFC: Adding 'no-overflow' keyword to 'sdiv'\'udiv' instructions

Introduction:

We would like to add new keyword to ‘sdiv’‘udiv’ instructions i.e. ‘no-overflow’.

This is the updated solution devised in the discussion: http://lists.llvm.org/pipermail/llvm-dev/2017-October/118257.html

The proposed keywords:

“nof” stands for ‘no-overflow’

Syntax:

= sdiv nof , ; yields ty:result

= udiv nof , ; yields ty:result

Overview:

If the keyword is present, the compiler can assume no zero values in the denominator. Moreover, for sdiv the division MIN_INT / -1 is prohibited. Otherwise, undefined behavior.

Poison value is returned, in case of division by zero or MIN_INT/-1 if the keyword not present.

Motivation:

In the current state if the loop-vectorizer decides that it should vectorize a loop which contains a predicated integer division - it will vectorize the loop body and scalarize the predicated division instruction into a sequence of branches that guard scalar division operations. In some cases the generated code for this will not be very efficient. Speculating the divides using current vector sdiv instruction is not an option due to the danger of integer divide-by-zero.

There are two ways for ensuring the safety of “vector div under condition”, One way is to use the same condition as the scalar execution. Current serialization approach and previous masked integer div intrinsic proposal (http://lists.llvm.org/pipermail/llvm-dev/2017-October/118257.html) follows this idea. Second way is to check the actual divisor, regardless of the original condition. The ‘no-overflow’ keyword follows this idea. If the original code has possible div-by-zero behavior, for example, the latter approach will end up hiding it – by taking advantage of the undefined behavior.

With the addition of ‘nof’ keyword Clang will lower C\C++ division to ‘nof’ div IR since it will keep the same semantics.

In case the vectorizer decided to vectorize one of the predicated div it can be done by widening the datatype of the div and the ‘nof’ keyword will not hold anymore (because of the risk that one of the predicated lanes may have zero).

Keeping that with the widened datatype will allow codegen to lower that instruction as a vector instruction while ensuring lanes that may have zero values do not trigger a trap.

Implementation considerations:

Initially all the targets can scalarize vector sdiv\udiv instructions to one with ‘nof’ by using guards for each lane:

%r = sdiv <4 x i32> %a, %b can be lowered to:

(assuimg %a = <i32 %a.0, i32 %a.1, i32 %a.2, i32 %a.3>, %b = <i32 %b.0, i32 %b.1, i32 %b.2, i32 %b.3> and %r = <i32 %r.0, i32 %r.1, i32 %r.2, i32 %r.3>)

If CheckSafety(%a.0,%b.0):

%r.0 = sdiv nof i32 %a.0, %b.0

If CheckSafety(%a.1,%b.1):

%r.1 = sdiv nof i32 %a.1, %b.1

If CheckSafety(%a.2,%b.2):

%r.2 = sdiv nof i32 %a.2, %b.2

If CheckSafety(%a.3,%b.3):

%r.3 = sdiv nof i32 %a.3, %b.3

CheckSafety(a,b): (of sdiv)

b != 0 || (b != -1 && a != MIN_INT)

CheckSafety(a,b): (of udiv)

b != 0

Changes in LangRef.rst of udiv/sdiv Instructions:

Your proposal is essentially to introduce division instructions that cannot trigger UB, but return poison instead. On ISAs like x86 that means that these instructions have to be lowered with guards around them.
You also propose to change clang to always emit these non-UB-triggering instructions. Is this only for vector operations or also for scalar ones? What's the performance impact of all those extra guards?

Also, if your ISA has vector instructions that don't trigger UB on e.g. division by zero, why don't you rely on this target-specific information in the vectorizer instead? I mean, you would still need to add the attribute you are proposing, but you wouldn't change clang.

Nuno

Your proposal is essentially to introduce division instructions that cannot trigger UB, but return poison instead. On ISAs like x86 that means that these instructions have to be lowered with guards around them.
You also propose to change clang to always emit these non-UB-triggering instructions. Is this only for vector operations or also for scalar ones? What's the performance impact of all those extra guards?

Just to comment here, I think this really is worth measuring. The results aren't easy to predict. Given the optimizer may be able to frequently discharge the guard via known-bits or constant-ranges, machine-licm can move the divide to the use (thus removing the need for the guard), and the vector forms can be done as a ptest/br, the results might be nowhere as bad as it might first seem. Particularly not after some targeted tuning work.

If someone wanted to get really fancy, there's also room for implicit fault detection and code patching based healing schemes here that I don't think have been well explored. (Or at least, I'm not aware of it.)

Also, if your ISA has vector instructions that don't trigger UB on e.g. division by zero, why don't you rely on this target-specific information in the vectorizer instead? I mean, you would still need to add the attribute you are proposing, but you wouldn't change clang.

I thought we generally tried to avoiding emitting target specific intrinsics in the vectorizer?

The aarch64 (integer) division instruction returns 0 on division by 0. So this would lower nicely without extra instructions.

  • Matthias

Your proposal is essentially to introduce division instructions that cannot trigger UB, but return poison instead. On ISAs like x86 that means that these instructions have to be >lowered with guards around them.

We return a poison value when this flag is not present. And yes in X86 we can simulate this by having a guard around div instruction (to make sure no trap can be raised) but there is another way to implement that using fp div in x86 once it becomes profitable.

You also propose to change clang to always emit these non-UB-triggering instructions.

Clang behavior should be the same like the current behavior which means undefined behavior in case of divide-by-zero/overflow (currently this raise a trap on X86) so to keep this behavior clang should generate div instruction with the flag set by default (i.e. div nof ...).

Is this only for vector operations or also for scalar ones?
What's the performance impact of all those extra guards?

The keyword is for both scalar and vector div instructions.
Of course these guards may hit the performance but this will allow the compiler to hoist\speculate div instructions which can be profitable in case of vectorization in some cases for example.

Also, if your ISA has vector instructions that don't trigger UB on e.g.
division by zero, why don't you rely on this target-specific information in the vectorizer instead? I mean, you would still need to add the attribute you are proposing, but you wouldn't change clang.

I am not sure that I got your point here, but we need this in order to know that the code may have div-by-zero which didn't come from the original code (speculation) so once the vectorizer tries to
Widen predicated div instruction it should delete the 'nof' flag if exist. Each target should choose how to lower this instruction and generating this widened div still will be controlled by the vectorizer cost-model.

Also, if your ISA has vector instructions that don't trigger UB on
e.g. division by zero, why don't you rely on this target-specific
information in the vectorizer instead? I mean, you would still need
to add the attribute you are proposing, but you wouldn't change clang.

I thought we generally tried to avoiding emitting target specific intrinsics in the vectorizer?

Just to comment here, in our first RFC we introduced generic masked div intrinsic (see http://lists.llvm.org/pipermail/llvm-dev/2017-October/118257.html )
And we have been devised that representing that as IR flag would be a better idea.
Of course we try to avoid any target specific intrinsic in the vectorizer.