Proposal: add intrinsics for safe division

Hi,

I’d like to propose to extend LLVM IR intrinsics set, adding new ones for safe-division. There are intrinsics for detecting overflow errors, like sadd.with.overflow, and the intrinsics I’m proposing will augment this set.

The new intrinsics will return a structure with two elements according to the following rules:

  1. safe.[us]div(x,0) = safe.[us]rem(x,0) = {0, 1}
  2. safe.sdiv(min, -1) = safe.srem(min, -1) = {min, 1}
  3. In other cases: safe.op(x,y) = {x op y, 0}, where op is sdiv, udiv, srem, or urem

The use of these intrinsics would be quite the same as it was for arith.with.overflow intrinsics. For instance:

%res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
%div = extractvalue {i32, i1} %res, 0
%bit = extractvalue {i32, i1} %res, 1
br i1 %bit, label %trap, label %normal

Now a few words about their implementation in LLVM. Though the new intrinsics look quite similar to the ones with overflow, there are significant differences. One of them is that during lowering we need to create control-flow for the new ones, while for the existing ones it was sufficient to simply compute the overflow flag. The control flow is needed to guard the division operation, which otherwise can cause an undefined behaviour.

The existing intrinsics are lowered in a back-end, during legalization steps. To do the same for the new ones, we’d need a more complicated implementation because of the need to create a new control flow. Also, that would be needed to be done in every backend.

Another alternative here is to lower the new intrinsics in CodeGenPrepare pass. That approach looks more convenient to me, because it allows us to have a single implementation for all targets in one place, and it’s easier to introduce control-flow at this point.

The patch below implements the second alternative. Along with a straight-forward lowering (which is valid and could be used as a base on all platforms), during the lowering some simple optimizations are performed (which I think is also easier to implement in CodeGenPrepare, than on DAGs):

  1. We don’t to generate code for unused part of the result structure.
  2. If div-instruction on the given platform behaves exactly as needed for the intrinsic (e.g. it takes place for ARM64), we don’t guard the div instruction. As a result, we could avoid branches at all if the second part of the result structure is not used.
  3. The most expected users of the result structure are extractvalue instructions. Having that in mind, we try to propagate the results - in most cases that allows to get rid of all corresponding extractvalues.

Attached are two patches: the first one with the described implementation and tests, and the second one with the corresponding documentation changes.

The first patch happened to already get to the trunk, but the topic is open, and any suggestions are welcome.

1-safediv-intrinsics.patch (28.1 KB)

2-safediv-intrinsics-docs.patch (6.78 KB)

Why not also support 8-bit vectors?

I’d be interested in these intrinsics also supporting integer vector types with the same semantics, and having the vectorization code be able to understand scalar/vector safe division the same way it understands scalar/vector division.

FWIW we did something in PNaCl, but only to guarantee that integer division/modulo by zero reliably traps on all architectures:

Sounds good JF. I suggest that we start by making scalar save division work and then move on to supporting vectors.

Sounds good JF. I suggest that we start by making scalar save division work and then move on to supporting vectors.

Sgtm

Hi Michael,

I’d like to propose to extend LLVM IR intrinsics set, adding new ones for
safe-division. There are intrinsics for detecting overflow errors, like
sadd.with.overflow, and the intrinsics I’m proposing will augment this set.

The new intrinsics will return a structure with two elements according to
the following rules:

safe.[us]div(x,0) = safe.[us]rem(x,0) = {0, 1}
safe.sdiv(min<T>, -1) = safe.srem(min<T>, -1) = {min<T>, 1}
In other cases: safe.op(x,y) = {x op y, 0}, where op is sdiv, udiv, srem, or
urem

The use of these intrinsics would be quite the same as it was for
arith.with.overflow intrinsics. For instance:
      %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
      %div = extractvalue {i32, i1} %res, 0
      %bit = extractvalue {i32, i1} %res, 1
      br i1 %bit, label %trap, label %normal

I notice that the patch is still in ToT even though we're now having a
discussion on whether it should go in. It should be reverted.

That said, I don't see how this is anything except for an optimization
intrinsic and not necessary for correct behavior for any language.
I.e. You can do what the PNaCl people are doing and emit branches
instead. Since this will only happen on a single target and not on all
targets why not just make this a target intrinisic? I don't buy the
argument that it biases the target independent IR since any pass that
uses TTI in the IR level does the same.

-eric

Hi Michael,

I’d like to propose to extend LLVM IR intrinsics set, adding new ones for
safe-division. There are intrinsics for detecting overflow errors, like
sadd.with.overflow, and the intrinsics I’m proposing will augment this set.

The new intrinsics will return a structure with two elements according to
the following rules:

safe.[us]div(x,0) = safe.[us]rem(x,0) = {0, 1}
safe.sdiv(min, -1) = safe.srem(min, -1) = {min, 1}
In other cases: safe.op(x,y) = {x op y, 0}, where op is sdiv, udiv, srem, or
urem

The use of these intrinsics would be quite the same as it was for
arith.with.overflow intrinsics. For instance:
%res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
%div = extractvalue {i32, i1} %res, 0
%bit = extractvalue {i32, i1} %res, 1
br i1 %bit, label %trap, label %normal

I notice that the patch is still in ToT even though we’re now having a
discussion on whether it should go in. It should be reverted.

That said, I don’t see how this is anything except for an optimization
intrinsic and not necessary for correct behavior for any language.
I.e. You can do what the PNaCl people are doing and emit branches
instead. Since this will only happen on a single target and not on all
targets why not just make this a target intrinisic? I don’t buy the
argument that it biases the target independent IR since any pass that
uses TTI in the IR level does the same.

The sdiv operation in LLVM IR only makes sense for C and its very direct relatives. The amount of control flow necessary to represent a safe division for any other language is ghastly. (a/b) becomes something like (b != 0 ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0). It’s more useful to the optimizer to see this as the single operation that it really is.

Also, this isn’t about just a single target. Any target that has to emulate integer division (like a lot of embedded systems would do) would benefit from this since the division emulation can most naturally be written to provide the “safe” semantics rather than LLVM’s default semantics (i.e. undefined behavior on x/0 and INT_MIN/-1).

I think adding these intrinics to the IR would be a great help for non-C clients of LLVM.

-Filip

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.

Are there ISAs out there that produce different results without trapping
for these boundary conditions? Should we leave the quotient result
unspecified in the b == 0 || (a == INT_MIN && B == -1) cases? If we do
that, we require frontends to check the "overflow" bit if they want
deterministic cross-platform behavior.

I agree with all of Filip’s points above. I’m in support of including these intrinsics in their current form. I have two suggestions, but both of these can be incremental improvements: 1) On many targets, it might make sense to lower these much earlier than codegenprepare. By lowering them that late, you loose a lot of potentially to have branches eliminated. If there’s no special hardware support which makes those conditions cheap, getting rid of them early is probably the best choice. 2) Extending Reid’s suggestion (in a separate response) could we define the intrinsics to take the (constant) values to be used in the boundary conditions? llvm.safe.div(x,y, undef, undef) would give Reid’s desired semantics. llvm.safe.div(x,y, 0, min) would give the original proposed semantics. This would allow a variety of language semantics with a shared implementation. Philip

The sdiv operation in LLVM IR only makes sense for C and its very direct
relatives. The amount of control flow necessary to represent a safe
division for any other language is ghastly. (a/b) becomes something like (b
!= 0 ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0). It's more useful
to the optimizer to see this as the single operation that it really is.

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.

No objections on this point here either - I agree that this makes
things more verbose for generating IR and that pattern matching it
back might be difficult (CVP should be able to do it, however...)

Are there ISAs out there that produce different results without trapping for
these boundary conditions? Should we leave the quotient result unspecified
in the b == 0 || (a == INT_MIN && B == -1) cases? If we do that, we require
frontends to check the "overflow" bit if they want deterministic
cross-platform behavior.

Agreed, perhaps we want to say "undefined" or "returns the same value"
or are there other semantics?

-eric

The sdiv operation in LLVM IR only makes sense for C and its very direct
relatives. The amount of control flow necessary to represent a safe
division for any other language is ghastly. (a/b) becomes something like (b
!= 0 ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0). It's more useful
to the optimizer to see this as the single operation that it really is.

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.

No objections on this point here either - I agree that this makes
things more verbose for generating IR and that pattern matching it
back might be difficult (CVP should be able to do it, however...)

As an example of a language that has similar but different semantics
for divide I give you Go:

Required to trap in the runtime on divide by zero, but required to
return min for type in min/-1.

Thoughts?

-eric

That’s a good question. I believe that all ISAs go to one of the following paths:

  • Trap on x/0 and INT_MIN/-1 like x86.

  • Return 0 for x/0 and INT_MIN for INT_MIN/-1 like ARM.

Also, here’s what I know about what different languages do:

C/C++/etc: undef on x/0 and INT_MIN/-1, so they would use LLVM’s existing div instruction.

JavaScript: optimizing compilers will look for places where people say things like (a/b)|0 - in JavaScript saying x|0 means “truncate to int32”, and they will do an integer division if the inputs were already represented as integers. x/0 will produce NaN and NaN|0 produces 0; INT_MIN/-1 produces -INT_MIN (which is representable as double) and then (-INT_MIN)|0 produces INT_MIN. So, the ARM64 semantics match optimized JavaScript semantics. There is also the case that you say a/b in JavaScript, and we have inferred that a and b are integers but there is no |0. In that case we want to detect when x/0 and INT_MIN/-1 so that we can promote the result to double.

Java: x/0 throws an exception but INT_MIN/-1 produces INT_MIN.

Ruby and probably a lot of others: provide BigInt semantics but have fast paths for small ints. So, these will throw an exception on x/0 and they will inflate to a BigInt on INT_MIN/-1.

I don’t know about other languages.

Looking at this, it makes me slightly prefer to generalize these intrinsics a bit. Instead of returning an i1 flag for all overflow cases, you could return an i2 status enum that says 0 for neither overflow condition, 1 for x/0, and 2 for INT_MIN/-1. Then, you get the following matrix for how Java and JavaScript would map to this and how it would be lowered:

  • Java on x86: use safediv and check if the if the status is 1. If it’s 1 then throw. This lowers to a bunch of branches fairly late in the backend.

  • JavaScript on x86 with |0: use safediv and ignore the status. Also loweres to a bunch of branches, which is as good as as it gets.

  • JavaScript on x86 without |0: use safediv and deoptimize if status != 0. Also a bunch of branches.

  • Ruby on x86: use safediv and detect both status==1 and status==2. This lowers to a bunch of branches.

  • Java on ARM: use safediv as on x86, but not it lowers to just a branch to detect x/0 and not the other case. Note that without the safediv intrinsic, you wouldn’t be able to get this to one branch unless you used a target intrinsic.

  • JavaScript on ARM with |0: use safediv and you ignore the status and you get no branches. It maps directly to ARM.

  • JavaScript on ARM without |0: use safediv and test the status, so you get a bunch of branches as before.

  • Ruby on ARM: this lowers to a bunch of branches as on x86.

So, it seems that for JavaScript and Java on ARM - and probably any environment with emulated division, these intrinsics would result in tighter code without requiring target intrinsics. For all other language/hardware combinations, the main benefit here is to materialize the gnarly control flow later and so hopefully reveal more optimization opportunities. That seems like a pretty good sweet-spot for the systems that I know about.

-Filip

Thanks for the writeup! It's very helpful.

The sdiv operation in LLVM IR only makes sense for C and its very direct
relatives. The amount of control flow necessary to represent a safe
division for any other language is ghastly. (a/b) becomes something like
(b != 0 ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0). It's more
useful to the optimizer to see this as the single operation that it really
is.

This argument makes sense to me. It would be hard for the arm64 backend to
pattern match all this control flow back into a single instruction after
optimizations.

I want to soften this statement a bit: while ISel can't look across basic

blocks, we could probably pattern match away the branches in the
SelectionDAGBuilder, because it can see the original use-def chain for the
result of the division. This would take a little work, but I can see how it
would work and I've only touched SelectionDAGBuilder a few times.

So, it seems that for JavaScript and Java on ARM - and probably any
environment with emulated division, these intrinsics would result in
tighter code without requiring target intrinsics. For all other
language/hardware combinations, the main benefit here is to materialize the
gnarly control flow later and so hopefully reveal more optimization
opportunities. That seems like a pretty good sweet-spot for the systems
that I know about.

If we're going to emit the control flow anyway, then we actually want to
expose the control flow to the mid-level optimizers as early as possible.
For example, on a non-ARM64 ISA, if the user checked that the divisor was
zero already, we can eliminate the zero check branch. We could also
unswitch this loop:

int b = ...;
for (int i = 0; i < n; ++i)
  c[i] = a[i] / b;

Lowering this with branches produces:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

b is loop invariant, and we could unswitch this on non-AArch64:

if (b == 0) for (int i = 0; i < n; ++i) c[i] = 0;
else if (b == -1) for (int i = 0; i < n; ++i) c[i] = (a[i] == INT_MIN) ?
INT_MIN : a[i] / b;
else for (int i = 0; i < n; ++i) c[i] = a[i] / b;

Note that the middle branch goes away for unsigned division. It seems to
me that the branches are better all around, at least for other ISAs.

As for generality, every language except asm.js (or
JS-with-i32-coercion-hints) has to branch on the "overflow" bit of the
intrinsic anyway.

There's also the discussion of iload and istore, where the consensus was
that we'd rather have explicit null checks and branches.

I'm also not excited by the prospect of teaching all the mid-level
optimizers about the arithmetic properties of safe.[us]div, which has been
a big drawback to the other overflow checked arithmetic intrinsics.

So, in conclusion, I think we should at least *try* to pattern match out
the branches in SDAGBuilder for ARM64 before giving up and adding this
intrinsic, or something like it. Does that sound reasonable?

Not if you want to move code around. Consider a loop that is too large to unswitch but that contains a “safe division” expressed as branches and that division is loop invariant. What would happen? My guess is that the optimizer would have a hard time with it.

Sure, but that would be definitely suboptimal for JS on ARM64. So if the claim is that ARM64 will optimize this by pattern matching in the backend then this is a clear example of where such pattern matching would be rather hard: an earlier optimization could inhibit our ability to do this effectively.

Heh, I wouldn’t call it asm.js. asm.js adopted something that was already a de-facto idiom.

But yes - most people want to be told if the denominator was zero but want INT_MIN/-1 to flow through gracefully and have defined behavior.

Lol. Since you brought that up, I figure I might as well throw this out: probably some VMs will want a guaranteed trap on divide-by-zero as a free way of throwing the exception. I actually think that the safediv intrinsic would ultimately make this easier. The client would generate code like “if (extractelement (safediv (…), 1) == 1) trap” and that could be pattern-matched to mean “emit code that deterministically traps on divide by zero”. I don’t think we have the vocabulary necessary to express this in IR. But I think that matching this in the backend would be easier than matching “if (b == 0) then trap else do a division”, since the division could be moved around independently of the branch.

My fear is that since the control flow is complex, the matching would get really nutty. For example you could be clever for a/b and say “b + 1 >_unsigned 1 ? a / b : slow path”. I actually did that on some VM at some point. Then there are all of the ways that the control flow could get messed up by optimization phases. If you ask the frontend that generates LLVM IR to do this on its own then you’ll end up with a lot of patterns on your hands. It would take some effort to canonicalize control flow of this complexity to the point where the pattern matching would be reliable. I think it’s better if we give frontends that need this some way of expressing it to LLVM explicitly. Note that this doesn’t preclude LLVM from lowering the intrinsic as early as we deem appropriate. But then the choice to do so is within LLVM rather than being delegated to the client, who is less likely to know about specifics like what LLVM can or can’t pattern match.

In short, I agree with your observations that these intrinsics are not an obvious slam-dunk compared to making the explicit control flow, but I think that the intrinsics do give enough flexibility on the LLVM side that it would be great if front-ends used them rather than rolling the control flow themselves.

-Filip

In short, I agree with your observations that these intrinsics are not an
obvious slam-dunk compared to making the explicit control flow, but I think
that the intrinsics do give enough flexibility on the LLVM side that it
would be great if front-ends used them rather than rolling the control flow
themselves.

The problem is that then we have 2 problems: All targets (except for
arm64) then have to lower the intrinsic as the first thing they do
(giving us a TTI pass as the first thing in the pipeline) to take
advantage of the information later during optimization, _and_ we have
to plumb all of the work optimizing the intrinsic as well giving us a
situation where we've now split our optimization efforts as well as
the pain of maintaining an intrinsic that's useful for a single
target.

I really think that this is just solidifying my position that the
intrinsic is a bad idea and that this should be done as later
optimizations.

-eric

Does anyone have performance data on ARM for the early vs late lowering approaches? Having data would make the discussion much more concrete about the tradeoffs of the two approaches.

To put it another way, how much do we loose performance-wise in missed pattern matching?

Philip

There were no performance or optimization numbers provided with the
patch or proposal :slight_smile:

-eric

Agreed. That was my point. :slight_smile:

Philip

The goal of the intrinsic wasn't stated clearly.

This intrinsic isn't particularly necessary for any specific frontend
or architecture. I think the LLVM community would benefit from a
canonical way to represent well-defined integer division. We don't
need to add an IR instruction, because it's fine for most optimization
passes to ignore these operations. A target independent intrinsic
works well as long as:

- It cleanly captures the language level semantics.

- It facilitates mid-level optimization.

- It naturally lowers into ideal code both on architectures that trap
  (x86) and those that don't (arm).

To summarize my understanding of the concerns:

(1) The semantics of the safe.div intrinsic need to be useful for the
Language/ISA Matrix that LLVMers care about.

At canonical IR level, the intrinsic is useful by eliminating control
flow merges and representing divide-by-zero and/or signed overflow in
a canonical form:

    %res = call {i32, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %bit = extractvalue {i32, i1} %res, 1
    br i1 %bit, label %trap, label %continue
trap:
    call ...
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

The original proposal fails to achieve this because the common case of
Java/Go would require a check in the slow-path to differentiate
divide-by-zero from signed overflow. That should be fixed by
generalizing the intrinsic so that the two conditions are distinct:

    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %div0 = extractvalue {i32, i1} %res, 1
    br i1 %div0, label %trap, label %checkovf

checkovf:
    %ovf = extractvalue {i32, i1} %res, 2
    br i1 %div0, label %trap, label %continue

trap:
    call ...
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

...or some variation of the above. I don't have a personal stake in this.

(2) The safe.div intrinsic inhibits generic code motion and Correlated
Value Prop based optimization.

This goes both ways.

CVP could miss out on cases unless we teach it about the semantics of
the intrinsics. Correct me if I'm wrong, but I don't think this would
actually be too difficult.

OTOH, with the intrinsics, it would be easy to write a simple
optimization pass that hoists and combines checks along the domtree.

After divide-by-zero optimization, you would have something like:

    %res = call {i32, i1, i1} @llvm.safe.sdiv.i32(i32 %a, i32 %b)
    %div0 = extractvalue {i32, i1} %res, 1
    br i1 %div0, label %trap, label %continue

trap:
    # No call here!
    unreachable

continue:
    %div = extractvalue {i32, i1} %res, 0

And the branch to trap just goes away at some point.

Now considering Reid's LICM example:

for (int i = 0; i < n; ++i) {
  if (b == 0) c[i] = 0;
  else if (b == -1 && a[i] == INT_MIN) c[i] = INT_MIN;
  else c[i] = a[i] / b;
}

Simply put, if we want to unswitch this code for x86, then we need a
TTI pass to lower the intrinsic. On the other hand, I believe it is
more important to eliminate the control flow within the loop to aid
loop analysis and other optimizations. So we still want the front end
to emit the intrinsic, we just may eventually want it lowered earlier
than CGP. I don't think this issue has any bearing on the intrinsic's
LangRef spec.

There was some comparison made to iload/istore, which I don't
follow:
- subsuming loads and stores into another instruction is really scary
  considering how much logic we have for analyzing the side effects of
  memory access.
- there is no benefit to IR optimization to the iload/istore intruction.
- it is easy to detect null checked load/stores in IR at any point.
- it is very rare that a platform would want to resume from trapping load/store.

(3) The safe.div intrinsic is a target-specific codegen optimization.

Regardless of the target, we want to eliminate control flow merges in
all cases, and completely eliminate control flow when we don't have
trapping semantics. To do this, we need an intrinsic at canonical IR
level. Clearly it should be a target independent intrinsic
since *some* optimizations/analyses want to be aware of it.

Even ignoring the rationale above, supporting codegen with a
target-specific intrinsic would minimally require the DAG builder to
match this
"if (b != 0) ? ((a != INT_MIN || b != -1) ? a / b : INT_MIN) : 0)"

after all optimizations have had their way with it. A big problem here
is that there are actually many different forms of this expression and
surrounding control flow, depending on the frontend's integer division
semantics. We would have to recognize all the variations of checking for
divide-by-zero and/or overflow, trapping and/or producing a certain
constant, branching or selecting values. This is obviously terrible
from an engineering standpoint.

-Andy

Hi,

Thanks to everybody for the very informative feedback! Due to difference in timezones, I usually miss the most active time of discussions, however I’ll try to cover the raised questions here and share my vision on them.

Generally speaking, we are trying to find answers to the following two questions:
    o Are these intrinsics useful in general?
    o If yes, when to lower them to actual IR/DAG?

In a nutshell, my answers are:
    o Yes, they are useful, and useful for all targets. Not only for ARM64, where they are especially useful.
    o After loop optimizations but before lowering IR to DAG. Currently it’s in CodeGenPrepare, but we might lower them before that if there is any need in this.

I'll try to explain why I think so. For the beginning, let's consider a simple expression div=x/y in different languages and write the corresponding IR with use of the intrinsics (here I omit types for conciseness, assuming result type of the safe.sdiv intrinsic being {i32, i1, i1}) :

*** C/C-like ***
  %div = sdiv %x, %y

*** Java, Go ***
  %divr = call %safe.sdiv(%x, %y)
  %div = extractvalue %divr, 0
  %divbyzero_bit = extractvalue %divr, 1
  br i1 %divbyzero_bit, label %throw, label %normal

*** JavaScript, with “|0” ***
  %divr = call %safe.sdiv(%x, %y)
  %div = extractvalue %divr, 0

*** JavaScript without “|0”, Ruby and maybe others ***
  %divr = call %safe.sdiv(%x, %y)
  %div = extractvalue %divr, 0
  %divbyzero_bit = extractvalue %divr, 1
  %ovf_bit = extractvalue %divr, 2
  %exceptional_case_bit = or %ovf_bit, %divbyzero_bit
  br i1 %exceptional_case_bit, label %handle_exceptional_case, label %normal

Now let’s write the IR for the same expression without the intrinsics:

*** C/C-like ***
  %div = sdiv %x, %y

*** Java, Go ***
  %divbyzero_bit = cmp %y, 0
  br i1 %divbyzero_bit, label %throw, label %normal
normal:
  %div = sdiv %x, %y

*** JavaScript, with “|0” ***
  %divbyzero_bit = cmp %y, 0
  br i1 %divbyzero_bit, label %division_by_zero, label %chkovf
division_by_zero:
  br label %end
chkovf:
  %tmin_bit = cmp %x, min<T>
  %minus_one_bit = cmp %y, -1
  %ovf_bit = and %tmin_bit, %minus_one_bit
  br i1 %ovf_bit, label %ovf, label %normal
ovf:
  br label %end
normal:
  %div.0 = sdiv %x, %y
  br label %end
end:
  %div = phi [%div.0, %normal], [min<T>, %ovf], [0, %division_by_zero]

*** JavaScript without “|0”, Ruby and maybe others ***
  %divbyzero_bit = cmp %y, 0
  br i1 %divbyzero_bit, label %throw, label %chkovf
chkovf:
  %tmin_bit = cmp %x, min<T>
  %minus_one_bit = cmp %y, -1
  %ovf_bit = and %tmin_bit, %minus_one_bit
  br i1 %ovf_bit, label %ovf, label %normal
ovf:
  ; Special handling of min<T>/-1: converting to BigInt, double or something other
  unreachable
normal:
  %div.0 = sdiv %x, %y

As one can see, if the intrinsic is lowered to some CFG (i.e. on x86), there is nor loss neither win: due to the implemented optimizations the final code would be roughly the same as in the case with explicitly written CFG.

If the intrinsic could be directly lowered to a single instruction (i.e. on ARM64 and if we don’t bother about divz/ovf-flag), use of the intrinsic would definitely be beneficial - that takes place for the <JavaScript, with “|0”> case.

That might make one think that the intrinsics are useful only for ARM64, and in general aren’t that useful at all. But in fact even when it can’t be lowered to a single instruction, keeping the control flow ‘inside’ the intrinsic till its lowering might be beneficial. For instance, loop passes work much better on loops without control flow: IndVars might be a good example of such pass. And, for instance, it seems much easier to vectorize a loop with call to one of these intrinsics than vectorize a loop with explicitly created control flow (and it could become even worse if there are several calls to the intrinsics in the same loop).

In some cases we might want to lower these intrinsics earlier than in CGP - as Reid mentioned, it could help us to hoist checks out of loops. That’s true but we need to keep in mind possible disadvantages of such early lowering (see previous paragraph). But in general I don’t see any problems with allowing earlier lowering.

As for extending the result structure to keep one more flag (being that i2 instead of i1, or one more i1 flag) - that seems fine, and both options {iN, i2} and {iN, i1, i1} look good to me.

Now, why not to lower these intrinsics even later? Almost for all targets we want to get very similar CFG, and I don’t see any benefits of duplicating very similar code all across different backends. Even on ARM64 we need to have control flow in some cases (e.g. for Java) - so we actually win nothing from making lowering completely target-specific.

I hope that covers the raised concerns, at least to some extent. Does that sound reasonable enough? If so, I’ll prepare and post here an updated version of the patch.

Thanks,
Michael

I am very much in favor of having a div instruction with well defined div-by-zero and overflow behavior. The undefined behavior on certain values for LLVM intrinsics has been a major pain point for us in Julia, because adding the extra branches just kills performance and we know that there is an X86 instruction that just does what we want. Anyway, this was brought up briefly above, but want to bring back the discussion of a div instruction that reliably traps, as that seems to me to be one of the only ways to get adequate performance in the common case. This can probably be done as suggested above with matching on the trap condition of this instruction, but I think while we’re at it, we should tackle that case as well.

Thanks,
Keno