A question about everyone's favorite constructs: NSW and NUW

So, I know there are a lot of mixed feelings about NSW and NUW, but I’m running into a common pattern where they really help:

void f(char *array, int i) {
// do some stuff…
g(array[i++]);

// do some more stuff…
g(array[i++]);

// more of the same
}

So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we’d like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we’re in a loop, etc. This is much faster than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently).

The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of:

%a = add nsw i32 %i, N # for some immediate small integer N
%b = sext i32 %i to i64
%ptr = gep inbounds i8* %base, i64 %b

When we try to match this pattern in the backend to an x86 addressing mode we fail because we can’t look through the sext. This is particularly frustrating because we should be able to look through the sext due to the NSW bit on the add, but we don’t preserve NSW constraints into the DAG.

How can we fix this? I see the following options:

  1. Canonicalize the other way in the IR. This could take a few different forms:
    1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index.
    1b) Widen all math which is only used by an *ext that targets a legal integer size.

  2. Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw …)) and (zext (op nuw …))

  3. Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR.

I dislike #3 intensely – we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep … (sext (op nsw …))) or (gep … (op nsw (sext …))). Also, it feels like quite a hack around the lack of fidelity in the backend.

I’m perfectly fine with either #1 or #2. Doing #1 would require a lot of benchmarking to make sure the different canonicalization doesn’t regress lots of things, and probably a few fixes where it does regress things. These are likely to be easy to find by looking for patterns that match sext.

Doing #2 requires a plumbing these flags through the DAG. I don’t yet know how hard that would be, but it doesn’t sound too bad. The current approach I would probably take is to add an SDNodeProperty for NSW and NUW, and nodes for operations with those flags. But that means generalizing everything that currently only looks at ADD. There is probably a better way to express this in the DAG, so if this is the direction folks thing LLVM should go, point me at any ideas you have about how to best implement it.

Currently, I would naively go after #1 just because I know exactly how to do that. But it doesn’t feel quite right and I wonder if we should really invest the time in making sure the backend is aware of the NSW and NUW flags.

-Chandler

[1]: Note that we could fix this inside of loop bodies when its an induction variable by widening the induction variable. But that doesn’t fix in the general case.

From: "Chandler Carruth" <chandlerc@gmail.com>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Saturday, January 4, 2014 7:29:07 PM
Subject: [LLVMdev] A question about everyone's favorite constructs: NSW and NUW

So, I know there are a lot of mixed feelings about NSW and NUW, but
I'm running into a common pattern where they really help:

void f(char *array, int i) {
// do some stuff...
g(array[i++]);

// do some more stuff...
g(array[i++]);

// more of the same
}

So, this kind of code comes up pretty frequently. Unrolled loops with
an int IV[1], encoding, decoding, compression, etc. What we'd like
to do is take the sequence of N increments and turn them into N
addressing modes with a constant offset. Then add a small constant
at the end and test the flags if we're in a loop, etc. This is *much
faster* than clobbering registers to do LEA or inc or add for each
one (and yea, we do LEA and use lots of registers currently).

The problem is that this is being compiled in 64-bit mode. So the
index is 32-bits and gets sign extended prior to the GEP. So we end
up with a lot of repetitions of:

%a = add nsw i32 %i, N # for some immediate small integer N
%b = sext i32 %i to i64
%ptr = gep inbounds i8* %base, i64 %b

When we try to match this pattern in the backend to an x86 addressing
mode we fail because we can't look through the sext. This is
particularly frustrating because we *should* be able to look through
the sext due to the NSW bit on the add, but we don't preserve NSW
constraints into the DAG.

How can we fix this? I see the following options:

1) Canonicalize the other way in the IR. This could take a few
different forms:
1a) Widen math (where NSW or NUW allows) which is only used by an
*ext feeding a GEP index.
1b) Widen all math which is only used by an *ext that targets a legal
integer size.

I'm in favor of having NSW/NUW information available at the SDAG level (and even at the MI level), but for other reasons. For this problem, I vote for solution 1b. My rationale is that this would solve a problem that I currently have in the PowerPC backend for 64-bit targets in a general way:

On 64-bit PPC targets, return values are (zero or sign) extended to 64-bits. Here's a simple example that demonstrates the issue:

define signext i32 @foo(i1 zeroext %a) #0 {
entry:
  %cond = select i1 %a, i32 7, i32 12
  ret i32 %cond
}

compiles into something like this:

        ...
  li r4, 7
  li r3, 12
  isel r3, r4, r3, 1
  rldicl r3, r3, 0, 32 <-- this is a completely unnecessary extension
        blr

Obviously, this affects cases other than when the extension is part of the return value (although the return value case is a pretty obvious code wart).

One possible solution to this that I've thought about (although not yet experimented with) is 'promoting' all 32-bit operations to 64-bit ones up front. That, however, might lead to a bunch of unnecessary extensions of its own. I think that the 1b solution would nicely solve this problem without introducing unnecessary extensions.

One issue that might come up is that 64-bit operations might be more expensive (either in latency or in throughput) than the corresponding 32-bit ones (I'm thinking specifically about multiplication and division here, but if we're spilling a lot, there could also be stack-size and bandwidth effects), and we might want some target input on that.

In short, I would not be so hard on #3 :wink: -- this is a legitimate problem that might not have a nice target-cost-independent solution, but I suspect that the benefits of using 1b as the new canonical form probably outweigh the negative side effects.

-Hal

> From: "Chandler Carruth" <chandlerc@gmail.com>
> To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
> Sent: Saturday, January 4, 2014 7:29:07 PM
> Subject: [LLVMdev] A question about everyone's favorite constructs: NSW
and NUW
>
>
>
> So, I know there are a lot of mixed feelings about NSW and NUW, but
> I'm running into a common pattern where they really help:
>
>
> void f(char *array, int i) {
> // do some stuff...
> g(array[i++]);
>
>
> // do some more stuff...
> g(array[i++]);
>
>
> // more of the same
> }
>
>
> So, this kind of code comes up pretty frequently. Unrolled loops with
> an int IV[1], encoding, decoding, compression, etc. What we'd like
> to do is take the sequence of N increments and turn them into N
> addressing modes with a constant offset. Then add a small constant
> at the end and test the flags if we're in a loop, etc. This is *much
> faster* than clobbering registers to do LEA or inc or add for each
> one (and yea, we do LEA and use lots of registers currently).
>
>
> The problem is that this is being compiled in 64-bit mode. So the
> index is 32-bits and gets sign extended prior to the GEP. So we end
> up with a lot of repetitions of:
>
>
> %a = add nsw i32 %i, N # for some immediate small integer N
> %b = sext i32 %i to i64
> %ptr = gep inbounds i8* %base, i64 %b
>
>
> When we try to match this pattern in the backend to an x86 addressing
> mode we fail because we can't look through the sext. This is
> particularly frustrating because we *should* be able to look through
> the sext due to the NSW bit on the add, but we don't preserve NSW
> constraints into the DAG.
>
>
> How can we fix this? I see the following options:
>
>
> 1) Canonicalize the other way in the IR. This could take a few
> different forms:
> 1a) Widen math (where NSW or NUW allows) which is only used by an
> *ext feeding a GEP index.
> 1b) Widen all math which is only used by an *ext that targets a legal
> integer size.

I'm in favor of having NSW/NUW information available at the SDAG level
(and even at the MI level), but for other reasons.

Cool. Could you suggest how best to implement this (especially in the
SDAG)? I don't like my ideas thus far.

For this problem, I vote for solution 1b. My rationale is that this would
solve a problem that I currently have in the PowerPC backend for 64-bit
targets in a general way:

On 64-bit PPC targets, return values are (zero or sign) extended to
64-bits. Here's a simple example that demonstrates the issue:

define signext i32 @foo(i1 zeroext %a) #0 {
entry:
  %cond = select i1 %a, i32 7, i32 12
  ret i32 %cond
}

compiles into something like this:

        ...
        li r4, 7
        li r3, 12
        isel r3, r4, r3, 1
        rldicl r3, r3, 0, 32 <-- this is a completely unnecessary extension
        blr

Obviously, this affects cases other than when the extension is part of the
return value (although the return value case is a pretty obvious code wart).

One possible solution to this that I've thought about (although not yet
experimented with) is 'promoting' all 32-bit operations to 64-bit ones up
front. That, however, might lead to a bunch of unnecessary extensions of
its own. I think that the 1b solution would nicely solve this problem
without introducing unnecessary extensions.

It feels like there are way better solutions to this. Why can't you just
fix this by looking for a sext used by ret in a function which is already
signext? It seems like you could match this in the SDAG and fix it very
directly?

One issue that might come up is that 64-bit operations might be more
expensive (either in latency or in throughput) than the corresponding
32-bit ones (I'm thinking specifically about multiplication and division
here, but if we're spilling a lot, there could also be stack-size and
bandwidth effects), and we might want some target input on that.

Oh, its worse than this. I'm no longer liking #1 at all.

If we hoist the extension over math which has flags allowing us to do so
safely, we actually lose significant information. The wider the type on
which we have the NSW flag, the less information we have about the known
value range of the inputs. While this may not have a significant impact in
many benchmarks, it seems like a very worrisome change to make when picking
the canonical form because we essentially say that we will *never* be able
to preserve this information even if some day it matters a lot. =/

I'm increasingly interested in getting NSW to reach the SDAG so that we can
actually do the extension when we know we have a fast 64-bit operation
available in the form of an addressing mode.

So, I know there are a lot of mixed feelings about NSW and NUW, but I'm running into a common pattern where they really help:

void f(char *array, int i) {
  // do some stuff...
  g(array[i++]);

  // do some more stuff...
  g(array[i++]);

  // more of the same
}

So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we'd like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we're in a loop, etc. This is *much faster* than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently).

The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of:

  %a = add nsw i32 %i, N # for some immediate small integer N
  %b = sext i32 %i to i64
  %ptr = gep inbounds i8* %base, i64 %b

When we try to match this pattern in the backend to an x86 addressing mode we fail because we can't look through the sext. This is particularly frustrating because we *should* be able to look through the sext due to the NSW bit on the add, but we don't preserve NSW constraints into the DAG.

How can we fix this? I see the following options:

1) Canonicalize the other way in the IR. This could take a few different forms:
1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index.
1b) Widen all math which is only used by an *ext that targets a legal integer size.

2) Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw ...)) and (zext (op nuw ...))

3) Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR.

Here are some of my current prespectives that are *loosely* related to the issue. All FWIW. There's no need to debate them in this thread other than as they apply to the problem at hand.

- I agree that promotion to 64-bit loses information and should only be done when target information tells us that it's good for codegen.

- I personally have no problem using "indvars" as a codegen prepare pass for any noncanonical target-specific transforms that require SCEV. The only canonicalization that it usually accomplishes that is replacement of loop exit values. The sign/zero extend elimination (widening) and LFTR are really codegen optimizations that should be defered until the last round of loop opts. I don't see it as being any different in this respect than LSR and vectorization. Unfortunately we rerun canonical InstCombine after indvars, which can undo any target-specific goodness. But when that becomes a serious problem we should just fix that too. If we really need to replace loop exit values earlier before other canonical passes, we can split up indvars.

- I know we have considered adding NSW to SelectionDag in the past and I'm not opposed to it in principle, but it was difficult to justify adding the complexity. I don't remember all the arguments against it now. There are some obvious issues: (1) it's not clear how to cleanly implement it. (2) I woud be concerned about relying on preservation of such a brittle attribute so late. It seems easy for target specific DAG combines to either drop or misuse the NSW flag. (3) Determining where to widen values may require global analysis.

- I know that the problem of determining when values should be promoted to a wider type and which uses to replace doesn't have a general solution that is practical to implement. We can easily end up with more register pressure and extra copies that the coalescer is not smart about removing or optimizing. It seems wise to tackle a narrow subset of the problem (e.g. codegen of specific patterns that match addressing modes). Quentin probably has more insight here.

I dislike #3 intensely -- we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep ... (sext (op nsw ...))) or (gep ... (op nsw (sext ...))). Also, it feels like quite a hack around the lack of fidelity in the backend.

I understand why this sort of hack is so terrible early in the IR pass pipeline (especially before inlining). But I think it is already the norm for anyone seriously optimizing for a specific target to manipulate the IR (here I do welcome others to confirm or refute this point). Since we can't do anything to change that, let's just declare that those codegen IR transforms are part of the "backend" and be done. Naturally, the more a target customizes codegen preparation, the less DAGCombine can be reused.

That said, if NSW can be added to SelectionDAG without too many issues, then it may be a moot point.

-Andy

So, I know there are a lot of mixed feelings about NSW and NUW, but I’m running into a common pattern where they really help:

void f(char *array, int i) {
// do some stuff…
g(array[i++]);

// do some more stuff…
g(array[i++]);

// more of the same
}

So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we’d like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we’re in a loop, etc. This is much faster than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently).

The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of:

%a = add nsw i32 %i, N # for some immediate small integer N
%b = sext i32 %i to i64
%ptr = gep inbounds i8* %base, i64 %b

When we try to match this pattern in the backend to an x86 addressing mode we fail because we can’t look through the sext. This is particularly frustrating because we should be able to look through the sext due to the NSW bit on the add, but we don’t preserve NSW constraints into the DAG.

How can we fix this? I see the following options:

  1. Canonicalize the other way in the IR. This could take a few different forms:
    1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index.
    1b) Widen all math which is only used by an *ext that targets a legal integer size.

  2. Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw …)) and (zext (op nuw …))

  3. Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR.

Here are some of my current prespectives that are loosely related to the issue. All FWIW. There’s no need to debate them in this thread other than as they apply to the problem at hand.

  • I agree that promotion to 64-bit loses information and should only be done when target information tells us that it’s good for codegen.

  • I personally have no problem using “indvars” as a codegen prepare pass for any noncanonical target-specific transforms that require SCEV. The only canonicalization that it usually accomplishes that is replacement of loop exit values. The sign/zero extend elimination (widening) and LFTR are really codegen optimizations that should be defered until the last round of loop opts. I don’t see it as being any different in this respect than LSR and vectorization. Unfortunately we rerun canonical InstCombine after indvars, which can undo any target-specific goodness. But when that becomes a serious problem we should just fix that too. If we really need to replace loop exit values earlier before other canonical passes, we can split up indvars.

  • I know we have considered adding NSW to SelectionDag in the past and I’m not opposed to it in principle, but it was difficult to justify adding the complexity. I don’t remember all the arguments against it now. There are some obvious issues: (1) it’s not clear how to cleanly implement it. (2) I woud be concerned about relying on preservation of such a brittle attribute so late. It seems easy for target specific DAG combines to either drop or misuse the NSW flag. (3) Determining where to widen values may require global analysis.

  • I know that the problem of determining when values should be promoted to a wider type and which uses to replace doesn’t have a general solution that is practical to implement. We can easily end up with more register pressure and extra copies that the coalescer is not smart about removing or optimizing. It seems wise to tackle a narrow subset of the problem (e.g. codegen of specific patterns that match addressing modes). Quentin probably has more insight here.

I dislike #3 intensely – we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep … (sext (op nsw …))) or (gep … (op nsw (sext …))). Also, it feels like quite a hack around the lack of fidelity in the backend.

I understand why this sort of hack is so terrible early in the IR pass pipeline (especially before inlining). But I think it is already the norm for anyone seriously optimizing for a specific target to manipulate the IR (here I do welcome others to confirm or refute this point). Since we can’t do anything to change that, let’s just declare that those codegen IR transforms are part of the “backend” and be done. Naturally, the more a target customizes codegen preparation, the less DAGCombine can be reused.

Yes. That’s CodgeGenPrepare in a nutshell, really. I don’t think anyone claims CGP is a good thing, merely that with our existing infrastructure, it’s a necessary and effective thing.

-Jim

Hi,

So, I know there are a lot of mixed feelings about NSW and NUW, but I’m running into a common pattern where they really help:

void f(char *array, int i) {
// do some stuff…
g(array[i++]);

// do some more stuff…
g(array[i++]);

// more of the same
}

So, this kind of code comes up pretty frequently. Unrolled loops with an int IV[1], encoding, decoding, compression, etc. What we’d like to do is take the sequence of N increments and turn them into N addressing modes with a constant offset. Then add a small constant at the end and test the flags if we’re in a loop, etc. This is much faster than clobbering registers to do LEA or inc or add for each one (and yea, we do LEA and use lots of registers currently).

The problem is that this is being compiled in 64-bit mode. So the index is 32-bits and gets sign extended prior to the GEP. So we end up with a lot of repetitions of:

%a = add nsw i32 %i, N # for some immediate small integer N
%b = sext i32 %i to i64
%ptr = gep inbounds i8* %base, i64 %b

When we try to match this pattern in the backend to an x86 addressing mode we fail because we can’t look through the sext. This is particularly frustrating because we should be able to look through the sext due to the NSW bit on the add, but we don’t preserve NSW constraints into the DAG.

How can we fix this? I see the following options:

  1. Canonicalize the other way in the IR. This could take a few different forms:
    1a) Widen math (where NSW or NUW allows) which is only used by an *ext feeding a GEP index.
    1b) Widen all math which is only used by an *ext that targets a legal integer size.

  2. Add the NSW (and NUW) bits to the DAG, and query them in the address mode matching to match across (sext (op nsw …)) and (zext (op nuw …))

  3. Add a complete hack to CodeGenPrepare (or similar) which does #1 as a non-canonical step in the IR.

Here are some of my current prespectives that are loosely related to the issue. All FWIW. There’s no need to debate them in this thread other than as they apply to the problem at hand.

  • I agree that promotion to 64-bit loses information and should only be done when target information tells us that it’s good for codegen.

  • I personally have no problem using “indvars” as a codegen prepare pass for any noncanonical target-specific transforms that require SCEV. The only canonicalization that it usually accomplishes that is replacement of loop exit values. The sign/zero extend elimination (widening) and LFTR are really codegen optimizations that should be defered until the last round of loop opts. I don’t see it as being any different in this respect than LSR and vectorization. Unfortunately we rerun canonical InstCombine after indvars, which can undo any target-specific goodness. But when that becomes a serious problem we should just fix that too. If we really need to replace loop exit values earlier before other canonical passes, we can split up indvars.

  • I know we have considered adding NSW to SelectionDag in the past and I’m not opposed to it in principle, but it was difficult to justify adding the complexity. I don’t remember all the arguments against it now. There are some obvious issues: (1) it’s not clear how to cleanly implement it. (2) I woud be concerned about relying on preservation of such a brittle attribute so late. It seems easy for target specific DAG combines to either drop or misuse the NSW flag. (3) Determining where to widen values may require global analysis.

  • I know that the problem of determining when values should be promoted to a wider type and which uses to replace doesn’t have a general solution that is practical to implement. We can easily end up with more register pressure and extra copies that the coalescer is not smart about removing or optimizing. It seems wise to tackle a narrow subset of the problem (e.g. codegen of specific patterns that match addressing modes). Quentin probably has more insight here.

I dislike #3 intensely – we really want to be able to combine on this stuff in a canonical way, so I want the pattern to either always be (gep … (sext (op nsw …))) or (gep … (op nsw (sext …))). Also, it feels like quite a hack around the lack of fidelity in the backend.

I understand why this sort of hack is so terrible early in the IR pass pipeline (especially before inlining). But I think it is already the norm for anyone seriously optimizing for a specific target to manipulate the IR (here I do welcome others to confirm or refute this point). Since we can’t do anything to change that, let’s just declare that those codegen IR transforms are part of the “backend” and be done. Naturally, the more a target customizes codegen preparation, the less DAGCombine can be reused.

Yes. That’s CodgeGenPrepare in a nutshell, really. I don’t think anyone claims CGP is a good thing, merely that with our existing infrastructure, it’s a necessary and effective thing.

I agree with Andy and Jim.
Actually, I have prototyped a compiler that does exactly this kind of promotion in CodeGenPrepare.
Basically, I have updated the addressing mode matcher so that it moves a sext that is in a way of an addressing mode (i.e., it promotes the operand of the sext, let us call this operand def, if it is legal to do so, and sign extends the operands of def). When the matcher does not manage to absorb more computation after promoting def, it can revert the promotion.

I am currently benchmarking this solution and I’ll update this thread with the results.

Cheers,
-Quentin

Very cool. Could you share the patch? I can also run some benchmarks.

Sure, but be aware that this is a prototype (not structured, not commented) :).In particular, it is missing the rollback if the promotion does not help

-Quentin

sext_x86.patch (11.7 KB)

From: "Chandler Carruth" <chandlerc@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Chandler Carruth" <chandlerc@gmail.com>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, January 6, 2014 7:44:30 PM
Subject: Re: [LLVMdev] A question about everyone's favorite constructs: NSW and NUW

> From: "Chandler Carruth" < chandlerc@gmail.com >
> To: "LLVM Developers Mailing List" < llvmdev@cs.uiuc.edu >
> Sent: Saturday, January 4, 2014 7:29:07 PM
> Subject: [LLVMdev] A question about everyone's favorite constructs:
> NSW and NUW
>
>
>
> So, I know there are a lot of mixed feelings about NSW and NUW, but
> I'm running into a common pattern where they really help:
>
>
> void f(char *array, int i) {
> // do some stuff...
> g(array[i++]);
>
>
> // do some more stuff...
> g(array[i++]);
>
>
> // more of the same
> }
>
>
> So, this kind of code comes up pretty frequently. Unrolled loops
> with
> an int IV[1], encoding, decoding, compression, etc. What we'd like
> to do is take the sequence of N increments and turn them into N
> addressing modes with a constant offset. Then add a small constant
> at the end and test the flags if we're in a loop, etc. This is
> *much
> faster* than clobbering registers to do LEA or inc or add for each
> one (and yea, we do LEA and use lots of registers currently).
>
>
> The problem is that this is being compiled in 64-bit mode. So the
> index is 32-bits and gets sign extended prior to the GEP. So we end
> up with a lot of repetitions of:
>
>
> %a = add nsw i32 %i, N # for some immediate small integer N
> %b = sext i32 %i to i64
> %ptr = gep inbounds i8* %base, i64 %b
>
>
> When we try to match this pattern in the backend to an x86
> addressing
> mode we fail because we can't look through the sext. This is
> particularly frustrating because we *should* be able to look
> through
> the sext due to the NSW bit on the add, but we don't preserve NSW
> constraints into the DAG.
>
>
> How can we fix this? I see the following options:
>
>
> 1) Canonicalize the other way in the IR. This could take a few
> different forms:
> 1a) Widen math (where NSW or NUW allows) which is only used by an
> *ext feeding a GEP index.
> 1b) Widen all math which is only used by an *ext that targets a
> legal
> integer size.

I'm in favor of having NSW/NUW information available at the SDAG
level (and even at the MI level), but for other reasons.

Cool. Could you suggest how best to implement this (especially in the
SDAG)? I don't like my ideas thus far.

I'm not sure that I have anything but the straightforward solution: Add a flags variable (and a few methods) to BinarySDNode; then go through all of the SDAG code with a fine-toothed comb, propagating the flags where appropriate.

If we're only concerned about operations that are directly represented in the IR, we could store a Value * to the original instruction (as we do for MMO). While this has the benefit of allowing for some global analysis with in the SDAG, it might be too limiting.

For this problem, I vote for solution 1b. My rationale is that this
would solve a problem that I currently have in the PowerPC backend
for 64-bit targets in a general way:

On 64-bit PPC targets, return values are (zero or sign) extended to
64-bits. Here's a simple example that demonstrates the issue:

define signext i32 @foo(i1 zeroext %a) #0 {
entry:
%cond = select i1 %a, i32 7, i32 12
ret i32 %cond
}

compiles into something like this:

...
li r4, 7
li r3, 12
isel r3, r4, r3, 1
rldicl r3, r3, 0, 32 <-- this is a completely unnecessary extension
blr

Obviously, this affects cases other than when the extension is part
of the return value (although the return value case is a pretty
obvious code wart).

One possible solution to this that I've thought about (although not
yet experimented with) is 'promoting' all 32-bit operations to
64-bit ones up front. That, however, might lead to a bunch of
unnecessary extensions of its own. I think that the 1b solution
would nicely solve this problem without introducing unnecessary
extensions.

It feels like there are way better solutions to this. Why can't you
just fix this by looking for a sext used by ret in a function which
is already signext? It seems like you could match this in the SDAG
and fix it very directly?

I can do this, and will (I can test the # of sign bits or known masked bits and eliminate the extension in a fairly general way). Unfortunately, this only solves the problem in a single-BB sense (and, thus, can be defeated by an only-slightly-more-complicated example). A more general solution will also need to involved an IR-level transformation as well.

One issue that might come up is that 64-bit operations might be more
expensive (either in latency or in throughput) than the
corresponding 32-bit ones (I'm thinking specifically about
multiplication and division here, but if we're spilling a lot, there
could also be stack-size and bandwidth effects), and we might want
some target input on that.

Oh, its worse than this. I'm no longer liking #1 at all.

If we hoist the extension over math which has flags allowing us to do
so safely, we actually lose significant information. The wider the
type on which we have the NSW flag, the less information we have
about the known value range of the inputs. While this may not have a
significant impact in many benchmarks, it seems like a very
worrisome change to make when picking the canonical form because we
essentially say that we will *never* be able to preserve this
information even if some day it matters a lot. =/

Good point.

-Hal

Hi Chandler,

Here is an updated patch, that I plan to submit for review.
In particular, it has the rollback mechanism when the addressing mode is not profitable.

Could you run some benchmarks with it?

Thanks,

-Quentin

sext_with_restore_x86.patch (47.2 KB)

I’ve not yet looked at the code, but just to give you somewhat raw numbers:

Across all of our benchmarks there was overall change outside of the noise. There is one benchmark that may have been improved, but I’m not 100% certain it’ll be reproducible as opposed to a fluke run.

While this isn’t fantastic news, on the flip side, nothing regressed, so that’s a really good sign. Also, no size regressions to speak of (0.02%? yea, no one will care).

Hi Chandler,

I’ve not yet looked at the code, but just to give you somewhat raw numbers:

Thanks a lot for your measurements!

Across all of our benchmarks there was overall change outside of the noise. There is one benchmark that may have been improved, but I’m not 100% certain it’ll be reproducible as opposed to a fluke run.

That was expected as this patch is intended to help code that runs on 64-bit architectures that is not 64-bit friendly for array accesses. I believe most of your code base is not in that shape :).

While this isn’t fantastic news, on the flip side, nothing regressed, so that’s a really good sign. Also, no size regressions to speak of (0.02%? yea, no one will care).

That’s good news because on our side we are seeing good performance improvement in the llvm test-suite and some external benchmarks like SPEC.

From a performance stance I think this patch looks fine if you guys are seeing significant improvements.

I am still investigating our numbers because we are seeing some regressions that I need to confirm (so far, I saw only false positive).

Anyway, thanks again!
-Quentin