[RFC] Adding range metadata to array subscripts.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM can do better alias analysis. I have a proposal as D99248 (clang part) and D99247 (llvm part). But there are other possible options that I’m detailing below.

Consider the following code, adapted from brotli:


struct Histogram {

int values[256];

int total;

};

Histogram DoIt(const int* image, int size) {

Histogram histogram;

for (int i = 0; i < size; ++i) {

++histogram.values[image[i]]; // (A)

++histogram.total; // (B)

}

return histogram;

}

In this code, the compiler does not know anything about the values of images[i], so it assumes that 256 is a possible value for it. In that case, (A) would change the value of histogram.total, so (B) has to load, add one and store [godbolt].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use values to form a pointer to total and dereference it. What valid C/C++ code is allowed to do with values is:

  • Form any pointer in [values, values + 256].
  • Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has no notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep reference] representing indexing into the [42 x i32] array to load the i32 member at index 2. It is also valid for %i to be 43 (though not 44 if an inbound GEP is used).
So clang has to give LLVM more information about the C/C++ rules.

IR representation:
LLVM has several ways of representing ranges of values:

  • !range metadata can be attached to integer call and load instructions to indicate the allowed range of values of the result. LLVM’s ValueTracking provides a function for querying the range for any llvm::Variable.
  • The llvm.assume intrinsic takes a boolean condition that can also be used by ValueTracking to infer range of values.
  • The inrange attribute of GEP can be used to indicate C-like semantics for the structure field marked with the inrange attribute. It can only be used for GEP constantexprs (ie.e. GEPs defined inline), but not for standalone GEPs defining instructions. relevant discussion.

Alternatives:
(1) Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri = call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %ri
...
!0 = !{i64 0, i64 42}

(2) (variant of 1) relax the constraint that !range metadata can only be set on call and load instructions, and set the !range metadata on the index expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %i
...
!0 = !{i64 0, i64 42}

This is slightly more compact.

(3) Same as (1), with llvm.assume. This feels inferior to annotations.
(4) Extend inrange to non-constantexprs GEPs. It is unclear how this will interfere with optimizations.

On the clang side:
The clang part is quite trivial as the infrastructure is already in place to emit dynamic ubsan guards: D99248

On the LLVM Side:
Alternatives:
(A) - (annotation or assume options) Simply enable scev-aa which knows how to handle value ranges in general. IIUC it’s not enabled in clang because it has issues with invalidation when code changes, and is therefore not cacheable. This makes it too slow to be practical.
(B) - (annotation or assume options) Teach BasicAA to honor !range metadata (D99247)
(C) - (inrange option) Teach BasicAA to honor inrange attributes of GEP.

I was leaning towards (1) and (B) because:

  • BasicAA already has basic support for value range analysis (zero/nonzero), this is a small and natural extension.
  • The BasicAA improvement already benefits some existing code (as evidenced by the test changes in D99247)
  • Using range metadata rather than the inrange attribute means that BasicAA will automatically benefit from improvements in value tracking in the future.

Opinions ?

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

Just my 2c.

Nuno

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right?

Agreed, but inrange was not extended to non-const geps. I’m not sure exactly why. From what I understand of the previous discussions, there were non-trivial issues with supporting inrange for non-const geps, though it’s not exactly clear to me what these were. Peter, what was the blocker for that ?

It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

Yes, for the purpose of BasicAA inrange is a special case of range, where the range happens to be the size of the underlying inrange field.

I really like encoding more (range) information in the IR,
more thoughts inlined.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM
can do better alias analysis. I have a proposal as D99248
<https://reviews.llvm.org/D99248> (clang part) and D99247
<https://reviews.llvm.org/D99247> (llvm part). But there are other possible
options that I'm detailing below.

Consider the following code, adapted from brotli
<https://en.wikipedia.org/wiki/Brotli>:


struct Histogram {

   int values[256];

   int total;

};

Histogram DoIt(const int* image, int size) {

   Histogram histogram;

   for (int i = 0; i < size; ++i) {

     ++histogram.values[image[i]];  // (A)

     ++histogram.total;             // (B)

   }

   return histogram;

}

In this code, the compiler does not know anything about the values of
images[i], so it assumes that 256 is a possible value for it. In that case,
(A) would change the value of histogram.total, so (B) has to load, add one
and store [godbolt <https://godbolt.org/z/KxE343>].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
values to form a pointer to total and dereference it. What valid C/C++ code
is allowed to do with values is:
  - Form any pointer in [values, values + 256].
  - Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has no
notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep
reference] representing indexing into the [42 x i32] array to load the i32
member at index 2. It is also valid for %i to be 43 (though not 44 if an
inbound GEP is used).
So clang has to give LLVM more information about the C/C++ rules.

*IR representation:*
LLVM has several ways of representing ranges of values:
  - *!range* metadata can be attached to integer call and load instructions
to indicate the allowed range of values of the result. LLVM's ValueTracking
provides a function for querying the range for any llvm::Variable.
  - The *llvm.assume* intrinsic takes a boolean condition that can also be
used by ValueTracking to infer range of values.
  - The *inrange* attribute of GEP can be used to indicate C-like semantics
for the structure field marked with the inrange attribute. It can only be
used for GEP constantexprs (ie.e. GEPs defined inline), but not for
standalone GEPs defining instructions. relevant discussion
<https://reviews.llvm.org/D22793?id=65626#inline-194653>.

Alternatives:
*(1) *Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
%ri
...
!0 = !{i64 0, i64 42}

*(2) *(variant of 1) relax the constraint that !range metadata can only be
set on call and load instructions, and set the !range metadata on the index
expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
%i
...
!0 = !{i64 0, i64 42}

This is slightly more compact.

*(3)* Same as (1), with llvm.assume. This feels inferior to annotations.
*(4)* Extend inrange to non-constantexprs GEPs. It is unclear how this will
interfere with optimizations.

I would very much like not to introduce another way to encode
assumptions other than `llvm.assume`. If you want to avoid the extra
instructions, use `llvm.assume(i1 true) ["range"(%val, %lb, %ub)]`,
which is in line with our move towards operand bundle use.

SCEV should be thought about this (as well), unsure what the problem
is you describe below. If BasicAA needs to know, sure.

~ Johannes

I really like encoding more (range) information in the IR,
more thoughts inlined.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM
can do better alias analysis. I have a proposal as D99248
<https://reviews.llvm.org/D99248> (clang part) and D99247
<https://reviews.llvm.org/D99247> (llvm part). But there are other possible
options that I’m detailing below.

Consider the following code, adapted from brotli
<https://en.wikipedia.org/wiki/Brotli>:


struct Histogram {

int values[256];

int total;

};

Histogram DoIt(const int* image, int size) {

Histogram histogram;

for (int i = 0; i < size; ++i) {

++histogram.values[image[i]]; // (A)

++histogram.total; // (B)

}

return histogram;

}

In this code, the compiler does not know anything about the values of
images[i], so it assumes that 256 is a possible value for it. In that case,
(A) would change the value of histogram.total, so (B) has to load, add one
and store [godbolt <https://godbolt.org/z/KxE343>].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
values to form a pointer to total and dereference it. What valid C/C++ code
is allowed to do with values is:

  • Form any pointer in [values, values + 256].
  • Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has no
notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep
reference] representing indexing into the [42 x i32] array to load the i32
member at index 2. It is also valid for %i to be 43 (though not 44 if an
inbound GEP is used).
So clang has to give LLVM more information about the C/C++ rules.

IR representation:
LLVM has several ways of representing ranges of values:

  • !range metadata can be attached to integer call and load instructions
    to indicate the allowed range of values of the result. LLVM’s ValueTracking
    provides a function for querying the range for any llvm::Variable.
  • The llvm.assume intrinsic takes a boolean condition that can also be
    used by ValueTracking to infer range of values.
  • The inrange attribute of GEP can be used to indicate C-like semantics
    for the structure field marked with the inrange attribute. It can only be
    used for GEP constantexprs (ie.e. GEPs defined inline), but not for
    standalone GEPs defining instructions. relevant discussion
    <https://reviews.llvm.org/D22793?id=65626#inline-194653>.

Alternatives:
*(1) *Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri = call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
%ri
...
!0 = !{i64 0, i64 42}

*(2) *(variant of 1) relax the constraint that !range metadata can only be
set on call and load instructions, and set the !range metadata on the index
expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32
%i
...
!0 = !{i64 0, i64 42}

This is slightly more compact.

(3) Same as (1), with llvm.assume. This feels inferior to annotations.
(4) Extend inrange to non-constantexprs GEPs. It is unclear how this will
interfere with optimizations.

I would very much like not to introduce another way to encode
assumptions other than llvm.assume. If you want to avoid the extra
instructions, use llvm.assume(i1 true) ["range"(%val, %lb, %ub)],
which is in line with our move towards operand bundle use.

Thanks, I did not know about that. I’ve just tried it but it appears that tags have to be attribute names, and !range is not a valid attribute, it’s a metadata node. Is there a way to encode this ?

SCEV should be thought about this (as well), unsure what the problem
is you describe below. If BasicAA needs to know, sure.

scev-aa already knows how to use range information. If we add a !range metadata in clang right now and use SCEV, there is nothing to do on the LLVM side. My point was that scev-aa is not enabled in the default pipeline, so we might as well teach BasicAA about this cheap case.

Actually I think it makes sense to teach BasicAA about range information anyway (D99247) given that it could already be useful in cases like:


define dso_local void @DoIt(%struct.Histogram* noalias nocapture sret(%struct.Histogram) align 4 %0, i32 %1, **i8 zeroext %2**) local_unnamed_addr #0 {
...
**%6 = zext i8 %2 to i64**
%7 = getelementptr inbounds %struct.Histogram, %struct.Histogram* %0, i64 0, i32 0, **i64 %6**

Where ValueTracking could easily deduce that %6 is in [0,255].

I really like encoding more (range) information in the IR,
more thoughts inlined.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM
can do better alias analysis. I have a proposal as D99248
<https://reviews.llvm.org/D99248> (clang part) and D99247
<https://reviews.llvm.org/D99247> (llvm part). But there are other

possible

options that I'm detailing below.

Consider the following code, adapted from brotli
<https://en.wikipedia.org/wiki/Brotli>:


struct Histogram {

    int values[256];

    int total;

};

Histogram DoIt(const int* image, int size) {

    Histogram histogram;

    for (int i = 0; i < size; ++i) {

      ++histogram.values[image[i]];  // (A)

      ++histogram.total;             // (B)

    }

    return histogram;

}

In this code, the compiler does not know anything about the values of
images[i], so it assumes that 256 is a possible value for it. In that

case,

(A) would change the value of histogram.total, so (B) has to load, add

one

and store [godbolt <https://godbolt.org/z/KxE343>].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
values to form a pointer to total and dereference it. What valid C/C++

code

is allowed to do with values is:
   - Form any pointer in [values, values + 256].
   - Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has

no

notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep
reference] representing indexing into the [42 x i32] array to load the

i32

member at index 2. It is also valid for %i to be 43 (though not 44 if an
inbound GEP is used).
So clang has to give LLVM more information about the C/C++ rules.

*IR representation:*
LLVM has several ways of representing ranges of values:
   - *!range* metadata can be attached to integer call and load

instructions

to indicate the allowed range of values of the result. LLVM's

ValueTracking

provides a function for querying the range for any llvm::Variable.
   - The *llvm.assume* intrinsic takes a boolean condition that can also

be

used by ValueTracking to infer range of values.
   - The *inrange* attribute of GEP can be used to indicate C-like

semantics

for the structure field marked with the inrange attribute. It can only be
used for GEP constantexprs (ie.e. GEPs defined inline), but not for
standalone GEPs defining instructions. relevant discussion
<https://reviews.llvm.org/D22793?id=65626#inline-194653>.

Alternatives:
*(1) *Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,

i32

%ri

!0 = !{i64 0, i64 42}

\*\(2\) \*\(variant of 1\) relax the constraint that \!range metadata can only

be

set on call and load instructions, and set the !range metadata on the

index

expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,

i32

%i

!0 = !{i64 0, i64 42}

This is slightly more compact\.

\*\(3\)\* Same as \(1\), with llvm\.assume\. This feels inferior to annotations\.
\*\(4\)\* Extend inrange to non\-constantexprs GEPs\. It is unclear how this

will

interfere with optimizations.

I would very much like not to introduce another way to encode
assumptions other than `llvm.assume`. If you want to avoid the extra
instructions, use `llvm.assume(i1 true) ["range"(%val, %lb, %ub)]`,
which is in line with our move towards operand bundle use.

Thanks, I did not know about that. I've just tried it but it appears that
tags have to be attribute names, and `!range` is not a valid attribute,
it's a metadata node. Is there a way to encode this ?

We need to get rid of that assertion. There are other non-attributes
to be used in assume operand bundles in the (near) future, so the this
work has to be done anyway.

SCEV should be thought about this (as well), unsure what the problem
is you describe below. If BasicAA needs to know, sure.

scev-aa already knows how to use range information. If we add a !range
metadata in clang right now and use SCEV, there is nothing to do on the
LLVM side. My point was that scev-aa is not enabled in the default
pipeline, so we might as well teach BasicAA about this cheap case.

Actually I think it makes sense to teach BasicAA about range information
anyway (D99247) given that it could already be useful in cases like:

define dso_local void @DoIt(%struct.Histogram* noalias nocapture sret(
%struct.Histogram) align 4 %0, i32 %1, *i8 zeroext %2*) local_unnamed_addr
#0 {
...
*%6 = zext i8 %2 to i64*
%7 = getelementptr inbounds %struct.Histogram, %struct.Histogram* %0, i64 0
, i32 0, *i64 %6*

Where ValueTracking could easily deduce that %6 is in [0,255].

Sounds reasonable.

~ Johannes

+1 on trying to use assume, rather than adding another way.

But are value ranges special for assumes, so that we need to handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?

Cheers,
Florian

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right?

Agreed, but inrange was not extended to non-const geps. I’m not sure exactly why. From what I understand of the previous discussions, there were non-trivial issues with supporting inrange for non-const geps, though it’s not exactly clear to me what these were. Peter, what was the blocker for that ?

I don’t recall any non-trivial issues. Reading the previous discussion, the only issues were mechanical ones, i.e. we would need an IR representation of inrange for non-constant GEPs together with printing/parsing/bitcode support.

Peter

I really like encoding more (range) information in the IR,
more thoughts inlined.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM
can do better alias analysis. I have a proposal as D99248
<https://reviews.llvm.org/D99248> (clang part) and D99247
<https://reviews.llvm.org/D99247> (llvm part). But there are other

possible

options that I'm detailing below.

Consider the following code, adapted from brotli
<https://en.wikipedia.org/wiki/Brotli>:


struct Histogram {

    int values[256];

    int total;

};

Histogram DoIt(const int* image, int size) {

    Histogram histogram;

    for (int i = 0; i < size; ++i) {

      ++histogram.values[image[i]];  // (A)

      ++histogram.total;             // (B)

    }

    return histogram;

}

In this code, the compiler does not know anything about the values of
images[i], so it assumes that 256 is a possible value for it. In that

case,

(A) would change the value of histogram.total, so (B) has to load, add

one

and store [godbolt <https://godbolt.org/z/KxE343>].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
values to form a pointer to total and dereference it. What valid C/C++

code

is allowed to do with values is:
   - Form any pointer in [values, values + 256].
   - Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has

no

notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep
reference] representing indexing into the [42 x i32] array to load the

i32

member at index 2. It is also valid for %i to be 43 (though not 44 if an
inbound GEP is used).
So clang has to give LLVM more information about the C/C++ rules.

*IR representation:*
LLVM has several ways of representing ranges of values:
   - *!range* metadata can be attached to integer call and load

instructions

to indicate the allowed range of values of the result. LLVM's

ValueTracking

provides a function for querying the range for any llvm::Variable.
   - The *llvm.assume* intrinsic takes a boolean condition that can also

be

used by ValueTracking to infer range of values.
   - The *inrange* attribute of GEP can be used to indicate C-like

semantics

for the structure field marked with the inrange attribute. It can only be
used for GEP constantexprs (ie.e. GEPs defined inline), but not for
standalone GEPs defining instructions. relevant discussion
<https://reviews.llvm.org/D22793?id=65626#inline-194653>.

Alternatives:
*(1) *Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,

i32

%ri

!0 = !{i64 0, i64 42}

\*\(2\) \*\(variant of 1\) relax the constraint that \!range metadata can only

be

set on call and load instructions, and set the !range metadata on the

index

expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,

i32

%i

!0 = !{i64 0, i64 42}

This is slightly more compact\.

\*\(3\)\* Same as \(1\), with llvm\.assume\. This feels inferior to annotations\.
\*\(4\)\* Extend inrange to non\-constantexprs GEPs\. It is unclear how this

will

interfere with optimizations.

I would very much like not to introduce another way to encode
assumptions other than `llvm.assume`. If you want to avoid the extra
instructions, use `llvm.assume(i1 true) ["range"(%val, %lb, %ub)]`,
which is in line with our move towards operand bundle use.

Thanks, I did not know about that. I've just tried it but it appears that
tags have to be attribute names, and `!range` is not a valid attribute,
it's a metadata node. Is there a way to encode this ?

We need to get rid of that assertion. There are other non-attributes
to be used in assume operand bundles in the (near) future, so the this
work has to be done anyway.

+1 on trying to use assume, rather than adding another way.

But are value ranges special for assumes, so that we need to handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?

It would make users explicit and we will have non-attribute bundles anyway.
I find it also "conceptually nicer", would you prefer explicit instructions?

~ Johannes

To summarize here is how each proposal would look like, along with a list of pros and cons for each.

%struct.S = type { i32, [2 x i32], i32 }

inrange
define void @with_inrange(%struct.S* %s, i32 %in_array) {
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, inrange i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

(Note that this is not currently valid IR, we need to make inrange work on standalone GEP instructions)
pros:

  • compact, no extra instructions
    cons:
  • The range information is local to the GEP instruction, and cannot be easily used for other purposes, e.g. SCEV (the ValueTracker is not aware of it)

annotation with range metadata
define void @with_annotation(%struct.S* %s, i32 %index) {
%in_array = call i32 @llvm.annotation.i32(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

pros:

  • a single added instruction
  • the ValueTracker already knows about range metadata
  • deriving the range in ValueTracker involves a single metadata lookup.
  • The range information will be available for other purposes (e.g. SCEV).
    cons:
  • one extra instruction
  • llvm.annotation is not widely used

assume
define void @with_assume(%struct.S* %s, i32 %in_array) {
%cmp1 = icmp sge i32 %in_array, 0
%cmp2 = icmp slt i32 %in_array, 2
%cmp = and i1 %cmp1, %cmp2
call void @llvm.assume(i1 %cmp)
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

pros:

  • The range information will be available for other purposes (e.g. SCEV).

cons:

  • several extra instructions

  • the ValueTracker cannot currently derive a range from this

  • Deriving the range can be costly

assume with bundle
define void @with_assume_bundle(%struct.S* %s, i32 %in_array) {
call void @llvm.assume(i1 true) [“range”(i32 %in_array, i32 0, i32 2)]
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1, i32 %in_array
%gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
ret void
}

(Note that this is not currently valid IR, we need to make bundle tags work for non-attributes)

pros:

  • The range information will be available for other purposes (e.g. SCEV).
  • ? (see below)

cons:

  • one extra instruction

  • ? (see below)

@johannes For the last option, I’m also not sure what becomes of the “range” tag in the assume bundle once we remove the assertion ? Does it become attached to the value (%in_array), in which case we have the advantages of range metadata (option 2), or is this still attached to the assume, with the disadvantages of option 3 ?

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

So I tried implementing this on top of inrange, and I found out that you can only have one inrange index, which doesn’t allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

So I tried implementing this on top of `inrange`, and I found out that you can only have one `inrange` index, which doesn't allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341

Are you aware of upcoming Opaque Pointers?
There won't be any multi-dimensional GEP's after that.

Just my 2c.

Nuno

From: Clement Courbet
Sent: 24 March 2021 09:15
To: llvm-dev <llvm-dev@lists.llvm.org>
Subject: [llvm-dev] [RFC] Adding range metadata to array subscripts.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM can do better alias analysis. I have a proposal as D99248 (clang part) and D99247 (llvm part). But there are other possible options that I'm detailing below.

Consider the following code, adapted from brotli:


struct Histogram {

  int values[256];

  int total;

};

Histogram DoIt(const int* image, int size) {

  Histogram histogram;

  for (int i = 0; i < size; ++i) {

    ++histogram.values[image[i]];  // (A)

    ++histogram.total;             // (B)

  }

  return histogram;

}

In this code, the compiler does not know anything about the values of images[i], so it assumes that 256 is a possible value for it. In that case, (A) would change the value of histogram.total, so (B) has to load, add one and store [godbolt].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use values to form a pointer to total and dereference it. What valid C/C++ code is allowed to do with values is:

- Form any pointer in [values, values + 256].
- Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has no notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep reference] representing indexing into the [42 x i32] array to load the i32 member at index 2. It is also valid for %i to be 43 (though not 44 if an inbound GEP is used).

So clang has to give LLVM more information about the C/C++ rules.

IR representation:
LLVM has several ways of representing ranges of values:
- !range metadata can be attached to integer call and load instructions to indicate the allowed range of values of the result. LLVM's ValueTracking provides a function for querying the range for any llvm::Variable.
- The llvm.assume intrinsic takes a boolean condition that can also be used by ValueTracking to infer range of values.
- The inrange attribute of GEP can be used to indicate C-like semantics for the structure field marked with the inrange attribute. It can only be used for GEP constantexprs (ie.e. GEPs defined inline), but not for standalone GEPs defining instructions. relevant discussion.

Alternatives:
(1) Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %ri
...
!0 = !{i64 0, i64 42}

(2) (variant of 1) relax the constraint that !range metadata can only be set on call and load instructions, and set the !range metadata on the index expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0, i32 %i
...
!0 = !{i64 0, i64 42}

This is slightly more compact.

(3) Same as (1), with llvm.assume. This feels inferior to annotations.
(4) Extend inrange to non-constantexprs GEPs. It is unclear how this will interfere with optimizations.

On the clang side:

The clang part is quite trivial as the infrastructure is already in place to emit dynamic ubsan guards: D99248

On the LLVM Side:

Alternatives:
(A) - (annotation or assume options) Simply enable scev-aa which knows how to handle value ranges in general. IIUC it's not enabled in clang because it has issues with invalidation when code changes, and is therefore not cacheable. This makes it too slow to be practical.
(B) - (annotation or assume options) Teach BasicAA to honor !range metadata (D99247)
(C) - (inrange option) Teach BasicAA to honor inrange attributes of GEP.

I was leaning towards (1) and (B) because:

- BasicAA already has basic support for value range analysis (zero/nonzero), this is a small and natural extension.

- The BasicAA improvement already benefits some existing code (as evidenced by the test changes in D99247)

- Using range metadata rather than the `inrange` attribute means that BasicAA will automatically benefit from improvements in value tracking in the future.

Opinions ?

Roman

In theory, inrange should be perfect for this job. If inrange was extended to non-const geps, I think it would work for all cases you care about, right? It’s also the most compact representation, so I would prefer that solution. Better than adding a new call instruction for every single pointer arithmetic instruction.

In terms of BasicAA, supporting inrange shouldn’t be any different from !range metadata I think.

So I tried implementing this on top of inrange, and I found out that you can only have one inrange index, which doesn’t allow us to represent access into multidimensional arrays : https://reviews.llvm.org/D99341
Are you aware of upcoming Opaque Pointers?
There won’t be any multi-dimensional GEP’s after that.

I’m not familiar with that ? Does that mean that any N-dimensional GEP will become a chain of N 1-dimensional GEPs ?

Hi,

there might be some overlap with this proposal:
https://lists.llvm.org/pipermail/llvm-dev/2019-July/134063.html

Michael

I think this is only for the internal representation. According to the
reference (https://llvm.org/docs/LangRef.html#id230), it is allowed
for each dimension.

Michael

To summarize here is how each proposal would look like, along with a list
of pros and cons for each.

%struct.S = type { i32, [2 x i32], i32 }

*inrange*
define void @with_inrange(%struct.S* %s, i32 %in_array) {
   %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
inrange i32 %in_array
   %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
   ret void
}

(Note that this is not currently valid IR, we need to make `inrange` work
on standalone GEP instructions)
pros:
  - compact, no extra instructions
cons:
  - The range information is local to the GEP instruction, and cannot be
easily used for other purposes, e.g. SCEV (the ValueTracker is not aware of
it)

This seems fine to me, SCEV and others can learn about this.
I assume we'll want the range `llvm.assume` anyway but that doesn't
necessarily preclude this.

*annotation with range metadata*
define void @with_annotation(%struct.S* %s, i32 %index) {
   %in_array = call i32 @llvm.annotation.i32(%index), !range !0
   %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
i32 %in_array
   %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
   ret void
}

pros:
  - a single added instruction
  - the ValueTracker already knows about range metadata
  - deriving the range in ValueTracker involves a single metadata lookup.
  - The range information will be available for other purposes (e.g. SCEV).
cons:
  - one extra instruction
  - llvm.annotation is not widely used

This is my least favorite. It introduces a completely new concept
for something that should be attached to the GEP or assume.

*assume*
define void @with_assume(%struct.S* %s, i32 %in_array) {
   %cmp1 = icmp sge i32 %in_array, 0
   %cmp2 = icmp slt i32 %in_array, 2
   %cmp = and i1 %cmp1, %cmp2
   call void @llvm.assume(i1 %cmp)
   %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
i32 %in_array
   %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
   ret void
}

pros:
  - The range information will be available for other purposes (e.g. SCEV).
cons:
  - several extra instructions
  - the ValueTracker cannot currently derive a range from this
  - Deriving the range can be costly
<https://github.com/llvm/llvm-project/blob/dd388ba3e0b0a5f06565d0bcb6e1aebb5daac065/llvm/lib/Analysis/ValueTracking.cpp#L638>

This is the old assume way and OK. I don't think the cost of
looking at assumptions is a good argument because we should
come up with solutions there anyway. The extra instructions
and uses are the real downside here.

*assume with bundle*
define void @with_assume_bundle(%struct.S* %s, i32 %in_array) {
   call void @llvm.assume(i1 true) ["range"(i32 %in_array, i32 0, i32 2)]
   %gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 1,
i32 %in_array
   %gep2 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0
   ret void
}

(Note that this is not currently valid IR, we need to make bundle tags work
for non-attributes)
pros:
  - The range information will be available for other purposes (e.g. SCEV).
  - ? (see below)
cons:
  - one extra instruction
  - ? (see below)

@johannes For the last option, I'm also not sure what becomes of the
"range" tag in the assume bundle once we remove the assertion ? Does it
become attached to the value (%in_array), in which case we have the
advantages of range metadata (option 2), or is this still attached to the
assume, with the disadvantages of option 3 ?

So, as I said before, we'll have to make bundles work for non-attribute
tags anyway, that should not be a negative point here. The extra instruction
is there but we already know how to combine multiple `llvm.assume`s into one.

The tag of the bundle describes the kind of assumption. In general, it could
be pretty much anything, the user needs to know how to interpret the arguments.

All arguments of a bundle are "attached" to the tag, think of the tag as a
function name of an uninterpreted function. The "range" one could look like this

void range(int v, int lb, int ub) {
   __builtin_assume(lb <= v && v < ub);
}

Later I'm confident we even want such explicit assumption functions, so you can
write the following, given the "range" definition above is in the module.
`call void @llvm.assume(i1 true) ["assumption_fn"(@range, i32 %in_array, i32 0, i32 2)]`
For more justification see: https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

I'm not sure I understand the questions properly. It basically is range metadata attached
to an arbitrary value for a fixed program point. Could you elaborate on your comments wrt.
the connection to option 2 and 3, please.

~ Johannes

One disadvantage of using a bundle (or !range metadata) is that we treat ranges for certain values in a special way and differently to how we treat range information expressed by the user e.g. via conditions (or builtin assume). This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled, which in turn can lead to surprising results (of the form: why does a transformation apply if information provided in a certain way, but does not apply of the equivalent info is provided in a different way).
Using instruction potentially also allows us to specify more complex ranges, in relation to other values.

But I realize that there are some practical consideration that make the instruction approach less appealing and I am all in favor of the more pragmatic & practical solution to start with.

Cheers,
Florian

I really like encoding more (range) information in the IR,
more thoughts inlined.

Hi everyone,

tl;dr: I would like to teach clang to output range metadata so that LLVM
can do better alias analysis. I have a proposal as D99248
<https://reviews.llvm.org/D99248> (clang part) and D99247
<https://reviews.llvm.org/D99247> (llvm part). But there are other

possible

options that I'm detailing below.

Consider the following code, adapted from brotli
<https://en.wikipedia.org/wiki/Brotli>:


struct Histogram {

    int values[256];

    int total;

};

Histogram DoIt(const int* image, int size) {

    Histogram histogram;

    for (int i = 0; i < size; ++i) {

      ++histogram.values[image[i]];  // (A)

      ++histogram.total;             // (B)

    }

    return histogram;

}

In this code, the compiler does not know anything about the values of
images[i], so it assumes that 256 is a possible value for it. In that

case,

(A) would change the value of histogram.total, so (B) has to load, add

one

and store [godbolt <https://godbolt.org/z/KxE343>].

Fortunately, C/C++ has a rule that it is invalid (actually, UB) to use
values to form a pointer to total and dereference it. What valid C/C++

code

is allowed to do with values is:
   - Form any pointer in [values, values + 256].
   - Form and dereference any pointer in [values, values + 256)

Note that the LLVM memory model is much laxer than that of C/C++. It has

no

notion of types. In particular, given an LLVM aggregate definition:

%struct.S = type { [42 x i32], i32, i32 }

It is perfectly valid to use an address derived from a GEP(0,0,%i) [gep
reference] representing indexing into the [42 x i32] array to load the

i32

member at index 2. It is also valid for %i to be 43 (though not 44 if an
inbound GEP is used).
So clang has to give LLVM more information about the C/C++ rules.

*IR representation:*
LLVM has several ways of representing ranges of values:
   - *!range* metadata can be attached to integer call and load

instructions

to indicate the allowed range of values of the result. LLVM's

ValueTracking

provides a function for querying the range for any llvm::Variable.
   - The *llvm.assume* intrinsic takes a boolean condition that can also

be

used by ValueTracking to infer range of values.
   - The *inrange* attribute of GEP can be used to indicate C-like

semantics

for the structure field marked with the inrange attribute. It can only be
used for GEP constantexprs (ie.e. GEPs defined inline), but not for
standalone GEPs defining instructions. relevant discussion
<https://reviews.llvm.org/D22793?id=65626#inline-194653>.

Alternatives:
*(1) *Annotate each array subscript index value with a range, e.g.:

%i = i64 …
%ri =  call i64 @llvm.annotation.i64(%index), !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,

i32

%ri

!0 = !{i64 0, i64 42}

\*\(2\) \*\(variant of 1\) relax the constraint that \!range metadata can only

be

set on call and load instructions, and set the !range metadata on the

index

expression. We still need annotations for function parameters though:

%i = i64 … , !range !0
%gep1 = getelementptr inbounds %struct.S, %struct.S* %s, i64 0, i32 0,

i32

%i

!0 = !{i64 0, i64 42}

This is slightly more compact\.

\*\(3\)\* Same as \(1\), with llvm\.assume\. This feels inferior to annotations\.
\*\(4\)\* Extend inrange to non\-constantexprs GEPs\. It is unclear how this

will

interfere with optimizations.

I would very much like not to introduce another way to encode
assumptions other than `llvm.assume`. If you want to avoid the extra
instructions, use `llvm.assume(i1 true) ["range"(%val, %lb, %ub)]`,
which is in line with our move towards operand bundle use.

Thanks, I did not know about that. I've just tried it but it appears that
tags have to be attribute names, and `!range` is not a valid attribute,
it's a metadata node. Is there a way to encode this ?

We need to get rid of that assertion. There are other non-attributes
to be used in assume operand bundles in the (near) future, so the this
work has to be done anyway.

+1 on trying to use assume, rather than adding another way.

But are value ranges special for assumes, so that we need to handle them in a bundle? Is that just so we can easier skip ‘artificial’ assume users?

It would make users explicit and we will have non-attribute bundles anyway.
I find it also "conceptually nicer", would you prefer explicit instructions?

One disadvantage of using a bundle (or !range metadata) is that we treat ranges for certain values in a special way and differently to how we treat range information expressed by the user e.g. via conditions (or builtin assume).

I don't think this is necessarily accurate. We can, and already do (https://godbolt.org/z/MaMEb1Koo),
generate bundles from conditions. If we can interpret a condition, why could we not rewrite it into a bundle?
I'm not sure why this is any different form other normalization we do. (And bundles have benefits over implicit
instruction encodings, for example use tracking and #instructions.)

  This means we have to handle multiple variants across the codebase, which can lead to situations where only one or the other is handled, which in turn can lead to surprising results (of the form: why does a transformation apply if information provided in a certain way, but does not apply of the equivalent info is provided in a different way).
Using instruction potentially also allows us to specify more complex ranges, in relation to other values.

I don't see how bundles would restrict us in any way. I mean, if we want to express property XYZ for %v and %q,
`llvm.assume(i1 true) ["XYZ"(%v, %q)]`, makes it really easy and it is arguably as generic as you want it to be.

But I realize that there are some practical consideration that make the instruction approach less appealing and I am all in favor of the more pragmatic & practical solution to start with.

I think bundles, and more generic assumptions, are what we need in the future. I still believe we should use them
to encode information in assertions [0], among other things, without running into the risk of having side-effects
that influence the compilation.

~ Johannes

[0] https://lists.llvm.org/pipermail/llvm-dev/2019-December/137632.html

Hi,

char* test_fill(int size) {
  char *test1 = malloc(size)
  for (n = 0; n <= size; n++) {
    test1[n] = 'A';
  }
}

Would it be worth making the "range" information a little richer and
be able to use algebraic expressions as well as numeric ranges.
Note: the above example code has an off by one overflow, and it would
be helpful if one could catch that at compile time.
In this case, it could catch that n must be less than size, and not
less than or equal to size.
Thus putting the range value on the test1 pointer as being from
address of test1 to test1 + (size - 1)

This can only be achieved if algebraic expressions are used for
ranges, and not just constant values.
Actual use cases can get much more complicated with for example,
non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.

Another useful thing to catch at compile time, would be a warning that
a pointer is being dereferenced, and we were not able to apply a range
expression to it. I.e. warn about unbounded dereferences.

I think it would be useful to at least consider how we would capture
this more complex range information/metadata in LLVM IR.

Kind Regards

James

Hi,

char* test_fill(int size) {
   char *test1 = malloc(size)
   for (n = 0; n <= size; n++) {
     test1[n] = 'A';
   }
}

Would it be worth making the "range" information a little richer and
be able to use algebraic expressions as well as numeric ranges.
Note: the above example code has an off by one overflow, and it would
be helpful if one could catch that at compile time.
In this case, it could catch that n must be less than size, and not
less than or equal to size.
Thus putting the range value on the test1 pointer as being from
address of test1 to test1 + (size - 1)

This can only be achieved if algebraic expressions are used for
ranges, and not just constant values.
Actual use cases can get much more complicated with for example,
non-contiguous ranges. e.g. 0,1,4,5 ok, but 2,3,6,7 not ok.

Another useful thing to catch at compile time, would be a warning that
a pointer is being dereferenced, and we were not able to apply a range
expression to it. I.e. warn about unbounded dereferences.

I think it would be useful to at least consider how we would capture
this more complex range information/metadata in LLVM IR.

I think what you want is the max object extend attribute, formerly
known as max object size when we only wanted to track an upper bound
next revision shall also include the lower one:
https://reviews.llvm.org/D87975

If we allow values instead of only constants you can "properly"
generate warnings, using SCEV to determine the range of `n` above.

That said, in operand bundles we can generally allow non-constant
values, e.g., `["range"(%p, i32 0, i32 %N)]`

~ Johannes