load widening conflicts with AddressSanitizer

Hello,

We’ve just got a bug report from Mozilla folks about AddressSanitizer false positive with -O2.
Turns out there is a conflict between load widening and AddressSanitizer.

Simple reproducer:

% cat load_widening.c && echo ========= && clang  -O2  -c  load_widening.c -flto && llvm-dis load_widening.o && cat load_widening.o.ll 
void init(char *);
int foo() {
  char a[22];
  init(a);
  return a[16] + a[21];
}

Hello,

We’ve just got a bug report from Mozilla folks about AddressSanitizer false positive with -O2.
Turns out there is a conflict between load widening and AddressSanitizer.

Simple reproducer:

% cat load_widening.c && echo ========= && clang  -O2  -c  load_widening.c -flto && llvm-dis load_widening.o && cat load_widening.o.ll 
void init(char *);
int foo() {
  char a[22];
  init(a);
  return a[16] + a[21];
}
=========
; ModuleID = 'load_widening.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i32 @foo() nounwind uwtable {
entry:
  %a = alloca [22 x i8], align 16
  %arraydecay = getelementptr inbounds [22 x i8]* %a, i64 0, i64 0
  call void @init(i8* %arraydecay) nounwind
  %arrayidx = getelementptr inbounds [22 x i8]* %a, i64 0, i64 16
  %0 = bitcast i8* %arrayidx to i64*
  %1 = load i64* %0, align 16 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  %2 = trunc i64 %1 to i32
  %sext = shl i32 %2, 24
  %conv = ashr exact i32 %sext, 24
  %3 = lshr i64 %1, 16
  %.tr = trunc i64 %3 to i32
  %sext3 = ashr i32 %.tr, 24
  %add = add nsw i32 %sext3, %conv
  ret i32 %add
}

Here, the load widening replaces two 1-byte loads with one 8-byte load which partially goes out of bounds.
Since the array is 16-byte aligned, this transformation should never cause problems in regular compilation,
but it causes AddressSanitizer false positives because the generated load is in fact out of bounds.

SAFECode would have the same problem on this code as it now checks for loads and stores that “fall off” the beginning or end of a memory object.

Do we consider the above transformation legal?

I would argue that it should not be legal. We don’t actually know what comes after the 22 byte object. Is it another memory object? A memory-mapped I/O device? Unmapped memory? Padded junk space? Reading memory-mapped I/O could have nasty side effects, and accessing unmapped memory could cause the program to fault even though it was written correctly as the source-language level.

While some may consider these sorts of scenarios to be unlikely, consider the possibility that the alloca is transformed into a global variable or heap allocation. That would be a legitimate transform and makes the above scenarios more likely.

– John T.

Hello,

We’ve just got a bug report from Mozilla folks about AddressSanitizer false positive with -O2.
Turns out there is a conflict between load widening and AddressSanitizer.

Simple reproducer:

% cat load_widening.c && echo ========= && clang  -O2  -c  load_widening.c -flto && llvm-dis load_widening.o && cat load_widening.o.ll 
void init(char *);
int foo() {
  char a[22];
  init(a);
  return a[16] + a[21];
}
=========
; ModuleID = 'load_widening.o'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i32 @foo() nounwind uwtable {
entry:
  %a = alloca [22 x i8], align 16
  %arraydecay = getelementptr inbounds [22 x i8]* %a, i64 0, i64 0
  call void @init(i8* %arraydecay) nounwind
  %arrayidx = getelementptr inbounds [22 x i8]* %a, i64 0, i64 16
  %0 = bitcast i8* %arrayidx to i64*
  %1 = load i64* %0, align 16 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
  %2 = trunc i64 %1 to i32
  %sext = shl i32 %2, 24
  %conv = ashr exact i32 %sext, 24
  %3 = lshr i64 %1, 16
  %.tr = trunc i64 %3 to i32
  %sext3 = ashr i32 %.tr, 24
  %add = add nsw i32 %sext3, %conv
  ret i32 %add
}

Here, the load widening replaces two 1-byte loads with one 8-byte load which partially goes out of bounds.
Since the array is 16-byte aligned, this transformation should never cause problems in regular compilation,
but it causes AddressSanitizer false positives because the generated load is in fact out of bounds.

SAFECode would have the same problem on this code as it now checks for loads and stores that “fall off” the beginning or end of a memory object.

Do we consider the above transformation legal?

I would argue that it should not be legal. We don’t actually know what comes after the 22 byte object. Is it another memory object? A memory-mapped I/O device? Unmapped memory? Padded junk space? Reading memory-mapped I/O could have nasty side effects, and accessing unmapped memory could cause the program to fault even though it was written correctly as the source-language level.

Sorry. Typo. I meant to write “at the source-language level.”

– John T.

Having the load hit unmapped memory is impossible on common
architectures given the alignment we're talking about here. And if
memory-mapped IO comes after the memory object, the object itself also
has some sort of unusual semantics, so it should be using volatile
loads anyway.

That said, LLVM isn't actually keeping track of the "page size" (or
equivalent), so the optimizers can't actually prove this will happen.

-Eli

Hello,

We’ve just got a bug report from Mozilla folks about AddressSanitizer false
positive with -O2.
Turns out there is a conflict between load widening and AddressSanitizer.

Simple reproducer:

% cat load_widening.c && echo ========= && clang -O2 -c load_widening.c
-flto && llvm-dis load_widening.o && cat load_widening.o.ll
void init(char *);
int foo() {
char a[22];
init(a);
return a[16] a[21];
}

; ModuleID = ‘load_widening.o’
target datalayout =
“e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128”
target triple = “x86_64-unknown-linux-gnu”

define i32 @foo() nounwind uwtable {
entry:
%a = alloca [22 x i8], align 16
%arraydecay = getelementptr inbounds [22 x i8]* %a, i64 0, i64 0
call void @init(i8* %arraydecay) nounwind
%arrayidx = getelementptr inbounds [22 x i8]* %a, i64 0, i64 16
%0 = bitcast i8* %arrayidx to i64*
%1 = load i64* %0, align 16 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
%2 = trunc i64 %1 to i32
%sext = shl i32 %2, 24
%conv = ashr exact i32 %sext, 24
%3 = lshr i64 %1, 16
%.tr = trunc i64 %3 to i32
%sext3 = ashr i32 %.tr, 24
%add = add nsw i32 %sext3, %conv
ret i32 %add
}

Here, the load widening replaces two 1-byte loads with one 8-byte load which
partially goes out of bounds.
Since the array is 16-byte aligned, this transformation should never cause
problems in regular compilation,
but it causes AddressSanitizer false positives because the generated load
is in fact out of bounds.

SAFECode would have the same problem on this code as it now checks for loads
and stores that “fall off” the beginning or end of a memory object.

Do we consider the above transformation legal?

I would argue that it should not be legal. We don’t actually know what
comes after the 22 byte object. Is it another memory object? A
memory-mapped I/O device? Unmapped memory? Padded junk space? Reading
memory-mapped I/O could have nasty side effects, and accessing unmapped
memory could cause the program to fault even though it was written correctly
as the source-language level.

While some may consider these sorts of scenarios to be unlikely, consider
the possibility that the alloca is transformed into a global variable or
heap allocation. That would be a legitimate transform and makes the above
scenarios more likely.

Having the load hit unmapped memory is impossible on common
architectures given the alignment we’re talking about here. And if
memory-mapped IO comes after the memory object, the object itself also
has some sort of unusual semantics, so it should be using volatile
loads anyway.

Would would be the right way to disable load widening when AddressSanitizer (or SAFECode) is enabled?

–kcc

Do we consider the above transformation legal?

Yes, the transformation is perfectly legal for the normal compiler.

I would argue that it should not be legal. We don’t actually know what
comes after the 22 byte object. Is it another memory object? A
memory-mapped I/O device? Unmapped memory? Padded junk space? Reading
memory-mapped I/O could have nasty side effects, and accessing unmapped
memory could cause the program to fault even though it was written correctly
as the source-language level.

Device memory accesses need to be done with volatile. This can’t cause a paging problem (e.g. causing an additional page fault where none existed before) on systems that use power-of-two sized pages.

Having the load hit unmapped memory is impossible on common

architectures given the alignment we’re talking about here. And if
memory-mapped IO comes after the memory object, the object itself also
has some sort of unusual semantics, so it should be using volatile
loads anyway.

Would would be the right way to disable load widening when AddressSanitizer (or SAFECode) is enabled?

This is a good question. Would it be possible for ASan to do its instrumentation earlier? I supposed we could add a “do not widen” metadata hint on load instructions or something like that.

-Chris

This is a good question. Would it be possible for ASan to do its instrumentation earlier?

It would be possible but undesirable.
First, asan blows up the IR and running asan early will increase the compile-time.
Second, asan greatly benefits from all optimizations running before it because it needs to instrument less memory accesses.
It actually benefits from load widening too: in the test case above asan instruments only one load instead of two.

In this case, we have an array of 22 bytes which is 16-aligned.
I suspect that load widening changed the alignment of alloca instruction to make the transformation legal. Right?
Can we change the load widening algorithm to also change the size of alloca instruction to be dividable by 16?
This will solve the problem, at least the variant I observe now.

–kcc

This is a good question. Would it be possible for ASan to do its instrumentation earlier?

It would be possible but undesirable.
First, asan blows up the IR and running asan early will increase the compile-time.
Second, asan greatly benefits from all optimizations running before it because it needs to instrument less memory accesses.
It actually benefits from load widening too: in the test case above asan instruments only one load instead of two.

You certainly wouldn’t want to run asan before mem2reg/SRoA, but after that, the benefits are probably small. I’d guess that there is some non-zero value to exposing the code generated by asan to the optimizer. Have you looked at that at all?

In this case, we have an array of 22 bytes which is 16-aligned.
I suspect that load widening changed the alignment of alloca instruction to make the transformation legal. Right?
Can we change the load widening algorithm to also change the size of alloca instruction to be dividable by 16?
This will solve the problem, at least the variant I observe now.

I believe it is 16-byte aligned based on ABI requirements for x86-64, though you’re right that the optimizer will increase alignment in other cases. In any case, we don’t want to increase the size of the object, because that would prevent packing some other data in after it. For example, a 2-byte aligned 10 byte object can be placed after it in memory if we keep it 22-bytes in size.

-Chris

Do we consider the above transformation legal?

Yes, the transformation is perfectly legal for the normal compiler.

So how do you guarantee that the behavior is predictable regardless of hardware platform if you don’t define what the behavior should be?

I would argue that it should not be legal. We don’t actually know what
comes after the 22 byte object. Is it another memory object? A
memory-mapped I/O device? Unmapped memory? Padded junk space? Reading
memory-mapped I/O could have nasty side effects, and accessing unmapped
memory could cause the program to fault even though it was written correctly
as the source-language level.

Device memory accesses need to be done with volatile. This can’t cause a paging problem (e.g. causing an additional page fault where none existed before) on systems that use power-of-two sized pages.

I think people are misunderstanding my point about I/O memory. I wasn’t saying that the alloca is supposed to access I/O memory; I was saying that it is possible for I/O memory to be located contiguously after the memory object should the memory object be the last object on its memory page.

Now, after thinking about it, I realize why that can’t happen if the memory is aligned to a 16-byte boundary on most architectures. However, does load-widening actually check that the memory is 16-byte aligned? If not, then I don’t see why this can’t be a problem with memory allocated at arbitrary alignments. Furthermore, you’re making assumptions about the underlying MMU. What if you have a funky architecture that someone is porting LLVM to, or someone is using x86-32 segments in an interesting way?

If load-widening searches back through the def-use chains to check that the memory is aligned, then fixing the transform to always perform defined behavior seems easy enough: we check that the alloca is of the right alignment, and then we boost the allocated size if necessary. Since we’ll never increase the size by more than 8 bytes, this seems reasonable.

Moreover, I don’t really understand the rationale for allowing a transform to introduce undefined behavior into programs that exhibit no undefined behavior. It’s a source of subtle problems in the future when architectures change or someone does something unconventional, and it trips up memory safety tools and static analysis tools that are working correctly. The only reason I see for tolerating it is if fixing the problem would be detrimental to performance. If Kostya or I or someone else fixes load-widening so that it doesn’t introduce undefined behavior, is it going to really hurt performance?

– John T.

This is a good question. Would it be possible for ASan to do its instrumentation earlier?

It would be possible but undesirable.
First, asan blows up the IR and running asan early will increase the compile-time.
Second, asan greatly benefits from all optimizations running before it because it needs to instrument less memory accesses.
It actually benefits from load widening too: in the test case above asan instruments only one load instead of two.

You certainly wouldn’t want to run asan before mem2reg/SRoA, but after that, the benefits are probably small. I’d guess that there is some non-zero value to exposing the code generated by asan to the optimizer. Have you looked at that at all?

In this case, we have an array of 22 bytes which is 16-aligned.
I suspect that load widening changed the alignment of alloca instruction to make the transformation legal. Right?
Can we change the load widening algorithm to also change the size of alloca instruction to be dividable by 16?
This will solve the problem, at least the variant I observe now.

I believe it is 16-byte aligned based on ABI requirements for x86-64, though you’re right that the optimizer will increase alignment in other cases. In any case, we don’t want to increase the size of the object, because that would prevent packing some other data in after it. For example, a 2-byte aligned 10 byte object can be placed after it in memory if we keep it 22-bytes in size.

It seems that another option would be to check that both the alignment and size of the memory object permit safe load-widening and to not perform it on memory objects such as this 22-byte object.

Do you think such unsafe cases occur often in practice?

– John T.

Do we consider the above transformation legal?

Yes, the transformation is perfectly legal for the normal compiler.

So how do you guarantee that the behavior is predictable regardless of hardware platform if you don’t define what the behavior should be?

I’m not sure what you mean. What isn’t defined?

I would argue that it should not be legal. We don’t actually know what
comes after the 22 byte object. Is it another memory object? A
memory-mapped I/O device? Unmapped memory? Padded junk space? Reading
memory-mapped I/O could have nasty side effects, and accessing unmapped
memory could cause the program to fault even though it was written correctly
as the source-language level.

Device memory accesses need to be done with volatile. This can’t cause a paging problem (e.g. causing an additional page fault where none existed before) on systems that use power-of-two sized pages.

I think people are misunderstanding my point about I/O memory. I wasn’t saying that the alloca is supposed to access I/O memory; I was saying that it is possible for I/O memory to be located contiguously after the memory object should the memory object be the last object on its memory page.

There is no way for this transformation to introduce a page spanning load.

Now, after thinking about it, I realize why that can’t happen if the memory is aligned to a 16-byte boundary on most architectures. However, does load-widening actually check that the memory is 16-byte aligned?

Yes.

What if you have a funky architecture that someone is porting LLVM to, or someone is using x86-32 segments in an interesting way?

We’ll burn that bridge when we get to it :wink:

Moreover, I don’t really understand the rationale for allowing a transform to introduce undefined behavior into programs that exhibit no undefined behavior.

There is no undefined behavior here. This is exactly analogous to the code you get for bitfield accesses. If you have an uninitialized struct and start storing into its fields (to initialize it) you get a series of “load + mask + or + store” operations. These are loading and touching “undefined” bits in a completely defined way.

-Chris

I'm not sure what "unsafe" means in this situation, it's clearly safe. It also clearly happens in practice if it's causing you guys heartburn. :slight_smile:

-Chris

This is a good question. Would it be possible for ASan to do its instrumentation earlier?

It would be possible but undesirable.
First, asan blows up the IR and running asan early will increase the compile-time.
Second, asan greatly benefits from all optimizations running before it because it needs to instrument less memory accesses.
It actually benefits from load widening too: in the test case above asan instruments only one load instead of two.

You certainly wouldn’t want to run asan before mem2reg/SRoA, but after that, the benefits are probably small. I’d guess that there is some non-zero value to exposing the code generated by asan to the optimizer. Have you looked at that at all?

This is the usual phase ordering problem.
If asan is done early, you get all of the optimizations cleaning up after asan.
If you run asan late, you instrument a cleaner IR.
Asan should run after the loop invariant loads are hoisted up, common subexpressions eliminated, loads widened/combined, dead stores eliminated, memsets merged, etc.
Otherwise the optimizer will have a hard time optimizing the original code and the instrumentation.

I have not looked into placing asan somewhere else in LLVM (this is in my TODO list somewhere at the bottom).
But I had to place asan early in gcc (before the loop optimizations) and it was a disaster (4x compile-time slowdown, 20% poorer run-time performance).
This is not necessary relevant to LLVM of course.

In this case, we have an array of 22 bytes which is 16-aligned.
I suspect that load widening changed the alignment of alloca instruction to make the transformation legal. Right?
Can we change the load widening algorithm to also change the size of alloca instruction to be dividable by 16?
This will solve the problem, at least the variant I observe now.

I believe it is 16-byte aligned based on ABI requirements for x86-64, though you’re right that the optimizer will increase alignment in other cases. In any case, we don’t want to increase the size of the object, because that would prevent packing some other data in after it. For example, a 2-byte aligned 10 byte object can be placed after it in memory if we keep it 22-bytes in size.

ok.

I wonder if the load widening can attach some metadata to the stack objects, accesses to which it has modified?
Then asan will increase the alloca size as appropriate (it does it anyway).

Why you don’t like the idea to disable or restrict the widening when asan is on?

–kcc

In this case, we have an array of 22 bytes which is 16-aligned.
I suspect that load widening changed the alignment of alloca instruction to make the transformation legal. Right?
Can we change the load widening algorithm to also change the size of alloca instruction to be dividable by 16?
This will solve the problem, at least the variant I observe now.

I believe it is 16-byte aligned based on ABI requirements for x86-64, though you’re right that the optimizer will increase alignment in other cases. In any case, we don’t want to increase the size of the object, because that would prevent packing some other data in after it. For example, a 2-byte aligned 10 byte object can be placed after it in memory if we keep it 22-bytes in size.

ok.

I wonder if the load widening can attach some metadata to the stack objects, accesses to which it has modified?
Then asan will increase the alloca size as appropriate (it does it anyway).

This doesn’t seem like a general solution. What if there is some heap allocated buffer proven to be 16 byte aligned, but 22 bytes long? Load widening may occur, but you won’t be able to fudge the size.

Why you don’t like the idea to disable or restrict the widening when asan is on?

I agree this seems reasonable.

Reid

In this case, we have an array of 22 bytes which is 16-aligned.
I suspect that load widening changed the alignment of alloca instruction to make the transformation legal. Right?
Can we change the load widening algorithm to also change the size of alloca instruction to be dividable by 16?
This will solve the problem, at least the variant I observe now.

I believe it is 16-byte aligned based on ABI requirements for x86-64, though you’re right that the optimizer will increase alignment in other cases. In any case, we don’t want to increase the size of the object, because that would prevent packing some other data in after it. For example, a 2-byte aligned 10 byte object can be placed after it in memory if we keep it 22-bytes in size.

ok.

I wonder if the load widening can attach some metadata to the stack objects, accesses to which it has modified?
Then asan will increase the alloca size as appropriate (it does it anyway).

This doesn’t seem like a general solution. What if there is some heap allocated buffer proven to be 16 byte aligned, but 22 bytes long? Load widening may occur, but you won’t be able to fudge the size.

A 16-byte aligned heap buffer of 22 bytes is a very unlikely thing (still possible, so I agree with you)

Why you don’t like the idea to disable or restrict the widening when asan is on?

I agree this seems reasonable.

Btw, having a flag that would restrict widening will be helpful for tools like Valgrind/Memcheck and Dr.Memory.

Currently, in order to run Valgrind we have to build programs with “gcc -O1 -disable-some-more-stuff” because of a similar issue.
Valgrind does not catch stack buffer overflows so it only applies to heap.

–kcc

Do we consider the above transformation legal?

Yes, the transformation is perfectly legal for the normal compiler.

So how do you guarantee that the behavior is predictable regardless of hardware platform if you don’t define what the behavior should be?

I’m not sure what you mean. What isn’t defined?

The alloca in question allocates 22 bytes. The 64-bit load in Kostya’s original email is accessing two additional bytes past the end of the alloca (i.e., it is accessing array “elements” a[22] and a[23]). Accessing that memory with a read or write is undefined behavior. The program could fault, read zeros, read arbitrary bit patterns, etc.

In other words, the compiler is transforming this:

return a[16] + a[21];

into something like this:

unsigned long * p = &(a[16]);
unsigned long v = *p; // This accesses memory locations a[22] and a[23]; doing so is undefined behavior
(do some bit fiddling to extra a[16] and a[17] from v)

The original code is memory safe and exhibits defined behavior. You can do whatever crazy, semantics-preserving optimization you want, run it on any crazy architecture you want, and it’ll always exhibit the same behavior.

The optimized code exhibits undefined behavior. On most systems, it just reads garbage data that is ignored by the compiler, but that’s really just a side-effect of how most OS’s and architectures do things. If you do some crazy transforms or run on some obscure architecture, the optimized code may break.

[snip]

What if you have a funky architecture that someone is porting LLVM to, or someone is using x86-32 segments in an interesting way?

We’ll burn that bridge when we get to it :wink:

ASAN got burnt; SAFECode probably got burnt, too. If we work around it, some poor researcher or developer may get burnt by it, too, and spend some time figuring out why his correct-looking program is not acting properly. In other words, you’re burning someone else’s bridge.

Granted, perhaps the benefits of an incorrect optimization may outweigh the loss of using LLVM on novel systems, but are you sure that making the optimization work properly is going to be so detrimental?

Moreover, I don’t really understand the rationale for allowing a transform to introduce undefined behavior into programs that exhibit no undefined behavior.

There is no undefined behavior here. This is exactly analogous to the code you get for bitfield accesses. If you have an uninitialized struct and start storing into its fields (to initialize it) you get a series of “load + mask + or + store” operations. These are loading and touching “undefined” bits in a completely defined way.

I’ll agree that they’re both undefined behavior, but I don’t think they fall within the same category. The bit-mask initializing issue is a compromise you made because there either isn’t an alternative way to do it that has defined behavior, or such an alternative is too expensive and difficult to implement and/or use.

This appears to be a different case. Fixing the optimization looks simple enough to me (am I missing something?), and I’m not convinced that fixing it would hurt performance (although since I haven’t run an experiment, that is conjecture).

So, perhaps I should ask this: if someone took the time to fix the transform so that it checks both the alignment and the allocation size and measures the resulting performance change, how much would performance need to suffer before the cure was deemed worse than the disease?

– John T.

I'm not opposed to disabling this transformation when asan is on, we just need a clean way to express this in the IR.

-Chris

Do we consider the above transformation legal?

Yes, the transformation is perfectly legal for the normal compiler.

So how do you guarantee that the behavior is predictable regardless of hardware platform if you don’t define what the behavior should be?

I’m not sure what you mean. What isn’t defined?

The alloca in question allocates 22 bytes. The 64-bit load in Kostya’s original email is accessing two additional bytes past the end of the alloca (i.e., it is accessing array “elements” a[22] and a[23]). Accessing that memory with a read or write is undefined behavior. The program could fault, read zeros, read arbitrary bit patterns, etc.

John, I think that you are missing that these operations are fully defined by LLVM IR. I’m not sure what languages rules you are drawing these rules from, but they are not the rules of IR.

Doing this inside a compiler (the way we do) also is not invalid according to the C notions of undefined behavior, as it has the “as if” rule. I agree that doing this at the source level would be invalid.

Again, I’m not opposed to having a way to disable these transformations, we just need a clean way to express it.

-Chris

I'm not opposed to disabling this transformation when asan is on, we just need a clean way to express this in the IR.

Could clang be aware of asan being on and introduce a "please don't
widen" metadata on local variable accesses?

Alternatively, could asan analyze the IL to see which bytes are used?

-Chris

Cheers,
Rafael

Do we consider the above transformation legal?

Yes, the transformation is perfectly legal for the normal compiler.

So how do you guarantee that the behavior is predictable regardless of hardware platform if you don’t define what the behavior should be?

I’m not sure what you mean. What isn’t defined?

The alloca in question allocates 22 bytes. The 64-bit load in Kostya’s original email is accessing two additional bytes past the end of the alloca (i.e., it is accessing array “elements” a[22] and a[23]). Accessing that memory with a read or write is undefined behavior. The program could fault, read zeros, read arbitrary bit patterns, etc.

John, I think that you are missing that these operations are fully defined by LLVM IR. I’m not sure what languages rules you are drawing these rules from, but they are not the rules of IR.

I apologize for mixing C and LLVM notation.

If you want to distinguish between C and LLVM semantics, then I think load-widening in this particular case has two problems:

  1. The load-widening transform is not guaranteed to preserve the semantics of the original C program unless the OS and hardware fulfill certain assumptions. The original C program performs in-bounds memory accesses at a[16] and a[21]. The transformed equivalent does the same thing except that it also reads two bytes outside the bounds of the array a. The transformed program is only equivalent if the OS and hardware fulfill certain assumptions (which, fortunately, they do on common systems currently in use).

  2. The load-widening transform introduces behavior that, as far as I know, is undefined at the LLVM IR level. It takes an LLVM program that loads two values from the alloca’ed memory and changes it to be a single load that accesses the same memory locations plus two out-of-bounds locations. Again, if the OS and hardware fulfill certain assumptions, then the fault behavior and computation behavior of the LLVM IR before and after the optimization is the same, but if those assumptions don’t hold, then the transformed program may act differently than the original.

At the LLVM IR level, performing a load or store in which part of the accessed memory is within bounds and part of it is out of bounds is undefined, correct? At the very least, I would think that there would not be a defined behavior for such a load or store.

Am I making sense now? Is there something I’m misunderstanding here?

Doing this inside a compiler (the way we do) also is not invalid according to the C notions of undefined behavior, as it has the “as if” rule. I agree that doing this at the source level would be invalid.

Again, I’m not opposed to having a way to disable these transformations, we just need a clean way to express it.

Having a list of which optimizations are safe to run and which ones are not can become tedious. I’d rather fix the optimization so that it always preserves the semantics of the program unless there’s a very compelling reason not to do so (e.g., significant performance loss).

In this case, it seems easy: instead of requiring that “align 16” is sufficient for doing load widening, require that the memory must be marked “align 16” and the allocation size is a multiple of 8. That will force load-widening to only occur when it is safe. There’s probably other ways to fix it as well (e.g., have load-widening widen to the largest size that meets the above two conditions).

– John T.