@llvm.memcpy not honoring volatile?

The following IR with the volatile parameter set to true

call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %0, i8* align 1 %1, i64 7, i1 true)

generates the following asm:

movl (%rsi), %eax
movl 3(%rsi), %ecx
movl %ecx, 3(%rdi)
movl %eax, (%rdi)

It performs an overlapping read/write which - I believe - is violating the volatile semantic
Full example here: https://godbolt.org/z/P_rjBT

Is this a bug or am I misunderstanding volatile in this context?

In general, the LangRef specification of “volatile” doesn’t specify how any particular memory access will be lowered. This includes the number of memory accesses, the number of bytes accessed, and whether the accesses overlap. That’s because it’s different for every target, depending on the available instructions. The only universal guarantee is that volatile operations are emitted in source order relative to other volatile operations. (See http://llvm.org/docs/LangRef.html#volatile-memory-accesses .)

For all in-tree targets, where possible, loads and stores with an appropriate scalar type and alignment will lowered to a single instruction. Whether this applies to any particular combination of type/alignment depends on the subtarget. For memcpy, or for loads and stores which can’t be lowered to a single instruction, we don’t promise to generate any particular sequence.

I don’t see any particular reason to guarantee that a volatile memcpy will access each byte exactly once. How is that useful?

-Eli

I agree it's probably not that useful, but I think the non-duplicating
property of volatile is ingrained strongly enough that viewing a
memcpy as a single load and store to each unit (in an unspecified
order) should be legitimate; so I think this actually is a bug.

As the documentation says though, it's unwise to depend on the
behaviour of a volatile memcpy.

Cheers.

Tim.

Thx for the explanation Eli and Tim.

My understanding of volatile was that you may have a different value every time you read and as such overlapping reads may be a bug.

Now, since the behaviour of volatile memcpy is not guaranteed and since clang does not allow to use it anyways I would like to challenge its existence.
Is there a know reason for keeping the volatile argument in @llvm.memcpy?

The primary reason I don’t want to provide any guarantees for what instructions are used to implement volatile memcpy is that it would forbid lowering a volatile memcpy to a library call.

clang uses a volatile memcpy for struct assignment in C. For example, “void f(volatile struct S*p) { p[0] = p[1]; }”. It’s not really that useful, but it’s been done that way since before clang was written.

-Eli

clang uses a volatile memcpy for struct assignment in C. For example, “void f(volatile struct S*p) { p[0] = p[1]; }”. It’s not really that useful, but it’s been done that way since before clang was written.

Thx. I guess my next question is, from the code gen perspective is there a difference between the volatile and non volatile versions

void f(volatile struct S*p) { p[0] = p[1]; }

and

void f(struct S*p) { p[0] = p[1]; }

Clang could lower the volatile copy to a regular copy. And since memcpy has side effects it would still maintain “that volatile operations are emitted in source order relative to other volatile operations.”
I tried x86_64 gcc, x86_64 clang, ARM gcc, they all generate the exact same code.

One data point. I ran check-llvm with a tweaked SelectionDAG::getMemcpy that always sets isVol to false.
Only one test fails llvm/test/CodeGen/X86/stack-align.ll and I think it’s a false positive.

I agree with Tim, this seems like a bug. My expectation is that volatile touch each memory location exactly once, unless absolutely impossible (e.g. bitfields on most ISAs).

I agree, this is a bug.

John

I spent some time reading the C standard:

5.1.2.3 Program execution
2. Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment…

6.7.3 Type qualifiers
8. An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.

My intuition is that it’s unreasonable to do the overlap because it may create side effects in the middle of the read. At the same time, “accessing a volatile object” is implementation defined…
As such “volatile” does not provide a lot of guarantees. I guess you have to resort on assembly if you want to really control what’s going on.

gcc, icc, msvc do not implement the overlap trick. They either read the memory region sequentially or use rep movs.
Let’s note that rep movs does not provide any guarantees on the memory access patterns either.

Besides I’m not even sure that volatile in the context of @llvm.memcpy intrinsics really relates to the C volatile semantic.

Now, I see several ways to move forward:

  1. Do nothing and call it implementation defined
  2. Fix the code that generates the loads/stores for the isVolatile case so no overlap can occur.
  3. Remove “volatile” argument from @llvm.memcpy, @llvm.memmove and @llvm.memset and generates the code in the front end using volatile load/stores.

3 is probably controversial and involves a lot a changes, it would move some complexity from backend to frontends.

I’m interested in thoughts from other developers here.

I spent some time reading the C standard:

5.1.2.3 Program execution
2. Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment…

6.7.3 Type qualifiers
8. An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously. What constitutes an access to an object that has volatile-qualified type is implementation-defined.

My intuition is that it’s unreasonable to do the overlap because it may create side effects in the middle of the read. At the same time, “accessing a volatile object” is implementation defined…
As such “volatile” does not provide a lot of guarantees. I guess you have to resort on assembly if you want to really control what’s going on.

gcc, icc, msvc do not implement the overlap trick. They either read the memory region sequentially or use rep movs.
Let’s note that rep movs does not provide any guarantees on the memory access patterns either.

Besides I’m not even sure that volatile in the context of @llvm.memcpy intrinsics really relates to the C volatile semantic.

Now, I see several ways to move forward:

  1. Do nothing and call it implementation defined
  2. Fix the code that generates the loads/stores for the isVolatile case so no overlap can occur.
  3. Remove “volatile” argument from @llvm.memcpy, @llvm.memmove and @llvm.memset and generates the code in the front end using volatile load/stores.

3 is probably controversial and involves a lot a changes, it would move some complexity from backend to frontends.

I’m interested in thoughts from other developers here.

Volatile isn’t specified in any decent normative way. There’s substantial documentation of intent, but C and C++ standards are lacking. We should therefore implement something that’s sensible and works kind-of as expected.

I’ve documented specification and intent here: wg21.link/P1152R0
If you want to follow-on (without all that documentation): wg21.link/P1152R1

I think we want option 2.: keep volatile memcpy, and implement it as touching each byte exactly once. That’s unlikely to be particularly useful for every direct-to-hardware uses, but it behaves intuitively enough that I think it’s desirable.

Thx for the reply.
I’ve created https://bugs.llvm.org/show_bug.cgi?id=42254 and will work on a fix.

As Eli pointed out, that precludes lowering a volatile memcpy into a call the memcpy library function. The usual “memcpy” library function may well use the same overlapping-memory trick, and there is no “volatile_memcpy” libc function which would provide a guarantee of not touching bytes multiple times. Perhaps it’s okay to just always emit an inline loop instead of falling back to a memcpy call.

But, possibly option 3 would be better. Maybe it’s better to force people/compiler-frontends to emit the raw load/store operations, so that it’s more clear exactly what semantics are desired.

The fundamental issue to me is that for reasonable usages of volatile, the operand size and number of memmory instructions generated for a given operation actually matters. Certainly, this is a somewhat unfortunate situation, since the C standard explicitly doesn’t forbid implementing any volatile access with smaller memory operations. (Which, among other issues, allows tearing as your wg21 doc nicely points out.) Nevertheless, it is an important property – required by POSIX for accesses of a volatile sig_atomic_t, even – and is a property which LLVM/Clang does provide when dealing with volatile accesses of target-specific appropriate sizes and alignments.

But, what does that mean for volatile memcpy? What size should it use? Always a byte-by-byte copy? May it do larger-sized reads/writes as well? Must it do so? Does it have to read/write the data in order? Or can it do so in reverse order? Can it use CPU’s block-copy instructions (e.g. rep movsb on x86) which may sometimes cause effectively-arbitrarily-sized memory-ops, in arbitrary order, in hardware?

If we’re going to keep volatile memcpy support, possibly those other questions ought to be answered too?

I dunno…

I think we want option 2.: keep volatile memcpy, and implement it as touching each byte exactly once. That’s unlikely to be particularly useful for every direct-to-hardware uses, but it behaves intuitively enough that I think it’s desirable.

As Eli pointed out, that precludes lowering a volatile memcpy into a call the memcpy library function. The usual “memcpy” library function may well use the same overlapping-memory trick, and there is no “volatile_memcpy” libc function which would provide a guarantee of not touching bytes multiple times. Perhaps it’s okay to just always emit an inline loop instead of falling back to a memcpy call.

In which circumstances does this matter?

But, possibly option 3 would be better. Maybe it’s better to force people/compiler-frontends to emit the raw load/store operations, so that it’s more clear exactly what semantics are desired.

The fundamental issue to me is that for reasonable usages of volatile, the operand size and number of memmory instructions generated for a given operation actually matters. Certainly, this is a somewhat unfortunate situation, since the C standard explicitly doesn’t forbid implementing any volatile access with smaller memory operations. (Which, among other issues, allows tearing as your wg21 doc nicely points out.) Nevertheless, it is an important property – required by POSIX for accesses of a volatile sig_atomic_t, even – and is a property which LLVM/Clang does provide when dealing with volatile accesses of target-specific appropriate sizes and alignments.

But, what does that mean for volatile memcpy? What size should it use?

Any size that makes sense to HW.

Always a byte-by-byte copy?

It can.

May it do larger-sized reads/writes as well?

Any size, but no larger than memcpy’s size parameter specified.

Must it do so?

No, but it has to be sensible (whatever that means).

Does it have to read/write the data in order?

No.

Or can it do so in reverse order?

Yes.

Can it use CPU’s block-copy instructions (e.g. rep movsb on x86) which may sometimes cause effectively-arbitrarily-sized memory-ops, in arbitrary order, in hardware?

Sure.

If we’re going to keep volatile memcpy support, possibly those other questions ought to be answered too?

Paul McKenney has a follow on paper (linked from R2 of mine) which addresses some of your questions I think. LLVM can do what it wants for now since there’s no standard, but there’s likely to be one eventually and we probably should match what it’s likely to be.

I think we want option 2.: keep volatile memcpy, and implement it as touching each byte exactly once. That’s unlikely to be particularly useful for every direct-to-hardware uses, but it behaves intuitively enough that I think it’s desirable.

As Eli pointed out, that precludes lowering a volatile memcpy into a call the memcpy library function. The usual “memcpy” library function may well use the same overlapping-memory trick, and there is no “volatile_memcpy” libc function which would provide a guarantee of not touching bytes multiple times. Perhaps it’s okay to just always emit an inline loop instead of falling back to a memcpy call.

In which circumstances does this matter?

If it’s problematic to touch a byte multiple times when emitting inlined instructions for a “volatile memcpy”, surely it’s also problematic to emit a library function call which does the same thing?

But – I don’t know of any realistic circumstance where either one would be important to actual users. Someone would need to have a situation where doing 2 overlapping 4-byte writes to implement a 7-byte memcpy is problematic, but where it doesn’t matter to them what permutation of non-overlapping memory read/write sizes is used – and furthermore, where the order doesn’t matter. That seems extremely unlikely to ever be the case.

Paul McKenney has a follow on paper (linked from R2 of mine) which addresses some of your questions I think. LLVM can do what it wants for now since there’s no standard, but there’s likely to be one eventually and we probably should match what it’s likely to be.

I agree, Paul’s paper describes the actually-required (vs C-standard-required) semantics for volatile loads and stores today – that they must use non-tearing operations for sizes/alignments where the hardware provides such. (IMO, any usage of volatile where that cannot be done is extremely questionable). Of course, that doesn’t say anything about memcpy, since volatile memcpy isn’t part of C, just part of LLVM.

Always a byte-by-byte copy?

It can.

So, why does llvm even provide a volatile memcpy intrinsic? One possible answer is that it was needed in order to implement volatile aggregate copies generated by the C frontend. So, given the real world requirement to use single instructions where possible…what about this code:

struct X {int n;};
void foo(volatile struct X *n) {
n[0] = n[1];
}

Clang implements it by creating a volatile llvm.memcpy call. Which currently is generally lowered as a 32-bit read/write. Maybe it should be required to always emit a 32-bit read/write instruction, just as if you were directly operating on a ‘volatile int *n’? (Assuming a 32-bit platform which has such instructions, of course)?

Or – maybe memcpy is actually not a reasonable thing to use for copying a volatile struct at all. Perhaps a volatile struct copy should do volatile element-wise copies of each fundamentally-typed field? That might make some sense. (But…unions?).

I think we want option 2.: keep volatile memcpy, and implement it as touching each byte exactly once. That’s unlikely to be particularly useful for every direct-to-hardware uses, but it behaves intuitively enough that I think it’s desirable.

As Eli pointed out, that precludes lowering a volatile memcpy into a call the memcpy library function. The usual “memcpy” library function may well use the same overlapping-memory trick, and there is no “volatile_memcpy” libc function which would provide a guarantee of not touching bytes multiple times. Perhaps it’s okay to just always emit an inline loop instead of falling back to a memcpy call.

In which circumstances does this matter?

If it’s problematic to touch a byte multiple times when emitting inlined instructions for a “volatile memcpy”, surely it’s also problematic to emit a library function call which does the same thing?

But – I don’t know of any realistic circumstance where either one would be important to actual users. Someone would need to have a situation where doing 2 overlapping 4-byte writes to implement a 7-byte memcpy is problematic, but where it doesn’t matter to them what permutation of non-overlapping memory read/write sizes is used – and furthermore, where the order doesn’t matter. That seems extremely unlikely to ever be the case.

Agreed, but that’s the stated intended behavior of volatile. Makes sense for hardware, weird otherwise, but we don’t need to do it any other way. I could construct a case where volatile is used in a signal handler, and where a partial result with overlap breaks expectations, but… I agree it’s unlikely.

Paul McKenney has a follow on paper (linked from R2 of mine) which addresses some of your questions I think. LLVM can do what it wants for now since there’s no standard, but there’s likely to be one eventually and we probably should match what it’s likely to be.

I agree, Paul’s paper describes the actually-required (vs C-standard-required) semantics for volatile loads and stores today – that they must use non-tearing operations for sizes/alignments where the hardware provides such. (IMO, any usage of volatile where that cannot be done is extremely questionable). Of course, that doesn’t say anything about memcpy, since volatile memcpy isn’t part of C, just part of LLVM.

Indeed. After we standardize Paul’s paper I expect to also do something like volatile memcpy based on it.

Always a byte-by-byte copy?

It can.

So, why does llvm even provide a volatile memcpy intrinsic? One possible answer is that it was needed in order to implement volatile aggregate copies generated by the C frontend. So, given the real world requirement to use single instructions where possible…what about this code:

struct X {int n;};
void foo(volatile struct X *n) {
n[0] = n[1];
}

Clang implements it by creating a volatile llvm.memcpy call. Which currently is generally lowered as a 32-bit read/write. Maybe it should be required to always emit a 32-bit read/write instruction, just as if you were directly operating on a ‘volatile int *n’? (Assuming a 32-bit platform which has such instructions, of course)?

In general, volatile instructions should behave as expected. Of course, we can disagree on expectations :wink:
Interpreting a non-specification is wonderful.

Or – maybe memcpy is actually not a reasonable thing to use for copying a volatile struct at all. Perhaps a volatile struct copy should do volatile element-wise copies of each fundamentally-typed field? That might make some sense. (But…unions?).

I think copying a volatile struct is the unreasonable part, but that’s not super relevant to how we implement this silly thing :slight_smile:
Don’t get me started on volatile unions (and bitfields).