SROA and volatile memcpy/memset

Hi,

I have a customer testcase where SROA splits a volatile memcpy and we end up generating bad code[1]. While this looks like a bug, simply preventing SROA from splitting volatile memory intrinsics causes basictest.ll for SROA to fail. Not only that, but it also seems like handling of volatile memory transfers was done with some intent.

What are the design decisions in SROA regarding handling of volatile memcpy/memset?

[1] In our applications, in most cases volatile objects are used to communicate with external devices. Both, the address and the transfer size must match what the device is expecting, so breaking up volatile memcpy/memset in smaller pieces must be done very carefully, if at all. Generally, target-independent transformations aren't expected to have that knowledge and so it would be preferable that they leave such memory traffic alone.

-Krzysztof

There is no such thing as a volatile memcpy or memset in standard ISO C,
so what exactly are you doing and why do you expect it to work that way?

Joerg

The motivating example has an aggregate copy where the aggregate is volatile, followed by a store to one of its members. (This does not have anything to do with devices.) SROA expanded this into a series of volatile loads and stores, which cannot be coalesced back into fewer instructions. This is clearly worse than doing the copy and then the member overwrite.

--- test.c ---
typedef struct {
   volatile unsigned int value;
} atomic_word_t;

typedef union {
   struct {
     unsigned char state;
     unsigned char priority;
   };
   atomic_word_t atomic;
   unsigned int full;
} mystruct_t;

mystruct_t a;

unsigned int foo(void) {
   mystruct_t x;
   mystruct_t y;

   x.full = a.atomic.value;
   y = x;
   y.priority = 7;

   return y.full;
}

But a copy and overwrite would violate the volatile rules?

Joerg

The copy and overwrite were both in the original code. SROA tries to optimize the overwrite away, but in doing so, it breaks up the memcpy into several instructions. Now, because they are all volatile, we can't do much to optimize them.

-Krzysztof

From: "Krzysztof Parzyszek via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Tuesday, November 10, 2015 1:22:57 PM
Subject: Re: [llvm-dev] SROA and volatile memcpy/memset

>> I have a customer testcase where SROA splits a volatile memcpy and
>> we end up
>> generating bad code[1]. While this looks like a bug, simply
>> preventing SROA
>> from splitting volatile memory intrinsics causes basictest.ll for
>> SROA to
>> fail. Not only that, but it also seems like handling of volatile
>> memory
>> transfers was done with some intent.
>
> There is no such thing as a volatile memcpy or memset in standard
> ISO C,
> so what exactly are you doing and why do you expect it to work that
> way?

The motivating example has an aggregate copy where the aggregate is
volatile, followed by a store to one of its members. (This does not
have
anything to do with devices.) SROA expanded this into a series of
volatile loads and stores, which cannot be coalesced back into fewer
instructions. This is clearly worse than doing the copy and then the
member overwrite.

--- test.c ---
typedef struct {
   volatile unsigned int value;
} atomic_word_t;

typedef union {
   struct {
     unsigned char state;
     unsigned char priority;
   };
   atomic_word_t atomic;
   unsigned int full;
} mystruct_t;

mystruct_t a;

unsigned int foo(void) {
   mystruct_t x;
   mystruct_t y;

   x.full = a.atomic.value;
   y = x;
   y.priority = 7;

   return y.full;
}
--------------

SROA seems to be doing a number of things here. What about if we prevented SROA from generating multiple slices splitting volatile accesses? There might be a significant difference between that and something like this test (test/Transforms/SROA/basictest.ll):

define i32 @test6() {
; CHECK-LABEL: @test6(
; CHECK: alloca i32
; CHECK-NEXT: store volatile i32
; CHECK-NEXT: load i32, i32*
; CHECK-NEXT: ret i32

entry:
  %a = alloca [4 x i8]
  %ptr = getelementptr [4 x i8], [4 x i8]* %a, i32 0, i32 0
  call void @llvm.memset.p0i8.i32(i8* %ptr, i8 42, i32 4, i32 1, i1 true)
  %iptr = bitcast i8* %ptr to i32*
  %val = load i32, i32* %iptr
  ret i32 %val
}

-Hal

Yes, that would work.

-Krzysztof

So, here is the model that LLVM is using: a volatile memcpy is lowered to a loop of loads and stores of indeterminate width. As such, splitting a memcpy is always valid.

If we want a very specific load and store width for volatile accesses, I think that the frontend should generate concrete loads and stores of a type with that width. Ultimately, memcpy is a pretty bad model for specific width accesses, it is best at handling indeterminate sized accesses, which is exactly what doesn’t make sense for device backed volatile accesses.

Yeah, the remark about devices I made in my post was a result of a "last-minute" thought to add some rationale. It doesn't actually apply to SROA, since there are no devices that are mapped to the stack, which is what SROA is interested in.

The concern with the testcase I attached is really about performance. Would it be reasonable to control the splitting in SROA via TTI?

-Krzysztof

From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
To: "Chandler Carruth" <chandlerc@gmail.com>, "Hal Finkel" <hfinkel@anl.gov>
Cc: llvm-dev@lists.llvm.org
Sent: Wednesday, November 11, 2015 9:34:01 AM
Subject: Re: [llvm-dev] SROA and volatile memcpy/memset

> So, here is the model that LLVM is using: a volatile memcpy is
> lowered
> to a loop of loads and stores of indeterminate width. As such,
> splitting
> a memcpy is always valid.
>
> If we want a very specific load and store width for volatile
> accesses, I
> think that the frontend should generate concrete loads and stores
> of a
> type with that width. Ultimately, memcpy is a pretty bad model for
> *specific* width accesses, it is best at handling indeterminate
> sized
> accesses, which is exactly what doesn't make sense for device
> backed
> volatile accesses.
>

Yeah, the remark about devices I made in my post was a result of a
"last-minute" thought to add some rationale. It doesn't actually
apply
to SROA, since there are no devices that are mapped to the stack,
which
is what SROA is interested in.

The concern with the testcase I attached is really about performance.
Would it be reasonable to control the splitting in SROA via TTI?

How so?

-Hal

I’m pretty sure volatile access voids your performance warranty…

I assume the issue is that the loads and stores aren’t combined late in the back end because we propagate the volatile? I think the fix for performance is “don’t use volatile”. I’m sure you’ve looked at that option, but we’ll need a lot more context on what problem you’re actually hitting to provide more realistic options.

I think TTI is a very bad fit here – target customization would really hurt the entire canonicalization efforts of the middle end…

I'm not sure which part you are referring to. The "volatileness" of the structure in question does not place the same restrictions on how we can access it as it would be in the case of a device access. The broken up loads and stores are legal in the sense that they won't cause any hardware issues, however they would take longer to execute because the resulting instructions would be marked as volatile and thus "non-optimizable".

-Krzysztof

From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvm-dev@lists.llvm.org, "Chandler Carruth" <chandlerc@gmail.com>
Sent: Wednesday, November 11, 2015 9:54:54 AM
Subject: Re: [llvm-dev] SROA and volatile memcpy/memset

>> From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
>>
>> Yeah, the remark about devices I made in my post was a result of a
>> "last-minute" thought to add some rationale. It doesn't actually
>> apply
>> to SROA, since there are no devices that are mapped to the stack,
>> which
>> is what SROA is interested in.
>>
>> The concern with the testcase I attached is really about
>> performance.
>> Would it be reasonable to control the splitting in SROA via TTI?
>
> How so?

I'm not sure which part you are referring to. The "volatileness" of
the
structure in question does not place the same restrictions on how we
can
access it as it would be in the case of a device access. The broken
up
loads and stores are legal in the sense that they won't cause any
hardware issues, however they would take longer to execute because
the
resulting instructions would be marked as volatile and thus
"non-optimizable".

How would TTI be used? DataLayout, for example, already has information on legal/preferred integer sizes. Would that help?

-Hal

I'm pretty sure volatile access voids your performance warranty....

I assume the issue is that the loads and stores aren't combined late in
the back end because we propagate the volatile? I think the fix for
performance is "don't use volatile". I'm sure you've looked at that
option, but we'll need a lot more context on what problem you're
actually hitting to provide more realistic options.

The testcase is a synthetic scenario that a customer gave us. I don't have much more insight into it than what I can get by just looking at it. I can try to find out more about the origin of it.

I think TTI is a very bad fit here -- target customization would really
hurt the entire canonicalization efforts of the middle end....

I see. Hmm.

-Krzysztof

I was thinking about having some information about the preferred treatment of volatile memory intrinsics. The problem in this case is that the aggregate is a union, which would make it harder to decide how the accesses should be formed.

-Krzysztof

The customer has a workaround that avoids this issue, so this is no longer a problem.

Thanks for all the input.

-Krzysztof