SROA and volatile memcpy/memset

Krzysztof_Parzyszek · November 10, 2015, 4:41pm

Hi,

I have a customer testcase where SROA splits a volatile memcpy and we end up generating bad code[1]. While this looks like a bug, simply preventing SROA from splitting volatile memory intrinsics causes basictest.ll for SROA to fail. Not only that, but it also seems like handling of volatile memory transfers was done with some intent.

What are the design decisions in SROA regarding handling of volatile memcpy/memset?

[1] In our applications, in most cases volatile objects are used to communicate with external devices. Both, the address and the transfer size must match what the device is expecting, so breaking up volatile memcpy/memset in smaller pieces must be done very carefully, if at all. Generally, target-independent transformations aren't expected to have that knowledge and so it would be preferable that they leave such memory traffic alone.

-Krzysztof

Joerg_Sonnenberger1 · November 10, 2015, 7:07pm

There is no such thing as a volatile memcpy or memset in standard ISO C,
so what exactly are you doing and why do you expect it to work that way?

Joerg

Krzysztof_Parzyszek · November 10, 2015, 7:22pm

The motivating example has an aggregate copy where the aggregate is volatile, followed by a store to one of its members. (This does not have anything to do with devices.) SROA expanded this into a series of volatile loads and stores, which cannot be coalesced back into fewer instructions. This is clearly worse than doing the copy and then the member overwrite.

--- test.c ---
typedef struct {
volatile unsigned int value;
} atomic_word_t;

typedef union {
   struct {
     unsigned char state;
     unsigned char priority;
   };
   atomic_word_t atomic;
   unsigned int full;
} mystruct_t;

mystruct_t a;

unsigned int foo(void) {
mystruct_t x;
mystruct_t y;

   x.full = a.atomic.value;
   y = x;
   y.priority = 7;

return y.full;
}

Joerg_Sonnenberger1 · November 10, 2015, 7:29pm

But a copy and overwrite would violate the volatile rules?

Joerg

Krzysztof_Parzyszek · November 10, 2015, 7:35pm

The copy and overwrite were both in the original code. SROA tries to optimize the overwrite away, but in doing so, it breaks up the memcpy into several instructions. Now, because they are all volatile, we can't do much to optimize them.

-Krzysztof

Finkel_Hal_J · November 11, 2015, 2:53pm

From: "Krzysztof Parzyszek via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Tuesday, November 10, 2015 1:22:57 PM
Subject: Re: [llvm-dev] SROA and volatile memcpy/memset

>> I have a customer testcase where SROA splits a volatile memcpy and
>> we end up
>> generating bad code[1]. While this looks like a bug, simply
>> preventing SROA
>> from splitting volatile memory intrinsics causes basictest.ll for
>> SROA to
>> fail. Not only that, but it also seems like handling of volatile
>> memory
>> transfers was done with some intent.
>
> There is no such thing as a volatile memcpy or memset in standard
> ISO C,
> so what exactly are you doing and why do you expect it to work that
> way?

The motivating example has an aggregate copy where the aggregate is
volatile, followed by a store to one of its members. (This does not
have
anything to do with devices.) SROA expanded this into a series of
volatile loads and stores, which cannot be coalesced back into fewer
instructions. This is clearly worse than doing the copy and then the
member overwrite.

--- test.c ---
typedef struct {
   volatile unsigned int value;
} atomic_word_t;

typedef union {
   struct {
     unsigned char state;
     unsigned char priority;
   };
   atomic_word_t atomic;
   unsigned int full;
} mystruct_t;

mystruct_t a;

unsigned int foo(void) {
   mystruct_t x;
   mystruct_t y;

   x.full = a.atomic.value;
   y = x;
   y.priority = 7;

   return y.full;
}
--------------

SROA seems to be doing a number of things here. What about if we prevented SROA from generating multiple slices splitting volatile accesses? There might be a significant difference between that and something like this test (test/Transforms/SROA/basictest.ll):

define i32 @test6() {
; CHECK-LABEL: @test6(
; CHECK: alloca i32
; CHECK-NEXT: store volatile i32
; CHECK-NEXT: load i32, i32*
; CHECK-NEXT: ret i32

entry:
  %a = alloca [4 x i8]
  %ptr = getelementptr [4 x i8], [4 x i8]* %a, i32 0, i32 0
  call void @llvm.memset.p0i8.i32(i8* %ptr, i8 42, i32 4, i32 1, i1 true)
  %iptr = bitcast i8* %ptr to i32*
  %val = load i32, i32* %iptr
  ret i32 %val
}

-Hal

Krzysztof_Parzyszek · November 11, 2015, 3:00pm

Yes, that would work.

-Krzysztof

chandlerc · November 11, 2015, 3:28pm

So, here is the model that LLVM is using: a volatile memcpy is lowered to a loop of loads and stores of indeterminate width. As such, splitting a memcpy is always valid.

If we want a very specific load and store width for volatile accesses, I think that the frontend should generate concrete loads and stores of a type with that width. Ultimately, memcpy is a pretty bad model for specific width accesses, it is best at handling indeterminate sized accesses, which is exactly what doesn’t make sense for device backed volatile accesses.

Krzysztof_Parzyszek · November 11, 2015, 3:34pm

Yeah, the remark about devices I made in my post was a result of a "last-minute" thought to add some rationale. It doesn't actually apply to SROA, since there are no devices that are mapped to the stack, which is what SROA is interested in.

The concern with the testcase I attached is really about performance. Would it be reasonable to control the splitting in SROA via TTI?

-Krzysztof

Finkel_Hal_J · November 11, 2015, 3:36pm

From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
To: "Chandler Carruth" <chandlerc@gmail.com>, "Hal Finkel" <hfinkel@anl.gov>
Cc: llvm-dev@lists.llvm.org
Sent: Wednesday, November 11, 2015 9:34:01 AM
Subject: Re: [llvm-dev] SROA and volatile memcpy/memset

> So, here is the model that LLVM is using: a volatile memcpy is
> lowered
> to a loop of loads and stores of indeterminate width. As such,
> splitting
> a memcpy is always valid.
>
> If we want a very specific load and store width for volatile
> accesses, I
> think that the frontend should generate concrete loads and stores
> of a
> type with that width. Ultimately, memcpy is a pretty bad model for
> *specific* width accesses, it is best at handling indeterminate
> sized
> accesses, which is exactly what doesn't make sense for device
> backed
> volatile accesses.
>

Yeah, the remark about devices I made in my post was a result of a
"last-minute" thought to add some rationale. It doesn't actually
apply
to SROA, since there are no devices that are mapped to the stack,
which
is what SROA is interested in.

The concern with the testcase I attached is really about performance.
Would it be reasonable to control the splitting in SROA via TTI?

How so?

-Hal

Chandler_Carruth · November 11, 2015, 3:40pm

I’m pretty sure volatile access voids your performance warranty…

I assume the issue is that the loads and stores aren’t combined late in the back end because we propagate the volatile? I think the fix for performance is “don’t use volatile”. I’m sure you’ve looked at that option, but we’ll need a lot more context on what problem you’re actually hitting to provide more realistic options.

I think TTI is a very bad fit here – target customization would really hurt the entire canonicalization efforts of the middle end…

Krzysztof_Parzyszek · November 11, 2015, 3:54pm

I'm not sure which part you are referring to. The "volatileness" of the structure in question does not place the same restrictions on how we can access it as it would be in the case of a device access. The broken up loads and stores are legal in the sense that they won't cause any hardware issues, however they would take longer to execute because the resulting instructions would be marked as volatile and thus "non-optimizable".

-Krzysztof

Finkel_Hal_J · November 11, 2015, 3:57pm

From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvm-dev@lists.llvm.org, "Chandler Carruth" <chandlerc@gmail.com>
Sent: Wednesday, November 11, 2015 9:54:54 AM
Subject: Re: [llvm-dev] SROA and volatile memcpy/memset

>> From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
>>
>> Yeah, the remark about devices I made in my post was a result of a
>> "last-minute" thought to add some rationale. It doesn't actually
>> apply
>> to SROA, since there are no devices that are mapped to the stack,
>> which
>> is what SROA is interested in.
>>
>> The concern with the testcase I attached is really about
>> performance.
>> Would it be reasonable to control the splitting in SROA via TTI?
>
> How so?

I'm not sure which part you are referring to. The "volatileness" of
the
structure in question does not place the same restrictions on how we
can
access it as it would be in the case of a device access. The broken
up
loads and stores are legal in the sense that they won't cause any
hardware issues, however they would take longer to execute because
the
resulting instructions would be marked as volatile and thus
"non-optimizable".

How would TTI be used? DataLayout, for example, already has information on legal/preferred integer sizes. Would that help?

-Hal

Krzysztof_Parzyszek · November 11, 2015, 3:57pm

I'm pretty sure volatile access voids your performance warranty....

I assume the issue is that the loads and stores aren't combined late in
the back end because we propagate the volatile? I think the fix for
performance is "don't use volatile". I'm sure you've looked at that
option, but we'll need a lot more context on what problem you're
actually hitting to provide more realistic options.

The testcase is a synthetic scenario that a customer gave us. I don't have much more insight into it than what I can get by just looking at it. I can try to find out more about the origin of it.

I think TTI is a very bad fit here -- target customization would really
hurt the entire canonicalization efforts of the middle end....

I see. Hmm.

-Krzysztof

Krzysztof_Parzyszek · November 11, 2015, 4:02pm

I was thinking about having some information about the preferred treatment of volatile memory intrinsics. The problem in this case is that the aggregate is a union, which would make it harder to decide how the accesses should be formed.

-Krzysztof

Krzysztof_Parzyszek · November 24, 2015, 8:05pm

The customer has a workaround that avoids this issue, so this is no longer a problem.

Thanks for all the input.

-Krzysztof

Topic		Replies	Views
Specify the volatile access behaviour of the memcpy, memmove and memset intrinsics LLVM Dev List Archives	4	79	February 3, 2013
PATCH: A new SROA implementation LLVM Dev List Archives	1	55	August 20, 2012
LLVM struct, alloca, SROA and the entry basic block LLVM Dev List Archives	8	178	September 9, 2015
Copy to/from volatile structure IR & Optimizations	11	623	October 24, 2023
Where could we perform (sudo-?)SROA with non-constant offsets? IR & Optimizations	2	281	November 16, 2022

SROA and volatile memcpy/memset

Related Topics