Dead store elimination in the backend for -ftrivial-auto-var-init

Hi folks,

When compiling the attached example with -ftrivial-auto-var-init=zero:

$ clang -no-integrated-as -mno-sse -m64 -mstack-alignment=8 -O2
-ftrivial-auto-var-init=zero
-enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang
-g -o ipt.ll -c ipt.i -w -S -emit-llvm

, Clang generates an initialization memset() call for |acpar| in the IR:

    %0 = bitcast %struct.xt_action_param* %acpar to i8*, !dbg !27
    call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %0, i8 0, i64
40, i1 false), !dbg !28

Clang only splits memsets into series of stores under certain
conditions, so it's hard to remove redundant stores on the IR level
(even with the recent DSE patches: ⚙ D61879 WIP: Prototype of DSE optimizations for -ftrivial-auto-var-init).
In the resulting assembly, however, the memset is lowered into a
series of MOVQ instructions (see the attached ipt.s file), of which at
least the stores to 24(%rsp) and 16(%rsp) are redundant.

Given that at MCInst level we already know if a memset is split into a
sequence of stores, can we do a better job by making the backend
delete these redundant stores for us?

Alex

ipt.i (461 Bytes)

ipt.s (1.84 KB)

Hi Alexander,

The code doesn’t compile. Could you send a godbolt.org link that shows the issue?

Thanks,

JF

Hi Alexander,

The code doesn’t compile. Could you send a godbolt.org link that shows the issue?

Sorry about that, here's the link: https://godbolt.org/z/-PinQP

Lines 4 to 8 are initializing |acpar|.
If I'm understanding correctly, the store to 8(%rsp) at line 7 can be
removed because of the store at line 35.
Same for store to 24(%rsp), which is later overwritten at lines 11 and 16.

Hi Alexander,

The code doesn’t compile. Could you send a godbolt.org link that shows the issue?

Sorry about that, here's the link: https://godbolt.org/z/-PinQP

Lines 4 to 8 are initializing |acpar|.
If I'm understanding correctly, the store to 8(%rsp) at line 7 can be
removed because of the store at line 35.

These are in different basic blocks. IIRC there are a bunch of cases where DCE just fails when there’s any control flow. I think this is what you’re seeing here.

Same for store to 24(%rsp), which is later overwritten at lines 11 and 16.

I’m not sure what we should do here… If it were a constant store we’d fold it in, but here both stores are variables so it’s not clear that we really should be doing much in MemCpyOptimizer.
The problem I see here is: there’s a small memset, followed by variable stores to the same location. DCE’ing requires realizing that the memset is small enough that the backend will expand it, at which point DCE will kick in. MemCpyOptimizer doesn’t know what “small” is, partly because it’s based on alignment and optimization level. Maybe it can be a target hook (“what would you do with this memset?”), that MemCpyOptimizer uses when it sees a memset followed by variable stores to the same location? We already fold in constant stores (replace the m() call by one)… but again this isn’t clearly a win: if you enable SSE you’ll get wide stores, followed by stores of smaller size. With SSE I’m not sure I’d want us to remove the MOVAPS instructions.

>
>>
>> Hi Alexander,
>>
>> The code doesn’t compile. Could you send a godbolt.org link that shows the issue?
> Sorry about that, here's the link: https://godbolt.org/z/-PinQP
>
> Lines 4 to 8 are initializing |acpar|.
> If I'm understanding correctly, the store to 8(%rsp) at line 7 can be
> removed because of the store at line 35.

These are in different basic blocks. IIRC there are a bunch of cases where DCE just fails when there’s any control flow. I think this is what you’re seeing here.

I think Vitaly's patches (https://reviews.llvm.org/D61879) should
allow DSE to work across basic blocks. But, as you mentioned, the
problem is that we only have a memset at IR level.

> Same for store to 24(%rsp), which is later overwritten at lines 11 and 16.

I’m not sure what we should do here… If it were a constant store we’d fold it in, but here both stores are variables so it’s not clear that we really should be doing much in MemCpyOptimizer.
The problem I see here is: there’s a small memset, followed by variable stores to the same location. DCE’ing requires realizing that the memset is small enough that the backend will expand it, at which point DCE will kick in. MemCpyOptimizer doesn’t know what “small” is, partly because it’s based on alignment and optimization level. Maybe it can be a target hook (“what would you do with this memset?”), that MemCpyOptimizer uses when it sees a memset followed by variable stores to the same location? We already fold in constant stores (replace the m() call by one)… but again this isn’t clearly a win: if you enable SSE you’ll get wide stores, followed by stores of smaller size. With SSE I’m not sure I’d want us to remove the MOVAPS instructions.

Yes, this depends very much on the backend flags, therefore it might
be too early to lower that memset before DSE.
I was thinking about some kind of a peephole optimization, that
deletes stores to the same memory location (at least for %rsp-based
addresses) in the backend.
I suspect AArch64 has something like this already, at least we don't
see these redundant stores there.

On a first look case like this should nor be a problem. The tail of the memset here is unused because it’s replaced immediately without reads . So it can be trimmed. I will try the patch and let you know.

There are two problems:

  1. padding after union and call to q(), without LTO we can’t remove that store.
  2. shortcut which I have which ignores all instructions q() . this assume that memset to acpar.match, acpar.matchinfo also useful which is not true. I should be able to improve this case.

I’ve updated the patch to produce this code:

.text
.file “auto-var-init-dse.c”
.globl ipt_do_table # – Begin function ipt_do_table
.p2align 4, 0x90
.type ipt_do_table,@function
ipt_do_table: # @ipt_do_table
.cfi_startproc

%bb.0: # %entry

pushq %rbx
.cfi_def_cfa_offset 16
subq $40, %rsp
.cfi_def_cfa_offset 56
.cfi_offset %rbx, -16
movl $0, 4(%rsp)
movl $0, 33(%rsp)
movl $0, 36(%rsp)
xorl %eax, %eax
callq m

These movl can’t be removed at it’s paddings never initialized here, and we don’t know if q() will read them.
However these cases relatively rare, on Chromium I see less than 0.01% binary size improvement.

I've updated the patch to produce this code:

  .text
  .file "auto-var-init-dse.c"
  .globl ipt_do_table # -- Begin function ipt_do_table
  .p2align 4, 0x90
  .type ipt_do_table,@function
ipt_do_table: # @ipt_do_table
  .cfi_startproc
# %bb.0: # %entry
  pushq %rbx
  .cfi_def_cfa_offset 16
  subq $40, %rsp
  .cfi_def_cfa_offset 56
  .cfi_offset %rbx, -16
  movl $0, 4(%rsp)
  movl $0, 33(%rsp)
  movl $0, 36(%rsp)
  xorl %eax, %eax
  callq m
....

Great!
I'll try to give these patches a shot this week.

These movl can't be removed at it's paddings never initialized here, and we don't know if q() will read them.
However these cases relatively rare, on Chromium I see less than 0.01% binary size improvement.

There's less than 50 functions consuming >1% performance in the
kernel, but saving a couple of instructions here and there can give a
notable performance boost.