Byte-wide stores aren't coalesced if interspersed with other stores

Hi,

I have, in postres, a piece of IR that, after inlining and constant
propagation boils (when cooked on really high heat) down to (also
attached for your convenience):

source_filename = "pg"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
entry:
  %a01 = getelementptr i8, i8* %0, i16 0
  store i8 0, i8* %a01

  ; in the real case this also loads data
  %b01 = getelementptr i32, i32* %1, i16 0
  store i32 0, i32* %b01

  %a02 = getelementptr i8, i8* %0, i16 1
  store i8 0, i8* %a02

  ; in the real case this also loads data
  %b02 = getelementptr i32, i32* %1, i16 1
  store i32 0, i32* %b02

  ; in the real case this also loads data
  %a03 = getelementptr i8, i8* %0, i16 2
  store i8 0, i8* %a03

  ; in the real case this also loads data
  %b03 = getelementptr i32, i32* %1, i16 2
  store i32 0, i32* %b03

  %a04 = getelementptr i8, i8* %0, i16 3
  store i8 0, i8* %a04

  ; in the real case this also loads data
  %b04 = getelementptr i32, i32* %1, i16 3
  store i32 0, i32* %b04

  ret void
}

I expected LLVM to be able to coalesce the i8 stores to an i32 (or a
number of i64 stores in my actual case). But it turns out it doesn't
really do so.

In postgres' current use the optimization pipeline doesn't contain SLP
(mostly because enabling it via PassManagerBuilder isn't exposed to C).
In that case optimization doesn't yield anything interesting with:

opt -mcpu=native -disable-slp-vectorization -O3 -S /tmp/combine.ll

With SLP enabled, this turns out a bit better:

define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 {
entry:
  store i8 0, i8* %0, align 8
  %a02 = getelementptr i8, i8* %0, i64 1
  store i8 0, i8* %a02, align 1
  %a03 = getelementptr i8, i8* %0, i64 2
  store i8 0, i8* %a03, align 2
  %a04 = getelementptr i8, i8* %0, i64 3
  store i8 0, i8* %a04, align 1
  %2 = bitcast i32* %1 to <4 x i32>*
  store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8
  ret void
}

but note that the i8 stores *still* haven't been coalesced, although
without the interspersed stores, llc/lowering is able to do so.

If I run another round of opt on it then "MemCpy Optimization" manages
to also optimize this on the IR level:

*** IR Dump After Global Value Numbering ***
; Function Attrs: norecurse nounwind
define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 {
entry:
  store i8 0, i8* %0, align 8
  %a02 = getelementptr i8, i8* %0, i64 1
  store i8 0, i8* %a02, align 1
  %a03 = getelementptr i8, i8* %0, i64 2
  store i8 0, i8* %a03, align 2
  %a04 = getelementptr i8, i8* %0, i64 3
  store i8 0, i8* %a04, align 1
  %2 = bitcast i32* %1 to <4 x i32>*
  store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8
  ret void
}
*** IR Dump After MemCpy Optimization ***
; Function Attrs: norecurse nounwind
define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 {
entry:
  %a02 = getelementptr i8, i8* %0, i64 1
  %a03 = getelementptr i8, i8* %0, i64 2
  %a04 = getelementptr i8, i8* %0, i64 3
  %2 = bitcast i32* %1 to <4 x i32>*
  call void @llvm.memset.p0i8.i64(i8* align 8 %0, i8 0, i64 4, i1 false)
  store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8
  ret void
}

which later then gets turned into a normal i32 store:

*** IR Dump After Combine redundant instructions ***
; Function Attrs: norecurse nounwind
define void @evalexpr_0_0(i8* noalias nocapture align 8, i32* noalias nocapture align 8) local_unnamed_addr #0 {
entry:
  %2 = bitcast i32* %1 to <4 x i32>*
  %3 = bitcast i8* %0 to i32*
  store i32 0, i32* %3, align 8
  store <4 x i32> zeroinitializer, <4 x i32>* %2, align 8
  ret void
}

So, here we finally come to my question: Is it really expected that,
unless largely independent optimizations (SLP in this case) happen to
move instructions *within the same basic block* out of the way, these
stores don't get coalesced? And then only if the either the
optimization pipeline is run again, or if instruction selection can do
so?

On IRC Roman Lebedev pointed out ⚙ D48725 [SLP] Vectorize bit-parallel operations with SWAR. which
might address this indirectly. But I'm somewhat doubtful that that's
the most straightforward way to optimize this kind of code?

Greetings,

Andres Freund

combine.ll (982 Bytes)

Hi,

I have, in postres, a piece of IR that, after inlining and constant
propagation boils (when cooked on really high heat) down to (also
attached for your convenience):

source_filename = "pg"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
entry:
  %a01 = getelementptr i8, i8* %0, i16 0
  store i8 0, i8* %a01

  ; in the real case this also loads data
  %b01 = getelementptr i32, i32* %1, i16 0
  store i32 0, i32* %b01

  %a02 = getelementptr i8, i8* %0, i16 1
  store i8 0, i8* %a02

  ; in the real case this also loads data
  %b02 = getelementptr i32, i32* %1, i16 1
  store i32 0, i32* %b02

  ; in the real case this also loads data
  %a03 = getelementptr i8, i8* %0, i16 2
  store i8 0, i8* %a03

  ; in the real case this also loads data
  %b03 = getelementptr i32, i32* %1, i16 2
  store i32 0, i32* %b03

  %a04 = getelementptr i8, i8* %0, i16 3
  store i8 0, i8* %a04

  ; in the real case this also loads data
  %b04 = getelementptr i32, i32* %1, i16 3
  store i32 0, i32* %b04

  ret void
}

So, here we finally come to my question: Is it really expected that,
unless largely independent optimizations (SLP in this case) happen to
move instructions *within the same basic block* out of the way, these
stores don't get coalesced? And then only if the either the
optimization pipeline is run again, or if instruction selection can do
so?

On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which
might address this indirectly. But I'm somewhat doubtful that that's
the most straightforward way to optimize this kind of code?

That doesn't help, but it turns out that //reviews.llvm.org/D30703 can
kinda somwhat help by adding a redundant
  %i32ptr = bitcast i8* %0 to i32*
  store i32 0, i32* %i32ptr

at the start. Then dse-partial-store-merging does its magic and
optimizes the sub-stores away. But it's fairly ugly to manually have to
add superflous stores in the right granularity (a larger llvm.memset
doesn't work).

gcc, since 7, detects such cases in its "new" -fstore-merging pass.

- Andres

Andres:

FWIW, codegen will do the merge if you turn on global alias analysis for it “-combiner-global-alias-analysis”. That said, we should be able to do this merging earlier.

-Nirav

Hi,

Andres:

FWIW, codegen will do the merge if you turn on global alias analysis for it
"-combiner-global-alias-analysis". That said, we should be able to do this
merging earlier.

Interesting. That does *something* for my real case, but certainly not
as much as I'd expected, or what I can get dse-partial-store-merging to
do if I emit some "superflous" earlier store (which encompass all the
previous stores) that allow it to its job.

In the case at hand, with a manual 64bit store (this is on a 64bit
target), llvm then combines 8 byte-wide stores into one.

Without -combiner-global-alias-analysis it generates:

        movb $0, 1(%rdx)
        movl 4(%rsi,%rdi), %ebx
        movq %rbx, 8(%rcx)
        movb $0, 2(%rdx)
        movl 8(%rsi,%rdi), %ebx
        movq %rbx, 16(%rcx)
        movb $0, 3(%rdx)
        movl 12(%rsi,%rdi), %ebx
        movq %rbx, 24(%rcx)
        movb $0, 4(%rdx)
        movq 16(%rsi,%rdi), %rbx
        movq %rbx, 32(%rcx)
        movb $0, 5(%rdx)
        movq 24(%rsi,%rdi), %rbx
        movq %rbx, 40(%rcx)
        movb $0, 6(%rdx)
        movq 32(%rsi,%rdi), %rbx
        movq %rbx, 48(%rcx)
        movb $0, 7(%rdx)
        movq 40(%rsi,%rdi), %rsi

were (%rdi) is the array of 1 byte values, where I hope to get stores
combined, which is guaranteed to be 8byte aligned.

With out -combiner-global-alias-analysis it generates:

  movw $0, (%rsi)
  movl (%rcx,%rdi), %ebx
  movq %rbx, (%rdx)
  movl 4(%rcx,%rdi), %ebx
  movl 8(%rcx,%rdi), %r8d
  movq %rbx, 8(%rdx)
  movl $0, 2(%rsi)
  movq %r8, 16(%rdx)
  movl 12(%rcx,%rdi), %ebx
  movq %rbx, 24(%rdx)
  movq 16(%rcx,%rdi), %rbx
  movq %rbx, 32(%rdx)
  movq 24(%rcx,%rdi), %rbx
  movq %rbx, 40(%rdx)
  movb $0, 6(%rsi)
  movq 32(%rcx,%rdi), %rbx
  movq %rbx, 48(%rdx)
  movb $0, 7(%rsi)

where (%rsi) is the array of 1-byte values. So it's a 2, 4, 1, 1
byte store. Huh?

Whereas, if I emit a superflous 8-byte store beforehand it becomes:
        movq $0, (%rsi)
        movl (%rcx,%rdi), %ebx
        movq %rbx, (%rdx)
        movl 4(%rcx,%rdi), %ebx
        movq %rbx, 8(%rdx)
        movl 8(%rcx,%rdi), %ebx
        movq %rbx, 16(%rdx)
        movl 12(%rcx,%rdi), %ebx
        movq %rbx, 24(%rdx)
        movq 16(%rcx,%rdi), %rbx
        movq %rbx, 32(%rdx)
        movq 24(%rcx,%rdi), %rbx
        movq %rbx, 40(%rdx)
        movq 32(%rcx,%rdi), %rbx
        movq %rbx, 48(%rdx)
        movq 40(%rcx,%rdi), %rcx

so just a single 8-byte store.

I've attached the two testfiles (which unfortunately are somewhat
messy):
24703.1.bc - file without "superflous" store
25256.0.bc - file with "superflous" store

the workflow I have, emulating the current pipeline, is:

opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc -O3 [-combiner-global-alias-analysis]

Note that the problem can also occur when -disable-slp-vectorization, it
just requires a larger example.

Greetings,

Andres Freund

24703.1.bc (12.6 KB)

25256.0.bc (12 KB)

Hmm. This looks like the backend conservatively giving up early on merging. It looks like you’re running clang 5.02. There have been some improvements to the backend’s memory aliasing and store merging that have landed since. Can you check if this is fixed in a newer version?

-Nirav

Hi,

Hmm. This looks like the backend conservatively giving up early on
merging.

It looks like you're running clang 5.02. There have been some improvements
to the backend's memory aliasing and store merging that have landed since.
Can you check if this is fixed in a newer version?

It's trunk at r341868. The clang version numbers come in from postgres'
JIT infrastructure inlining SQL operators/functions, which are just
taken from postgres' C code. I probably shouldn't have mixed versions
in this installation, but it shouldn't affect the result, as that's only
for the available_externally functions. The rest of the IR is generated
by directly emitting IR via IRBuilder.

- Andres

Hmm, we have memory operations to the same location are not getting the same addressing metadata associated with them, similar to bugs.llvm.org/pr38241. Presumably there’s a pass that’s dropping that information on the floor.

-NIrav