Missed optimization - spill/load generated instead of reg-to-reg move (and two other questions)

Hello all!

I was looking through the results of disassembling a heavily-used short function
in the program I’m working on, and ended up wondering why LLVM was generating
that assembly and what changes would be necessary to improve the code. I asked
on #llvm, but it seems that the people with the necessary expertise weren’t
around.

Here is a condensed version of the code: https://godbolt.org/g/ec5cP7

My main question concerns assembly lines 37/38 and 59/60, where xmm0 is spilled
to the stack, only to be immediately reloaded into xmm1. Google tells me that
there is a register-to-register mov instruction for the xmmn registers, so I
found it odd that LLVM missed what looks like an easy optimization. tstellar on
#llvm pointed me towards using -debug-only=regalloc with llc to see what LLVM is
thinking (regalloc log here, since I’m not sure what’s considered “too large”
for mailing lists: 0), and it seemed to me like the load/store were
introduced separately, and llc never looked at them at the same time, and so
never realized that they could be folded. Is that what is happening? I know
little about compilers, so I wouldn’t be surprised if I were wrong.

The other two questions are tangential, so please let me know if I should ask
them somewhere else.

On assembly lines 24 and 46, I think the vtable pointer for the Quad object is
being reloaded every iteration of the loop. nbjoerg on #llvm said that’s due to
the possibility of placement new being used somewhere inside the called
function, which makes sense to me. Is there a way to indicate to LLVM that this
will not happen? I tried [[gnu::pure]], since the function doesn’t write to
externally-visible memory, but the vtable pointer reload remained.

Finally, I’m inclined to say that this routine should be vectorizable, since
it’s essentially just an accumulate, but Clang can’t prove that GetLocalValue
doesn’t have side effects that will affect later iterations. Is this correct,
and if so, are there any hints I can give Clang besides just manually
parallelizing it with #pragma omp or something?

I do intend on changing this loop to something a bit less messy, but it’ll be
part of a larger refactoring, so it’s still a ways off.

Thanks!

Alex

I don’t have time to dig into this in detail, but you’re heading in the right direction if you’re looking at regalloc tracing. This vaguely looks like something related to phi lowering, so you might want to check what the MIR looks like immediately before regalloc as well. I don’t know that we have anything like this, but we totally should if we don’t. You’re more likely to get a useful answer if you send this separately to cfe-dev though. The clang frontend devs don’t tend to read emails apparently about register allocation. :slight_smile: If you want to assist with the devirt directly, you could capture the member pointer on the first iteration, then reuse. This does require that all Nodes in your array are the exact same type though! To clarify, is the the GetLocalValue in your example? Or some more complicated version? It depends a lot on what IPO can conclude about the function. You can also manually annotate the initial IR, but I don’t know how to do that from clang.Â

Hello all!

I was looking through the results of disassembling a heavily-used short function
in the program I’m working on, and ended up wondering why LLVM was generating
that assembly and what changes would be necessary to improve the code. I asked
on LLVM Project, but it seems that the people with the necessary expertise weren’t
around.

Here is a condensed version of the code: https://godbolt.org/g/ec5cP7

My main question concerns assembly lines 37/38 and 59/60, where xmm0 is spilled
to the stack, only to be immediately reloaded into xmm1. Google tells me that
there is a register-to-register mov instruction for the xmmn registers, so I
found it odd that LLVM missed what looks like an easy optimization. tstellar on
LLVM Project pointed me towards using -debug-only=regalloc with llc to see what LLVM is
thinking (regalloc log here, since I’m not sure what’s considered “too large”
for mailing lists: [0]), and it seemed to me like the load/store were
introduced separately, and llc never looked at them at the same time, and so
never realized that they could be folded. Is that what is happening? I know
little about compilers, so I wouldn’t be surprised if I were wrong.

I don’t have time to dig into this in detail, but you’re heading in the right direction if you’re looking at regalloc tracing. This vaguely looks like something related to phi lowering, so you might want to check what the MIR looks like immediately before regalloc as well.

Apparently it’s the greedy register allocation pass that results in the
spill/reload? If I’m reading the MIR files right, before the pass the MIR is

; …
%36:fr64 = MULSDrr %36, %32
%41:fr64 = ADDSDrr %41, %36
%40:gr64_nosp = INC64r %40, implicit-def dead $eflats
; …

And after:

; …
%36:fr64 = MULSDrm %36, %stack.2, 1, $noreg, 0, $noreg :: (load 8 from %stack.2)
%47:fr64 = MOVSDrm %stack.0, 1, $noreg, 0, $noreg :: (load 8 from %stack.0)
%47:fr64 = ADDSDrr %47, %36
MOVSDmr %stack.0, 1, $noreg, 0, $noreg, %47 :: (store 8 into %stack.0)
%45:fr64 = MOVSDrm %stack.0, 1, $noreg, 0, $noreg :: (load 8 from %stack.0)
%40:gr64_nosp = INC64r %40, implicit-def dead $eflags
; …

So I guess that’s a good place to start. Unfortunately, I have no clue whether
the fix should be in RegAllocGreedy.cpp or if it should be taken care of
elsewhere.

The other two questions are tangential, so please let me know if I should ask
them somewhere else.

On assembly lines 24 and 46, I think the vtable pointer for the Quad object is
being reloaded every iteration of the loop. nbjoerg on LLVM Project said that’s due to
the possibility of placement new being used somewhere inside the called
function, which makes sense to me. Is there a way to indicate to LLVM that this
will not happen? I tried [[gnu::pure]], since the function doesn’t write to
externally-visible memory, but the vtable pointer reload remained.

I don’t know that we have anything like this, but we totally should if we don’t. You’re more likely to get a useful answer if you send this separately to cfe-dev though. The clang frontend devs don’t tend to read emails apparently about register allocation. :slight_smile:

Alright, I’ll head on over to cfe-dev to see if that functionality is
available/implemented.

If you want to assist with the devirt directly, you could capture the member pointer on the first iteration, then reuse. This does require that all Nodes in your array are the exact same type though!

Just to ensure I’m reading you right, are you talking about something like:

auto f = &Rect::GetLocalValue;
for (…) {
// …
const double localVal = (this->*f)(i, local, gauss);
}

?

If so, I’ve tried something along those lines already, and the vtable pointer
reload remained inside the loop. I’ve also tried capturing this into a local
pointer, and capturing into a lambda. None of those got Clang to hoist the
vtable pointer load out of the loop. I wouldn’t be surprised if I were barking
up the wrong tree, though.

Finally, I’m inclined to say that this routine should be vectorizable, since
it’s essentially just an accumulate, but Clang can’t prove that GetLocalValue
doesn’t have side effects that will affect later iterations. Is this correct,
and if so, are there any hints I can give Clang besides just manually
parallelizing it with #pragma omp or something?

To clarify, is the the GetLocalValue in your example? Or some more complicated version? It depends a lot on what IPO can conclude about the function. You can also manually annotate the initial IR, but I don’t know how to do that from clang.

I was wondering if GetValue() was vectorizable. The version on godbolt is
basically the same as the original, and the generated assembly is more or less
the same, modulo member offsets. GetLocalValue is a “true” pure function, in
that it does some math based only on its arguments, but it seems that’s not
getting picked up, even with -flto. Maybe it’s the virtual-ness preventing
deeper analysis?