Memset/memcpy: user control of loop-idiom recognizer

Hi,

In feedback from game studios a common issue is the replacement of
loops with calls to memcpy/memset. These loops are often
hand-optimised, and highly-efficient and the developers strongly want
a way to control the compiler (i.e. leave my loop alone).

The culprit is of course the loop-idiom recognizer. This replaces any
loop that looks like a memset/memcpy with calls. This affects loops
with both a variable and a constant trip-count. The question is, does
this make sense in all cases? Also, should the compiler provide a way
to turn it off for certain types of loop, or on a loop individually?
The standard answer is to use -fno-builtin but this does not provide
fine-grain control (e.g. we may want the loop-idiom to recognise
constant loops but not variable loops).

As an example, it could be argued that replacing constant loops always
makes sense. Here the compiler knows how big the memset/memcpy is and
can make an accurate decision. For small values the memcpy/memset
will be expanded inline, while larger values will remain a call, but
due to the size the overhead will be negligible.

On the other hand, the compiler knows very little about variable loops
(the loop could be used primarily for copying 10 bytes or 10 Mbytes,
the compiler doesn't know). The compiler will replace it with a call,
but as it is variable it will not be expanded inline. In this case
small values may see significant overhead in comparison to the
original loop. The game studio examples all fall into this category.

The loop-idiom recognizer also has no notion of "quality" - it always
assumes that replacing the loop makes sense. While it might be the
case for a naive byte-copy, some of the examples we've seen have been
carefully tuned.

So, to summarise, we feel that there's sufficient justification to add
some sort of user-control. However, we do not want to suggest a
solution, but prefer to start a discussion, and obtain opinions. So
to start, how do people feel about:

- A switch to disable loop-idiom recognizer completely?

- A switch to disable loop-idiom recognizer for loops with variable trip count?

- A switch to disable loop-idiom recognizer for loops with constant
trip count (can't see this being much use)?

- Per-function control of loop-idiom recognizer (which must work with LTO)?

Thanks for any feedback!
Rob.

What if we had a pragma or attribute that lowered down to metadata indicating that the variable length trip count was small?

Then backends could choose to lower short memcpys to an inlined, slightly widened loop. For example, ‘rep movsq’ on x86_64.

That seems nice from the compiler perspective, since it preserves the canonical form and we get the same kind of information from profiling. Then again, I can imagine most game dev users just want control and don’t want to change their code.

I doubt that. If anything, it means the lowering of the intrinsic is
bad, not that the transformation should not happen.

Joerg

In feedback from game studios a common issue is the replacement of
loops with calls to memcpy/memset. These loops are often
hand-optimised, and highly-efficient and the developers strongly want
a way to control the compiler (i.e. leave my loop alone).

I doubt that. If anything, it means the lowering of the intrinsic is
bad, not that the transformation should not happen.

Joerg

Yes, that's why I talked about variable and constant trip-counts. For
constant loops there generally isn't a problem, as they can be lowered
inline (if small). Variable loops, however, get expanded into a
library call.

Rob.

I like this general idea. Here’s another possibility… We actually already have such a construct in the form of the expect builtins. One way to structure this would be: if (__builtin_expect(N < SmallSize, 1)) { //small loop here } else { // memcpy here // or unreachable if you’re really brave } I could see us failing to exploit this of course. :slight_smile:

In feedback from game studios a common issue is the replacement of
loops with calls to memcpy/memset. These loops are often
hand-optimised, and highly-efficient and the developers strongly want
a way to control the compiler (i.e. leave my loop alone).

I doubt that. If anything, it means the lowering of the intrinsic is
bad, not that the transformation should not happen.

Joerg

Yes, that’s why I talked about variable and constant trip-counts. For
constant loops there generally isn’t a problem, as they can be lowered
inline (if small). Variable loops, however, get expanded into a
library call.

So the biggest problem is that you don’t want a call and would prefer to have inline memcpy code everywhere or something else? If the memcpy isn’t being lowered efficiently I’m curious as to what isn’t being lowered well.

-eric

I'd agree. On x86-64, however, memcpy is difficult. Some recent profiling shows that various different approaches using SSE instructions have around a 50% performance difference between Sandy Bridge, Ivy Bridge and Haswell, with different versions performing very differently (no idea what the variation is like between AMD chips).

Lowering memcpy in LLVM is particularly horrible as it's done in three different places, only one of which has anything that's a bit like a cost model.

We can often generate a very efficient memcpy loop in the back end if we know that the data being copied is strongly aligned. For x86-64 (and our architecture), if the data is 256-bit aligned and known to be a multiple of 256 bits (or, even better, a multiple of a known multiple of 256 bits) then we can generate something that is likely to be significantly faster than a call to memcpy, but we often lose this information by the time we are doing the lowering.

The interface for target-specific lowering of memcpy is horribly convoluted (and assumes that memcpy is always in AS 0, even though the intrinsic supports multiple address spaces, but that's a different issue) and so some cleanup would make it possible to exploit some of this information a bit better. Ideally, I'd see it moved entirely to the back end (or a single flag saying 'expand this in the IR, I don't care about optimising it yet'), rather than having the back end trying to provide SelectionDAG with some things that it sometimes uses.

David

In feedback from game studios a common issue is the replacement of
loops with calls to memcpy/memset. These loops are often
hand-optimised, and highly-efficient and the developers strongly want
a way to control the compiler (i.e. leave my loop alone).

I doubt that. If anything, it means the lowering of the intrinsic is
bad, not that the transformation should not happen.

Joerg

Yes, that’s why I talked about variable and constant trip-counts. For
constant loops there generally isn’t a problem, as they can be lowered
inline (if small). Variable loops, however, get expanded into a
library call.

So the biggest problem is that you don’t want a call and would prefer to have inline memcpy code everywhere or something else? If the memcpy isn’t being lowered efficiently I’m curious as to what isn’t being lowered well.

Our C library amplifies this problem by being in a dynamic library, so the call has additional overhead, which for small trip counts swamps the copy/set.

Certainly, the lowering can be better across the many cases as discussed elsewhere in this thread.

Game developers expect precise control and are surprised by this canonicalization. They also don’t have the compiler’s frame of reference as a basis for understanding issues like this.

Alex

What surprises me is that a loop should ever result in more efficient
code given that it needs pretty much the same set of constraints.

Joerg

Our C library amplifies this problem by being in a dynamic library, so the
call has additional overhead, which for small trip counts swamps the
copy/set.

I can't imagine we're the only platform (now or in the future) that
has comparatively slow library calls. We had discussed some sort of
platform flag (has slow library calls) but this would be too late to
affect the loop-idiom. However, it could affect lowering. Following
on from Reid's earlier idea to lower short memcpys to an inlined,
slightly widened loop, we could expand into a guarded loop for small
values and a call?

Game developers expect precise control and are surprised by this
canonicalization. They also don't have the compiler's frame of reference as
a basis for understanding issues like this.

Unfortunately this issue has now been noticed. Whether or not we can
"get away" with fixing the performance issue without giving them the
control remains to be seen...

Rob.

>> In feedback from game studios a common issue is the replacement of
>> loops with calls to memcpy/memset. These loops are often
>> hand-optimised, and highly-efficient and the developers strongly want
>> a way to control the compiler (i.e. leave my loop alone).
>
> I doubt that. If anything, it means the lowering of the intrinsic is
> bad, not that the transformation should not happen.
>
> Joerg

Yes, that's why I talked about variable and constant trip-counts. For
constant loops there generally isn't a problem, as they can be lowered
inline (if small). Variable loops, however, get expanded into a
library call.

So the biggest problem is that you don't want a call and would prefer to
have inline memcpy code everywhere or something else? If the memcpy isn't
being lowered efficiently I'm curious as to what isn't being lowered well.

Our C library amplifies this problem by being in a dynamic library, so the
call has additional overhead, which for small trip counts swamps the
copy/set.

Certainly, the lowering can be better across the many cases as discussed
elsewhere in this thread.

It's also worth mentioning that when the loop-idiom recognizer is
disabled the loop vectorizer steps in, and will vectorize the loop.

Rob.

Hi,

In feedback from game studios a common issue is the replacement of
loops with calls to memcpy/memset. These loops are often
hand-optimised, and highly-efficient and the developers strongly want
a way to control the compiler (i.e. leave my loop alone).

Please provide examples of such "hand-optimised, and highly-efficient"
routines and test cases (and execution conditions) that demonstrate a
performance improvement.

-- Sean Silva

I think the bug is not that we are recognising that the loop is memcpy, it's that we're then generating an inefficient memcpy. We do this for a variety of reasons, some of which apply elsewhere. One issue I hit a few months ago was that the vectoriser doesn't notice whether unaligned loads and stores are supported, so will happily replace two adjacent i32 align 4 loads followed by two adjacent i64 align 4 stores with an i64 align 4 load followed by an i64 align 4 store, which more than doubles the number of instructions that the back end emits.

We expand memcpy and friends in several different places (in the IR in at least one place, then in SelectionDAG, and then again in the back end, as I recall - I remember playing whack-a-bug with this for a while as the lowering was differently broken for our target in each place). In SelectionDAG, we're dealing with a single basic block, so we can't construct the loop. In the back end we've already lost a lot of high-level type information that would make this easier.

I'd be in favour of consolidating the memcpy / memset / memmove expansion into an IR pass that would take a cost model from the target.

David

This sounds like a cop-out, but we can't share customer code (even if
we could get a small runnable example). But this is all getting
beside the point. I discussed performance issues to try and justify
why the user should have control. That was probably a mistake as it
has subverted the conversation. The blunt fact is that game
developers don't like their loops being replaced and they want user
control. The real conversation I wanted was what form should this
user control take. To be honest, I am surprised at the level of
resistance to giving users *any* control over their codegen.

This doesn't really make sense. They're writing code in a high(ish)-level language and passing it to a compiler. The processor doesn't understand C, so the compiler *must* replace it with something. Whether that's a call to a library routine, a scalar loop, a vector loop, or a completely unrolled sequence depends on heuristics in the compiler.

If they want full control of the machine instructions that are generated, then there is a mechanism for doing this: inline assembly.

The complaint can't be that it's not generating the same code, it is that the compiler is generating something with performance characteristics that are difficult to reason about from the input code. That's always a danger when you use an optimising compiler, but it looks like this case is a pretty extreme example.

David

The blunt fact is that game
developers don't like their loops being replaced and they want user
control. The real conversation I wanted was what form should this
user control take.

This doesn't really make sense. They're writing code in a high(ish)-level language and passing it to a compiler. The processor doesn't understand C, so the compiler *must* replace it with something. Whether that's a call to a library routine, a scalar loop, a vector loop, or a completely unrolled sequence depends on heuristics in the compiler.

Yes, but why isn't the user allowed any control over what the compiler does?

If they want full control of the machine instructions that are generated, then there is a mechanism for doing this: inline assembly.

Of course they could do this, but this is like having to make your
deli sandwich yourself instead of telling the guy to "hold the mayo".

Rob.

If you want to maintain a custom branch of clang with an additional option added, no one would object or care. If you were to submit a patch to add such a flag, it might even be accepted.

So far, the discussion has focused on what the compiler is doing wrong in this case. You have requested a workaround for what is clearly a compiler optimization bug. Before agreeing to support the workaround, considering how hard it would be to fix is clearly the right approach.

Having said all of that, the existing push/pop optimization scopes (a gcc extension) should either already work for what you're trying to with the workaround or could be relatively easy to adapt. If there's an -OX setting that excludes the optimization you consider problematic try:
#pragma GCC optimize("-OX")

You could also try the clang::optnone function attribute.

Philip

Our C library amplifies this problem by being in a dynamic library, so the
call has additional overhead, which for small trip counts swamps the
copy/set.

I can't imagine we're the only platform (now or in the future) that
has comparatively slow library calls. We had discussed some sort of
platform flag (has slow library calls) but this would be too late to
affect the loop-idiom. However, it could affect lowering. Following
on from Reid's earlier idea to lower short memcpys to an inlined,
slightly widened loop, we could expand into a guarded loop for small
values and a call?

I think the bug is not that we are recognising that the loop is memcpy, it's that we're then generating an inefficient memcpy. We do this for a variety of reasons, some of which apply elsewhere. One issue I hit a few months ago was that the vectoriser doesn't notice whether unaligned loads and stores are supported, so will happily replace two adjacent i32 align 4 loads followed by two adjacent i64 align 4 stores with an i64 align 4 load followed by an i64 align 4 store, which more than doubles the number of instructions that the back end emits.

We expand memcpy and friends in several different places (in the IR in at least one place, then in SelectionDAG, and then again in the back end, as I recall - I remember playing whack-a-bug with this for a while as the lowering was differently broken for our target in each place). In SelectionDAG, we're dealing with a single basic block, so we can't construct the loop. In the back end we've already lost a lot of high-level type information that would make this easier.

I'd be in favour of consolidating the memcpy / memset / memmove expansion into an IR pass that would take a cost model from the target.

+1

It sounds like we might also be loosing information about alignment in the loop-idiom recognizer. Or at least not using it when we lower.

There are a large number of ways to lose information in translating loops into memset/memcpy calls, alignment is one of them.
As previously mentioned, loop-trip-count is another. Another is size of accesses. For example, the loop may have originally been using
int64_t sized copies. This has definite impact on what the best memset/memcpy expansion is, because effectively, the loop knows that
it is always writing a multiple of 8 bytes, and does so in 8 bytes chunks. So, that the number of bytes has some specific value property (like the lower 3 bits
are always 0, another reason for having known bits and known bit values :-)) all (should) affect the lowering of such loops/calls, but probably doesn't.

Database folks often write their own copy routines for use in specific instances, as do OSes, such as when they know they are clearing or copying exact
page size on exact page-size boundaries, and have very special implementations of these, including some that will use non-temporal hints, so as not to
pollute cache.

It is also worth pointing out that most loops have a very specific behavior in the case of overlaps that is well-defined, and that memcpy does not.

There are definitely good reasons why various knowledgeable users would not want a compiler to perform such a transform on at least some of their loops.

Kevin Smith

The LoadCombine pass suffers from the same problem too. It's producing unaligned
loads and that's why it's not enabled by default.

I'm currently working on that problem (trying to combine stores too)
and I suppose
that once it's solved we will be able to draw important conclusions about how to
make _alignment-aware_ the rest of the LLVM places mentioned in this thread.

I don't know a lot about the vectorizer yet, but I don't think that
discovering the
load/store alignment of adjacent loads/stores should be one of its main tasks.
By enabling a properly working LoadStoreCombine pass we could probably eliminate
altogether the alignment problem in the vectorizer.

Regards,
Vasileios Kalintiris