Memcpy expansion: InstCombine vs SelectionDAG

Hi all,

We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.

We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a

This simple program shows the problem:

struct A {
  int a, b;
} global;

int mv(const struct A *b)
{
  global = *b;
  return 1;
}

Compiled with clang at -O1 or above, the function becomes:

define i32 @mv(%struct.A* nocapture readonly %b) #0 {
entry:
  %0 = bitcast %struct.A* %b to i64*
  %1 = load i64* %0, align 4
  store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4
  ret i32 1
}

The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store. I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register). As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca.

A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG.

Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?

David

Hi David,

Hi all,

We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.

Could you be more specific?
In particular, what is the problem?

We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a

We are missing the interesting bits here :).
What is next?
A 64-bit store?

This simple program shows the problem:

struct A {
  int a, b;
} global;

int mv(const struct A *b)
{
  global = *b;
  return 1;
}

Compiled with clang at -O1 or above, the function becomes:

define i32 @mv(%struct.A* nocapture readonly %b) #0 {
entry:
%0 = bitcast %struct.A* %b to i64*
%1 = load i64* %0, align 4
store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4
ret i32 1
}

The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store.

I am not sure I follow what you are saying.
In this example, the load (%1) is used only once and in the same basic block. You should have every information you need in SelectionDAG.
That said, as you may know, SelectionDAG is a bottom-up approach. Thus, store is matched before the load gets a change to be matched.

It is hard to give direction on how to fix that since we do not know anything about the target. In particular, is i64 a legal type or not?
Based on your problem description, it is not clear since we are missing what the 2 loads are feeding but it sounds like they are feeding a 64-bit store.

In short, what is the best lowering you would expect for your target?
1. 2 loads - 2 store?
2. A memcpy intrinsic?

For #1, I am wondering if we could extend the slicing that takes place during DAG combine (look for SliceUpLoad in DAGCombiner) to slice the store part too if that is the problem.

I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register).

As far as I can tell, InstCombine is target independent and its purpose is to produce a canonical form. In that case, memcpy/memmove of chunk that are less that 8 bytes are canonicalized to 64-bit load/store.

As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca.

A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG.

Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?

We usually use the TargetLowering class, you can look into the CodeGenPrepare pass :).

Hope that helps.

-Quentin

Hi David,

Hi all,

We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.

Could you be more specific?
In particular, what is the problem?

That, for the reasons outlined in my mail, we generate suboptimal code for small structure copies on architectures where unaligned loads and stores are expensive.

We're seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a

We are missing the interesting bits here :).
What is next?
A 64-bit store?

Then two 32-bit stores (because unaligned 64-bit stores are not permitted).

This simple program shows the problem:

struct A {
  int a, b;
} global;

int mv(const struct A *b)
{
  global = *b;
  return 1;
}

Compiled with clang at -O1 or above, the function becomes:

define i32 @mv(%struct.A* nocapture readonly %b) #0 {
entry:
%0 = bitcast %struct.A* %b to i64*
%1 = load i64* %0, align 4
store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4
ret i32 1
}

The back end now pattern matches on the load, but doesn't know that the load is only ever used for a corresponding store.

I am not sure I follow what you are saying.
In this example, the load (%1) is used only once and in the same basic block. You should have every information you need in SelectionDAG.

Yes, I'd expect SelectionDAG should be able to do this.

That said, as you may know, SelectionDAG is a bottom-up approach. Thus, store is matched before the load gets a change to be matched.

It is hard to give direction on how to fix that since we do not know anything about the target. In particular, is i64 a legal type or not?

Yes, i64 is a legal type, but unaligned loads and stores are not supported by the architecture and so SelectionDAG will expand them to a more complex sequence.

Based on your problem description, it is not clear since we are missing what the 2 loads are feeding but it sounds like they are feeding a 64-bit store.

In short, what is the best lowering you would expect for your target?
1. 2 loads - 2 store?
2. A memcpy intrinsic?

A memcpy intrinsic will expand to 2 loads, 2 stores, if the alignment is less than 8, so they are equivalent.

For #1, I am wondering if we could extend the slicing that takes place during DAG combine (look for SliceUpLoad in DAGCombiner) to slice the store part too if that is the problem.

I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register).

As far as I can tell, InstCombine is target independent and its purpose is to produce a canonical form. In that case, memcpy/memmove of chunk that are less that 8 bytes are canonicalized to 64-bit load/store.

The canonical form, however, is inefficient for targets where unaligned 64-bit stores are expensive. The same applies to unaligned memory operations of other widths.

SelectionDAG already has sensible code for handling unaligned memcpy intrinsics, delegating to the target the choice of the most efficient set of loads and stores. If InstCombine is replacing the memcpy with a load + store that is wider than something that the target can handle for the desired alignment, then this is a problem, because we then lack information about how to sensibly lower it.

As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca.

A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG.

Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?

We usually use the TargetLowering class, you can look into the CodeGenPrepare pass :).

Unless I miss something, TargetLowering is not available to InstCombine.

David

-Quentin

Hi David,

Hi all,

We currently have code in InstCombine that tries to expand memcpy / memmove intrinsics that are copying something that fits in a single register to a load and store in the IR. We then have other code in SelectionDAG that expands small memcpy intrinsics to sequences of loads and stores. The InstCombine one is useful, as it exposes more optimisation opportunities, but unfortunately it is not semantically equivalent to the SelectionDAG expansion.

Could you be more specific?
In particular, what is the problem?

That, for the reasons outlined in my mail, we generate suboptimal code for small structure copies on architectures where unaligned loads and stores are expensive.

Ok, I see. InstCombine doing something different that SelectionDAG is expected. One canonicalizes, the other lowers.

We’re seeing this in our back end where a 32-bit aligned 64-bit memcpy is generates efficient code when expanded near the back end (2 loads, 2 stores), but quite suboptimal code when the load is a result of InstCombine expansion. In the second case, it becomes two loads, a shift and an or (to construct the 64-bit value) and then a

We are missing the interesting bits here :).
What is next?
A 64-bit store?

Then two 32-bit stores (because unaligned 64-bit stores are not permitted).

Thanks, that helps.

To clarify, you end up with something like:

/* load the 32-bit part /
low = load i32 base
high = load i32 base + 4bytes
/
build up the 64-bit value /
high64 = zext i32 high to i64
low64 = zext i32 low to i64
shiftHigh64 = shl i64 high64, 32
full = or i64 low64, high64
/
break down the 64-bit value into 32-bit values /
newLow = trunc i64 full to i32
newHigh64 = lshr i64 full, 32
newHigh = trunc i64 newHigh64 to i32
/
store the 32-bits value */
store i32 newHigh, base2 + 4bytes

store i32 newLow, base2

Is that correct?

If not, could you give the complet sequence?

From your description, there is something missing to break down the newly created 64-bit value into the values that feed the stores.

Assuming this is the right code sequence, I guess ISel creates the full i64 type because it is legal to do so. However, in that case, it would have been better to just stick to two 32-bit slices.
Like I previously said, this looks like the DAG combine SliceUpLoad may be extended to support this case.

Would it be possible to share the input of isel (-view-isel-dags)?

This simple program shows the problem:

struct A {
int a, b;
} global;

int mv(const struct A *b)
{
global = *b;
return 1;
}

Compiled with clang at -O1 or above, the function becomes:

define i32 @mv(%struct.A* nocapture readonly %b) #0 {
entry:
%0 = bitcast %struct.A* %b to i64*
%1 = load i64* %0, align 4
store i64 %1, i64* bitcast (%struct.A* @global to i64*), align 4
ret i32 1
}

The back end now pattern matches on the load, but doesn’t know that the load is only ever used for a corresponding store.

I am not sure I follow what you are saying.
In this example, the load (%1) is used only once and in the same basic block. You should have every information you need in SelectionDAG.

Yes, I’d expect SelectionDAG should be able to do this.

That said, as you may know, SelectionDAG is a bottom-up approach. Thus, store is matched before the load gets a change to be matched.

It is hard to give direction on how to fix that since we do not know anything about the target. In particular, is i64 a legal type or not?

Yes, i64 is a legal type, but unaligned loads and stores are not supported by the architecture and so SelectionDAG will expand them to a more complex sequence.

Based on your problem description, it is not clear since we are missing what the 2 loads are feeding but it sounds like they are feeding a 64-bit store.

In short, what is the best lowering you would expect for your target?

  1. 2 loads - 2 store?
  2. A memcpy intrinsic?

A memcpy intrinsic will expand to 2 loads, 2 stores, if the alignment is less than 8, so they are equivalent.

Okay, but what you are looking for is 2 loads and 2 stores in that case, let us just skip the intrinsic step if we can :).

For #1, I am wondering if we could extend the slicing that takes place during DAG combine (look for SliceUpLoad in DAGCombiner) to slice the store part too if that is the problem.

I can think of several ways of fixing this, but the ideal one is probably for InstCombine to know whether unaligned loads and stores are expensive for this target and emit a short sequence of loads and stores at the desired alignment if so. This code in InstCombine could actually benefit a lot from some target knowledge, as this transform may be applicable on some systems to vector or other types (for example, I bet there are a lot of C++ structures that are copied around a lot that would fit happily in an AVX register).

As far as I can tell, InstCombine is target independent and its purpose is to produce a canonical form. In that case, memcpy/memmove of chunk that are less that 8 bytes are canonicalized to 64-bit load/store.

The canonical form, however, is inefficient for targets where unaligned 64-bit stores are expensive. The same applies to unaligned memory operations of other widths.

SelectionDAG already has sensible code for handling unaligned memcpy intrinsics, delegating to the target the choice of the most efficient set of loads and stores. If InstCombine is replacing the memcpy with a load + store that is wider than something that the target can handle for the desired alignment, then this is a problem, because we then lack information about how to sensibly lower it.

Sure, I am still not convinced fixing InstCombine is the right thing to do here. At this point, but I may be wrong, I believe it is the fact that i64 is a legal type for you end up generating such a bad sequence. I need more input to give you direction for fixing that. See my proposal.

As a small extension, if the operation can be more efficient if the alignment is better, and either the source or destination is on the stack, then it may make sense to increase the alignment of the alloca.

A somewhat hacky alternative that would address this case would be to have a codegen prepare pass that would try to turn load-store sequences back into memcopy / memmove intrinsics (although the aliasing information present in the decision to use memmove may be lost from the load / store sequence), which we can then expand more efficiently in SelectionDAG.

Do we have a sensible way of exposing this kind of target-specific information to optimisers other than the vector cost model?

We usually use the TargetLowering class, you can look into the CodeGenPrepare pass :).

Unless I miss something, TargetLowering is not available to InstCombine.

No, you are right. InstCombine is not supposed to take target specific decisions, but CodeGenPrepare is.

Cheers,
-Quentin