[RFC] Poor code generation for paired load

Hi,

I am investigating a poor code generation on x86-64 involving a 64-bits structure with two 32-bits fields (in the attached examples float, but similar behavior is exposed with i32, and we can probably generalize that to smaller types too).
The root cause of the problem is in SROA, although I am not sure we should fix something there. That is why I need your advices.

** Problem **

64-bits structures are usually loaded as one chunk of bits and fields are extracted from this chunk.
Although this may be generally better than loading each field on its own, this can lead to poor code generation when the operations extracting the fields are more expensive than a load or when “fancy” loads are available.

More generally, this may happen for smaller size too.

** Example **

  1. %chunk64 = load i64

  2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1 from chunk

  3. %field1float = bitcast i32 field1trunced to float // <— build field1 from chunk

  4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2 from chunk

  5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2 from chunk

  6. %field2 = bitcast i32 %field2trunced to float // <— build field2 from chunk

Scenario #1:
Floating point registers are on another register bank and register bank moves are almost as expensive as loads (instructions 3. and 6.).
Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat

Scenario #2
Paired loads are available on the target. Truncate and shift instructions are useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair

** To Reproduce **

Here is a way to reproduce the poor code generation for x86-64.

opt -sroa current_input.ll -S -o - | llc -O3 -o -

You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the next command.

Here is a nicer code produced by modifying the input so that SROA generates friendlier code for this case.

opt -sroa mod_input.ll -S -o - | llc -O3 -o -

Basically the difference between both inputs is that memcpy has not been expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA inserts its own loads to get rid of the memcpy instead of extracting the values from the 64-bits loads.

** Advices Required **

SROA generates this extract-fields-from-chunk-of-bits thing.
However, like I said, I do not think this is generally a bad thing.

Would it make sense to rewrite the definitions of the involved slices so that SROA breaks them apart when they are loads (and under certain circumstance)?

More generally, do you think there is something we should do in SROA for this?

Currently, 32-bits targets (e.g., armv7s) do not suffer this because the legalization of types in selection DAG split the 64-bits loads.

Should we do something similar for 64-bits targets with the proper target hooks?
If yes, what hooks?

Thanks for your help.

Cheers,

-Quentin

current_input.ll (1.91 KB)

mod_input.ll (1.97 KB)

Hi,

I am investigating a poor code generation on x86-64 involving a 64-bits
structure with two 32-bits fields (in the attached examples float, but
similar behavior is exposed with i32, and we can probably generalize that to
smaller types too).
The root cause of the problem is in SROA, although I am not sure we should
fix something there. That is why I need your advices.

** Problem **

64-bits structures are usually loaded as one chunk of bits and fields are
extracted from this chunk.
Although this may be generally better than loading each field on its own,
this can lead to poor code generation when the operations extracting the
fields are more expensive than a load or when “fancy” loads are available.

More generally, this may happen for smaller size too.

** Example **

1. %chunk64 = load i64
2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1
from chunk
3. %field1float = bitcast i32 field1trunced to float // <— build
field1 from chunk
4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2
from chunk
5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2
from chunk
6. %field2 = bitcast i32 %field2trunced to float // <— build
field2 from chunk

Scenario #1:
Floating point registers are on another register bank and register bank
moves are almost as expensive as loads (instructions 3. and 6.).
Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat

Scenario #2
Paired loads are available on the target. Truncate and shift instructions
are useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair

** To Reproduce **

Here is a way to reproduce the poor code generation for x86-64.

opt -sroa current_input.ll -S -o - | llc -O3 -o -

You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
next command.

Here is a nicer code produced by modifying the input so that SROA generates
friendlier code for this case.

opt -sroa mod_input.ll -S -o - | llc -O3 -o -

Basically the difference between both inputs is that memcpy has not been
expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
inserts its own loads to get rid of the memcpy instead of extracting the
values from the 64-bits loads.

** Advices Required **

SROA generates this extract-fields-from-chunk-of-bits thing.
However, like I said, I do not think this is generally a bad thing.

Would it make sense to rewrite the definitions of the involved slices so
that SROA breaks them apart when they are loads (and under certain
circumstance)?

More generally, do you think there is something we should do in SROA for
this?

The load that you want to split is actually the i64 load from memory
with SROA knows nothing about. So the transform you want isn't really
related to what SROA does.

Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
legalization of types in selection DAG split the 64-bits loads.

Should we do something similar for 64-bits targets with the proper target
hooks?
If yes, what hooks?

Hmm... detecting the pattern in IR isn't particularly hard; see
InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
detects a similar sort of pattern. You might want to consider adding
something in instcombine.

I'm trying to think if there's some scenario where you wouldn't want
to rewrite the load into two loads, but I'm having trouble coming up
with anything: two scalar loads at a constant offset from each other
are pretty easy to detect for the sorts of passes that like to mess
with loads. So we probably just want to declare that two load
instructions is the canonical form for loading two floats which are
next to each other in memory, and do this transform on IR for all
targets.

-Eli

Hi Eli,

Thanks for the feedbacks.

Hi,

I am investigating a poor code generation on x86-64 involving a 64-bits
structure with two 32-bits fields (in the attached examples float, but
similar behavior is exposed with i32, and we can probably generalize that to
smaller types too).
The root cause of the problem is in SROA, although I am not sure we should
fix something there. That is why I need your advices.

** Problem **

64-bits structures are usually loaded as one chunk of bits and fields are
extracted from this chunk.
Although this may be generally better than loading each field on its own,
this can lead to poor code generation when the operations extracting the
fields are more expensive than a load or when “fancy” loads are available.

More generally, this may happen for smaller size too.

** Example **

  1. %chunk64 = load i64
  2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1
    from chunk
  3. %field1float = bitcast i32 field1trunced to float // <— build
    field1 from chunk
  4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2
    from chunk
  5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2
    from chunk
  6. %field2 = bitcast i32 %field2trunced to float // <— build
    field2 from chunk

Scenario #1:
Floating point registers are on another register bank and register bank
moves are almost as expensive as loads (instructions 3. and 6.).
Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat

Scenario #2
Paired loads are available on the target. Truncate and shift instructions
are useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair

** To Reproduce **

Here is a way to reproduce the poor code generation for x86-64.

opt -sroa current_input.ll -S -o - | llc -O3 -o -

You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
next command.

Here is a nicer code produced by modifying the input so that SROA generates
friendlier code for this case.

opt -sroa mod_input.ll -S -o - | llc -O3 -o -

Basically the difference between both inputs is that memcpy has not been
expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
inserts its own loads to get rid of the memcpy instead of extracting the
values from the 64-bits loads.

** Advices Required **

SROA generates this extract-fields-from-chunk-of-bits thing.
However, like I said, I do not think this is generally a bad thing.

Would it make sense to rewrite the definitions of the involved slices so
that SROA breaks them apart when they are loads (and under certain
circumstance)?

More generally, do you think there is something we should do in SROA for
this?

The load that you want to split is actually the i64 load from memory
with SROA knows nothing about. So the transform you want isn’t really
related to what SROA does.

Agreed.

Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
legalization of types in selection DAG split the 64-bits loads.

Should we do something similar for 64-bits targets with the proper target
hooks?
If yes, what hooks?

Hmm… detecting the pattern in IR isn’t particularly hard; see
InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
detects a similar sort of pattern. You might want to consider adding
something in instcombine.

Thanks for the direction, I will have a look.

I’m trying to think if there’s some scenario where you wouldn’t want
to rewrite the load into two loads, but I’m having trouble coming up
with anything: two scalar loads at a constant offset from each other
are pretty easy to detect for the sorts of passes that like to mess
with loads. So we probably just want to declare that two load
instructions is the canonical form for loading two floats which are
next to each other in memory, and do this transform on IR for all
targets.

It is more general than floating point types. For instance, if you replace float by i32 (and fadd by add, of course :)) in the example I had in my initial email, the produced code is also better for x86-64 when you split the loads.
Thus, do you think we should split all loads and rely on later passes to combine them, if needed for the current target?

Thanks,

-Quentin

Right... replace "float" with "arithmetic type".

You probably want to be careful you don't end up generating
overlapping loads, and larger loads are generally better if we're not
doing arithmetic, but otherwise this seems like a good idea. That
said, I'd suggest writing it up and starting a new thread on llvmdev.

-Eli

Hi Eli,

Thanks for the feedbacks.

Hi,

I am investigating a poor code generation on x86-64 involving a 64-bits
structure with two 32-bits fields (in the attached examples float, but
similar behavior is exposed with i32, and we can probably generalize that to
smaller types too).
The root cause of the problem is in SROA, although I am not sure we should
fix something there. That is why I need your advices.

** Problem **

64-bits structures are usually loaded as one chunk of bits and fields are
extracted from this chunk.
Although this may be generally better than loading each field on its own,
this can lead to poor code generation when the operations extracting the
fields are more expensive than a load or when “fancy” loads are available.

More generally, this may happen for smaller size too.

** Example **

  1. %chunk64 = load i64
  2. %field1trunced = trunc i64 %chunk64 to i32 // <— build field1
    from chunk
  3. %field1float = bitcast i32 field1trunced to float // <— build
    field1 from chunk
  4. %field2shifted = lshr i64 %chunk64, 32 // <— build field2
    from chunk
  5. %field2trunced = trunc i64 %field2shifter to i32 // <— build field2
    from chunk
  6. %field2 = bitcast i32 %field2trunced to float // <— build
    field2 from chunk

Scenario #1:
Floating point registers are on another register bank and register bank
moves are almost as expensive as loads (instructions 3. and 6.).
Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat

Scenario #2
Paired loads are available on the target. Truncate and shift instructions
are useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair

** To Reproduce **

Here is a way to reproduce the poor code generation for x86-64.

opt -sroa current_input.ll -S -o - | llc -O3 -o -

You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
next command.

Here is a nicer code produced by modifying the input so that SROA generates
friendlier code for this case.

opt -sroa mod_input.ll -S -o - | llc -O3 -o -

Basically the difference between both inputs is that memcpy has not been
expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
inserts its own loads to get rid of the memcpy instead of extracting the
values from the 64-bits loads.

** Advices Required **

SROA generates this extract-fields-from-chunk-of-bits thing.
However, like I said, I do not think this is generally a bad thing.

Would it make sense to rewrite the definitions of the involved slices so
that SROA breaks them apart when they are loads (and under certain
circumstance)?

More generally, do you think there is something we should do in SROA for
this?

The load that you want to split is actually the i64 load from memory
with SROA knows nothing about. So the transform you want isn’t really
related to what SROA does.

Agreed.

Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
legalization of types in selection DAG split the 64-bits loads.

Should we do something similar for 64-bits targets with the proper target
hooks?
If yes, what hooks?

Hmm… detecting the pattern in IR isn’t particularly hard; see
InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
detects a similar sort of pattern. You might want to consider adding
something in instcombine.

Thanks for the direction, I will have a look.

I’m trying to think if there’s some scenario where you wouldn’t want
to rewrite the load into two loads, but I’m having trouble coming up
with anything: two scalar loads at a constant offset from each other
are pretty easy to detect for the sorts of passes that like to mess
with loads. So we probably just want to declare that two load
instructions is the canonical form for loading two floats which are
next to each other in memory, and do this transform on IR for all
targets.

It is more general than floating point types. For instance, if you replace
float by i32 (and fadd by add, of course :)) in the example I had in my
initial email, the produced code is also better for x86-64 when you split
the loads.
Thus, do you think we should split all loads and rely on later passes to
combine them, if needed for the current target?

Right… replace “float” with “arithmetic type”.

You probably want to be careful you don’t end up generating
overlapping loads, and larger loads are generally better if we’re not
doing arithmetic, but otherwise this seems like a good idea. That
said, I’d suggest writing it up and starting a new thread on llvmddv.

Thanks Eli.

Will do.

-Quentin