I am investigating a poor code generation on x86-64 involving a 64-bits structure with two 32-bits fields (in the attached examples float, but similar behavior is exposed with i32, and we can probably generalize that to smaller types too).
The root cause of the problem is in SROA, although I am not sure we should fix something there. That is why I need your advices.
** Problem **
64-bits structures are usually loaded as one chunk of bits and fields are extracted from this chunk.
Although this may be generally better than loading each field on its own, this can lead to poor code generation when the operations extracting the fields are more expensive than a load or when “fancy” loads are available.
More generally, this may happen for smaller size too.
** Example **
%chunk64 = load i64
%field1trunced = trunc i64 %chunk64 to i32 // <— build field1 from chunk
%field1float = bitcast i32 field1trunced to float // <— build field1 from chunk
%field2shifted = lshr i64 %chunk64, 32 // <— build field2 from chunk
%field2trunced = trunc i64 %field2shifter to i32 // <— build field2 from chunk
%field2 = bitcast i32 %field2trunced to float // <— build field2 from chunk
Floating point registers are on another register bank and register bank moves are almost as expensive as loads (instructions 3. and 6.).
Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat
Paired loads are available on the target. Truncate and shift instructions are useless (instructions 2., 4., and 5.).
Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
** To Reproduce **
Here is a way to reproduce the poor code generation for x86-64.
opt -sroa current_input.ll -S -o - | llc -O3 -o -
You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the next command.
Here is a nicer code produced by modifying the input so that SROA generates friendlier code for this case.
opt -sroa mod_input.ll -S -o - | llc -O3 -o -
Basically the difference between both inputs is that memcpy has not been expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA inserts its own loads to get rid of the memcpy instead of extracting the values from the 64-bits loads.
** Advices Required **
SROA generates this extract-fields-from-chunk-of-bits thing.
However, like I said, I do not think this is generally a bad thing.
Would it make sense to rewrite the definitions of the involved slices so that SROA breaks them apart when they are loads (and under certain circumstance)?
More generally, do you think there is something we should do in SROA for this?
Currently, 32-bits targets (e.g., armv7s) do not suffer this because the legalization of types in selection DAG split the 64-bits loads.
Should we do something similar for 64-bits targets with the proper target hooks?
If yes, what hooks?
Thanks for your help.
current_input.ll (1.91 KB)
mod_input.ll (1.97 KB)