Where could we perform (sudo-?)SROA with non-constant offsets?

LebedevRI · November 16, 2022, 9:42pm

First, consider: (example 0)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int max_bytes = length - position;
    int bytes = std::min(size, max_bytes);
    char res[size] = {};
    memcpy(res, input + position, bytes);
    sink(res);
}

This function has to perform dynamically-sized, but bound, memcpy,
which may or may not be good, given particular use case: Compiler Explorer

Now, let’s look at another variant (not strictly identical): (example 1)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int last_pos = length - size;
    int clamped_pos = std::min(position, last_pos);
    char tmp[2 * size] = {};
    memcpy(tmp, &input + clamped_pos, size);
    int num_leading_padding_bytes = std::min(size, position - clamped_pos);
    char res[size] = {};
    memcpy(res, tmp + num_leading_padding_bytes, size);
    sink(res);
}

Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.

Problem is, the second memcpy loads from a non-constant offset into tmp,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: Compiler Explorer

So while this may or may not already be better than the original variant,
this is still not great.

But the transformation is pretty obvious here: (example 2)

Compiler Explorer
alive2 proof: ~~https://alive2.llvm.org/ce/z/Y-n5qy~~ err, wrong proof, got hit by Contents of memory input to fns not checked · Issue #647 · AliveToolkit/alive2 · GitHub / Comparison of bytes of allocas given to function arguments · Issue #435 · AliveToolkit/alive2 · GitHub again.
I.e. instead of storing the loaded value and loading from some offset into the tmp,
just lshr (at least for little-endian), the originally loaded value.

And we finally get near-optimal assembly: Compiler Explorer

Question: is there any pass that would be the best place
to perform (example 1) → (example 2) transformation?
Should the SROA be taught about non-constant offsets?

I don’t expect this to be a common pattern, but when it does happen,
it may be performance-sensitive, so it would be good to deal with this.

Roman.

nlopes · November 16, 2022, 9:59pm

It sounds like an interesting trick! But alloca must fit in one register.
Alternatively, one could use vectors and maybe it pays off to use shuffles to emulate the load?
But only if this pattern is common, which I have no clue.

LebedevRI · November 16, 2022, 10:12pm

Yeah, i was hoping to improve this hotpath: https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
It works, just without the perf impact i was wishing for, at least without this opt

Indeed, there is probably some profitability heuristics needed for this.
I think i’m mainly interested in i32/i64 and ideally i128,
and for i128, the codegen is indeed not quite nice: Compiler Explorer

Hard to say. Depending on how you look at it, every optimization
is for some pattern that is uncommon in some codebase

Topic		Replies	Views
Non-uniform SROA behavior for two similar locals IR & Optimizations	0	69	April 3, 2024
SROA and volatile memcpy/memset LLVM Dev List Archives	15	86	November 24, 2015
LLVM struct, alloca, SROA and the entry basic block LLVM Dev List Archives	8	205	September 9, 2015
wrong code generation for memcpy function in SROA optimization pass LLVM Dev List Archives	1	117	November 25, 2013
[RFC] Correct implementation of memcpy with metadata IR & Optimizations core , rfc , llvm	5	976	June 20, 2023

Where could we perform (sudo-?)SROA with non-constant offsets?

Related topics