Where could we perform (sudo-?)SROA with non-constant offsets?

First, consider: (example 0)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int max_bytes = length - position;
    int bytes = std::min(size, max_bytes);
    char res[size] = {};
    memcpy(res, input + position, bytes);
    sink(res);
}

This function has to perform dynamically-sized, but bound, memcpy,
which may or may not be good, given particular use case: Compiler Explorer

Now, let’s look at another variant (not strictly identical): (example 1)

#include <cstdlib>
#include <cstring>
#include <algorithm>

void sink(char*);

constexpr int size = 4;

void entry(char* input, int length, int position) {
    int last_pos = length - size;
    int clamped_pos = std::min(position, last_pos);
    char tmp[2 * size] = {};
    memcpy(tmp, &input + clamped_pos, size);
    int num_leading_padding_bytes = std::min(size, position - clamped_pos);
    char res[size] = {};
    memcpy(res, tmp + num_leading_padding_bytes, size);
    sink(res);
}

Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.

Problem is, the second memcpy loads from a non-constant offset into tmp,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: Compiler Explorer

So while this may or may not already be better than the original variant,
this is still not great.

But the transformation is pretty obvious here: (example 2)

And we finally get near-optimal assembly: Compiler Explorer

Question: is there any pass that would be the best place
to perform (example 1) → (example 2) transformation?
Should the SROA be taught about non-constant offsets?

I don’t expect this to be a common pattern, but when it does happen,
it may be performance-sensitive, so it would be good to deal with this.

Roman.

It sounds like an interesting trick! But alloca must fit in one register.
Alternatively, one could use vectors and maybe it pays off to use shuffles to emulate the load?
But only if this pattern is common, which I have no clue.

Yeah, i was hoping to improve this hotpath: https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
It works, just without the perf impact i was wishing for, at least without this opt :slight_smile:

Indeed, there is probably some profitability heuristics needed for this.
I think i’m mainly interested in i32/i64 and ideally i128,
and for i128, the codegen is indeed not quite nice: Compiler Explorer

Hard to say. Depending on how you look at it, every optimization
is for some pattern that is uncommon in some codebase :slight_smile: