First, consider: (example 0)
#include <cstdlib>
#include <cstring>
#include <algorithm>
void sink(char*);
constexpr int size = 4;
void entry(char* input, int length, int position) {
int max_bytes = length - position;
int bytes = std::min(size, max_bytes);
char res[size] = {};
memcpy(res, input + position, bytes);
sink(res);
}
This function has to perform dynamically-sized, but bound, memcpy
,
which may or may not be good, given particular use case: Compiler Explorer
Now, let’s look at another variant (not strictly identical): (example 1)
#include <cstdlib>
#include <cstring>
#include <algorithm>
void sink(char*);
constexpr int size = 4;
void entry(char* input, int length, int position) {
int last_pos = length - size;
int clamped_pos = std::min(position, last_pos);
char tmp[2 * size] = {};
memcpy(tmp, &input + clamped_pos, size);
int num_leading_padding_bytes = std::min(size, position - clamped_pos);
char res[size] = {};
memcpy(res, tmp + num_leading_padding_bytes, size);
sink(res);
}
Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.
Problem is, the second memcpy
loads from a non-constant offset into tmp
,
SROA does not deal with non-constant offsets, so we end up with tmp
not being promoted into a register: Compiler Explorer
So while this may or may not already be better than the original variant,
this is still not great.
But the transformation is pretty obvious here: (example 2)
- Compiler Explorer
- alive2 proof: ~~https://alive2.llvm.org/ce/z/Y-n5qy~~ err, wrong proof, got hit by Contents of memory input to fns not checked · Issue #647 · AliveToolkit/alive2 · GitHub / Comparison of bytes of allocas given to function arguments · Issue #435 · AliveToolkit/alive2 · GitHub again.
I.e. instead of storing the loaded value and loading from some offset into thetmp
,
justlshr
(at least for little-endian), the originally loaded value.
And we finally get near-optimal assembly: Compiler Explorer
Question: is there any pass that would be the best place
to perform (example 1) → (example 2) transformation?
Should the SROA be taught about non-constant offsets?
I don’t expect this to be a common pattern, but when it does happen,
it may be performance-sensitive, so it would be good to deal with this.
Roman.