Missed optimization opportunity with piecewise load shift-or'd together?

The following piece of IR is a fixed point for opt -std-compile-opts/-O3:

My guess is that this is a missed optimization, but in real life, all projects i have worked fix this in the C or C++ code using macros that change what instructions are used based on target platform and its endedness.
James

One reason for writing code like this, i.e. explicitly spelling out
the accesses to the individual bytes, would be to allow compile-time
evaluation of the fragment in the D programming language, where
arbitrarily reinterpreting memory is not supported (although
integer->integer pointer casts might be supported at some point).

Would a patch adding the capability to lower this to InstCombine or
similar have a chance of being accepted, or would that be considered
to be too rare a spacial case to be worth the added complexity?

David

> My guess is that this is a missed optimization, but in real life,
> all
> projects i have worked fix this in the C or C++ code using macros
> that
> change what instructions are used based on target platform and its
> endedness.

One reason for writing code like this, i.e. explicitly spelling out
the accesses to the individual bytes, would be to allow compile-time
evaluation of the fragment in the D programming language, where
arbitrarily reinterpreting memory is not supported (although
integer->integer pointer casts might be supported at some point).

Would a patch adding the capability to lower this to InstCombine or
similar have a chance of being accepted, or would that be considered
to be too rare a spacial case to be worth the added complexity?

I think that a patch for this would be great; I've seen plenty of real-life deserialization code that looks like this.

FWIW, some patterns like this (byte swapping, for example), are matched during CodeGen (see DAGCombiner::visitOR in lib/CodeGen/SelectionDAG/DAGCombiner.cpp), but there is no reason that this pattern cannot be recognized and canonicalized early in the IR (and I think that it should be).

-Hal

I wrote up this optimization as an LLVM IR pass last month, actually:
https://code.google.com/p/foster/source/browse/compiler/llvm/passes/BitcastLoadRecognizer.cpp

It recognizes trees of `or’ operations where the leaves are (buf[v+c] << c * sizeof(buf[0])).

There are a few improvements needed to make it fit for general consumption; it assumes (without checking) that it’s targeting a little-endian architecture, and it doesn’t propagate alignment or inbounds information from loads. It also does not recognize/generate byte swaps, though it would be a pretty trivial extension to do so. It might also be nice to have the dual optimization to coalesce stores rather than loads.

If anyone wants to use this as a starting point for a patch, feel free!

I wrote up this optimization as an LLVM IR pass last month, actually:
https://code.google.com/p/foster/source/browse/compiler/llvm/passes/BitcastLoadRecognizer.cpp

Neat, thanks for sharing! FWIW, I think that in terms of an implementation in LLVM itself, putting this inside of InstCombine makes the most sense.

-Hal