Hello all,
I was pointed to this discussion regarding my recent commit in ⚙ D129497 [flang] Lower TRANSPOSE without using runtime.
I would like to explain why I took the route of special FIR lowering of TRANSPOSE
intrinsics rather than inlining it later in a separate pass.
My implementation follows the ideas described in llvm-project/ArrayComposition.md at main · llvm/llvm-project · GitHub, i.e. lowering the TRANSPOSE
intrinsic is a “simple” reversal of the iteration space for its argument expression lowering. This produces much nicer FIR from the beginning in cases, where the argument expression is not simple and requires a temporary array for the purpose of passing the expression value to TRANSPOSE
runtime call.
For example:
A = TRANSPOSE(B + C)
With the runtime call B + C
needs to be computed into a temporary array implying explicit fir.allocmem
/fir.freemem
and the following execution flow:
TEMP = fir.allocmem
fir.do_loop {
// run over elements of B and C, add them and store into TEMP
}
// enbox TEMP into TEMP_BOX
fir.call @_FortranATranspose(RESULT_BOX, TEMP_BOX, ...)
fir.do_loop {
// run over elements of RESULT_BOX and store them into A
}
fir.freemem TEMP
fir.freemem RESULT_BOX.Addr
While I agree with the idea of keeping FIR representation high level, I cannot say that what we get from the lowering here is a high level representation. Maybe the lack of high level representation here is a problem itself, but it is not TRANSPOSE
specific.
I did consider the path of lowering TRANSPOSE
call later after ArrayValueCopy
pass (though, I think we will have to do it even later if we want to make inlining decisison based on the array sizes, which implies some constant propagation). I found that the current MLIR passes used in Flang
are not able to do anything about these loops, and I expected something like loop fusions followed by the elimination of the temporary allocation and usage. In particular, the affine promote/demote passes failed for these loops, and I think the FIR operations stucked in between the loop (nests) will be a problem for the loop fusion.
So it seemed reasonable to me to follow the ArrayComposition
direction at least for TRANSPOSE
. I agree that it is worth looking into the late inlining of the runtime calls (and, actually, this was my very first idea as well), but in the meantime it is also worth showcasing what kind of performance improvements may be achieved with proper handling of Fortran intrinsics. I guess my implementation is one of the examples. I tried to do it in a non-intrusive way, e.g. there is an engineering option -opt-transpose=true/false
controlling the “optimized” TRANSPOSE
lowering, so we can experiment with it further.
I did enable it by default, and I am about to upload changes that disable it for -O0
(the runtime TRANSPOSE
implementation has a nice feature of reporting out-of-memory conditions to the user, whereas the “optimized” TRANPOSE
lowering will just operate on a null
returned by fir.allocmem
).
My work was motivated by SPEC CPU 2000 178.galgel that has TRANSPOSE
hotspots in syshtN.f90 and sysnsN.f90.
I used flang-new
driver with the following options on Icelake (Gold 6338): -ffixed-form -march=native -O2
Results:
Default (optimized TRANPOSE
lowering): 81.09
-Xflang -mllvm -Xflang -opt-transpose=false
: 481.33 seconds
It seems like a good result and we should see what it will cost to achieve the same speed-up with late inlining.
Side notes about Fortran runtime working on arrays:
As I said, I totally agree that inlining of runtime implementations processing arrays is worth it, especially, for small sizes. This will enable further optimizations, e.g. after full unrolling. In partcular, Polyhedron/induct2_11
suffers from lots of 3-element dot_product
computations.
At the same time, during my investigations I found that the current runtime implementation of the array iterator puzzles the MLIR/LLVM optimizations quite a lot. I am talking about the interaction between IncrementSubscripts
and SubscriptByteOffset
.
To process the next array element we use IncrementSubscript
, which is basically a loop with an early exit that in most cases increments just one subscript value by 1:
for (int j{0}; j < raw_.rank; ++j) {
int k{permutation ? permutation[j] : j};
const Dimension &dim{GetDimension(k)};
if (subscript[k]++ < dim.UpperBound()) {
return true;
}
subscript[k] = dim.LowerBound();
}
Then to get the element value we iterate through all subscripts and compute the element’s offset:
std::size_t SubscriptByteOffset(
int dim, SubscriptValue subscriptValue) const {
const Dimension &dimension{GetDimension(dim)};
return (subscriptValue - dimension.LowerBound()) * dimension.ByteStride();
}
...
std::size_t offset{0};
for (int j{0}; j < raw_.rank; ++j) {
offset += SubscriptByteOffset(j, subscript[j]);
}
return offset;
So even if we just increment one subscript by 1, and even if array rank (which is a runtime value) is 1, we compute (subscriptValue - dimension.LowerBound()) * dimension.ByteStride()
, whereas we could have just added dimension.ByteStride()
to the current value of the offset. I used some recent clang
build for building the runtime library, but it failed to do anything to reduce the operations strength here.
Just a simple example how serious this can be: I specialized SubscriptsToByteOffset
for rank 1 just to allow clang
to optimize the runtime code better:
std::size_t SubscriptsToByteOffset(const SubscriptValue subscript[]) const {
std::size_t offset{0};
#if 1
if (raw_.rank == 1) {
offset += SubscriptByteOffset(0, subscript[0]);
} else {
#endif
for (int j{0}; j < raw_.rank; ++j) {
offset += SubscriptByteOffset(j, subscript[j]);
}
#if 1
}
#endif
return offset;
}
This caused 1.46x instructions retired reduction in Polyhedron/gas_dyn2_11
that uses DoTotalReduction
pretty hard.
As long as we are not going to inline all the runtime calls, I think we also need to focus on performance of the runtime implementation itself. Regarding this particular issue, I think we need to implement a more efficient iterator over all elements and get rid of the redundant operations computing the same offset values over and over again. We may also get additional improvements from specializing the reduction related templates for ranks 1 and 2 (the most common ones).