SNAP Performance analysis, more detailed than the presentation

szakharin · July 20, 2022, 7:44pm

Hello all,

I was pointed to this discussion regarding my recent commit in ⚙ D129497 [flang] Lower TRANSPOSE without using runtime.

I would like to explain why I took the route of special FIR lowering of TRANSPOSE intrinsics rather than inlining it later in a separate pass.

My implementation follows the ideas described in llvm-project/ArrayComposition.md at main · llvm/llvm-project · GitHub, i.e. lowering the TRANSPOSE intrinsic is a “simple” reversal of the iteration space for its argument expression lowering. This produces much nicer FIR from the beginning in cases, where the argument expression is not simple and requires a temporary array for the purpose of passing the expression value to TRANSPOSE runtime call.

For example:

A = TRANSPOSE(B + C)

With the runtime call B + C needs to be computed into a temporary array implying explicit fir.allocmem/fir.freemem and the following execution flow:

TEMP = fir.allocmem
fir.do_loop {
  // run over elements of B and C, add them and store into TEMP
}
// enbox TEMP into TEMP_BOX
fir.call @_FortranATranspose(RESULT_BOX, TEMP_BOX, ...)
fir.do_loop {
  // run over elements of RESULT_BOX and store them into A
}
fir.freemem TEMP
fir.freemem RESULT_BOX.Addr

While I agree with the idea of keeping FIR representation high level, I cannot say that what we get from the lowering here is a high level representation. Maybe the lack of high level representation here is a problem itself, but it is not TRANSPOSE specific.

I did consider the path of lowering TRANSPOSE call later after ArrayValueCopy pass (though, I think we will have to do it even later if we want to make inlining decisison based on the array sizes, which implies some constant propagation). I found that the current MLIR passes used in Flang are not able to do anything about these loops, and I expected something like loop fusions followed by the elimination of the temporary allocation and usage. In particular, the affine promote/demote passes failed for these loops, and I think the FIR operations stucked in between the loop (nests) will be a problem for the loop fusion.

So it seemed reasonable to me to follow the ArrayComposition direction at least for TRANSPOSE. I agree that it is worth looking into the late inlining of the runtime calls (and, actually, this was my very first idea as well), but in the meantime it is also worth showcasing what kind of performance improvements may be achieved with proper handling of Fortran intrinsics. I guess my implementation is one of the examples. I tried to do it in a non-intrusive way, e.g. there is an engineering option -opt-transpose=true/false controlling the “optimized” TRANSPOSE lowering, so we can experiment with it further.

I did enable it by default, and I am about to upload changes that disable it for -O0 (the runtime TRANSPOSE implementation has a nice feature of reporting out-of-memory conditions to the user, whereas the “optimized” TRANPOSE lowering will just operate on a null returned by fir.allocmem).

My work was motivated by SPEC CPU 2000 178.galgel that has TRANSPOSE hotspots in syshtN.f90 and sysnsN.f90.

I used flang-new driver with the following options on Icelake (Gold 6338): -ffixed-form -march=native -O2

Results:
Default (optimized TRANPOSE lowering): 81.09
-Xflang -mllvm -Xflang -opt-transpose=false: 481.33 seconds

It seems like a good result and we should see what it will cost to achieve the same speed-up with late inlining.

Side notes about Fortran runtime working on arrays:

As I said, I totally agree that inlining of runtime implementations processing arrays is worth it, especially, for small sizes. This will enable further optimizations, e.g. after full unrolling. In partcular, Polyhedron/induct2_11 suffers from lots of 3-element dot_product computations.

At the same time, during my investigations I found that the current runtime implementation of the array iterator puzzles the MLIR/LLVM optimizations quite a lot. I am talking about the interaction between IncrementSubscripts and SubscriptByteOffset.

To process the next array element we use IncrementSubscript, which is basically a loop with an early exit that in most cases increments just one subscript value by 1:

    for (int j{0}; j < raw_.rank; ++j) {
      int k{permutation ? permutation[j] : j};
      const Dimension &dim{GetDimension(k)};
      if (subscript[k]++ < dim.UpperBound()) {
        return true;
      }
      subscript[k] = dim.LowerBound();
    }

Then to get the element value we iterate through all subscripts and compute the element’s offset:

  std::size_t SubscriptByteOffset(
      int dim, SubscriptValue subscriptValue) const {
    const Dimension &dimension{GetDimension(dim)};
    return (subscriptValue - dimension.LowerBound()) * dimension.ByteStride();
  }
...
    std::size_t offset{0};
    for (int j{0}; j < raw_.rank; ++j) {
      offset += SubscriptByteOffset(j, subscript[j]);
    }
    return offset;

So even if we just increment one subscript by 1, and even if array rank (which is a runtime value) is 1, we compute (subscriptValue - dimension.LowerBound()) * dimension.ByteStride(), whereas we could have just added dimension.ByteStride() to the current value of the offset. I used some recent clang build for building the runtime library, but it failed to do anything to reduce the operations strength here.

Just a simple example how serious this can be: I specialized SubscriptsToByteOffset for rank 1 just to allow clang to optimize the runtime code better:

  std::size_t SubscriptsToByteOffset(const SubscriptValue subscript[]) const {
    std::size_t offset{0};
#if 1
    if (raw_.rank == 1) {
      offset += SubscriptByteOffset(0, subscript[0]);
    } else {
#endif
    for (int j{0}; j < raw_.rank; ++j) {
      offset += SubscriptByteOffset(j, subscript[j]);
    }
#if 1
    }
#endif
    return offset;
  }

This caused 1.46x instructions retired reduction in Polyhedron/gas_dyn2_11 that uses DoTotalReduction pretty hard.

As long as we are not going to inline all the runtime calls, I think we also need to focus on performance of the runtime implementation itself. Regarding this particular issue, I think we need to implement a more efficient iterator over all elements and get rid of the redundant operations computing the same offset values over and over again. We may also get additional improvements from specializing the reduction related templates for ranks 1 and 2 (the most common ones).

Topic		Replies	Views
food for optimizer developers Clang Frontend	7	83	August 11, 2010
Area for improvement LLVM Dev List Archives	23	144	February 22, 2005
compiling the full SPEC CPU2000 suite to LLVM bytecode LLVM Dev List Archives	14	91	September 1, 2006
Testing LLVM on OS X LLVM Dev List Archives	19	150	May 25, 2004
9 Ideas To Better Support Source Language Developers LLVM Dev List Archives	20	172	January 8, 2004

SNAP Performance analysis, more detailed than the presentation

Side notes about Fortran runtime working on arrays:

Related topics