[RFC][OpenMP] Parallelize the possible intrinsic functions in the workshare construct

Problem:

Context

Using Ivan’s PR the intrinsic calls seem to be enclosed within the single construct.

Consider an example:

!$omp parallel workshare
  s = sum(x)
!$omp end parallel workshare

Gets converted to:

omp.parallel {
  omp.single nowait {
    // [...]
    %13 = fir.call @_FortranASumInteger4(%9, %10, %c5_i32, %11, %12)
  }
}

Details

The documentation says:
OpenMP 5.2: 11.4 Workshare Construct
[…]
Evaluation of transformational array intrinsic functions may be freely subdivided into any number of units of work.
The transformational array intrinsic functions are MATMUL, DOT_PRODUCT, SUM, PRODUCT, MAXVAL, MINVAL, COUNT, ANY, ALL, SPREAD, PACK, UNPACK, RESHAPE, TRANSPOSE, EOSHIFT, CSHIFT, MINLOC, and MAXLOC.

So, the intrinsic operations have to be parallelized to get better performance.

Solution:

One solution is to add the intrinsic operation into the body replacing the intrinsic call using the WorkshareLowering pass. So, I decided to pick one and experiment by enclosing the operations into a wsloop node which helps us to share the operations among different threads.
Here: PR, I picked up SUM intrinsic and added a patch to generate the wsloop.

Consider the above example and the MLIR generated is:

omp.parallel {
  // [...]
  omp.wsloop {
    omp.loop_nest {{.*}} {
      // [...]
      %[[SUM:.*]] = arith.addf {{.*}}
      // [...]
    }
  }
}

Also, see: patch for the generated MLIR.

The same idea has to be performed to implement for all the other intrinsic mentioned in the documentation.

Community Replies in the PR

Idea suggestion:

From Ivan R. Ivanov,

I think there are two main approaches possible here, and they can coexist:

  • Have alternative intrinsic implementations in their own runtime library
  • Code gen the intrinsics at the IR level

See in details here, link

Reviewers comments:

From Tom Eccles,

I don’t have a strong opinion about whether this should be done in flang codegen or in a runtime library. Some drive-by thoughts:
[…]
Overall I don’t know what the right approach is. Maybe we should do both as Ivan said. I think if we do decide to implement several intrinsics this way, it should go in its own pass because the code could quickly become very long.
Comment link

From Michael Klemm,

Regarding the initial question: I’m ambivalent on the direction (runtime entry point vs. code-gen). I will say though that if we go with the runtime entry point, we will tie the Fortran runtime to the OpenMP runtime, which may not be desired in all cases.
Comment link

Reference compiler’s intrinsic handling analysis

Intel compiler (ifx)

Based on the analysis of the assembly file generated by the ifx, it seems all the operations are enclosed within the single construct. Also, it doesn’t contain any call to the runtime functions, all the operations seem to be performed in place.

GNU (gfortran)

Overall analysis
  • If the return type is a scalar, then the whole operation is enclosed within the single construct
  • If dim is specified, a runtime call is used, along with the resultant value being assigned using a for construct. Some runtime calls are inside a single construct, but others don’t use single constructs.
Keys:
Abbreviations Description
Runtime call Calls runtime function
Assigns result (for loop) Uses !$omp for to assign the resultant output
Single construct All the operations are enclosed inside the single construct
No Single construct No single construct is used, but the operations are performed directly under the parallel construct
For construct All the operations are enclosed inside the for construct
Analysis
Intrinsics Case 1 Case 2
MATMUL For 1D output,
Single construct, Runtime call
For 2D output,
No Single construct, Runtime call,
Assigns result using for loop
DOT_PRODUCT For 1D,
No Single construct,
Assigns result using for loop
-
SUM & PRODUCT For Scalar output,
Single construct
For Array output, with dim argument,
For construct
MAXVAL, MINVAL, ANY, ALL For Scalar output,
Single construct
For Array output,
Single construct, Runtime call
COUNT For Scalar output,
Single construct
For Array output,
No Single construct, Runtime call
Assigns result using for loop
SPREAD For all cases,
Single construct, Runtime call
-
PACK For 1D array,
No Single construct, Runtime call,
Assigns result using for loop
For 2D array,
Single construct, Runtime call
UNPACK For all cases,
No Single construct, Runtime call, Assigns result using for loop
-
RESHAPE For all cases,
Single construct, Runtime call
-
TRANSPOSE For all cases,
Single construct
-
EOSHIFT & CSHIFT For all cases,
No Single construct, Runtime call,
Assigns result using for loop
-
MINLOC & MAXLOC For Scalar output,
Single construct
For Array output,
No Single construct, Runtime call,
Assigns result using for loop

Any more ideas or suggestions would be really helpful for further implementation.

Thank you

First of all, how does this work together with function inlining - we already have code that simplifies/inlines some variants of these functions. Particularly for small cases, the overhead of calling a function is significant compared to the code executed in the function itself.

Second, I think in many cases, splitting the work across multiple cores can make the performance worse rather than better, as memory fetches across a wider range of addresses will just make memory bandwidth limits more obvious.

I’m absolutely not opposed to the idea as a general, but it would need to be correctly tuned to not some cases worse rather than better.

I agree with Mats that this would need significant effort towards benchmarking and tuning to make it advantageous. It isn’t obvious to me what the costs of using multiple threads vs a single construct (plus synchronization) will be in different cases.

Expanding on Mats’ point about inlining, we have seen significant performance gains in some benchmarks from the inline implementations of some intrinsic functions in the HLFIR OptimizedBufferization pass.

In addition to Mats’ points:

  • I’m not sure if 1D vs 2D output is the right criterion for whether to multi-thread. It would be better to consider the overall input size
  • If any of these are worth doing I think the reduction intrinsics map most conveniently onto OpenMP. I don’t think scalar outputs should prevent these from being multithreaded (at least for very large contiguous inputs)
  • I think a high performance multithreaded matmul implementation sounds like a difficult undertaking. I worry that a naive implementation would worsen the existing memory access locality issues. If you don’t have a good implementation to hand already (or some other strong motivation) I think it would be a better pay off to just make our scalar implementation more efficient for common cases. Currently it is a pretty naive matmul without any tricks to improve memory locality. Even just providing an option to use BLAS routines where possible could be an improvement.
  • Overall I think it would be better to approach this with particular applications or benchmarks in mind and then see which optimizations actually look helpful. One of the reasons I haven’t done much work on optimizing the single threaded runtime transformation functions is that (other than spec2000) I haven’t seen real Fortran code which spends much time inside of those intrinsic functions. Usually we have more pressing problems elsewhere. Of course large differences can be shown using microbenchmarks.

Finally, have you considered doing these implementations using upstream MLIR dialects? I believe there is a lot of work going on upstream we could utilize.

This paper might be of interest: https://arxiv.org/pdf/2409.18824. It has some measurements comparing our runtime library to single- and muti-threaded implementations using the affine and linalg upstream MLIR dialects.

With the latest comments on this matter, it seems like going with single for now could be a sensible choice. I see several arguments in favor of this:

  • it’s the simplest solution for now
  • it is correct from an OpenMP specification POV, albeit it may or may give the best possible performance.
  • It may cover "most’ (gut feeling, no hard data in number of such constructs) applications that use workshare for rather simple stuff and do not necessarily use complex array statements (this is partly because of long and weird history that workshare has for most Fortran compilers)