Currently, the common sub-expression elimination MLIR pass only avoids crossing nested region boundaries of IsolatedFromAbove
operations while checking for existing known values. However, reading/writing a value defined on a parent region might be costly for some operations and it would be a better choice to re-calculate the common expression inside. Or there may be some cases where crossing some operation’s boundary means some constants are not the same as in the parent regions, even if they are based on the same values.
One instance of such a behavior being possible to override can be seen for constant hoisting by the OperationFolder
class. In that case, constants are inserted in the first parent region that is part of an IsolatedFromAbove
operation, unless there is an earlier parent operation for which a DialectFoldInterface
has been registered that returns true
for a call to shouldMaterializeInto(Region*)
. This allows dialects overriding the default behavior of the canonicalization pass and prevent constants from being hoisted out of certain regions.
In D159212 I propose adding a similar mechanism to CSE, so that the elimination of common sub-expressions between the region(s) of a given operation and any parents can be prevented without forcing that operation to be IsolatedFromAbove
, with the restrictions that would come from that.
We stumbled across this need while working on the OpenMP dialect adding support for target offload for LLVM-Flang. The omp.target
operation, which contains a single region, represents code to be executed by a target device which can be different from the host device and with its own memory space. It can use values defined in parent regions, for which they need to be mapped to the target device memory. This mapping may involve sending data from host device (CPU) to target device (e.g. GPU) and vice versa, which can be expensive. Additionally, calculations related to indexing of an array cannot be done in the host device and then passed on to the target device, because the same array can be placed in different addresses for host and
device.
Below is an example showing these problems which we encountered and that could be solved by preventing CSE across omp.target
boundaries. This is the pre-CSE MLIR representation, with comments showing the Fortran they were lowered from:
// subroutine f()
func.func @_QPf() {
// integer :: a(10)
%0 = fir.alloca !fir.array<10xi32>
// integer :: b(10)
%1 = fir.alloca !fir.array<10xi32>
// b(1) = 100
%c0_i64 = arith.constant 0 : i64
%c100_i32 = arith.constant 100 : i32
%2 = fir.coordinate_of %1, %c0_i64 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
fir.store %c100_i32 to %2 : !fir.ref<i32>
// !$omp target map(from: a) map(to: b)
omp.target map((from -> %0 : !fir.ref<!fir.array<10xi32>>), (to -> %1 : !fir.ref<!fir.array<10xi32>>)) {
// a(1) = b(1)
%c0_i64_0 = arith.constant 0 : i64
%3 = fir.coordinate_of %1, %c0_i64_0 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
%4 = fir.load %3 : !fir.ref<i32>
%5 = fir.coordinate_of %0, %c0_i64_0 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
fir.store %4 to %5 : !fir.ref<i32>
omp.terminator
}
return
}
And this is what happens after the CSE pass runs. The arith.constant 0 : i64
constants are consolidated and also the result of fir.coordinate_of %1, %c0_i64
is reused outside and inside of the omp.target
region. The first case could result in the need for copying the zero value from host device to target device and the second case can result in a wrong calculation of the address for the fir.load
operation:
func.func @_QPf() {
%0 = fir.alloca !fir.array<10xi32>
%1 = fir.alloca !fir.array<10xi32>
%c0_i64 = arith.constant 0 : i64
%c100_i32 = arith.constant 100 : i32
%2 = fir.coordinate_of %1, %c0_i64 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
fir.store %c100_i32 to %2 : !fir.ref<i32>
omp.target map((from -> %0 : !fir.ref<!fir.array<10xi32>>), (to -> %1 : !fir.ref<!fir.array<10xi32>>)) {
// Loading from host device address
%3 = fir.load %2 : !fir.ref<i32>
// Implicit map of %c0_i64
%4 = fir.coordinate_of %0, %c0_i64 : (!fir.ref<!fir.array<10xi32>>, i64) -> !fir.ref<i32>
fir.store %3 to %4 : !fir.ref<i32>
omp.terminator
}
return
}
So I wanted to ask if the approach I proposed in my patch would be a good way to address this issue, or rather if there are preferred approaches, maybe involving the creation of another trait, extending an existing interface rather than creating a new one, or modifying the omp.target
operation definition to be IsolatedFromAbove
and dealing with mapped variables differently.