Comparison of 2 schemes to implement OpenMP 5.0 declare mapper codegen

Hi,

Alexey Bataev and I (Lingda Li) would like to have your attention on an ongoing discussion of 2 schemes to implement the declare mapper in OpenMP 5.0. The detailed discussion can be found at https://reviews.llvm.org/D59474

Scheme 1 (the one has been implemented by me in https://reviews.llvm.org/D59474):
The detailed design can be found at https://github.com/lingda-li/public-sharing/blob/master/mapper_runtime_design.pptx
For each mapper function, the compiler generates a function like this:

void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
// Allocate space for an array section first.
if (size > 1 && !maptype.IsDelete)
<push>(base, begin, size*sizeof(Ty), clearToFrom(type));

// Map members.
for (unsigned i = 0; i < size; i++) {
// For each component specified by this mapper:
for (auto c : components) {
...; // code to generate c.arg_base, c.arg_begin, c.arg_size, c.arg_type
if (c.hasMapper())
(*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
else
<push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
}
}
// Delete the array section.
if (size > 1 && maptype.IsDelete)
<push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}

This function is passed to the OpenMP runtime, and the runtime will call this function to finish the data mapping.

Scheme 2 (which Alexey proposes):
Alexey proposes to move parts of the mapper function above into the OpenMP runtime, so the compiler will generate code below:

void <type>.mapper(void *base, void *begin, size_t size, int64_t type) {
...; // code to generate arg_base, arg_begin, arg_size, arg_type, arg_mapper.
auto sub_components[] = {...}; // fill in generated begin, base, ...
__tgt_mapper(base, begin, size, type, sub_components);
}

__tgt_mapper is a runtime function as below:

void __tgt_mapper(void *base, void *begin, size_t size, int64_t type, auto components[]) {
// Allocate space for an array section first.
if (size > 1 && !maptype.IsDelete)
<push>(base, begin, size*sizeof(Ty), clearToFrom(type));

// Map members.
for (unsigned i = 0; i < size; i++) {
// For each component specified by this mapper:
for (auto c : components) {
if (c.hasMapper())
(*c.Mapper())(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
else
<push>(c.arg_base, c.arg_begin, c.arg_size, c.arg_type);
}
}
// Delete the array section.
if (size > 1 && maptype.IsDelete)
<push>(base, begin, size*sizeof(Ty), clearToFrom(type));
}

Comparison:
Why to choose 1 over 2:

  1. In scheme 2, the compiler needs to generate all map types and pass them to __tgt_mapper through sub_components. But in this case, the compiler won’t be able to generate the correct MEMBER_OF field in map type. As a result, the runtime has to fix it using the mechanism we already have here: __tgt_mapper_num_components. This not only increases complexity, but also, it means the runtime needs further manipulation of the map type, which creates locality issues. While in the current scheme, the map type is generated by compiler once, so the data locality will be very good in this case.
  2. In scheme 2, sub_components includes all components that should be mapped. If we are mapping an array, this means we need to map many components, which will need to allocate memory for sub_components in the heap. This creates further memory management burden and is not an efficient way to use memory.
  3. In scheme 1, we are able to inline nested mapper functions. As a result, the compiler can do further optimizations to optimize the mapper function, e.g., eliminate redundant computation, loop unrolling, and thus achieve potentially better performance. We cannot achieve these optimizations in scheme 2.

Why to choose 2 over 1:

  1. Less code in the mapper function codegen (I doubt this because the codegen function of scheme 1 uses less than 200 loc)

We will appreciate if you can share your opinions.

Thanks,
Lingda Li