I don’t think this works today. There’s some state in omp_data.cu which looks like it gets shared by any target offloading invocation, so I think multiple host threads will invoke kernels that step on each other.
Further, I think this should work. It seems like a reasonable use case.
Shall we fix up deviceRTL (and the compiler side of reduction, possibly elsewhere) to support multiple simultaneous invocations of the openmp target runtime?