[libomptarget] Data corruption on target nowait with more than 4 threads

Hello everyone,

I’m having some data corruption issues when using the generic-elf plugin on the program below (blocked matrix multiplication). I tried to use 3 builds to test this program: the release branches “release/11.x” and “release/12.x”, and the main branch as well. I observed the following behavior:

  • release/11.x & main: the program works correctly with up to 4 OpenMP threads (OMP_NUM_THREADS=4), but with any number higher than that the result of the operation becomes incorrect. I believe that the problem may also happen with 2-4 threads, but with a lower likelihood to do so (of 500 executions, none have presented the problem);
  • release/12.x: the program crashes due to a segfault inside a function called “__kmp_push_task” from OpenMP runtime regardless of the number of threads.

The program was compiled with the following command after setting the environment variables to point to the correct clang build:

“clang++ -fopenmp -fopenmp-targets=x86_64-pc-linux-gnu BlockMatMul.cpp”

Does anyone know if this is an already known problem (e.g. multiple parallel mappings happening at the same time)? What about the “__kmp_push_task”?

Thanks for the help,
Guilherme Valarini

Here is the program (sorry I could not come up with a smaller example to post it here). I have dumped the task graph build by OpenMP in a dot/graphviz form and it seems to be correct with the indented dependencies found at the function “BlockMatMul_TargetNowait”:

Hi Guilherme,

We do have some bugs on the target x86_64-pc-linux-gnu. Existing test cases in libomptarget can’t all pass (IIRC, three stable failures and one random failure). Therefore, it is expected to see some data racing or corruption on the target.

Regards,
Shilei

Hi all,

So I took a deeper look at the problems mentioned by Guilherme and here are few observations:

(1) The data corruption of the result of the BlockMatMul is not only happening with x86_64-pc-linux-gnu target but also with nvptx64-nvidia-cuda. So it seems the problems is coming from target-agnostic part of libomptarget and not specifically from the x86 plugin. Please notice the problem does not always appear so you might need to execute it multiple times. Reducing the number of omp threads sometimes helps to reproduce the problem with CUDA plugin.

export OMP_NUM_THREADS=2

clang++ -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda BlockMatMul.cpp -o blockmatmul

for i in {1…100}; do ./blockmatmul || break; done

(2) The segfault in __kmp_push_task is only happening for x86_64-pc-linux-gnu target but it comes from a regression in libomp which seems to have been introduced with the support for hidden helper task in RTL : it is caused because the task_team pointer here is NULL. Maybe you guys have an idea on the best way to solve it.

Best regards,
Hervé

Thanks. I opened a bug to track the issue. https://bugs.llvm.org/show_bug.cgi?id=49334

Regards,
Shilei

Dear Shilei,

Thank you, I saw you already submitted a fix for the segfault.

You said in the PR that the test pass with NVPTX target but I was wondering if you tried to run it several times: it is failing sometimes but not always when running on my computer with NVTX target (even with your patch) so it makes it a bit complicated to reproduce.

Regards,
Hervé

Okay, thanks for the information. Would you please update the bug tracker?

Regards,
Shilei

Sorry, it took me some times but here it is: https://bugs.llvm.org/show_bug.cgi?id=49940

Regards,
Hervé

Thanks for the report. I’ll investigate it once I finish my current project.

Regards,
Shilei