Enabling OMPT by default

Dear all,

OpenMP 5.0 will introduce OMPT, the OpenMP tools interface. The current implementation in the LLVM/OpenMP runtime is at the level of OpenMP TR6, as published in November (with some known issues, were we have patches under review).
Currently, this feature is deactivated by default (LIBOMP_OMPT_SUPPORT).
According to overhead measurements performed by Intel and on some of our machines, this feature introduces a worst case runtime overhead of ~4%, but for most cases basically no overhead.
As I understood, some of the build-bots run the tests when the feature is enabled. We locally test the changes continuously with different compiler versions (icc/gcc/clang), but only on x86_64 architecture.
Further, this feature was tested on ppc64.

How should we proceed to enable this feature by default, and get it tested on other architectures?
For the 6.0 release, I would like to enable OMPT at least for the supported platforms x86/x86_64 and ppc64.

We would also appreciate collaboration with vendors/users of other platforms, to ensure the code is running there.

Typically, only the architecture specific offset needs to be added:

Looking forward to your feedback,

Joachim

The ARM folks should chip in. This should clearly be tested on AARCH64 as well!

-- Jim

Jim Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers, and Runtimes)
Tel: +44 117 9071438

Adding Paul, maybe he can help here. Though switching the default for x86/x86_64 and ppc64 doesn't mean it has to work on AArch64 for the 6.0 release, it will just stay deactivated as for MIPS(64).

Jonas

Does it pass the tests? I doubt that because it needs modifications as Joachim pointed out (had to do the same for PPC64).

Jonas

Nah, it blocks forever on nested_lwt.c when I started it on 96-cores ARMv8.1 machine. BTW, the missing part of callbacks.h for tests I added like this:

diff --git a/runtime/test/ompt/callback.h b/runtime/test/ompt/callback.h
index 8481fb4..9a560aa 100755
--- a/runtime/test/ompt/callback.h
+++ b/runtime/test/ompt/callback.h
@@ -105,6 +105,11 @@ ompt_label_##id:
  #define print_possible_return_addresses(addr) \
    printf("%" PRIu64 ": current_address=%p\n", ompt_get_thread_data()->value, \
           ((char *)addr) - 8)
+#elif KMP_ARCH_AARCH64
+// On AArch64 the NOP instruction is 4 bytes long.
+#define print_possible_return_addresses(addr) \
+ printf("%" PRIu64 ": current_address=%p\n", ompt_get_thread_data()->value, \
+ ((char *)addr) - 4)
  #else
  #error Unsupported target architecture, cannot determine address offset!
  #endif

Nah, it blocks forever on nested_lwt.c when I started it on 96-cores
ARMv8.1 machine.

Can you get us a stack trace? I think we don't have access to an ARM machine to test ourselves but we might be able to guess based on the implementation.

BTW, the missing part of callbacks.h for tests I added like this:

diff --git a/runtime/test/ompt/callback.h b/runtime/test/ompt/callback.h
index 8481fb4..9a560aa 100755
--- a/runtime/test/ompt/callback.h
+++ b/runtime/test/ompt/callback.h
@@ -105,6 +105,11 @@ ompt_label_##id:
#define print_possible_return_addresses(addr) \
   printf("%" PRIu64 ": current_address=%p\n", ompt_get_thread_data()->value, \
          ((char *)addr) - 8)
+#elif KMP_ARCH_AARCH64
+// On AArch64 the NOP instruction is 4 bytes long.
+#define print_possible_return_addresses(addr) \
+ printf("%" PRIu64 ": current_address=%p\n", ompt_get_thread_data()->value, \
+ ((char *)addr) - 4)
#else
#error Unsupported target architecture, cannot determine address offset!
#endif

Interesting that we don't have a load instruction at all for AArch64. Sigh, that's why we have to customize it for every architecture...

Probably my fault, there is an uninitialized use of "condition":

https://github.com/llvm-mirror/openmp/blob/master/runtime/test/ompt/parallel/nested_lwt.c#L11

Please initialize to 0 and test again.

Thanks
Joachim

Yeah, you're right initializing this variable to zero fixes failing test case. However, this uncovers flaw in my snippet. Some test cases have failed as additional str instruction was added after nop:

     libomp :: ompt/misc/control_tool.c
     libomp :: ompt/synchronization/taskwait.c
     libomp :: ompt/synchronization/test_lock.c
     libomp :: ompt/synchronization/test_nest_lock_parallel.c

Will look at this later as I'm waiting for my flitht now.

Ok, this seems to be similar as for x86. You can add multiple offset prints. The test will match any of these printed addresses.

- Joachim

Is there a significant barrier to building both OMPT=on and OMPT=off binaries and LD_PRELOAD-ing the former when the user actually wants it? This is basically what happens with Linux packages that don’t have debug symbols by default (and users who want them have to install the dbg package) and some MPI libraries.

Jeff

I think we don't have access to an ARM machine to test ourselves

There was someone on Twitter the other day running OpenMP code on his Raspberry Pi (using "Raspbian",
I assume since he had a Linux environment) so it may be possible to add ARM to your testing environment
at little monetary cost. (Of course the time cost to get it all working may be non-trivial, but the
hardware cost seems under 100Euro).

That would be a 32bit ARM, not AARCH64, but still better than nothing.

-- Jim

Jim Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers, and Runtimes)
Tel: +44 117 9071438

Banana Pi M64 comes with 64bit arm. They are ~100 €/$. Probably, Jonas can set up the testing once he is back😀

-Joachim

We could certainly do that, but if ARM (or some related companies) are interested in testing, they should donate the hardware or we should reuse the existig LLVM build slaves. (In the same way it would be good for Intel to setup testing with their proprietary compiler - I'll send a message about that after the holidays in January).

Jonas

(In the same way it would be good
for Intel to setup testing with their proprietary compiler - I'll send a
message about that after the holidays in January).

1) We do *lots* of testing internally with our compiler :slight_smile:
2) I believe that there is nothing that prevents anyone who wants to from setting up public testing
   of the LLVM runtime with Intel's commercial compiler, since there are free compiler
   licenses available for contributors to open-source projects and this should surely qualify.
   Free Intel® Software Development Tools

-- Jim

Jim Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers, and Runtimes)
Tel: +44 117 9071438

Sure, but IMO it would be better to have them run on each commit in LLVM. This also saves you the effort of finding out which changes broke functionality with your compiler :wink: in addition, it makes contributors more comfortable because they know their change gets automatic testing so they can fix problems right away or revert the change until finding out what's happening.
And I agree that anyone *could* setup the bot but I believe it's in Intel's interest and you certainly have hardware that could be used for that. I don't see why the community should pay for something that is easy to do for a larger company whose employees also work on the software.

Jonas

Yeah, you're right again, with the following change:

+#define print_possible_return_addresses(addr) \
+ printf("%" PRIu64 ": current_address=%p or %p\n", ompt_get_thread_data()->value, \
+ ((char *)addr) - 4, ((char *)addr) - 8)

...I can see only ompt/tasks/explicit_task.c failing from time to time, but it seems to be unrelated to printed address issue:

runtime/test/ompt/tasks/explicit_task.c:94:12: error: expected string not found in input
  // CHECK: {{^}}[[THREAD_ID]]: ompt_event_barrier_end: parallel_id={{[0-9]+}}, task_id=[[IMPLICIT_TASK_ID]]
            ^
<stdin>:53:1: note: scanning from here

Yeah, you're right again, with the following change:

+#define print_possible_return_addresses(addr) \
+ printf("%" PRIu64 ": current_address=%p or %p\n",
ompt_get_thread_data()->value, \
+ ((char *)addr) - 4, ((char *)addr) - 8)

Cool, can you put up a patch for this?

...I can see only ompt/tasks/explicit_task.c failing from time to
time, but it seems to be unrelated to printed address issue:

runtime/test/ompt/tasks/explicit_task.c:94:12: error: expected string
not found in input
// CHECK: {{^}}[[THREAD_ID]]: ompt_event_barrier_end:
parallel_id={{[0-9]+}}, task_id=[[IMPLICIT_TASK_ID]]
           ^
<stdin>:53:1: note: scanning from here

Do you have the chance to get the full output when the checks fail? (I usually run the test directly, save the output temporarily and pass it to FileCheck to have the output at hand if that fails.)

replies inlined below:

Yeah, you're right again, with the following change:

+#define print_possible_return_addresses(addr) \
+ printf("%" PRIu64 ": current_address=%p or %p\n",
ompt_get_thread_data()->value, \
+ ((char *)addr) - 4, ((char *)addr) - 8)

Cool, can you put up a patch for this?

Done, https://reviews.llvm.org/D41482

...I can see only ompt/tasks/explicit_task.c failing from time to
time, but it seems to be unrelated to printed address issue:

runtime/test/ompt/tasks/explicit_task.c:94:12: error: expected string
not found in input
// CHECK: {{^}}[[THREAD_ID]]: ompt_event_barrier_end:
parallel_id={{[0-9]+}}, task_id=[[IMPLICIT_TASK_ID]]
^
<stdin>:53:1: note: scanning from here

Do you have the chance to get the full output when the checks fail? (I usually run the test directly, save the output temporarily and pass it to FileCheck to have the output at hand if that fails.)

Now when I run it in isolation, it points to different line, but the issue seems the same:

$ cat explicit_task.c.tmp.out

$HOME/llvm/build-shared-release/bin/FileCheck

$HOME/openmp/runtime/test/ompt/tasks/explicit_task.c
$HOME/openmp/runtime/test/ompt/tasks/explicit_task.c:76:12: error: expected string not found in input
  // CHECK: {{^}}[[THREAD_ID:[0-9]+]]: ompt_event_implicit_task_begin: parallel_id=[[PARALLEL_ID]], task_id=[[IMPLICIT_TASK_ID:[0-9]+]]
            ^
<stdin>:50:86: note: scanning from here
281474976710657: ompt_event_implicit_task_end: parallel_id=0, task_id=281474976710661, team_size=2, thread_num=0
               ^
<stdin>:50:86: note: with variable "PARALLEL_ID" equal to "281474976710660"
281474976710657: ompt_event_implicit_task_end: parallel_id=0, task_id=281474976710661, team_size=2, thread_num=0
               ^
<stdin>:55:5: note: possible intended match here
562949953421313: ompt_event_implicit_task_end: parallel_id=0, task_id=562949953421314, team_size=0, thread_num=1
     ^

...And the full output is:

0: NULL_POINTER=(nil)
281474976710657: ompt_event_thread_begin: thread_type=ompt_thread_initial=1, thread_id=281474976710657
281474976710657: ompt_event_task_create: parent_task_id=0, parent_task_frame.exit=(nil), parent_task_frame.reenter=(nil), new_task_id=281474976710658, codeptr_ra=(nil), task_type=ompt_task_initial=1, has_dependences=no
281474976710657: __builtin_frame_address(0)=0xffffc0a5d2a0
281474976710657: ompt_event_parallel_begin: parent_task_id=281474976710658, parent_task_frame.exit=(nil), parent_task_frame.reenter=0xffffc0a5d2a0, parallel_id=281474976710660, requested_team_size=2, codeptr_ra=0x402fac, invoker=2
281474976710657: ompt_event_implicit_task_begin: parallel_id=281474976710660, task_id=281474976710661, team_size=2, thread_num=0
281474976710657: __builtin_frame_address(1)=0xffffc0a5cec0
281474976710657: task level 0: parallel_id=281474976710660, task_id=281474976710661, exit_frame=0xffffc0a5cec0, reenter_frame=(nil)
281474976710657: task level 1: parallel_id=281474976710659, task_id=281474976710658, exit_frame=(nil), reenter_frame=0xffffc0a5d2a0
281474976710657: __builtin_frame_address(0)=0xffffc0a5cea0
281474976710657: ompt_event_master_begin: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x403078
281474976710657: task level 0: parallel_id=281474976710660, task_id=281474976710661, exit_frame=0xffffc0a5cec0, reenter_frame=(nil)
281474976710657: ompt_event_task_create: parent_task_id=281474976710661, parent_task_frame.exit=0xffffc0a5cec0, parent_task_frame.reenter=0xffffc0a5cea0, new_task_id=281474976710662, codeptr_ra=0x4030e8, task_type=ompt_task_explicit=4, has_dependences=no
281474976710657: fuzzy_address=0x402f or 0x4030 (0x4030f0)
562949953421313: ompt_event_thread_begin: thread_type=ompt_thread_worker=2, thread_id=562949953421313
562949953421313: ompt_event_implicit_task_begin: parallel_id=281474976710660, task_id=562949953421314, team_size=2, thread_num=1
562949953421313: __builtin_frame_address(1)=0xffff8f929610
562949953421313: task level 0: parallel_id=281474976710660, task_id=562949953421314, exit_frame=0xffff8f929610, reenter_frame=(nil)
562949953421313: task level 1: parallel_id=281474976710659, task_id=281474976710658, exit_frame=(nil), reenter_frame=0xffffc0a5d2a0
562949953421313: __builtin_frame_address(0)=0xffff8f9295f0
562949953421313: ompt_event_barrier_begin: parallel_id=281474976710660, task_id=562949953421314, codeptr_ra=0x403174
562949953421313: task level 0: parallel_id=281474976710660, task_id=562949953421314, exit_frame=0xffff8f929610, reenter_frame=0xffff8f9295f0
562949953421313: ompt_event_wait_barrier_begin: parallel_id=281474976710660, task_id=562949953421314, codeptr_ra=0x403174
562949953421313: ompt_event_task_schedule: first_task_id=562949953421314, second_task_id=281474976710662, prior_task_status=ompt_task_others=4
562949953421313: __builtin_frame_address(1)=0xffff8f929260
562949953421313: task level 0: parallel_id=281474976710660, task_id=281474976710662, exit_frame=0xffff8f929260, reenter_frame=(nil)
562949953421313: task level 1: parallel_id=281474976710660, task_id=562949953421314, exit_frame=0xffff8f929610, reenter_frame=0xffff8f9295f0
562949953421313: task level 2: parallel_id=281474976710659, task_id=281474976710658, exit_frame=(nil), reenter_frame=0xffffc0a5d2a0
562949953421313: ompt_event_task_schedule: first_task_id=281474976710662, second_task_id=562949953421314, prior_task_status=ompt_task_complete=1
562949953421313: ompt_event_task_end: task_id=281474976710662
281474976710657: task level 0: parallel_id=281474976710660, task_id=281474976710661, exit_frame=0xffffc0a5cec0, reenter_frame=(nil)
281474976710657: ompt_event_master_end: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x403160
281474976710657: ompt_event_barrier_begin: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x403174
281474976710657: task level 0: parallel_id=281474976710660, task_id=281474976710661, exit_frame=0xffffc0a5cec0, reenter_frame=0xffffc0a5cea0
281474976710657: ompt_event_wait_barrier_begin: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x403174
562949953421313: ompt_event_wait_barrier_end: parallel_id=281474976710660, task_id=562949953421314, codeptr_ra=0x403174
281474976710657: ompt_event_wait_barrier_end: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x403174
562949953421313: ompt_event_barrier_end: parallel_id=281474976710660, task_id=562949953421314, codeptr_ra=0x403174
562949953421313: task level 0: parallel_id=281474976710660, task_id=562949953421314, exit_frame=0xffff8f929610, reenter_frame=(nil)
281474976710657: ompt_event_barrier_end: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x403174
281474976710657: task level 0: parallel_id=281474976710660, task_id=281474976710661, exit_frame=0xffffc0a5cec0, reenter_frame=(nil)
562949953421313: ompt_event_barrier_begin: parallel_id=281474976710660, task_id=562949953421314, codeptr_ra=(nil)
562949953421313: task level 0: parallel_id=281474976710660, task_id=562949953421314, exit_frame=(nil), reenter_frame=(nil)
562949953421313: ompt_event_wait_barrier_begin: parallel_id=281474976710660, task_id=562949953421314, codeptr_ra=(nil)
281474976710657: ompt_event_barrier_begin: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x402fac
281474976710657: task level 0: parallel_id=281474976710660, task_id=281474976710661, exit_frame=(nil), reenter_frame=(nil)
281474976710657: ompt_event_wait_barrier_begin: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x402fac
281474976710657: ompt_event_wait_barrier_end: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x402fac
281474976710657: ompt_event_barrier_end: parallel_id=281474976710660, task_id=281474976710661, codeptr_ra=0x402fac
281474976710657: ompt_event_implicit_task_end: parallel_id=0, task_id=281474976710661, team_size=2, thread_num=0
281474976710657: ompt_event_parallel_end: parallel_id=281474976710660, task_id=281474976710658, invoker=2, codeptr_ra=0x402fac
281474976710657: ompt_event_thread_end: thread_id=281474976710657
562949953421313: ompt_event_wait_barrier_end: parallel_id=0, task_id=562949953421314, codeptr_ra=(nil)
562949953421313: ompt_event_barrier_end: parallel_id=0, task_id=562949953421314, codeptr_ra=(nil)
562949953421313: ompt_event_implicit_task_end: parallel_id=0, task_id=562949953421314, team_size=0, thread_num=1
562949953421313: ompt_event_idle_begin:
562949953421313: ompt_event_idle_end:
562949953421313: ompt_event_thread_end: thread_id=562949953421313
0: ompt_event_runtime_shutdown

The test sorts the output by thread: sort --numeric-sort --stable. So unfortunately this output doesn't show the original error that you have been seeing :frowning:

Unfortunately it occurs very rarely...