Is libomp open to add arch-specific barrier implementation?

Hello, list.

I'm Tomohiro Misono and software engineer at Fujitsu.

I'm new to libomp community and I'd like to ask a question about libomp's development policy today.
In short, as the title says, is libomp open to add arch(CPU)-specific barrier implementation?
I have Fujitsu A64FX's hardware assisted barrier implementation in mind.

Below is more detailed background.

A64FX processor[*] (which is for HPC and used in supercomputer Fugaku) has hardware assisted
barrier using architecture specific registers. This mechanism can be used to make a synchronization
within L2-share domains using these registers. Although Fujitsu has its own openmp runtime library
implementation to support this barrier, we are now considering if it is possible to support it in open
library (i.e. libomp) too. Based on my research, I think it would be possible to support the barrier in
libomp by adding a new barrier type which only works for specific architecture, but is this approach ok
for the community?

[*] Specifications: https://github.com/fujitsu/A64FX

Note that the code we have at this point is not easily incorporated into libomp and totally new
development is required from scratch. Also, it requires kernel driver to be loaded to access the
registers (please see below). I just want to know if this plan is feasible in the first place before
starting development.

Some notes for possible implementation:
- A64FX's hardware barrier can perform synchronization within L2-share domains. Therefore
  conventional barrier by software (i.e. flag Class) is still needed for cross-L2 domain synchronization.
  So, the possible implementation would have some similarity in hierarchical barrier (only leaf can
  use hardware barrier). I think expanding current hierarchical barrier code becomes messy and
  introducing a new barrier type is better
- In the optimal case (i.e. barrier within L2 domain), there is no need to use software barrier at all.
  Currently task execution is mainly coupled with flag Class and this needs to be addressed somehow
- In order to use hardware barrier, each thread must be bound to its specific core and cannot be
  moved. If the condition does not meet, the library has to fallback to use software barrier.
  I think this restriction implies hardware barrier cannot be used at fork_barrier.
- Last but not least; In order to access the barrier registers on A64FX, linux kernel driver is needed.
  We are willing to open the driver code too (but it is not accepted linux kernel community at this point).
  The ultimate goal is determining user-kernel interface as general as possible so that code can be
  reused for both libomp and kernel driver if other new hardware assisted barrier implementation emerges,
  but this is a challenging problem.

I'd appreciate any comments.

Regards,
Tomohiro

Hi, Thanks for quick response. I will try to write acceptable code.
Also, If anyone has any comments on overall approach, I'd like to hear it.

Best Regards,
Tomohiro

If there is anyone who interested in this, I posted RFC version: âš™ D122646 [OpenMP][RFC] libomp: Introduce hardware assisted barrier support for A64FX

1 Like

@AndreyChurbanov @RichBarton-Arm Any thoughts?

@shiltian FYI

As a general principle, I think that we should accept things like this, even though they are very machine specific. If we do not, that leads to private vendor forks which are a bad thing.
Since such code will only be compiled for the relevant architecture(s) the only overheads it has for others are:-

  • the cost of skipping it at compile time,
  • the cost of moving it backwards and forwards when pushing and pulling the code

On the other hand , having it there is useful as a potential educational resource, and should improve the performance of LLVM on the relevant machine.

(Which isn’t a comment on the particular code, of course, just sayng that LVM should be open to having implementation specific code in it).

– Jim
James Cownie <jcownie@gmail.com>
Mob: +44 780 637 7146

1 Like

I concur with Jim. Once we don’t expect any impact on code execution on other architectures, it is safe to add such code to the repo.

1 Like

Once we don’t expect any impact on code execution on other architectures, it is safe to add such code to the repo.

Thanks for comments.

Barrier’s main logic is in kmp_barrier.cpp and each barrier’s release/gather function is selected in switch statement. So, there should be no overhead in that.
In addition to that, hard barrier logic needs some setup in kmp_runtime.cpp (like dist barrier) and therefore some if statements are executed even in other barriers if we don’t exclude them by ifdef at compile time, but I don’t think this has any visible impacts.

(btw, current code does not build on non-arm64 arch, I will fix that)

1 Like