[RFC] OpenMP offload infrastructure

Hello everybody!

I would like to present a proposal for implementation of OpenMP
offloading in LLVM. It was created by a list of authors and covers the
runtime part at most and at a very high level. I believe it will be
good to have input from community at this early stage before moving
deeper in details.

The driver part is intentionally not touched, since we have no clear
vision on how one can use 3rd party compiler for target code
generation and incorporate its results into the final host link phase.
I hope to hear from you more on this.

I invite you to take part in discussion of the document. Critics,
proposals, updates - all are welcome!

Thank you,
Sergey Ostanevich
Open Source Compilers
Intel Corporation

offload-proposal.pdf (669 KB)

I didn’t see SPIR discussed anywhere.

This isn't OpenCL and depending on OpenCL for OpenMP may not really make sense. While I have my own opinions - If you feel strongly that it will help enable higher performance somewhere please list those reasons.

Storing llvm-ir in the fat binary may have the same performance issues mentioned below. The fat binary discussed in the proposal has provision for storing the isa/llvm-ir. My point is instead of llvm-ir it shd be something like spir.

Ok - so lets see some data.

#1 Benchmarks showing at least SPIR dgemm/sgemm performance
#2 Some logical explanation why all the extra work for SPIR when LLVM IR is native

Basically besides an opinion or because it's "shiny" some solid technical reason.

I hate to repeat myself, but again.. why on earth would a solution which is closed source be preferred over llvm ir...

Sergey [et.al], thanks for putting this proposal together. Overall, this looks like a pretty solid approach to providing relatively hardware agnostic omp target functionality. I had several comments/questions as summarized below:

Pros:
- We [local colleagues and myself] like the concise target API. We’re big fans of KISS development principles.
- We believe this provides a good basis for future work in heterogeneous OMP support

Comments/Questions:
- There doesn’t seem to be any mention of how mutable each runtime function is with respect to its target execution region. The core OMP spec document notes in several places that certain user-visible runtime calls have “implementation defined” behavior depending upon where/how they’re used. For example, what happens if the host runtime issues a __tgt_target_data_update() while the target is currently executing (__tgt_rtl_run_target_region() )? Is this implementation defined? I’m certainly ok with that answer, but I believe we need to explicitly state what the behavior is.

- I noticed that Alexandre Eichenberger was one of the authors. Has he mentioned any support/compatibility with the profiling interfaces he (JMC, et.al.) proposed? How does one integrate the proposed profiling runtime logic with a target region (specifically the dispatch & data movement interfaces)? This would be very handy.

- I don’t see any mention of an interface to query the physical details of a device. I know this strays a bit from the notion of portability, but it would be nice to have a simple interface (similar to ‘omp_get_max_threads’). I stop short of querying information as detailed as provided by hwloc, but it would be nice for the user to have the ability to query the targets and see which ones are appropriate for execution. This would essentially provide you the ability to build different implementations of a kernel and make a runtime decision on which one to execute. EG,
if( /* target of some specific type present */ ){
    /* use the omp target interface */
}else{
   /* use the normal worksharing or tasking interfaces */
}

(I realize this is more of an OMP spec question)
  
- It would be nice to define a runtime and/or environment mechanism that permits the user to enable/disable specific targets. For example, if a system had four GPUs, but you only wanted to enable two, it would be convenient to do so using an environment variable. I realize that one could do this using actual runtime calls in the code with some amount of intelligence, but this somewhat defeats the purpose of portability. Again, this is more related to the 4.x spec, but it does have implications in the lower-level runtime.

cheers
john

Hi John,

Thank you for the comments. I am addressing some of them bellow.

Regards,
Samuel

Samuel, thanks for the response. I have a few short responses below.

cheers
john

John D. Leidel
Software Compiler Development Manager
Micron Technology, Inc.
jleidel@micron.com
office: 972-521-5271
cell: 214-578-8510

Hi John,

Thank you for the comments. I am addressing some of them bellow.

Regards,
Samuel

Sergey [et.al], thanks for putting this proposal together. Overall, this looks like a pretty solid approach to providing relatively hardware agnostic omp target functionality. I had several comments/questions as summarized below:

Pros:
- We [local colleagues and myself] like the concise target API. We’re big fans of KISS development principles.
- We believe this provides a good basis for future work in heterogeneous OMP support

Comments/Questions:
- There doesn’t seem to be any mention of how mutable each runtime function is with respect to its target execution region. The core OMP spec document notes in several places that certain user-visible runtime calls have “implementation defined” behavior depending upon where/how they’re used. For example, what happens if the host runtime issues a __tgt_target_data_update() while the target is currently executing (__tgt_rtl_run_target_region() )? Is this implementation defined? I’m certainly ok with that answer, but I believe we need to explicitly state what the behavior is.

In my view the user-visible OpenMP calls that apply to target regions depend on the state kept in libtarget.so, and are therefore device-type independent. What is device dependent is how the OpenMP terminology is mapped. For example, get_num_teams() would operate on top of the state kept in libtarget.so but how the device interpret a team is device dependent and deviced by the target dependent runtime.

A different issue is how the RTL implementation for calls that are common for target and host (i.e. the kmpc_ calls) should be implemented. I think it is a good idea to have some flexibility in the codegen to tune the generation of these calls if the default interface is not suitable for a given target. But in general, the kmpc_ library implementation should be known to the toolchain of that target so it can properly drive the linking.

About the specific example you mentioned. If I understand it correctly, following the current version of the spec tgt_rtl_run_target_region() has to be blocking so libtarget.so would have to wait for the update to be issued. The actions in libtarget.so would have to be sequential exactly has the codegeneration expects. If for some reason these constraints change in future specs, both codegeneration and libtarget.so implementation would have to be made consistent.

I’ll echo back my understanding of your statements. OpenMP user calls are well defined and apply to the calling target region (which is what I expect from my interactions with the subcommittee). The ‘kmpc_*’ library call implementations (especially the ability to generate these calls) is “implementation defined.” I believe this is the best path in order to allow for high performance implementations for orthogonal target architectures. Finally, your statements regarding my example indicate that a target execution region is instantiated sequentially (eg, blocking) with respect to the calling construct (thread/task). This was also my assumption. I wanted to make sure others interpreted the document/spec the same way.

- I noticed that Alexandre Eichenberger was one of the authors. Has he mentioned any support/compatibility with the profiling interfaces he (JMC, et.al.) proposed? How does one integrate the proposed profiling runtime logic with a target region (specifically the dispatch & data movement interfaces)? This would be very handy.

- I don’t see any mention of an interface to query the physical details of a device. I know this strays a bit from the notion of portability, but it would be nice to have a simple interface (similar to ‘omp_get_max_threads’). I stop short of querying information as detailed as provided by hwloc, but it would be nice for the user to have the ability to query the targets and see which ones are appropriate for execution. This would essentially provide you the ability to build different implementations of a kernel and make a runtime decision on which one to execute. EG,
if( /* target of some specific type present */ ){
    /* use the omp target interface */
}else{
   /* use the normal worksharing or tasking interfaces */
}

(I realize this is more of an OMP spec question)

I agree this is more of an OMP spec issue. The fact we are addressing different device-types is already an extension to the spec which poses some issues. One of them, somehow related with this, is how the device ids are mapped to device types. Should this depend on flags passed to the compiler ( e.g. omptargets=A,B with ids 0-1 assigned to A and 2-3 to B given that the runtime identified in the system two devices of each), or should it depend on the environment? In the current proposal, libtarget.so abstracts a single target made of several targets, do we want to let the user prioritize which exact device to use? Should this be decided at compile time or runtime?

If my memory serves me, the original OMP 4.0 spec for target execution/data regions is based upon the notion of a host + 1 target, rather than a host + `N` different targets. This was definitely the right decision as we have to crawl before we can walk. You bring up an interesting point with respect to the priority of the devices discovered/dispatched. I fear this is a rather complex issue (although an interesting one).

I believe I should just extend the section regarding the target code.
It can be literally anything that target RTL could support. For Xeon
PHI it will be a linux executable - best for performance, code size,
portability, etc.
For any OpenCL-compatible system this can be SPIR, as you wish. For a
proprietary DSP it can be something else. But it is your (or a vendor)
responsibility to provide a RTL that will be capable to translate this
to the target.

Sergos

I need to talk this over internally, but for Xeon PHI we may be willing to contribute some (a lot?) of code to the open source.

Specifically - we have a tiny and scalable "OS" we wrote in order to replace the onboard linux which is uploaded to Xeon PHI.

The benefits appear to be
1) Less overhead (both on init times as well resident card)
2) Less impact in terms of scheduling and large linux kernel problems oncard
3) Easier to hack if you want to research and test something
4) Exposes an interface which makes it appear more similar in design to the GPU. (This mostly impacts runtime design, but also may make it easier to support the target/data OMP4 clauses)

Not to blame your memory, John. :slight_smile:

p.14 lines 19-21:
An implementation may support other target devices. If supported, one
or more devices are available to the host device for offloading code
and data.

Regards,
Sergos