[RFC] Delayed target-specific diagnostic when compiling for the devices.

Hi, many of the OpenMP users experience troubles when they try to compile real-world OpenMP applications, which use offloading constructs.

Problem Description

Ping!

I have some questions about this... I'll write them up later today.

-Hal

Hi, many of the OpenMP users experience troubles when they try to
compile real-world OpenMP applications, which use offloading constructs.

Problem Description

For example, let’s look at the abstract code:

void target_unused() {
  int a;
  __asm__("constraints"
          : "=a"(a)
          :         // None
  );
}

void target_used() {
  int a;
  __asm__("constraints"
          : "=a"(a)
          :         // None
  );
}

void foo() {
  target_unused();
#pragma omp target
  target_used();
}

Assume, we going to compile this code on X86_64 host to run on the
NVidia NVPTX64 device. When we compile this code for the host,
everything is good. But when we compile the same code for the NVPTX64
target, we get the next error messages:

11:13: error: invalid output constraint '=a' in asm
20:13: error: invalid output constraint '=a' in asm

But, actually, we should see only one error message, the second one,
for the function `target_used()`, which is actually used in the target
region. The second function, `target_unused()` is used only on the
host and we should no produce error message for this function when we
compile the code for the device.

The main problem with those functions is that they are not marked
explicitly as the device functions, just like it is required in CUDA.
In OpenMP, it is not required to mark them explicitly as the
device-only or both device-host function. They can be marked
implicitly, depending of the fact that they are used in target-based
constructs (probably, indirectly, through chain of calls) or not.

Do you mean that the implicit declare-target feature, which marks
functions to be generated for the target based on transitive usage, is
not implemented in the frontend? My understanding is that this feature
was designed to be implementable in the frontend. In that case, the
frontend can validate the inline assembly as it does now based on the
current code-generation target.

Thanks again,

Hal

Yes, it is implemented in the frontend. No, it is not implemented in the Sema analysis, it is implemented in Codegen part of the frontend. And it is almost impossible to implement it in Sema. It requires recursive analysis of the functions used (not only called directly, but also indirectly). The better place to implement it is the CodeGen of the frontend.

Best regards,
Alexey Bataev

Yes, it is implemented in the frontend. No, it is not implemented in the Sema analysis, it is implemented in Codegen part of the frontend. And it is almost impossible to implement it in Sema. It requires recursive analysis of the functions used (not only called directly, but also indirectly). The better place to implement it is the CodeGen of the frontend.

Your proposal makes sense to me. And, I agree, just moving the logic to
CodeGen seems like the most-straightforward solution. I don't see any
conceptual problem with inline assembly being checked a bit later in the
pipeline (it is, after all, quite target specific).

-Hal

We generally prefer to diagnose things in Sema rather than IRGen, and one
reason is that we want these diagnostics to show up in clients like IDEs.

Obviously, ASM constraint validation is a pretty minor diagnostic, but I
don't see what naturally limits this to just that. In fact, hasn't this
come up before?

If this is really just specific to ASM constraint validation, I think we
can find a way to delay the diagnosis of those to IRGen conditionally.
Otherwise, I think you need to find a way to suppress diagnostics from
functions that you don't care about.

John.

Didn’t we already go through this for CUDA? It has all the same issues. I see some comments about Sema::CUDADeferredDiags. I would look at that and try to generalize it.

This is not only for asm, we need to delay all target-specific diagnostics.
I’m not saying that we need to move the host diagnostic, only the diagnostic for the device compilation.
As for Cuda, it is a little but different. In Cuda the programmer must explicitly mark the device functions, while in OpenMP it must be done implicitly. Thus, we cannot reuse the solution used for Cuda.

Best regards,
Alexey Bataev

As for Cuda, it is a little but different. In Cuda the programmer must explicitly mark the device functions, while in OpenMP it must be done implicitly. Thus, we cannot reuse the solution used for Cuda.

Do you mean that because the OpenMP programmer does not explicitly mark device functions, the general approach taken to these sorts of errors in CUDA is inapplicable?

Yes, right. OpenMP does not require explicit marking of the device function. It supports implicit target functions.

Best regards,
Alexey Bataev

Do you mean that because the OpenMP programmer does not explicitly mark device functions, the general approach taken to these sorts of errors in CUDA is inapplicable?

Yes, right. OpenMP does not require explicit marking of the device function. It supports implicit target functions.

Would you be willing to elaborate on how the fact that OpenMP has implicit device functions means that the general approach taken by CUDA to this problem is inapplicable to OpenMP?

(For example, I’d have naively thought that CUDA host device functions are very similar to an OpenMP implicit-device function, in basically all respects other than the fact that the CUDA functions have an explicit attribute.)

All it means is that you can't just use the solution used for CUDA "off the shelf". The basic idea of associating diagnostics with the current function and then emitting those diagnostics later when you realize that you have to emit that function is still completely applicable.

John.

__host__ __device__ functions are still device functions and it means that they must be emitted when you compile for the device. You know, that the user marked those functions as the device functions. In OpenMP, you cannot say before the codegen phase whether the function is used on the device or not. We should not emit all the functions available, only those, which are used (implicitly or explicitly, directly or indirectly) in the target regions.

Best regards,
Alexey Bataev

I don't see why you couldn't do that analysis in Sema.

John.

host device functions are still device functions and it means that they must be emitted when you compile for the device

That is not the case for templated or inline host device functions. They explicitly are not emitted for host/device unless they are called from a host/device context. CUDA code relies heavily on this fact. As a result, you are allowed to do “host-only” things from a host device function so long as it’s not codegen’ed for device. Similarly, you can do “device-only” things from a host device function so long as it’s not codegen’ed for host.

The notion of “deferred diagnostics” in clang’s CUDA support is explicitly there to handle the case when we do not know whether or not a host device function must be emitted for host or device and so we don’t know whether or not to raise an error when you do a “wrong-side” thing (i.e. you’re compiling for device and you did a host-only thing, or you’re compiling for host and you did a device-only thing).

Because currently this kind of the analysis is implemented in Codegen and Codegen decides, which function should be emitted and which is not.
To implement it in Sema, we'll need to reimpelement almost everything from the codegen, because we will need to analyse all the statements in all functions. It significantly increases the compilation time.

Best regards,
Alexey Bataev

Could you reference such functions in the initializer? Could you call them indirectly? Could you take their addresses? Or they just can be directly called from other functions?

Best regards,
Alexey Bataev

Yes, I thought about this. But we need to delay the diagnostic until the Codegen phase. What I need is the way to associate the diagnostic with the function so that this diagnostic is available in CodeGen.

Also, we need to postpone the diagnotics not only for functions, but,for example, for some types. For example, __float128 type is not supported by CUDA. We can get error messages when we ran into something like typedef __float128 SomeOtherType (say, in some system header files) and get the error diagnostic when we compile for the device. Though, actually, this type is not used in the device code, the diagnostic is still emitted and we need to delay too and emit it only iff the type is used in the device code.

This is exactly what I mean. We are not going to rewrite the compiler to
delay arbitrary diagnostics until IRGen. You need to figure out a way to
do this that doesn't require that, and it probably starts with taking
advantage of the deferred-until-function-use infrastructure put in place
for CUDA.

John.