[RFC] __has_builtin behavior on offloading targets

Hi all,

Motivation

For non-offloading single-phase compilation builds, the behavior of __has_builtin is straightforward: if the builtin can be used on the current target, language and environment, return true.

However, this gets complicated when two-phase compilation comes into play. Two phase compilation is hidden from the user as an implementation detail, and as a consequence, __has_builtin returns true if the builtin can be used on either the host target or offloading target.

If we have something like:

void foo() {
#pragma omp target
#if __has_builtin(__builtin_ia32_pause)
    __builtin_ia32_pause();
#endif
}
}

When compiled with OpenMP offloading with the offloading device being a GPU and the host being x86, __has_builtin will always return true and lead to the below error:

error: cannot compile this builtin function yet

This behavior reduces the usefulness of __has_builtin.

Note that GCC also implements __has_builtin, and the behavior is the same as Clang today. It is unclear the GCC behavior is intentional and is being discussed here.

There are also some relevant PRs: here, here, and here.

Finally, some clients rely on the current behavior of __has_builtin such as ARM on CUDA and __cpuidex.

Proposal

I propose we deprecate __has_builtin and introduce a new function-like macro tentatively named __can_use_builtin, that only returns true if the builtin can actually be used on the current target, language and environment being used for compilation. This means for offload targets, the function may return different values for the host and offload target compiles.

In the above example, the host compile will return true, however the target compile will return false.

Next Steps

Assuming community consensus, first I will implement __can_use_builtin as per the above design and document it with descriptive and relevant examples (prototype implementation available here).

Then, I will deprecate __has_builtin and potentially remove it in a future Clang release.

Thanks,
Nick

I personally don’t think we should deprecate __has_builtin. It’s been around for a long time and works properly an non-offloading targets. Offloading targets being weird doesn’t seem like a good enough reason to change behavior here. Personally, I think we should just respect __has_builtin correctly for offloading targets and fix the one Arm case that depends on it. It does change behavior, so we’d need to document that.

1 Like

Totally agree that deprecating __has_builtin because of this issue seems entirely unreasonable; it’s irrelevant to the 99% of code which isn’t doing an offloading compilation.

4 Likes

Sure, the natural alternative approach is to document that its behavior may be unexpected when used with offloading and suggest _can_use_builtin.
Thanks for the feedback.

I would prefer not even having can_use_builtin. To me, __has_builtin should work as advertised. This is mostly just working around library code not respecting offloading compilations by guarding things like typedefs behind __has_builtin.

I agree with Joseph in the ideal world, my concern is that we might break other people relying on the current behavior and won’t find out until we ship a new compiler version.

If we are convinced that’s unlikely, I’m happy to go forward with that approach.

It sounds like our OpenMP implementors decided to give __has_builtin questionable semantics in target mode and now regret that decision. That’s unfortunate, but it’s a self-created problem, and we’re not going to throw out the feature for everyone else.

The immediate question that comes to mind for me is whether we can just revert the special treatment of __has_builtin. How far is OpenMP willing to take this desire to hide two-phase compilation from the user?

  • If the second-phase compiler actually gets preprocessed tokens, __can_use_builtin will not work.
  • If the second-phase compiler reprocesses the translation unit from scratch, OpenMP is not actually trying very hard to maintain the illusion. It’s weird to say that __has_builtin needs special behavior when e.g. the target predefined macros do not. It seems much more reasonable to just acknowledge that the preprocessor is running twice and can have different expansions in each phase.

This is pervasive across all offloading languages, not just OpenMP (SYCL, OpenMP, CUDA, HIP). The general hack of these offloading languages is that we pretend incompatible architectures are the same, thus we try to parse the same tokens on both sides so that things like structs have the same offsets.

I would prefer we just do this correctly, but it’s observable to the user as we originally just make __has_builitin work as you’d expect without checking the auxilliary triple, but it broke a CUDA on Arm case where they were guarding some typedefs behind a __has_builtin check. This lead to one compilation seeing the type and the other not. Personally I feel like permitting this was self-inflicted. Stuff like this can probably be worked around at the user level with enough hacks since we already provide macros to break out of the ‘unified’ compile hack, like __HIP_DEVICE_COMPILE or __CUDA_ARCH__ being set.

I’m fairly certain that that’s just an easy way of thinking about it and that you aren’t actually trying to parse the same tokens on both sides. If you did, you’d have overwhelming incompatibilities, because you really need system headers to pick the correct type for basic typedefs like size_t and int32_t for the actual target, and that almost always means token differences. And if someone writes #if __ARM__ in the middle of a struct, there’s no way that you’re magically making that work when one target or the other isn’t ARM-based.

I’m pretty sure that the real story is much less interesting, and that all these offloading modes are actually just relying on code to be written in a reasonably portable manner, such that the compiler will end up using compatible types (at least in the places it matter for the offloading interface) after all stages of processing. That is totally fine, and we should just acknowledge it instead of talking ourselves into more complicated ideas that inevitably get us into trouble.

I agree that this is something that just should not work.

Currently, you get x87 floats on the GPU if you use long double Compiler Explorer among other fun stuff. We copy most all of the type definitions from the auxiliary triple as far as I’m aware. llvm-project/clang/lib/Basic/Targets/NVPTX.cpp at 27901cec0e76d2cbf648b3b63d5b2fec1d46bb9c · llvm/llvm-project · GitHub

Well, I stand at least partially corrected, as far as offloading to NVPTX and SPIR goes. I guess it works as long as the propagation only goes that way and the offloaded program is completely self-contained. I believe the project does support other offloading targets, though, and this proposal wasn’t actually brought up as target-specific.

So it sounds like the consensus so far is to change the __has_builtin implementation to only consider the current target, and fix any code that breaks because of that.

Are there any concerns with this approach from others?

2 Likes

The Clang area team looked at this RFC as one that potentially needed our help to identify next steps, but given the last update from @sarnex , it seems to me like consensus is forming, and we probably don’t need to set up a call.

Speaking for myself and not the clang area team, I think it would be the wrong tradeoff to deprecate the __has_builtin API. The majority of users are not using offloading and should not be subject to that API churn. __has_builtin is a feature test. Code should be written to have fallback paths for when builtins aren’t available.

2 Likes

My primary concern is compatibility with GCC. GCC and Clang both currently implement the same behavior and this would introduce an incompatibility. Are we comfortable moving forward even if GCC does not wish to do the same?

I’m not sure how strong a consideration compatibility with GCC should be in this particular case. Unless I’m mistaken, GCC doesn’t have builtin support for CUDA, HIP, SYCL, or other offloading languages that Clang supports. I think the compatibility concern is probably limited to OpenMP. How does gcc handle use of target specific builtins in its OpenMP implementation? Perhaps GCC maintainers would be interested in making a similar change.

That’s this bugzilla ticket: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118882 which hasn’t garnered any responses yet.

1 Like

This is the first format RFC I’ve done so I’m not sure what the next steps are here.

It seems like there are no major flags to the current approach (change __has_builtin to check current target only and differ from GCC), however Aaron’s latest question doesn’t seem fully answered by the community.

I’m not sure at what point I can move forward with an implementation of the proposal.

First off I’d try to fix the failing CUDA case, I think you already had a PR for that. Then try to bug the GCC people, but honestly I don’t know how many people working on offloading in GCC are left at this point so you might not get an answer. Either way, after fixing the existing fail just try to land the original patch and see if anyone complains.

Sure, I think that was blocked waiting for someone to answer a question, I’ll ping them