clang-cl's <intrin.h>, _tzcnt_u32, and compatibility with MSVC's <intrin.h>

[While I filed this as https://llvm.org/bugs/show_bug.cgi?id=30506,
Paul Robinson suggested in the bug that it might be worth bringing to
cfe-dev for wider discussion.]

Firefox's copy of FFmpeg includes <intrin.h> on MSVC:

https://dxr.mozilla.org/mozilla-central/source/media/ffvpx/libavutil/x86/intmath.h#26

and MSVC's <intrin.h> (after including several other headers) declares
_tzcnt_u32. clang-cl declares _tzcnt_u32 in <bmiintrin.h>, but
<bmiintrin.h> is only included from <intrin.h> if an appropriate
target CPU is detected:

https://github.com/llvm-mirror/clang/blob/master/lib/Headers/immintrin.h#L82

even though _tzcnt_u32 (or rather, its underlying implementation
function, __tzcnt_u32) is explicitly declared to be available
everywhere:

https://github.com/llvm-mirror/clang/blob/master/lib/Headers/bmiintrin.h#L284

The net result is that Firefox's copy of FFmpeg doesn't compile with
clang-cl because of this issue. Upstream has the same code, so I
assume the same is true there:

https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/intmath.h#L26

AFAICT, this behavior was changed by only including <bmiitrin.h> if
__BMI__ is defined in:

http://reviews.llvm.org/D20291

with the desirable goal of reducing compile time. But this change
also broke things like _tzcnt_u32 being available.

What is the right thing to do here? Should <bmiintrin.h> be
unconditionally included, or should something else be done?

Thanks,
-Nathan

I think these intrinsics (it's really just tzcnt though?) that are
useful also for non-BMI targets should probably be moved out of the
BMI-specific header.

+thakis and rnk if they have any thoughts on this.

We’ve been here before:
https://llvm.org/bugs/show_bug.cgi?id=25544

http://llvm.org/viewvc/llvm-project?view=revision&revision=253358

https://bugs.chromium.org/p/chromium/issues/detail?id=556735

Nico’s change to avoid including bmiintrin.h if _MSC_VER is set to reduce compile time brought the problem back.

Basically, Intel’s immintrin.h interface is too big. It’s windows.h all over again. It significantly slows down builds of simple projects that include basic STL headers like . Nico was able to speed up the overall build time of Chromium by at least 10% (ask him for specifics) by adding all these ‘#if !defined(_MSC_VER)’ checks to immintrin.h. However, for compatibility, I think we may need intrin.h to include most of that stuff by default.

I added a bunch of Intel folks to basically ask the question: Can we please go back to the old days of mmintrin.h, xmmintrin.h, then emmintrin.h?

If not, we will have to consider adding an evil configuration macro, like WIN32_LEAN_AND_MEAN for windows.h, that opts out of all this extra immintrin.h functionality to speed up compilation.

Forwarding your mail to Elad who is working now on intrinsics. Please take into account a possible delay in Elad’s response, we are in local holidays this week.

  • Elena

Hi All,

As mentioned in past discussions, the introduction of AVX512 intrinsics had a severe impact on the compile time.

The extreme case happens on windows where the mere inclusion of system headers like causes a hit in the compile time of applications that don’t even make use of any intrinsics.

http://lists.llvm.org/pipermail/cfe-dev/2016-May/048837.html
http://lists.llvm.org/pipermail/cfe-dev/2016-June/049460.html

During these discussions Chandler Carruth indicated that usage of clang modules is probably the best alternative to mitigate this problem:

http://lists.llvm.org/pipermail/cfe-dev/2016-June/049513.html

Based on this feedback, Intel has started working on this proposal. Initial Internal measurements indicated that this direction is promising.

For example, the compile time of a single translation unit that includes all of the intel intrinsic header files would reduce by 90% starting from the 2nd compilation (by using cached pre-compiled modules). In general, we saw that compilation of projects that have more than one translation unit that includes x86intrin.h/immintrin.h will get big compile time reductions starting from the 1st compilation.

The biggest advantage of this approach is that it scales and would help also applications that do rely on intrinsics and hence will not be able to avoid the compile time overhead.

I (Intel) would like to add a Clang option that enables the modules feature exclusively for the Intel intrinsics header files, and set this option to be turned on by default. This allows clang to cache a pre-compiled (actually pre-parsed) module of all the headers included by x86intrin.h/immintrin.h, and any following compilation of #include<x86intrin.h>/<immintrin.h> directives will skip the parsing phase and save precious compile time. Since ‘clang modules’ and the regular ‘pre-processor include mechanism’ are not 100% compatible, I started by cleaning some small Intel intrinsics module bugs exposed by this work:

https://reviews.llvm.org/D23871, https://reviews.llvm.org/D24825, https://reviews.llvm.org/D24752 (This one is still in review).

I plan to upload the actual patch that adds the option I mentioned above somewhere next week hopefully. Once this patch is uploaded, you could experiment with it and see if it actually mitigates the problem on your end and provide us with valuable feedback on this direction.

Thanks,

Elad

I uploaded to review the patch that adds the option for enabling the modules feature exclusively for the intrinsic header files: https://reviews.llvm.org/D25337
You can try it out by adding the –fexclusive-builtin-modules command line option. (all the different x86 *intrin.h files should be included by default since the #if guards already contain a __has_feature(modules) condition).

Thanks, Elad

I've been following the review below and its successor, D25767, but
I'm unable to make the proposed solution (or its equivalent) work. If
I specify:

-Xclang -fimplicit-module-maps -Xclang -fmodule-map-file=[builtin .modulemap]

I still get an error about an unresolved __tzcnt_u32 symbol. If I
remove the "-Xclang -fimplicit-module-maps" from the above, I get the
same error. If I try instead:

-Xclang -fmodules -Xclang -fmodule-map-file=[builtin .modulemap]

I get peculiar errors about x86intrin.h and the _Builtin_intrinsics
module not being available and implicit use of module files being
disabled. (Same goes for _Builtin_stddef_max_align_t and
__stddef_max_align_t.h.) Adding "-Xclang -fimplicit-module-maps"
gives me the same error.

I'm not sure what's going wrong, or if there's been some sort of
mistake in generating the module map. Help?

Thanks,
-Nathan

I haven't been following this thread very closely since the topic
shifted towards using modules for the intrinsics headers.

Is the underlying problem we're trying to solve still how to expose
_tzcnt_u32? Isn't it enough to just make it available in intrin.h
instead of (or addition to) bmiintrin.h? It's special in that it
really doesn't require a BMI-enabled target to be useful, and is
probably one of very few instructions with this property?

- Hans

Hi Nathan,

I've been following the review below and its successor, D25767, but
I'm unable to make the proposed solution (or its equivalent) work. If
I specify:

-Xclang -fimplicit-module-maps -Xclang -fmodule-map-file=[builtin
.modulemap]

I still get an error about an unresolved __tzcnt_u32 symbol. If I
remove the "-Xclang -fimplicit-module-maps" from the above, I get the
same error. If I try instead:

-Xclang -fmodules -Xclang -fmodule-map-file=[builtin .modulemap]

I get peculiar errors about x86intrin.h and the _Builtin_intrinsics
module not being available and implicit use of module files being
disabled. (Same goes for _Builtin_stddef_max_align_t and
__stddef_max_align_t.h.) Adding "-Xclang -fimplicit-module-maps"
gives me the same error.

I'm not sure what's going wrong, or if there's been some sort of
mistake in generating the module map. Help?

The equivalent command line using clang-cl should specify:
-Xclang -fmodules -Xclang -fmodules-cache-path=[modules cache dir] -Xclang -fmodule-map-file=[builtin .modulemap]
Let me know if this still doesn't work.

Thanks, Elad

Exposing _tzcnt_u32 in intrin.h seems like the easiest solution to me,
yes. Modules may also solve this problem elegantly, but requiring
people to modify command-lines just for MSVC compat in this instance
seems like a unnecessary hurdle.

-Nathan

This does indeed work, thanks for the pointer. But why? The
documentation for -fmodules-cache-path indicates that "Clang will
select a system-appropriate default" if this option is not provided.
So explicitly providing it shouldn't make any difference, right? Or
did the system-appropriate default perhaps have stale information in
it?

-Nathan

This does indeed work, thanks for the pointer. But why? The documentation for -fmodules-cache-path indicates that "Clang will select a system-appropriate default" if this option is not provided.
So explicitly providing it shouldn't make any difference, right? Or did the system-appropriate default perhaps have stale information in it?

If I'm not mistaken, the modules flags are not exposed to clang-cl driver and only available on the regular clang driver.
When you use clang (and not clang-cl), passing "-fmodules" to the driver will indeed get clang to select a system-appropriate default if "- fmodules-cache-path " is not provided. (You can try "clang -### -fmodules input.c" and see the commands the driver generates).
The clang-cl driver on the other hand, doesn't know what "-fmodules" is. So you are left with doing the Driver's job on your own in this case, choosing the proper options and passing them directly to 'clang cc1' using -Xclang. So in your example, simply no cache dir was provided to 'clang cc1'.

Thanks, Elad