The intrinsics headers (especially avx512) are too big. What to do about it?

Hi,

on Windows, C++ system headers like e.g. end up pulling in intrin.h. clang’s intrinsic headers are very large.

If you take a cc file containing just #include <string> and run that through the preprocessor with cl /P test.cc and clang-cl /P test.cc, the test.I file generated by clang-cl is 1.7MB while the one created by cl.exe is 0.7MB. This is solely due to clang’s intrin.h expanding to way more stuff.

The biggest offenders are avx512vlintrin.h, avx512fintrin.h, avx512vlbwintrin.h which add up to 657kB already. Before r239883, we only included avx headers if AVX512F etc was defined. This is currently never the case in practice. Later (r243394 r243402 r243406 and more), the avx headers got much bigger.

Parsing all this code takes time – removing the avx512 includes from immintrin.h locally makes compiling a file containing just the header 0.25s faster (!), and building all of v8 gets 6% faster, just from not including the avx512 headers.

What can we do about this? Since avx512 is new, maybe they could be not part of immintrin.h? Or we could re-introduce

#if !__has_feature(modules) && defined(AVX512BW)

include guards in immintrin.h. This would give us a speed win immediately without drawbacks as far as I can see, but in a few years when people start compiling with /arch:avx512 that’d go away again. (Then again, by then, modules are hopefully commonly available. cl.exe doesn’t have an /arch:avx512 switch yet, so this is probably several years away from happening.)

Comments? Is it feasible to require that people who want to use avx512 include a new header instead of immintrin.h? Else, does anyone have a better idea other than reintroducing the #ifdefs, augmented with the module check?

Thanks,
Nico

I think our approach to the mmintrin headers doesn't scale. We're
creating the windows.h of intel intrinsics in immintrin.h.

When they were first created, a large percentage of the intrinsics
were mapping from hyper-specific instruction names to generic vector
math operations like this:

static __inline__ __m128 __DEFAULT_FN_ATTRS
_mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }

This made a lot of sense at the time, because we could just write come
C and not worry about teaching clang and LLVM about every Intel
intrinsic under the sun.

From looking at the avx512 headers, it seems this is no longer the

case. Now we are mostly mapping from _mm_* intrinsic to
__builtin_ia32_ function.

If this continues to be the case going forward, then I think we should
make the _mm* intrinsics into compiler builtins like the
__builtin_ia32 functions. It also avoids the need for those ugly
forwarding macros for intrinsics that take arguments that must be
constant.

The _mm_* builtins should only be available if the user includes
<immintrin.h>. We can replace the contents of that file with a pragma
that just says "enable all intel intrinsics".

From: "Reid Kleckner via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Nico Weber" <thakis@chromium.org>, "David Majnemer" <majnemer@google.com>
Cc: "Elena Demikhovsky" <elena.demikhovsky@intel.com>, "cfe-dev" <cfe-dev@lists.llvm.org>, "asaf badouh"
<asaf.badouh@intel.com>, "Michael zuckerman" <Michael.zuckerman@intel.com>
Sent: Thursday, May 12, 2016 6:10:06 PM
Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too big. What to do about it?

I think our approach to the mmintrin headers doesn't scale. We're
creating the windows.h of intel intrinsics in immintrin.h.

When they were first created, a large percentage of the intrinsics
were mapping from hyper-specific instruction names to generic vector
math operations like this:

static __inline__ __m128 __DEFAULT_FN_ATTRS
_mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }

This made a lot of sense at the time, because we could just write
come
C and not worry about teaching clang and LLVM about every Intel
intrinsic under the sun.

From looking at the avx512 headers, it seems this is no longer the
case. Now we are mostly mapping from _mm_* intrinsic to
__builtin_ia32_ function.

If this continues to be the case going forward,

Indeed. It is not clear to me, however, that this situation is desirable. We had a general policy that our intrinsics headers should generate generic IR whenever possible, and if we've strayed from that, we should discuss that first.

-Hal

From: cfe-dev [mailto:cfe-dev-bounces@lists.llvm.org] On Behalf Of Hal
Finkel via cfe-dev
Sent: Thursday, May 12, 2016 4:14 PM
To: Reid Kleckner
Cc: asaf badouh; David Majnemer; Michael zuckerman; cfe-dev; Elena
Demikhovsky
Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are too
big. What to do about it?

> From: "Reid Kleckner via cfe-dev" <cfe-dev@lists.llvm.org>
> To: "Nico Weber" <thakis@chromium.org>, "David Majnemer"
<majnemer@google.com>
> Cc: "Elena Demikhovsky" <elena.demikhovsky@intel.com>, "cfe-dev" <cfe-
dev@lists.llvm.org>, "asaf badouh"
> <asaf.badouh@intel.com>, "Michael zuckerman"
<Michael.zuckerman@intel.com>
> Sent: Thursday, May 12, 2016 6:10:06 PM
> Subject: Re: [cfe-dev] The intrinsics headers (especially avx512) are
too big. What to do about it?
>
> I think our approach to the mmintrin headers doesn't scale. We're
> creating the windows.h of intel intrinsics in immintrin.h.
>
> When they were first created, a large percentage of the intrinsics
> were mapping from hyper-specific instruction names to generic vector
> math operations like this:
>
> static __inline__ __m128 __DEFAULT_FN_ATTRS
> _mm_add_ps(__m128 __a, __m128 __b) { return __a + __b; }
>
> This made a lot of sense at the time, because we could just write
> come
> C and not worry about teaching clang and LLVM about every Intel
> intrinsic under the sun.
>
> From looking at the avx512 headers, it seems this is no longer the
> case. Now we are mostly mapping from _mm_* intrinsic to
> __builtin_ia32_ function.
>
> If this continues to be the case going forward,

Indeed. It is not clear to me, however, that this situation is desirable.
We had a general policy that our intrinsics headers should generate
generic IR whenever possible, and if we've strayed from that, we should
discuss that first.

If you look at the history of some of the headers, they used to map the
intrinsic function names to builtins. As codegen got smarter over time,
many of these were converted to generic C, and the builtins could go away.
A *lot* of the intrinsics didn't start out as generic C.

(I personally spent probably months of my life merging the evolution of
the intrinsics and builtins and tablegen instruction definitions for a
variety of X86 instruction subsets into our local tree.)

-Hal

> then I think we
> should
> make the _mm* intrinsics into compiler builtins like the
> __builtin_ia32 functions. It also avoids the need for those ugly
> forwarding macros for intrinsics that take arguments that must be
> constant.

If you do that, then there is less motivation to make codegen smarter.
--paulr

Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang’s? (e.g. maybe the windows ones don’t cover all the ISA’s that clang’s do).

– Sean Silva

This old discussion may cover some of this as well? I also thought I
remember something more recent around this..
http://clang-developers.42468.n3.nabble.com/PROPOSAL-Reintroduce-guards-for-Intel-intrinsic-headers-td4046979.html

A couple of points:

  1. Definitely agree with Hal that these intrinsics really shouldn’t be mapping to builtins. This is something I’m pretty frustrated about the direction of AVX-512 support in Clang and LLVM. We really need generic vector IR to lower cleanly into these instructions.

  2. Reid, you specifically advocated for not having the set of intrinsics available based on particular feature sets. ;] But I agree there seems to be a scalability problem here.

  3. I think a lot of the scalability problem is that very basic, non-vector code patterns, require Intrin.h on Windows and pull in ALL the vector intrinsics. =/ It’d be really good to try to fix that.

  4. AVX-512 has made this incredibly worse than any previous ISA extension. It used to be we had the product of (operation * operand-type) intrinsics. This is already pretty bad. Now we have (operation * operand-type * 4) because we have 4 masking variants. So it seems Intel has just made a really unfortunate API choice by forcing every permutation of these things to get a different name and thus a different intrinsic in a header file. =/ And sadly that too is probably too late to walk back.

I wonder if we could at least initially address this by providing very limited “builtin” modules for truly builtin headers that don’t touch any system headers, and actually always use the modules approach for these headers, right out of the box.

Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang’s?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang’s headers), yes. MSVC doesn’t have the avx512 intrinsics yet, but that’s probably only because they’re new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says “The intrin.h header includes both immintrin.h and ammintrin.h for simplicity.”

This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn’t really mention the reasons, and since clang doesn’t implement full multiversioning yet I’m unable to guess at the reasons – but it sounds like people don’t want to re-add the arch ifdefs. Ok, I’ll send a patch to add them back ifdef _MSC_VER only – there should be no drawback to that, and it stops the bleeding in the case where it’s worst (with Microsoft headers).

Going forward, we’ll have to teach clang more about at least some intrinsics for #pragma intrin (PR19898), which might end up helping for this too.

I also reached out to STL at Microsoft, he said he’ll try to look into including an “intrin0.h” header in the next major version of MSVC which would only declare a small set of intrinsics instead of all of them (no promises, of course).

People working on avx512, I’d be curious to hear your perspective on this, as well as your reply to Chandler’s points.

Thanks,
Nico

Sorry if this is a stupid question, but do the windows intrinsic headers actually contain the same contents as clang’s?

As far as I can tell (from looking at https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to clang’s headers), yes. MSVC doesn’t have the avx512 intrinsics yet, but that’s probably only because they’re new.

I had hoped that I could not include all of x86intrin.h in intrin.h, but that page says “The intrin.h header includes both immintrin.h and ammintrin.h for simplicity.”

This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the includes back behind ifdefs. The thread doesn’t really mention the reasons, and since clang doesn’t implement full multiversioning yet I’m unable to guess at the reasons – but it sounds like people don’t want to re-add the arch ifdefs. Ok, I’ll send a patch to add them back ifdef _MSC_VER only – there should be no drawback to that, and it stops the bleeding in the case where it’s worst (with Microsoft headers).

It implements enough multiversioning to make it worthwhile to have them, I don’t know what you’re confused about here. Why do we want to make this platform specific? If you wanted to match MSVC I guess you could just turn them off for windows as an alternate solution?

-eric

> Sorry if this is a stupid question, but do the windows intrinsic
headers actually contain the same contents as clang's?

As far as I can tell (from looking at
https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing to
clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but
that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h,
but that page says "The intrin.h header includes both immintrin.h and
ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the
includes back behind ifdefs. The thread doesn't really mention the reasons,
and since clang doesn't implement full multiversioning yet I'm unable to
guess at the reasons -- but it sounds like people don't want to re-add the
arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only --
there should be no drawback to that, and it stops the bleeding in the case
where it's worst (with Microsoft headers).

It implements enough multiversioning to make it worthwhile to have them, I
don't know what you're confused about here.

I didn't mean to question this point, I just don't understand it. Can you
give an example where it's useful? I'm sure there is one, I just can't
think of one.

Why do we want to make this platform specific? If you wanted to match MSVC
I guess you could just turn them off for windows as an alternate solution?

Yes, that's what I meant with the _MSC_VER check.

Sure, the programmer has to write their own dispatch, but it’ll allow you to include variously target optimized versions of the same function in the same file.

The equivalent linux side of things is:

void my_avx_function() attribute((target(“avx”)))
void my_nonavx_function()

if (__builtin_cpu_supports(“avx”))
my_avx_function()
else
my_nonavx_function()

and you can keep both implementations in the same file and don’t have to worry about things like command line options being different and causing all sorts of haywire.

I meant just turn off the avx512 headers, not sure if that’s what you meant. We should probably look at the lexing and parsing code to see what can be sped up here.

-eric

> Sorry if this is a stupid question, but do the windows intrinsic
headers actually contain the same contents as clang's?

As far as I can tell (from looking at
https://msdn.microsoft.com/en-us/library/hh977023.aspx and comparing
to clang's headers), yes. MSVC doesn't have the avx512 intrinsics yet, but
that's probably only because they're new.

I had hoped that I could not include all of x86intrin.h in intrin.h,
but that page says "The intrin.h header includes both immintrin.h and
ammintrin.h for simplicity."

> This old discussion may cover some of this as well?

Ah thanks, yes, sounds like there are reasons for not putting the
includes back behind ifdefs. The thread doesn't really mention the reasons,
and since clang doesn't implement full multiversioning yet I'm unable to
guess at the reasons -- but it sounds like people don't want to re-add the
arch ifdefs. Ok, I'll send a patch to add them back ifdef _MSC_VER only --
there should be no drawback to that, and it stops the bleeding in the case
where it's worst (with Microsoft headers).

It implements enough multiversioning to make it worthwhile to have them,
I don't know what you're confused about here.

I didn't mean to question this point, I just don't understand it. Can you
give an example where it's useful? I'm sure there is one, I just can't
think of one.

Sure, the programmer has to write their own dispatch, but it'll allow you
to include variously target optimized versions of the same function in the
same file.

The equivalent linux side of things is:

void my_avx_function() __attribute__((__target__("avx")))
void my_nonavx_function()

...

if (__builtin_cpu_supports("avx"))
   my_avx_function()
else
   my_nonavx_function()

and you can keep both implementations in the same file and don't have to
worry about things like command line options being different and causing
all sorts of haywire.

Ah, thanks, I didn't know that worked :slight_smile: (I played with it a bit and found
PR27779)

Why do we want to make this platform specific? If you wanted to match
MSVC I guess you could just turn them off for windows as an alternate
solution?

Yes, that's what I meant with the _MSC_VER check.

I meant just turn off the avx512 headers, not sure if that's what you
meant. We should probably look at the lexing and parsing code to see what
can be sped up here.

I turned it off for all headers for now in r269675. I agree that we should
look at lexing and parsing speed, and also consider things like
tablegen'ing intinsics. Until then, making most compiles faster seems like
a better tradeoff on Windows.

As said above, I'm curious to hear from the people working on avx512 :slight_smile:

Indeed. It is not clear to me, however, that this situation is desirable. We

   >had a general policy that our intrinsics headers should generate generic IR
   >whenever possible, and if we've strayed from that, we should discuss that
   >first.

Let's take a look at this intrinsic:

static __inline__ __m512i __DEFAULT_FN_ATTRS
_mm512_mask_add_epi64 (__m512i __W, __mmask8 __U, __m512i __A, __m512i __B)
{
  return (__m512i) __builtin_ia32_paddq512_mask ((__v8di) __A,
             (__v8di) __B,
             (__v8di) __W,
             (__mmask8) __U);
}

The IR that should be generated:
%C = add <8 x double> %B, %A
%res = select <8 x i1> %mask, <8 x double> %C, %W

If we parse __builtin_ia32_paddq512_mask in CGBuiltin.cpp and generate IR there, will it help?

(Please do not consider my question as a general Intel solution. I just want to understand the problem.)

Thanks.

- Elena

The bit I care most about is that adding `#include <intrin.h>` shouldn't
add megabytes of stuff to my translation unit.

Hve you discussed making immintrin.h more modular? It looks like many more
avx512 builtins keep landing, making this problem bigger and bigger. It'd
be good if I only had to pay for this if I explicitly included an avx512.h,
and even then it'd be nice if that wasn't one huge header, but several
smaller ones, so I only have to pay compile time for the bits I need.

We are still trying to find a suitable solution.

Keeping declarations only inside header files will save compile time.

In this case the implementation will be hidden inside clang.

Can somebody help me to estimate impact and complexity of this solution?

Thank you.

  • Elena

So have clang magically emit the generated code based on the intrinsic header? That’ll be a lot of typing, but ultimately shouldn’t be terrible. You’ll effectively turn the mm interface into __builtin as far as automatic recognition etc and I’m not sure we’d want to do that sort of thing.

-eric

I’m not sure we’d want to do that sort of thing.

Do you have any other suggestion?

  • Elena

I agree, that it’s not desirable thing in Clang, however it seems to me the lesser evil.

The long compilation time will hurt many projects, I’m not sure if it’s reasonable even for the ones trying to use AVX intrinsics intentionally.

Another issue is, that the “mm” prefixed identifiers are not reserved for the compiler - the C99 spec says that identifiers that start with two underscores or an underscore and a capital letter are reserved. So, Clang should recognize the “mm” prefix, check if the right header was indeed included, and then try to identify the x86 intrinsic. If this fails, the identifier should be considered a standard identifier.

Of course, there is the case of x86 intrinsics that should be compiled to pure LLVM IR, rather than LLVM intrinsic calls. I see two possible solutions:

  1. Make CGBuiltin.cpp/EmitX86BuiltinExpr generate the IR using the IR builder. This approach might be less intuitive, and may become very long as we change more and more intrinsics to pure LLVM IR

  2. Leave the intrinsics implemented in C language in the header, rather than making these “mm” builtins. Then, again, as we move more and more intrinsics to C representation, the header might get big and heavy again.

I think in the short term I would prefer the 2nd solution as for its simplicity. Any other ideas to overcome this issue?

Thanks

Guy Benyei

image001.png

As I said up the thread, I think the right way to solve this is with modules.

We have the infrastructure in clang to lazily load things like the intrinsics headers in a very efficient way. All we are missing is:

  1. The ability to enable this by default exclusively for the intrinsic headers. (or more generally for any subset of the builtin headers where we would like this behavior…)
  2. To build the actual module files for the builtin headers (much the way we generate some of them) as part of the build system

So far I’ve not seen any suggestions that really seem superior to this…

image001.png

image001.png

Modules could be a good solution for this issue, however I’m a bit concerned about some technical issues with keeping the modules in sync with the Clang compiler.

Building the actual module files as part of the build system doesn’t seem really hard, but still need to think about the location it should be installed (with the headers?), and need to make sure, that for any Clang we run we can find the suiting module files, or fall back to the headers approach.

We could add the module as some kind of resource to the executable. This way it couldn’t get out of sync. I would also add some way to disable this optimization entirely.

What do you think?

image001.png