[PROPOSAL] Reintroduce guards for Intel intrinsic headers

I've run into some code which no longer compiles because of two recent changes:

  41885d3 Update the intel intrinsic headers to use the target attribute support.
  695aff1 Use a define for per-file function attributes for the Intel intrinsic headers.

Specifically, one project defines its own SSE4.1 emulation routines when the real intrinsics aren't available. This is a problem because they've reused the names of the intrinsics. E.g;

#ifndef __SSE4_1__
#define _mm_extract_epi8(a_, ndx) ({ ... })
static inline __m128i _mm_blendv_epi8(__m128i a, __m128i b, __m128i mask) { ... }
...
#endif

SSE4.1 intrinsics now leak into the project when it's being compiled for targets without SSE4.1 support. Compilation fails with "error: redefinition ...".

When these changes were initially being discussed, I think our stance was that we shouldn't support code like this [1]. However, we should reconsider for the sake of avoiding breakage. AFAICT, we would need to revert just two types of changes:

In lib/Headers/__wmmintrin_aes.h:

-#if defined (__SSE4_2__) || defined (__SSE4_1__)
#include <smmintrin.h>
-#endif

In lib/Headers/smmintrin.h:

-#ifndef __SSE4_1__
-#error "SSE4.1 instruction set not enabled"
-#else

I don't see any downsides to reintroducing these guards. If everyone's OK with this, I can mail a patch in. The alternative is to have clients rewrite their emulation layers like this:

#ifdef __SSE4_1__
#define compat_mm_extract_epi8 _mm_extract_epi8
static inline __m128i combat_mm_blendv_epi8(__m128i a, __m128i b, __m128i mask) __attribute__((__target__(("sse4.1")))) {
  return _mm_blendv_epi8(a, b, mask);
}
...
#else /* OK, no native SSE 4.1. Define our own. */
#define compat_mm_extract_epi8(a_, ndx) ({ ... })
static inline __m128i compat_mm_blendv_epi8(__m128i a, __m128i b, __m128i mask) { ... }
...
#endif

... and then replace all calls to intrinsics with calls to the new compatibility routines. This seems like a lot of tedious work, and I'd love to help people avoid it :).

Let me know what you think!

vedant

[1] http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20150615/131192.html

I’m opposed to this. Going forward, I would really like target intrinsics to be available regardless of the current feature set, so users don’t need hacks like these.

I see two ways to do this with different tradeoffs:

  1. Diagnose missing target attributes when calling the intel intrinsics. I was surprised to find that we don’t already do this.
  2. We could support some automatic transfer of the target attribute to the caller when calling these intrinsics, but I worry that this is too confusing.

Implicitly setting a target attribute may block inlining that the user expected to happen, for example. Alternatively, there may be a dynamic cpuid check in the same function between SSE2 and AVX variants of the same algorithm, and now the SSE2 loop will unexpectedly use AVX instructions.

So we should probably settle with telling the user to add -msseNN or atribute((target((“sseNN”)))).

I’m opposed to this. Going forward, I would really like target intrinsics to be available regardless of the current feature set, so users don’t need hacks like these.

Agreed.

I see two ways to do this with different tradeoffs:

  1. Diagnose missing target attributes when calling the intel intrinsics. I was surprised to find that we don’t already do this.

Sorry. This is on my list of things to do.

  1. We could support some automatic transfer of the target attribute to the caller when calling these intrinsics, but I worry that this is too confusing.

We could, but it’s probably better to leave it as is.

-eric

I don’t see any downsides to reintroducing these guards.

Then you weren’t really paying attention to the point of removing them :slight_smile:

The idea is so that the headers can be used, with appropriate target attributes, for any code.

You could check whether or not the routines are defined and then define them. You could perhaps rewrite the code to use target attributes as well, but that’s quite a bit of work.

What kind of file is this? Keep in mind that things in the global namespace prefixed with an underscore is a reserved name for implementers as well. That would make this code not standards compliant as well.

-eric

I’m opposed to this. Going forward, I would really like target intrinsics to be available regardless of the current feature set, so users don’t need hacks like these.

Agreed.

I see two ways to do this with different tradeoffs:

  1. Diagnose missing target attributes when calling the intel intrinsics. I was surprised to find that we don’t already do this.

Sorry. This is on my list of things to do.

FWIW the bug on the warning is:

https://llvm.org/bugs/show_bug.cgi?id=24125

and it really is on my list. :slight_smile:

-eric

The actual C++ rules are that any name with double underscores is reserved,
and names beginning with an underscore followed by a capital letter. So,
the Intel intrinsics are *not* in the implementer's namespace, but that was
probably a mistake.

I'm sympathetic to users who are probably implementing a compatibility
layer here and don't want to write their own intrinsic wrappers, but I
think the right tradeoff is probably to fix the code.

I was pretty sure that what you’re quoting is in any namespace, but in the global namespace it’s what I said? I’d have to double check.

Agreed here.

-eric

I don't see any downsides to reintroducing these guards.

Then you weren't really paying attention to the point of removing them :slight_smile:

The idea is so that the headers can be used, with appropriate target attributes, for any code.

Right, I thought about this but wasn't sure if there were benefits to having symbols available for an unsupported target.

I.e, is there some reason a project might want to include the header for SSE4 intrinsics if it can't use any of those symbols?

What kind of file is this? Keep in mind that things in the global namespace prefixed with an underscore is a reserved name for implementers as well. That would make this code not standards compliant as well.

It's a utility header in a C project.

I'm sympathetic to users who are probably implementing a compatibility layer here and don't want to write their own intrinsic wrappers, but I think the right tradeoff is probably to fix the code.

Ok. It seems like the consensus is that Eric's patch does the right thing -- and I actually agree with that.

I'm just not 100% convinced that removing the header guards was necessary (which, I admit, could just be due to a lack of understanding on my part).

I checked with gcc trunk and they've taken the same approach, so at least it'll all be consistent.

vedant

What kind of file is this? Keep in mind that things in the global
namespace prefixed with an underscore is a reserved name for implementers
as well. That would make this code not standards compliant as well.

The actual C++ rules are that any name with double underscores is
reserved, and names beginning with an underscore followed by a capital
letter. So, the Intel intrinsics are *not* in the implementer's namespace,
but that was probably a mistake.

There's a second clause in 17.6.4.3.2\1:

"Each name that begins with an underscore is reserved to the implementation
for use as a name in the global namespace" (the other names you mentioned
are reserved for any use - so they can be used for macros, etc too - these
one's cannot)

I don’t see any downsides to reintroducing these guards.

Then you weren’t really paying attention to the point of removing them :slight_smile:

The idea is so that the headers can be used, with appropriate target attributes, for any code.

Right, I thought about this but wasn’t sure if there were benefits to having symbols available for an unsupported target.

I.e, is there some reason a project might want to include the header for SSE4 intrinsics if it can’t use any of those symbols?

I put a code snippet for something to do in the commit, but the general idea is that you can compile a function for a specific target with subtarget features and use the target attribute to add subtarget features and you’ll want to be able to use the intrinsics at the same time. It won’t work if you block them at the preprocessor level.

I’m just not 100% convinced that removing the header guards was necessary (which, I admit, could just be due to a lack of understanding on my part).

Did the above help?

-eric

What kind of file is this? Keep in mind that things in the global
namespace prefixed with an underscore is a reserved name for implementers
as well. That would make this code not standards compliant as well.

The actual C++ rules are that any name with double underscores is
reserved, and names beginning with an underscore followed by a capital
letter.

Also, any name starting with an underscore _in the global namespace_.
[global.names]/2, "Each name that begins with an underscore is reserved to
the implementation for use as a name in the global namespace."

So, the Intel intrinsics are *not* in the implementer's namespace, but
that was probably a mistake.

Given [global.names]/2, they are in the implementer's namespace (presumably
as intended).

-- James

>
>> I don't see any downsides to reintroducing these guards.
>
> Then you weren't really paying attention to the point of removing them
:slight_smile:
>
> The idea is so that the headers can be used, with appropriate target
attributes, for any code.

Right, I thought about this but wasn't sure if there were benefits to
having symbols available for an unsupported target.

I.e, is there some reason a project might want to include the header for
SSE4 intrinsics if it can't use any of those symbols?

I put a code snippet for something to do in the commit, but the general
idea is that you can compile a function for a specific target with
subtarget features and use the target attribute to add subtarget features
and you'll want to be able to use the intrinsics at the same time. It won't
work if you block them at the preprocessor level.

Sorry if this is a stupid question, but would it make sense to gate this
behind a flag? Breaking user code is bad, bad news. This target attribute
stuff is pretty niche, so it would make sense to have it be opt-in.

Or is this how GCC/ICC have always done it? I would expect user code to not
be breaking if that were the case though.

-- Sean Silva

This is already non-standards compliant code :slight_smile:

I realize that seems like an easy handwave here, but in this case I don’t really want to support someone redefining things in our headers this way and expect it to work.

This code is likely Apple Internal and so wouldn’t be compiled with gcc or icc.

gcc uses a pragma for this sort of thing. I’m using an attribute. icc has something slightly different, but the same general idea.

-eric

I don't see any downsides to reintroducing these guards.

Then you weren't really paying attention to the point of removing them :slight_smile:

The idea is so that the headers can be used, with appropriate target attributes, for any code.

Right, I thought about this but wasn't sure if there were benefits to having symbols available for an unsupported target.

I.e, is there some reason a project might want to include the header for SSE4 intrinsics if it can't use any of those symbols?

I put a code snippet for something to do in the commit, but the general idea is that you can compile a function for a specific target with subtarget features and use the target attribute to add subtarget features and you'll want to be able to use the intrinsics at the same time. It won't work if you block them at the preprocessor level.

Sorry if this is a stupid question, but would it make sense to gate this behind a flag? Breaking user code is bad, bad news. This target attribute stuff is pretty niche, so it would make sense to have it be opt-in.

Or is this how GCC/ICC have always done it? I would expect user code to not be breaking if that were the case though.

ICC considers all intrinsics to be available in any environment, regardless of the compiler switches. So, you can be generating code assuming you are targeting a plain old Pentium Processor and use AVX512 intriniscs (for example). We specifically choose this model, and consider that if such an intrinsic is used, it implies that the user has done the necessary checks to ensure that the processor they are running on, supports the intrinsics in use in whatever context that happens to be. And we can propogate that “cpu assertion” around a bit in the control-flow-graph. We felt this model was necessary to allow code to be developed which did the following:

if (some condition that is only set for a certain processor) {
   // Use processor specific intrinsics to support optimized code for some algorithm based on the dynamic check above.
}

My understanding is that neither gcc nor clang/LLVM really support this type of usage. Because in order to use say an AVX intrinsic,
the whole module that the intrinsic is used in needs to be compiled with the compiler switch targeting AVX, potentially allowing AVX instructions outside the “safe” area. Or for some intrinsics (if you get around of ifdef guards), you simply will lower them to a set of instructions that implements a similar operation, but without using the wider vector the user really intended.

Kevin B. Smith

-- Sean Silva

I'm just not 100% convinced that removing the header guards was necessary (which, I admit, could just be due to a lack of understanding on my part).

Did the above help?

-eric

ICC considers all intrinsics to be available in any environment, regardless of the compiler switches. So, you can be generating code assuming you are targeting a plain old Pentium Processor and use AVX512 intriniscs (for example). We specifically choose this model, and consider that if such an intrinsic is used, it implies that the user has done the necessary checks to ensure that the processor they are running on, supports the intrinsics in use in whatever context that happens to be. And we can propogate that “cpu assertion” around a bit in the control-flow-graph. We felt this model was necessary to allow code to be developed which did the following:

if (some condition that is only set for a certain processor) {

// Use processor specific intrinsics to support optimized code for some algorithm based on the dynamic check above.

}

My understanding is that neither gcc nor clang/LLVM really support this type of usage. Because in order to use say an AVX intrinsic,

the whole module that the intrinsic is used in needs to be compiled with the compiler switch targeting AVX, potentially allowing AVX instructions outside the “safe” area. Or for some intrinsics (if you get around of ifdef guards), you simply will lower them to a set of instructions that implements a similar operation, but without using the wider vector the user really intended.

We (recently) and gcc (in 4.4?) moved away from this and more to a method that you’ve got. We both do it on a function by function basis rather than in code blocks.

-eric

ICC considers all intrinsics to be available in any environment, regardless of the compiler switches. So, you can be generating code assuming you are targeting a plain old Pentium Processor and use AVX512 intriniscs (for example). We specifically choose this model, and consider that if such an intrinsic is used, it implies that the user has done the necessary checks to ensure that the processor they are running on, supports the intrinsics in use in whatever context that happens to be. And we can propogate that “cpu assertion” around a bit in the control-flow-graph. We felt this model was necessary to allow code to be developed which did the following:

if (some condition that is only set for a certain processor) {
   // Use processor specific intrinsics to support optimized code for some algorithm based on the dynamic check above.
}

My understanding is that neither gcc nor clang/LLVM really support this type of usage. Because in order to use say an AVX intrinsic,
the whole module that the intrinsic is used in needs to be compiled with the compiler switch targeting AVX, potentially allowing AVX instructions outside the “safe” area. Or for some intrinsics (if you get around of ifdef guards), you simply will lower them to a set of instructions that implements a similar operation, but without using the wider vector the user really intended.

We (recently) and gcc (in 4.4?) moved away from this and more to a method that you've got. We both do it on a function by function basis rather than in code blocks.

-eric

I noticed the changes that you had made to the header files Eric. I like the change :relaxed: and added flexibility it allows for the user.

Kevin B. Smith

-- Sean Silva

I'm just not 100% convinced that removing the header guards was necessary (which, I admit, could just be due to a lack of understanding on my part).

Did the above help?

-eric

Ah ok, I think I understand. If we want the extra granularity, we can't block off some of the symbols in the preprocessor because some function could need them.

Sean's suggestion of putting this behind a flag sounds nice, but the details are hairy. We might have to provide a separate set of headers for people who want the feature guards.. and it's not clear whether the flag would be gcc-compatible.

vedant

I'm opposed to this. Going forward, I would really like target intrinsics to be available regardless of the current feature set, so users don't need hacks like these.

I see two ways to do this with different tradeoffs:
1. Diagnose missing target attributes when calling the intel intrinsics. I was surprised to find that we don't already do this.
2. We could support some automatic transfer of the target attribute to the caller when calling these intrinsics, but I worry that this is too confusing.

Regarding automatic transfer of the target attribute. It seems like something like this:

static __inline __m256
__attribute__((__always_inline__, __nodebug__))
_mm256_add_ps(__m256 __a, __m256 __b)
{
  __builtin_assume(__has_feature(FEATURE_AVX));
  return __a + __b;
}

might be a good way to represent this. It has the nice property that the __builtin_assume only applies at the exact point in execution where
it occurs. You could probably add something like this as a small addition to what Eric has already done with the target attribute, or you could try to make inlining of something with the target attribute automatically apply a target __builtin_assume property.

The difficulty is defining how the semantics of the different target properties have to be retained during optimization and code generation.
I’m sure this difficulty is why the granularity that Eric is working on is at the routine level.

Kevin

Implicitly setting a target attribute may block inlining that the user expected to happen, for example. Alternatively, there may be a dynamic cpuid check in the same function between SSE2 and AVX variants of the same algorithm, and now the SSE2 loop will unexpectedly use AVX instructions.

So we should probably settle with telling the user to add -msseNN or __atribute__((target(("sseNN")))).

I've run into some code which no longer compiles because of two recent changes:

  41885d3 Update the intel intrinsic headers to use the target attribute support.
  695aff1 Use a define for per-file function attributes for the Intel intrinsic headers.

Specifically, one project defines its own SSE4.1 emulation routines when the real intrinsics aren't available. This is a problem because they've reused the names of the intrinsics. E.g;

#ifndef __SSE4_1__
#define _mm_extract_epi8(a_, ndx) ({ ... })
static inline __m128i _mm_blendv_epi8(__m128i a, __m128i b, __m128i mask) { ... }
...
#endif

SSE4.1 intrinsics now leak into the project when it's being compiled for targets without SSE4.1 support. Compilation fails with "error: redefinition ...".

When these changes were initially being discussed, I think our stance was that we shouldn't support code like this [1]. However, we should reconsider for the sake of avoiding breakage. AFAICT, we would need to revert just two types of changes:

In lib/Headers/__wmmintrin_aes.h:

-#if defined (__SSE4_2__) || defined (__SSE4_1__)
#include <smmintrin.h>
-#endif

In lib/Headers/smmintrin.h:

-#ifndef __SSE4_1__
-#error "SSE4.1 instruction set not enabled"
-#else

I don't see any downsides to reintroducing these guards. If everyone's OK with this, I can mail a patch in. The alternative is to have clients rewrite their emulation layers like this:

#ifdef __SSE4_1__
#define compat_mm_extract_epi8 _mm_extract_epi8
static inline __m128i combat_mm_blendv_epi8(__m128i a, __m128i b, __m128i mask) __attribute__((__target__(("sse4.1")))) {
  return _mm_blendv_epi8(a, b, mask);
}
...
#else /* OK, no native SSE 4.1. Define our own. */
#define compat_mm_extract_epi8(a_, ndx) ({ ... })
static inline __m128i compat_mm_blendv_epi8(__m128i a, __m128i b, __m128i mask) { ... }
...
#endif

... and then replace all calls to intrinsics with calls to the new compatibility routines. This seems like a lot of tedious work, and I'd love to help people avoid it :).

Let me know what you think!

vedant

[1] http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20150615/131192.html

The problem that we reported in PR24125 is fundamentally that for intrinsics implemented as macros (rather than inline functions) the symptom for “you didn’t set the right target” is a backend crash. For those intrinsics there’s no function to attach the attribute to. I was thinking about re-introducing the #ifdefs for those cases, so we’d be going back to the “undefined identifier” diagnostic from the frontend. But I’d be happier with some other solution that worked more smoothly for macros.

–paulr

The macro intrinsics are pretty gnarly. I'd love it if we could come up
with a principled solution to the general problem of always_inline
functions that need to propagate constant parameters into their bodies.

Barring a solution to the general problem, we could at least address
PR24125 by having all the macros call an artificial empty inline function
with __attribute__((target)).