[RFC] Design for AVX10 feature support

phoebe · August 9, 2023, 10:07am

Background

Intel just disclosed AVX10 [spec, technical paper]. The TL;DR is AVX10 is a vector ISA evolution that includes all the capabilities and features of the Intel AVX-512 ISA in a converged version which can run on both E-cores and P-cores.

To enable this feature in LLVM, we have some unique challenges as well as design choices compared to traditional ISA enablement. It is also a peculiar ISA in which no new instructions are introduced in the initial version. It looks more like a re-organization of the AVX512 instructions in LLVM rather than introducing anything brand new here.

Considering the large among of re-organizing work, the preference of each proposal from developer and user, we are requesting for comments before we complete the final implementation. We also welcome collaboration from community to do the re-organization together.

Major challenges

One of the philosophies of AVX10 design is to provide distinct 256-bit and 512-bit capabilities for E-cores and P-cores respectively. But the current organization of AVX512 instructions assumes 512-bit registers are always usable.

The other one is AVX10 is a full set of all the current AVX512 features (except for a few deprecated KNL features and AVX512_VP2INTERSECT), which means we have ~700 instructions, ~5000 intrinsics and hundreds of CPP code that checking predicates that might be affected.

New Options in Clang

-mavx10.x
The initial version is AVX10.1, which includes all instructions in the major AVX512 features. But these features cannot be enabled or disabled separately.
The future versions will be numbered in avx10.2, avx10.3 etc. All the instructions in early AVX10 version will be inherited in the future version.
-mavx10.x-256/-mavx10.x-512
The default vector size is 256-bit if the suffix -256 or -512 is not specified. These suffixes are used to override the vector size to the specific value. Only the value 256 and 512 are supported in AVX10.
If there are more than one option used in the command line, the last vector size will override the previous ones and compiler will emit a warning about the use.

Ambiguity when mixture use of AVX10 and AVX512 options

Due to the difference of controlling 512-bit related instructions between AVX10 and AVX512, there are two major ambiguities when mixture use of AVX10 and AVX512 options.

-mavx10.1-256/512 + -mno-avx512xxx
The combination can be either interpreted as invalid or to disable the overlapped instructions in AVX10 and AVX512XXX.
-mavx10.1-256 + -mavx512xxx
There’s no real target that only support 256-bit AVX10 and some AVX512 features. Since AVX10 is a converged ISA, the combination can be interpreted as to keep all instructions with 256-bit, or promote them to 512-bit, or promote only the overlapped instructions to 512-bit.

Our internal consensus from GCC and LLVM support team is users are not encouraged to mix AVX10 and AVX512 options in command line and compiler will emit warning for these ambiguous cases, while the code generation might be different based on the implementation.

In GCC’s RFC, they choose to ignore -mno-avx512xxx in case 1) and generate 512-bit related instructions for -mavx512xxx in case 2).

In LLVM the behavior is depended on the design choice we will select.

Design choices

1. Make AVX10 imply all related AVX512 features and exclude all ZMM and 64-bit mask instructions if avx10-512bit feature is not set

Since we use a global avx10-512bit flag for all 512-bit related instructions, the legacy AVX512 features will be affected by the flag as well. In a word, we cannot control 512-bit related instructions for legacy AVX512 features when used together with AVX10.

In this choice, we choose to making AVX10 dominates AVX512 options in driver. In other words, when used with AVX10 options, AVX512 options will always be ignored with a warning.

Pros: Easy to implement and very small code change

Cons: Cannot consistent with GCC’s behavior

2. Split all existing AVX512 features into 2 parts: Scalar+Vec128+Vec256 and Vec512 only and make AVX10.1-256 imply the former

In this choise, we can make AVX10 and AVX512 independent, which means AVX10 and AVX512 options can be used together to turn on or off related instructions. For example, -mavx10-256 -mavx512bw will turn on all 256-bit instructions of AVX10 and 512-bit instructions of AVX512BW, while -mavx10-512 -mno-avx512bw will turn off all instructions in AVX512BW or features imply AVX512BW. We can also match GCC’s behavior through modification in the driver.

Pros: Can emulate GCC’s behavior when use AVX10 with AVX512 options

Cons: A lot of effort to refactor all code involving AVX512 predicates

3. Add “ || hasAVX10()” logic to all current AVX512 predicates excluding 512-bit related part and still make AVX10-512 imply all related AVX512 features

In this choise, we will behave the same as GCC, i.e., -mno-avx512xxx will be ignored and 512-bit instructions will be generated for -mavx512xxx when they are used together with e.g., -mavx10.1-256.

It also requires a lot of refactor effort, besides, there’s an obstacle that we cannot make intrinsics check for both AVX10 and AVX512 features. It’s not easy to modify the front-end to do so according to FE folks. The workaround is to change all definition to macro which also needs a lot of refactor effort.

Preference

I am preferring to design 1, because it requies less effort and less destructive change in the code while inconsistency with GCC in ambiguous scenarios is not a problem. Here is the RFC patch: D157485

I may miss some pential problems in above evaluation. Please raise concerns or give your suggestions. Thanks in advance!

cc @RKSimon, @topperc, @e-kud, @nwg, @efriedma-quic

RKSimon · August 9, 2023, 1:07pm

I need to think about this further but having different behaviour to gcc is going to cause a lot of issues and user confusion.

There’s no discussion yet about the AVX512 headers and the defines / function attributes they use - how do you intend to adjust these?

phoebe · August 9, 2023, 2:58pm

I think the difference is minor enough in design 1. The only difference so far is when specifing e.g., -mavx10(-256) -mavx512dq, GCC may generate 512-bit AVX512DQ instructions and allows use of its intrinsics while LLVM doesn’t.
But the options combination doesn’t make sence in reality. We won’t have a target only supports 256-bit AVX10 while supports AVX512DQ.
And GCC’s behaviour is also problematic if we see AVX10 instructions as a whole, i.e., 512-bit instructions should be enabled/disabled together rather than in sub features.

What’s the discussion you are seeking? For design choice 1, there’s no need to modify the headers. The only work is to add a verifier in FE to make sure 512-bit intrinsics are not usable for AVX10-256.
For design choice 2, we need to split the function attributes in the headers too, but it won’t be a big task since we define the attributes with macro. There only 2~3 lines modification in each header files.
For design choice 3, AVX512 headers is an obstacle which I discussed in the proposal.

nwg · August 9, 2023, 7:17pm

I think the difference is minor enough in design 1. The only difference so far is when specifing e.g., -mavx10(-256) -mavx512dq , GCC may generate 512-bit AVX512DQ instructions and allows use of its intrinsics while LLVM doesn’t.

Is it possible someone would want 256 bit for all operations except maybe a specific kernel?
How would something like function specific target attributes works? (for example xxhash).

jyknight · August 9, 2023, 11:16pm

Since AVX10.1-512 is essentially a collection of existing AVX512 feature bits under a new name and with a new CPUID enumeration mechanism, and AVX10.1-256 is just “that, except without the actual 512 part”, ISTM there should be some way for users to compile code that will run on already-shipped AVX-512 processors as well as on future “AVX10.1-256” processors (which I guess are predicted to be the more common variant in the future).

Internally to LLVM, I think you could add a new predicate to all the 512-bit instructions for “has ZMM registers”. E.g. for vbmi2 instructions, the 512-bit instruction definitions in the tablegen would check hasZMM() && hasVBMI2(), while the 128/256-bit ones check hasVLX() && hasVBMI2() (unchanged).

And we can then expose that to the command-line: let users write something like: “-mavx512vl -mno-avx-zmm” [with the latter flag being newly-invented, defaulting to “-mavx-zmm”, but only having an effect if any avx512-based features are also enabled].

Once that works, the new AVX10.1-256 and AVX10.1-512 becomes simply a collection of other already-existing flags. Thinking of it that way might make the interaction between command-line options less mysterious.

So, we could say that -mavx10.1-512 is effectively equivalent to -mavx512f -mavx512cd -mavx512bw -mavx512dq -mavx512vbmi -mavx512ifma -mavx512vnni -mavx512bf16 -mavx512vpopcntdq -mavx512vbmi2 -mavx512bitalg -mavx512fp16. and that -mavx10.1-256 is equivalent to the same list, but with -mno-avx-zmm as well.

With that understanding, -mavx10.x-yyy -mno-avx512f is just as reasonable as -mavx512bf16 -mno-avx512f – in both cases, the end result is everything gets disabled.

There’s a choice to make, though: should all -mavx512xxx flags implicitly re-enable -mavx-zmm, or should they leave it untouched if it’s explicitly specified elsewhere on the command-line? Which way is chosen correspondingly implies either that -mavx10.1-256 -mavx512vbmi is equivalent to -mavx10.1-512 or else that it’s equivalent to -mavx10.1-256.

The latter (that -mno-avx-zmm/-mavx-zmm is unaffected by other -mavx512-xxx flags) seems potentially cleaner?

phoebe · August 10, 2023, 12:47pm

I don’t find anything special in your example, though I think I got your point.
The function specific target attributes should be warned as well when compiled with AVX10, especially AVX10-256.
Considering you have a function using attribute avx512dq and you use -mavx10-256 in command line. What’s the code gen you expected? What’s the target the code can run on? Isn’t it only runable on the non-existent target that only supports 256-bit AVX10 and AVX512DQ?

phoebe · August 10, 2023, 1:37pm

I don’t think so. Binary compiled with “AVX10.1-256” cannot run on 512-bit processors prior to Sapphire Rapids. And we want user completely forget the AVX512XXX things in the future. We even warn it when -mavx512xxx used together with -mavx10.

hasZMM() is just another name of hasAVX10_512BIT().

The target is to control AVX10-256 and AVX10-512. There’s no value to control the legacy -mavx512xxx options. We control for AVX10 because the -256 and -512 are distinct targets. On the contrary, there’s no a real target that meet -mavx512vl -mno-avx-zmm and you can always use -mavx512vl -mprefer-vector-width=256 for the same behavior.

If we can ignore the change to legacy -mavx512xxx options. The proposal here is identical to the design 1) I proposed. It won’t solve the minor gap with GCC either while needs more code change. So I don’t think we should go this way.

nwg · August 10, 2023, 3:27pm

phoebe:

nwg:

I think the difference is minor enough in design 1. The only difference so far is when specifing e.g., -mavx10(-256) -mavx512dq , GCC may generate 512-bit AVX512DQ instructions and allows use of its intrinsics while LLVM doesn’t.

Is it possible someone would want 256 bit for all operations except maybe a specific kernel?
How would something like function specific target attributes works? (for example xxhash ).

I don’t find anything special in your example, though I think I got your point.
The function specific target attributes should be warned as well when compiled with AVX10, especially AVX10-256.
Considering you have a function using attribute avx512dq and you use -mavx10-256 in command line. What’s the code gen you expected? What’s the target the code can run on? Isn’t it only runable on the non-existent target that only supports 256-bit AVX10 and AVX512DQ?

Well from earlier

“The other one is AVX10 is a full set of all the current AVX512 features (except for a few deprecated KNL features and AVX512_VP2INTERSECT), which means we have ~700 instructions, ~5000 intrinsics and hundreds of CPP code that checking predicates that might be affected.”

So while a user might be compiling a project with AVX10 and specify 256bit, they might also be including libraries that have hand tuned portions. I don’t think its really fair for us to potentially de-optimize those routines (or just fail to compile them) because of the AVX10 flag. using XXHASH as an example again, which is used by thousands of projects (including LLVM), it seems reasonable to expect that by enabling AVX10 when compiling LLVM, I wouldn’t be unduly slowing down any portion of the codes.

jyknight · August 10, 2023, 5:05pm

Yes, but unless I misunderstand, the main reason it cannot run on earlier architectures is because AVX10.1-256 includes the “FP16” instructions, which aren’t implemented before Sapphire Rapids? Did I miss some further fundamental incompatibility?

I mean, sure, maybe everyone would like to pretend AVX512 never happened, but it’s shipping on a bunch of hardware already, and AVX10 isn’t yet…

Unless I’m incorrect about the technical feasibility, it really seems like it’s desirable to be able to compile a binary which can run on Intel Ice Lake, Intel Tiger Lake, Intel Rocket Lake, AMD Zen 4, Intel Sapphire Rapids, AND whatever future AVX10.1-256 CPUs.

Won’t there be such a target, when 256-bit AVX10 CPUs are released?

and you can always use -mavx512vl -mprefer-vector-width=256 for the same behavior.

That’s not the same thing. -mprefer-vector-width tells LLVM to prefer to use 256-bit operations for autovectorization, but it can still emit ZMM instructions.

phoebe · August 11, 2023, 3:28am

nwg:

Well from earlier

“The other one is AVX10 is a full set of all the current AVX512 features (except for a few deprecated KNL features and AVX512_VP2INTERSECT), which means we have ~700 instructions, ~5000 intrinsics and hundreds of CPP code that checking predicates that might be affected.”

So while a user might be compiling a project with AVX10 and specify 256bit, they might also be including libraries that have hand tuned portions. I don’t think its really fair for us to potentially de-optimize those routines (or just fail to compile them) because of the AVX10 flag. using XXHASH as an example again, which is used by thousands of projects (including LLVM), it seems reasonable to expect that by enabling AVX10 when compiling LLVM, I wouldn’t be unduly slowing down any portion of the codes.

What’re the hand tuned portions you refer to? The function target attributes features are appended to the command line features rather than limited. Considering AVX10 is a superset of AVX512 features, you either cannot tune for a specific feature, nor lose the turning if it’s available for AVX10.
The only problem is when you manually call to 512-bit intrinsics in such case. As I discussed above, there’s no such target in reality. You will get runtime crash anyway.

phoebe · August 11, 2023, 3:46am

It’s probably true for skylake-server, but earlier architectures doesn’t mean skylake-server itself. See the feature inheritance relationship below:

And we have already optimized general operation with FP16 instructions like vmovw. Which makes binaries built by AVX10 not runable on skylake-server and earlier architectures.

No, the desired behavior is not to run on old architectures but the future ones.
The AVX512 options can still be used on old architectures or even new AVX10.1-512 CPUs. But using them on new CPUs in not encouraged.

No as far as I can tell. A 256-bit AVX10 CPU won’t enumerate any AVX512 feature in its CPUID.

jyknight · August 11, 2023, 6:36pm

I feel like we’re not communicating well. Certainly code compiled with -mavx10.1-256 cannot be expected run on older processors. That’s typical with a new generation of CPU adding new command-line flags to enable support.

But the unusual situation here is that newer processors will drop a feature that’s supported by processors today (512 bit ZMM registers part of AVX512), yet keep the rest of the AVX512 instructions which were introduced at the same time, without a separate feature-bit.

What I’m trying to say is there ought to be a way to compile code which runs both on an N-generation-old CPU (e.g. a CPU that supports only AVX512F, AVX512CD, AVX512BW, AVX512DQ, AVX512VL extensions, but not all the rest), and be forward-compatible to future CPUs which support AVX10.1-256 (and have no 512 bit registers).

In order for that to work, there needs to be some way to specify this intersection of features to the compiler – but there is no such way today, or in your proposal. It has to be possible to explicitly specify which AVX512 extensions to enable (in order to retain compatibility with older CPU generations), yet at the same time not enable the use of ZMM registers (to retain compatibility with newer AVX10-256 generations).

It seems to me that with appropriate compiler support, “AVX512 restricted to 256-bit-registers” ought to be a generally useful target ABI: widely supported on current hardware, and compatible with a presumed future of widely-deployed 256-bit-only CPUs.

On the contrary, there’s no a real target that meet -mavx512vl -mno-avx-zmm

Won’t there be such a target, when 256-bit AVX10 CPUs are released?

No as far as I can tell. A 256-bit AVX10 CPU won’t enumerate any AVX512 feature in its CPUID.

Certainly such a CPU cannot enable AVX512 feature bits in the CPUID, because the AVX512 CPUID feature bits are defined to mean that the CPU does support the 512-bit versions of the instructions, and this CPU does not.

But the CPU will support the 128-bit and 256-bit instructions from AVX512-F AVX512-VL AVX512-CD AVX512-BW AVX512-DQ AVX512-VBMI AVX512-IFMA AVX512-VNNI AVX512-BF16 AVX512-VPOPCNTDQ AVX512-VBMI2 AVX512-BITALG and AVX512-FP16.

So if you create a binary which restricts itself to only 256-bit AVX512 instructions from some subset of the above, it should run on current hardware which supports the chosen AVX512 features, and also on a future AVX10.1-256 CPU, right? If you want to do runtime feature-detection for “256-bit VPOPCNTDQ support”, you just have to check AVX512-VPOPCNTDQ || (AVX10.1 && AVX10 bitwidth >= 256).

dzaima · August 11, 2023, 9:35pm

Some option to compile for “these AVX512 features, or AVX10.1-256” would definitely be very nice, as it’d allow having a single future-proof featureful build target for nearly all hardware with any SIMD ISA past AVX2, if utilizing 512-bit vectors isn’t important (leading to three main targets for reasonably covering all of x86_64 - SSE, AVX2, and this AVX512-or-AVX10.1-256).

phoebe · August 12, 2023, 5:14am

Thanks for the explanation! This is a scenario we didn’t consider before. I think it is a bit tangential to the AVX10 design and impractical from HW’s perspective. E.g., dynamic dispatch requires AVX512 feature enumerated. So it will never get dispatched on AVX10.x-256 targets.

But I think the requirement sounds reasonable and we can mitigate it to some extent in the SW’s concept. E.g., we can provide a standalone tool or integrated into compiler to detect if the binary can run on AVX10.x-256 targets. Or furthermore, we can provide an option like -mavx10-compatible. I expect it would be a systematic work rather than allow -mno-avx-zmm arbitrarily used on AVX512 features.

Speak to “supports only AVX512F, AVX512CD, AVX512BW, AVX512DQ, AVX512VL extensions, but not all the rest”, I have an idea that we can extend x86-64-v4 to x86-64-v4-256. I think it not only solves the requirement here but also solves the dilemma that how to define x86-64-v5 for AVX10 and future targets. WDYT?

In a word, I think the requirement you raised would not affect the proposals for AVX10 here. We can do follow up based on AVX10 implementations.

phoebe · August 12, 2023, 5:20am

Thanks! That makes me understand @jyknight’s requirment better.
I think we at least can develop a standalone tool to help user evaluate if the binary can run on AVX10.1-256.
Furthermore, we can help user to migrate their code by providing options like -mavx10-compatible.

phoebe · August 31, 2023, 6:54am

@jyknight @dzaima FYI, after synced with GCC folks, we agreed to start from supporting -m[no-]evex512 [D159250] which should meet your expectations. Please take a look and comment if you have other ideas, thanks!

flashymob · September 19, 2023, 4:14am

Thanks for the work to enable this.

I see that there’s a new __EVEX512__ define - would it be possible to also add a __EVEX256__ or similar define? At the moment, there doesn’t seem to be any straightforward way, via defines, to detect that the compiler supports AVX10.1/256.

phoebe · September 19, 2023, 2:32pm

Thanks! That’s a valuable suggestion. Given AVX10.1/512 always enable AVX10.1/256, the __EVEX256__ would be set for AVX10.1/512 too.
So you have to use #if defined(__EVEX256__) && !defined(__EVEX512__) to detect AVX10.1/256. Is it what you are looking for?

flashymob · September 19, 2023, 9:56pm

Yes, that sounds sensible. A __EVEX_MAX_WIDTH__ or similar define would also work.

I don’t really mind how it’s exactly done, just that, at the moment, the only way to detect AVX10.1/256 support is by checking the compiler version. Will have to see what GCC does though (from what I can tell, the patch there hasn’t landed yet).

CKingX · April 13, 2024, 5:10am

So I just want to confirm. If you enable these features:
avx512f
avx512vl
avx512dq
avx512cd
avx512bw

and pass in -mno-evex512

The compiled code should work on all AVX-512 CPUs as well as any future AVX10.1/256 CPUs?

Topic		Replies	Views
New AVX512{VL,BW,DQ} features enabled in LLVM LLVM Dev List Archives	0	118	July 21, 2014
[RFC] Design for AVX10 options support X86	0	346	September 25, 2023
RFC: AVX Feature Specification LLVM Dev List Archives	3	67	May 27, 2009
Haswell New Instructions LLVM Dev List Archives	5	59	June 17, 2011
[RFC][SVE] Extend vector types to support SVE registers. LLVM Dev List Archives	2	97	March 13, 2017