[RFC] Expose user provided vector function for auto-vectorization.

Hi All,

[I'm only subscribed to digest, so the reply doesn't look great, sorry about that]

The second component is a tool that other parts of LLVM (for example, the loop vectorizer) can use to query the availability of the vector function, the SVFS I have described in the original post of the RFC, which is based on interpreting the `vector-variant` attribute.
The final component is the one that seems to have generated most of the controversies discussed in the thread, and for which I decided to move away from `declare variant`.

Where will the mapping between parameters positions be stored? Using the example from https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-vector-variant:

float MyAdd(float* a, int b) { return *a + b; }
__declspec(vector_variant(implements(MyAdd(float *a, int b)),
                          linear(a), vectorlength(8),
                          nomask, processor(core_2nd_gen_avx)))
__m256 __regcall MyAddVec(float* v_a, __m128i v_b1, __m128i v_b2)

We need somehow communicate which lanes of widened "b" would map for the b1 parameter and which would go to the b2. If we only care about single ABI (like the one mandated by the OMP) than such things could be put to TTI, but what about other ABIs? Should we encode this explicitly in the annotation too?

Best Regards,
Andrei

Message: 1

Hi Andrei,

Hi All,

[I'm only subscribed to digest, so the reply doesn't look great, sorry about that]

The second component is a tool that other parts of LLVM (for example, the loop vectorizer) can use to query the availability of the vector function, the SVFS I have described in the original post of the RFC, which is based on interpreting the `vector-variant` attribute.
The final component is the one that seems to have generated most of the controversies discussed in the thread, and for which I decided to move away from `declare variant`.

Where will the mapping between parameters positions be stored? Using the example from https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-vector-variant:

float MyAdd(float* a, int b) { return *a + b; }
__declspec(vector_variant(implements(MyAdd(float *a, int b)),
                         linear(a), vectorlength(8),
                         nomask, processor(core_2nd_gen_avx)))
__m256 __regcall MyAddVec(float* v_a, __m128i v_b1, __m128i v_b2)

We need somehow communicate which lanes of widened "b" would map for the b1 parameter and which would go to the b2. If we only care about single ABI (like the one mandated by the OMP) than such things could be put to TTI, but what about other ABIs? Should we encode this explicitly in the annotation too?

I think that the mapping between a scalar parameter and the correspondent vector parameter(s - there can be more than one) should be handled by the Vector Function ABI when a vector function ABI is defined.

I am working out on a new proposal, I’ll keep you posted.

I think that the requirements of 1. being a user feature 2. Based on a standard (OpenMP), implies the fact that a contract between the scalar functions and the vector functions must be stipulated in some document, such document being a vector function ABI for the target.

I am crafting the attribute so that it makes it explicit that we are using OpenMP and we are expecting a Vector Function ABI.

Kind regards,

Francesco

Best Regards,
Andrei

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Hi Francesco,

I am crafting the attribute so that it makes it explicit that we are using OpenMP and we are expecting a Vector Function ABI.

I just thought that another option would be to force FE to always emit "logically"-widened alwaysinline wrapper for the vector function that does the arguments processing according to ABI inside (we need that info in the FE anyway). That way the vectorizer pass won't need to care about the tricky processing and we (possibly) will get a somewhat easier to understand IR after the vectorizer.

Is that something that might work?

Thanks,
Andrei

Hi Francesco,

Hello!

I am crafting the attribute so that it makes it explicit that we are using OpenMP and we are expecting a Vector Function ABI.

I just thought that another option would be to force FE to always emit "logically"-widened alwaysinline wrapper for the vector function that does the arguments processing according to ABI inside (we need that info in the FE anyway). That way the vectorizer pass won't need to care about the tricky processing and we (possibly) will get a somewhat easier to understand IR after the vectorizer.

Is that something that might work?

I don’t know, I am not sure I understand your request.

What is a `"logically"-widened alwaysinline wrapper for the vector function`? Can you provide an example? Also, what is the `tricky processing` you are referring to that the vectorizer should care about?

Kind regards,

Francesco

Thanks,
Andrei

From: Francesco Petrogalli <Francesco.Petrogalli@arm.com>
Sent: Monday, June 10, 2019 09:09
To: Elovikov, Andrei <andrei.elovikov@intel.com>
Cc: llvm-dev@lists.llvm.org; Saito, Hideki <hideki.saito@intel.com>
Subject: Re: [RFC] Expose user provided vector function for auto-vectorization.

Hi Andrei,

Hi All,

[I'm only subscribed to digest, so the reply doesn't look great, sorry
about that]

The second component is a tool that other parts of LLVM (for example, the loop vectorizer) can use to query the availability of the vector function, the SVFS I have described in the original post of the RFC, which is based on interpreting the `vector-variant` attribute.
The final component is the one that seems to have generated most of the controversies discussed in the thread, and for which I decided to move away from `declare variant`.

Where will the mapping between parameters positions be stored? Using the example from https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-vector-variant:

float MyAdd(float* a, int b) { return *a + b; }
__declspec(vector_variant(implements(MyAdd(float *a, int b)),
                        linear(a), vectorlength(8),
                        nomask, processor(core_2nd_gen_avx)))
__m256 __regcall MyAddVec(float* v_a, __m128i v_b1, __m128i v_b2)

We need somehow communicate which lanes of widened "b" would map for the b1 parameter and which would go to the b2. If we only care about single ABI (like the one mandated by the OMP) than such things could be put to TTI, but what about other ABIs? Should we encode this explicitly in the annotation too?

I think that the mapping between a scalar parameter and the correspondent vector parameter(s - there can be more than one) should be handled by the Vector Function ABI when a vector function ABI is defined.

I am working out on a new proposal, I’ll keep you posted.

I think that the requirements of 1. being a user feature 2. Based on a standard (OpenMP), implies the fact that a contract between the scalar functions and the vector functions must be stipulated in some document, such document being a vector function ABI for the target.

I am crafting the attribute so that it makes it explicit that we are using OpenMP and we are expecting a Vector Function ABI.

Kind regards,

Francesco

Best Regards,
Andrei

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

What is a `"logically"-widened alwaysinline wrapper for the vector function`? Can you provide an example? Also, what is the `tricky processing` you are referring to that the vectorizer should care about?

For the case mentioned earlier:

float MyAdd(float* a, int b) { return *a + b; }
__declspec(vector_variant(implements(MyAdd(float *a, int b)),
                         linear(a), vectorlength(8),
                         nomask, processor(core_2nd_gen_avx)))
__m256 __regcall MyAddVec(float* v_a, __m128i v_b1, __m128i v_b2)

If FE emitted

;; Alwaysinline
define <8 x float> @MyAddVec.abi_wrapper(float* %v_a, <8 x i32> %v_b) {
  ;; Not sure about the exact values in the mask parameter.
  %v_b1 = shufflevector <8 x i32> %v_b, <8 x i32> undef, <4 x i32><i32 0, i32 1, i32 2, i32 3>
  %v_b2 = shufflevector <8 x i32> %v_b, <8 x i32> undef, <4 x i32><i32 4, i32 5, i32 6, i32 7>
  %ret = call <8 x float> @MyAddVec(%v_a, %v_b1, %v_b2)
}

Then the vectorizer won't need to deal with the splitting of vector version of %b into two arguments, and the vector-attribute would only describe that kind of processing that is specific to the vectorizer and not the lowering ABI part.

Note, that I don't insist on this approach, it's just an alternative to the "hardcoded" usage of OMP's Vector Function ABI.

Thanks,
Andrei

What is a `"logically"-widened alwaysinline wrapper for the vector function`? Can you provide an example? Also, what is the `tricky processing` you are referring to that the vectorizer should care about?

For the case mentioned earlier:

float MyAdd(float* a, int b) { return *a + b; }
__declspec(vector_variant(implements(MyAdd(float *a, int b)),
                        linear(a), vectorlength(8),
                        nomask, processor(core_2nd_gen_avx)))
__m256 __regcall MyAddVec(float* v_a, __m128i v_b1, __m128i v_b2)

If FE emitted

;; Alwaysinline
define <8 x float> @MyAddVec.abi_wrapper(float* %v_a, <8 x i32> %v_b) {
;; Not sure about the exact values in the mask parameter.
%v_b1 = shufflevector <8 x i32> %v_b, <8 x i32> undef, <4 x i32><i32 0, i32 1, i32 2, i32 3>
%v_b2 = shufflevector <8 x i32> %v_b, <8 x i32> undef, <4 x i32><i32 4, i32 5, i32 6, i32 7>
%ret = call <8 x float> @MyAddVec(%v_a, %v_b1, %v_b2)
}

I see, thank you for the clear explanation.

Then the vectorizer won't need to deal with the splitting of vector version of %b into two arguments, and the vector-attribute would only describe that kind of processing that is specific to the vectorizer and not the lowering ABI part.

Why would you want to split the input in 2 parameters at C level? Is is because for that particular core the 256-bits wide vectors are only for FP data and not for the integer?

I would have bet that the signature of `MyAddVec` would have been something along the lines of `__m256 __regcall MyAddVec(float* v_a, __m256i v_b)`, but with this I might just be showing my ignorance of the Intel ABI...

Note, that I don't insist on this approach, it's just an alternative to the "hardcoded" usage of OMP's Vector Function ABI.

To me the “hardcoded” way has the advantage to be portable and to be able to provide useful information at compile time to the user, along the lines of “For scalar version `foo` with such clang_declare_variant_simd attribute, the expected signature is `whatever signature the ABI request`”.

Also, At IR level, wouldn’t it better to the vectorizer or the TLI/TTI build the wrapper itself if needed, instead of having the frontend doing it?

Francesco

Why would you want to split the input in 2 parameters at C level? Is is because for that particular core the 256-bits wide vectors are only for FP data and not for the integer?

I think yes, 256-bit integer arithmetic needs AVX2, not AVX. That particular mapping is described in 2.4 Element Data Type to Vector Data Type Mapping of https://software.intel.com/sites/default/files/managed/b4/c8/Intel-Vector-Function-ABI.pdf.

Also, At IR level, wouldn’t it better to the vectorizer or the TLI/TTI build the wrapper itself if needed, instead of having the frontend doing it?

There might be some mis-understanding here. I suggested to use the wrapper so that the vectorizer didn't have to deal with the mapping at all. I thought of it as another way of abstracting the ABI (the first one is implementing it in TTI).

Thanks,
Andrei

If the user function is only used in a non-autovectorization context, the wrapper generated by the FE will be left unused.

I think it is better to generate such wrapper “on demand”.

This doesn’t mean that the logic of the mapping of the vector function ABI needs to be exposed to the vectorizer, I agree with you. It may well be exposed at the SVFS level (please refer to previous iterations of the RFC for a description of the SVFS, or wait for the upcoming new draft).

Thank you for your comments Andrei, this was a very useful exercise for me!

Let me know if you have any other question.

Francesco