[RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

Dear all,

I have just created a couple of differential reviews to enable the
vectorisation of loops that have function calls to routines marked with
#pragma omp declare simd”.

They can be (re)viewed here:

* https://reviews.llvm.org/D27249
  
* https://reviews.llvm.org/D27250

The current implementation allows the loop vectorizer to generate vector
code for source file as:

  #pragma omp declare simd
  double f(double x);

  void aaa(double *x, double *y, int N) {
    for (int i = 0; i < N; ++i) {
      x[i] = f(y[i]);
    }
  }

by invoking clang with arguments:

  $> clang -fopenmp -c -O3 file.c […]

Such functionality should provide a nice interface for vector libraries
developers that can be used to inform the loop vectorizer of the
availability of an external library with the vector implementation of the
scalar functions in the loops. For this, all is needed to do is to mark
with “#pragma omp declare simd” the function declaration in the header
file of the library and generate the associated symbols in the object file
of the library according to the name scheme of the vector ABI (see notes
below).

I am interested in any feedback/suggestion/review the community might have
regarding this behaviour.

Below you find a description of the implementation and some notes.

Thanks,

Francesco

Hi Francesco,

Good to know, you are working on the support for this feature. I assume you knew the RFC below. The VectorABI mangling we proposed were approved by C++ Clang FE name mangling owner David M from Google, the ClangFE support was committed in its main trunk by Alexey.

“Proposal for function vectorization and loop vectorization with function calls”, March 2, 2016. Intel Corp. http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html.

Matt submitted patch to generate vector variants for function definitions, not just function declarations. You may want to take a look. Ayal's RFC will be also needed to support vectorization of function body in general.

I agreed, we should have an option -fopenmp-simd to enable SIMD only, both GCC and ICC have similar options.

I would suggest we shall sync-up on these work, so we don't duplicate the effort.

Thanks,
Xinmin

Hi Xinmin,

Thank you for your email.

I have been catching up with the content of your proposal, and I have
some questions/remarks below that I’d like to discuss with you - see
the final section in the proposal.

I have specifically added Alexey B. to the mail so we can move our
conversation from phabricator to the mailing list.

Before we start, I just want to mention that the initial idea of using
llvm::FunctionType for vector function generation and matching has
been proposed by a colleague, Paul Walker, when we first tried out
supporting this on AArch64 on an internal version of llvm. I received
some input also from Amara Emerson.

In our case we had a slightly different problem to solve: we wanted to
support in the vectorizer a rich set of vector math routines provided
with an external library. We managed to do this by adding the pragma
to the (scalar) function declaration of the header file provided with
the library, and as shown by the patches I have submitted, by
generating vector function signatures that the vectorizer can search
in the TargetLibraryInfo.

Here is an updated version of the proposal. Please let me know what
you think, and if you have any solution we could use for the final
section.

# RFC for "pragma omp declare simd"

Hight level components:

A) Global variable generator (clang FE)
B) Parameter descriptors (as new enumerations in llvm::Attribute)
C) TLII methods and fields for the multimap (llvm middle-end)

## Workflow

Example user input, with a declaration and definition:

    #pragma omp declare simd
    #pragma omp declare simd uniform(y)
    extern double pow(double x, double y);

    #pragma omp declare simd
    #pragma omp declare simd linear(x:2)
    float foo(float x) {....}

    /// code using both functions

### Step 1

The compiler FE process these definition and declaration and
generates a list of globals as follows:

    @prefix_vector_pow1_midfix_pow_postfix = external global
                                             <4 x double>(<4 x double>,
                                                          <4 x double>)
    @prefix_vector_pow2_midfix_pow_postfix = external global
                                             <4 x double>(<4 x double>,
                                                          double)
    @prefix_vector_foo1_midfix_foo_postfix = external global
                                             <8 x float>(<8 x float>,
                                                         <8 x float>)
    @prefix_vector_foo1_midfix_foo_postfix = external global
                                             <8 x float>(<8 x float>,
                                                         <8 x float> #0)
    ...
    attribute #0 = {linear = 2}

Notes about step 1:

1. The mapping scalar name <-> vector name is in the
   prefix/midfix/postfix mangled name of the global variable.
2. The examples shows only a set of possible vector function for a
   sizeof(<4 x double>) vector extension. If multiple vector extension
   live in the same target (eg. NEON 64-bit or NEON 128-bit, or SSE
   and AVX512) the front end takes care to generate each of the
   associated functions (like it is done now).
3. Vector function parameters are rendered using the same
   Characteristic Data Type (CDT) rule already in the compiler FE.
4. Uniform parameters are rendered with the original scalar type.
5. Linear parameters are rendered with vectors using the same
   CDT-generated vector length, and decorated with proper
   attributes. I think we could extent the llvm::Attribute enumeration adding the following:
   - linear : numeric, specify_the step
   - linear_var : numeric, specify the position of the uniform variable holding the step
   - linear_uval[_var]: numeric as before, but for the "uval" modifier (both constant step or variable step)
   - linear_val[_var]: numeric, as before, but for "val" modifier
   - linear_ref[_var] numeric, for "ref" modifier.

   For example, "attribute #0 = {linear = 2}" says that the vector of
   the associated parameter in the function signature has a linear
   step of 2.

### Step 2

The compiler FE invokes a TLII method in BackendUtils.cpp that
populate a multimap in the TLII by checking the globals created in the
previous step.

Each global is processed, demangling the [pre/mid/post]fix name and
generate a mapping in the TLII as follows:

    struct VectorFnInfo {
       std::string Name;
       FunctionType *Signature;
    };
    std::multimap<std:string, VectorFnInfo> VFInfo;

For the initial example, the multimap in the TLI is populated as follows:

    "pow" -> [(vector_pow1, <4 x double>(<4 x double>, <4 x double>)),
              (vector_pow2, <4 x double>(<4 x double>, double))]

    "foo" -> [(vector_foo1, <8 x float>(<8 x float>, <8 x float>)),
              (vector_foo2, <8 x float>(<8 x float>, <8 x float> #0))]

Notes about step 2:

Given the fact that the external globals that the FE have generated
are removed _before_ the vectorizer kicks in, I am not sure if the
"attribute #0" needed for one of the parameter is still present at
this point. IF NOT, I think that in this case we could enrich the
"VectorFnInfo" as follows:

    struct VectorFnInfo {
       std::string Name;
       FunctionType *Signature;
       std::set<unsigned, llvm:Attribute> Attrs;
    };

The field "Attrs" maps the position of the parameter with the
correspondent llvm::Attribute present in the global variable.

I have added this note for the sake of completeness. I *think* that we
won't be needing this additional Attrs field: I have already shown in
the llvm patch I submitted that the function type "survives" after the
global gets removed, I don't see why the parameter attribute shouldn't
survive too (last famous words?).

### Step 3

This step happens in the LoopVectorizer. The InnerLoopVectorizer
queries the TargetLibraryInfo looking for a vectorized version of the
function by scalar name and function signature with the following method:

    TargetLibraryInfo::isFunctionVectorizable(std::string ScalarName, FuncionType *FTy);

This is done in a way similar to what my current llvm patch does: the
loop vectorizer makes up the function signature it needs and look for
it in the TLI. If a match is found, vectorization is possible. Right
now the compiler is not aware of uniform/linear function attributes,
but it still can refer to them in a target agnostic way, by using
scalar signatures for the uniform ones and using llvm::Attributes for
the linear ones.

Notice that the vector name here is not used at all, which is good as
any architecture can come up with it's own name mangling for vector
functions, without breaking the ability of the vectorizer to vectorize
the same code with the new name mangling.

## External libraries vs user provided code

The example with "pow" and "foo" I have provided before shows a
function declaration and a function definition. Although the TLII
mechanism I have described seems to be valid only for the former case,
I think that it is valid also for the latter. In fact, in case of a
function definition, the compiler would have to generate also the body
of the vector function, but that external global variable could still
be used to inform the TLII of such function. The fact that the vector
function needed by the vectorizer is in some module instead of in an
external library doesn't seems to make all that difference at compile
time to me.

# Some final notes (call for ideas!)

There is one level of target dependence that I still have to sort out,
and for this I need input from the community and in particular from
the Intel folks.

I will start with this example:

    #pragma omp declare simd
    float foo(float x);

In case of NEON, this would generate 2 globals, one for vectors
holding 2 floats, and one for vector holding 4 floats, corresponding
to NEON 64-bit and 128-bit respectively. This means that the
vectorizer have a unique function it could choose from the list the
TLI provides.

This is not the same on Intel, for example when this code generates
vector names for AVX and AVX2. The register width for these
architecture extensions are the same, so all the TLI has is a mapping
between scalar name and (vectro_name, function_type) who's two
elements differ only in the vector_name string.

This breaks the target independence of the vectorizer, as it would
require it to parse the vector_name to be able to choose between the
AVX or the AVX2 implementation.

Now, to make this work one should have to encode the SSE/SSE2/AVX/AVX2
information in the VectorFnInfo structure. Does anybody have an idea on how
best to do it? For the sake of keeping the vectorizer target
independent, I would like to avoid encoding this piece of information
in the VectorFnInfo struct. I have seen that in your code you are
generating SSE/AVX/AVX2/AVX512 vector functions, how do you plan to
choose between them in the vectorizer? I could not find how you
planned to solve this problem in your proposal, or have I just missed
it?

Is there a way to do this in the TLII? The function type of the vector
function could use the "target-feature" attribute of function
definitions, but how coudl the vectorizer decide which one to use?

Anyway, that's it. Your feedback will be much appreciated.

Cheers,
Francesco

Hi Francesco,

As you stated in the RFC, when vectorizing a scalar function (e.g. when using omp declare simd), one needs to incorporate attributes to the resulting vectorized-function.
These attributes describe a) the behavior of the function, e.g. mask-able or not, and b) the type of the parameters, e.g. scalar or linear or any other option.

As this list is extensive, it is only logical to use an existing infrastructure of ICC and GCC vectorABI which already covers all of these options as stated in Xinmin's RFC [http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].
Moreover, when considering other compilers such as GCC, I do see that the resulting assembly actually does incorporate this exact infrastructure. So if we wish to link different parts of the program using clang and GCC we'll need to adhere to the same name mangling/ABI. Please see the below result after compiling an omp declare simd function using GCC.
Lastly, please note the two out of the three components of the implementation have already been committed or submitted, and both are adhering the name mangling proposed by Xinmin's RFC. A) committed - the FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates mangled names in the manner described by Xinmin's RFC, See below B) Submitted - the callee side by Matt [https://reviews.llvm.org/D22792], it uses these mangled names. and C) caller which is covered by this patch.

In order to mitigate the needed effort and possible issues when implementing, I believe it is best to follow the name mangling proposed in Xinmin's RFC. What do you think?

GCC Example

Hi Saher,

Thanks for your email. Please see my comments below.

Hi Francesco,

As you stated in the RFC, when vectorizing a scalar function (e.g. when
using omp declare simd), one needs to incorporate attributes to the
resulting vectorized-function.
These attributes describe a) the behavior of the function, e.g. mask-able
or not, and b) the type of the parameters, e.g. scalar or linear or any
other option.

No, the attributes are needed only for the different linear clauses, not
for (not)inbranch or uniform, as follows:

1. Regular vector parameter -> rendered as a vector in the function
signature
2. Linear parameter -> rendered as a vector in the function signature,
plus an attribute describing the kind of linearity, attached to the vector
function signature parameter
   linear(i) -> linear = 1
   linear(i:2) -> linear = 2
   linear(i:c) uniform(c) ->linear_var = X, X being the position of the c
parameter in the original function
   linear(ref(i):3) -> linear_ref = 3
   linear(ref(I):c) uniform(c) -> linear_ref_var = X, X as before
   linear(val(i):3) -> linear_val = 3
   linear(val(I):c) uniform(c) -> linear_val_var = X, X as before
   linear(uval(i):3) -> linear_uval = 3
   linear(uval(I):c) uniform(c) -> linear_uval_var = X, X as before
This means that I am asking to have 7 additional attributes, with an
explicit meaning.
3. Uniform parameters -> kept as scalar parameters in the vector function
signature

I have a solution also for the (not)inbranch clause: just add an
additional vector parameter representing the vector predicate.

Here are some more examples for the cases I haven’t considered in the RFC

#pragma omp declare simd inbranch
int doit(float)

vector signature for a 128 bit vector (the CDT here is int):
<4 x i64> (<4 x i64>, <4 x float>) (notice the additional mask parameter)

As this list is extensive, it is only logical to use an existing
infrastructure of ICC and GCC vectorABI which already covers all of these
options as stated in Xinmin's RFC
[http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html].

The list is made by 7 attributes. They are gonna be present in the IR
attribute only if used, and no more than 1 per function directive. This is
smaller than the list of strings carrying all the possible names for the
architecture.

Moreover, when considering other compilers such as GCC, I do see that the
resulting assembly actually does incorporate this exact infrastructure.

I haven’t changed the ABI. The ABI produced (for Intel) is still the same.
You can check this in test/OpenMP/declare_simd_no_definition.c, line 24-34
of https://reviews.llvm.org/D27250.

All I want to achieve here is the vectorizer uses OpenMP declare simd
information in a target independent way.

So if we wish to link different parts of the program using clang and GCC
we'll need to adhere to the same name mangling/ABI. Please see the below

Definitely. As I said, the name mangling I am producing is the same on x86.

result after compiling an omp declare simd function using GCC.
Lastly, please note the two out of the three components of the
implementation have already been committed or submitted, and both are
adhering the name mangling proposed by Xinmin's RFC. A) committed - the
FE portion by Alexey [https://reviews.llvm.org/rL264853], it generates
mangled names in the manner described by Xinmin's RFC, See below B)
Submitted - the callee side by Matt [https://reviews.llvm.org/D22792], it
uses these mangled names. and C) caller which is covered by this patch.

Yes, I understand. Nevertheless, I think that this approach it too X86
centric.

Matt patch could be simplified with the method I am proposing, because
inside the same target architecture his patch would have to look only at
the part of the naming that relates to the particular vector extension
(like “b” “c” “e” “d” for X86).
Other architecture will take care of their own vector extension token in
the naming. This is a possible solution to the “calls for ideas” I added
and the end of my last email.
I believe that this could be solve by attaching the already available
“target-features” attribute to the function signature. I will update the
proposal with this bits.

Also, you mentioned gcc. Gcc is already gone further the name mangling you
are proposing, as you can see here (short url https://goo.gl/WIfzQo):
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/omp-simd-clone.c;h=df140d40
446df013d25bcde6f13af82bf00c8f47;hb=HEAD#l331

This gcc patch contemplates all the possible combination of the linear
clauses, which have been extended I openmp4.5. So the idea of using
attributes would abstract this from the name mangling.

In order to mitigate the needed effort and possible issues when
implementing, I believe it is best to follow the name mangling proposed
in Xinmin's RFC. What do you think?

I think I stated my disagreement about the name mangling in this email. As
I said, the name mangling works for X86, not for other architectures. I
believe that each architecture is free to chose whatever name mangling
they prefer. My example patches make the vectorizer independent of the
architectural name mangling, which I think is a good approach as it could
be shared by all architecture, avoiding the need to create custom code in
the vectorizer - other than the target-feature attribute for the function
signature.

Cheers,

Francesco

Hi Francesco, a bit more information. GCC veclib is implemented based on GCC VectorABI for declare simd as well.

For name mangling, we have to follow certain rules of C/C++ (e.g. prefix needs to _ZVG ....). David Majnemer who is the owner and stakeholder for approval for Clang and LLVM. Also, we need to pay attention to GCC compatibility. I would suggest you look into how GCC VectorABI can be extended support your Arch.

Thanks,
Xinmin

Hi Xinmin,

I only began to review this proposal, and like yours, I think this is
a really important feature to get in.

I agree with you on the name mangling need for C++, as well as
compatibility with GCC, but according to Francesco, there are some
problems that those two alone don't solve.

I'm still unsure how the simplistic mangling we have today will work
around the multiple versions we could have with NEON (and in the
future, SVE) without polluting the mangling quite a lot (have you seen
arm_neon.h?).

So, we may get away with it for now with some basic support and the
current style, but this should grow into a more flexible scheme.

About the current IR form, I don't particularly like how they're tied
up together, but other than having multiple global functions defined
(something like weak linkage?), I don't have a better idea right now.

Francesco,

Maybe the best thing to do right now would be to try and fit NEON
alternatives in this mangling scheme and see how it goes. If anything,
it'll give us an idea on what's broken, and hopefully, how to fix it.

cheers,
--renato

Hi Xinmin,

I have updated the clang patch using the standard name mangling you
suggested - I was not fully aware of the C++ mangling convention “_ZVG”.

I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON
vector symbols look as follows:

_ZVGQN2v__Z1fd
_ZVGDN2v__Z1ff
_ZVGQN4v__Z1ff

Here “Q” means -> NEON 128-bit, “D” means -> NEON 64-bit

Please notice that although I have changed the name mangling in clang [1],
there have been no need to update the relative llvm patch [2], as the
vectorisation process is _independent_ of the name mangling.

Regards,

Francesco

[1] https://reviews.llvm.org/D27250
[2] https://reviews.llvm.org/D27249, The only update was a bug fix in the
copy constructor of the TLII and in the return value of the TLII::mangle()
method. None of the underlying scalar/vector function matching algorithms
have been touched.

)On 12 December 2016 at 13:44, Francesco Petrogalli

I am using “D” for 64-bit NEON and “Q” for 128-bit NEON, which makes NEON
vector symbols look as follows:

_ZVGQN2v__Z1fd
_ZVGDN2v__Z1ff
_ZVGQN4v__Z1ff

Hi Francesco,

The ARM AAPCS (A.2.1) says:

"For C++ the mangled name for parameters is as though the equivalent
type name was used."

Clang is already able to mangle NEON vectors of any length
(CXXNameMangler::mangleNeonVectorType), you should use that, as this
is very likely to be compatible with other compilers as well.

cheers,
--renato

Hi Renato,

The “u”, “v”, “l” mangling of parameters has already been agreed
internally at ARM with the gcc team, and as Xinmin has pointed out, this
is going to guarantee compatibility with GCC.

Thanks,

Francesco

Ah, right. Sounds good, then.

cheers,
--renato

Xinmin,

Allow me to share a couple of comments about what Renato is saying.

I'm still unsure how the simplistic mangling we have today will work
around the multiple versions we could have with NEON (and in the
future, SVE) without polluting the mangling quite a lot (have you seen
arm_neon.h?).

Reconstructing the vector parameter types from the name mangling works for
fixed-width vector architectures, including NEON.
For SVE, the alternative method I am proposing of using IR types will make
easier the handling of width agnostic
vector function types.

With SVE in we could have multiple version of the same function for
different vector lengths, plus a totally width agnostic version that would
work on any SVE implementation. All these information could be potentially
used
by the compiler, I see an advantage in having them encoded in IR
structures (FunctionType and VectorType) instead of strings, as is would
make the information directly accessible by other parts of the compiler.
There is a proposal in the ML for extending the IR vector type to support
width agnostic
vectors. Whatever will be the final shape of such vectors, I suspect it
would be easier to handle multiple width agnostic version of functions by
classifying them with IR types.

So, we may get away with it for now with some basic support and the
current style, but this should grow into a more flexible scheme.

About the current IR form, I don't particularly like how they're tied
up together, but other than having multiple global functions defined
(something like weak linkage?), I don't have a better idea right now.

I am not sure I understand here. In my patch, all I am doing is “vector
symbol awareness generation”. There are no globals that are generated in
the final object file, it is just the TargetLibraryInfoImpl that is being
populated with the info needed by the vectorizer.

The information needs to be serialised into IR, so that a multi-step
compilation (clang->llc) picks up the details. Otherwise, we'd be
vectorising when coming from Clang and not when passing through
opt/llc.

Admittedly, opt/llc are not user tools, but we should try to not have
any hidden knowledge that can't be inferred again by the debug tools
(or testing becomes impossible).

So, you either add metadata, attributes or you create multiple weak
global functions that get destroyed by the linker if unused.

Makes sense?

cheers,
--renato

Francesco, thanks for updating the patch.

GCC used b, c, d, you used Q for ARM 128-bit which seems fine. For D (64-bit), do you have to use it, or you can find another letter to avoid the future conflict / confusion if they need D vs. d? Is GCC community ok with them for compatibility for ARM?

Thanks,
Xinmin

Thanks Renato. Per the latest email from Francesco, it seems the current mangling mechanism works for ARM as well, except we need to use different arch "letter" for Neon-64-bit and Neon-128-bit.

Cheers
Xinmin

From: "Renato Golin via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Francesco Petrogalli" <Francesco.Petrogalli@arm.com>
Cc: "a bataev" <a.bataev@hotmail.com>, llvm-dev@lists.llvm.org, "Matt Masten" <matt.masten@intel.com>, "nd"
<nd@arm.com>
Sent: Monday, December 12, 2016 11:05:46 AM
Subject: Re: [llvm-dev] [RFC] Enable "#pragma omp declare simd" in the LoopVectorizer

> I am not sure I understand here. In my patch, all I am doing is
> “vector
> symbol awareness generation”. There are no globals that are
> generated in
> the final object file, it is just the TargetLibraryInfoImpl that is
> being
> populated with the info needed by the vectorizer.

The information needs to be serialised into IR, so that a multi-step
compilation (clang->llc) picks up the details. Otherwise, we'd be
vectorising when coming from Clang and not when passing through
opt/llc.

FWIW, this is the situation we have now with function vectorization (both Clang and opt end up calling TLI->addVectorizableFunctionsFromVecLib based on command-line arguments, etc.). We probably should move this into some function attribute.

-Hal