[Proposal] function attribute to reduce emission of vzeroupper instructions

Hi all,

I would like to find out whether anyone will find it useful to add an x86-

specific calling convention for reducing emission of vzeroupper instructions.

Current implementation:

vzeroupper is inserted to any functions that use AVX instructions. The

insertion points are:

  1. before a call instruction;

  2. before a return instruction;

Background:

vzeroupper is an AVX instruction; it is inserted to avoid performance penalty

when transitioning between x86 AVX mode and legacy SSE mode, e.g., when an

AVX function calls a SSE function. However, vzeroupper is a slow instruction; it

adds to register pressure and hurts performance for AVX-to-AVX calls.

My proposal:

  1. (LLVM part) Add an x86-specific calling convention to the LLVM IR which

specifies that an external function will be compiled with AVX support and its

function definition does not use any legacy SSE instructions, e.g.,

declare x86_avxcc i32 @foo()

  1. (Clang part) Add a function attribute to the clang front-end which specifies

this calling convention, e.g.,

extern int foo() attribute((avx));

Function definitions in a translation unit compiled with -mavx architecture will

implicitly have this attribute.

Benefits:

No vzeroupper is needed before calling a function with this avx attribute, e.g.,

extern int foo() attribute((avx));

void bar() {

// some AVX instruction

// no vzeroupper is needed before the call instruction

foo();

// still needs a vzeroupper before the return instruction

}

Reference:

A few months ago, I submitted a proposal for improving vzeroupper optimization

strategy by changing the default code-emission strategy. The proposal was rejected

on the ground that it would cause problems for existing operating systems.

http://lists.cs.uiuc.edu/pipermail/llvmdev/2013-September/065720.html

I would suggest using metadata instead. The reasons are:

* It could be applied to functions with different calling conventions.
For example, on windows we would probably want to do this to thiscall
(methods) too.
* It the metadata is dropped, we would just produced slower but still
correct code (calls vzeroupper).

Cheers,
Rafael

Maybe a target-specific attribute instead? It would still apply to all
CCs, but would never be dropped.

Maybe a target-specific attribute instead? It would still apply to all CCs,
but would never be dropped.

That would work too, yes. I proposed metadata because it looks like it
can be dropped, but that is not a big issue. I would be OK with an
attribute too if that is more convenient or we want to make sure it is
kept.

Cheers,
Rafael

Hi Rafael and Reid,

To clarify,

With a target-specific attribute, the LLVM IR representation will be something like this:
  declare i32 @foo() "x86_avx"="true"

With a target-specific metadata, the IR will be something like this:
   declare i32 @foo() !1
   ...
   !1 = metadata !{metadata !"x86_avx"}

If a backend does not understand this attribute or this metadata, it will have no effect on code generation.

If this is what you mean, then I believe either approach will work for me.

Do you have any opinion on the clang part of the proposal?

I plan to take the next two weeks off from work, so I probably will respond to emails only
sporadically. I hope you have a happy holiday there too,

- Gao.

Hi Rafael and Reid,

To clarify,

With a target-specific attribute, the LLVM IR representation will be something like this:
  declare i32 @foo() "x86_avx"="true"

With a target-specific metadata, the IR will be something like this:
   declare i32 @foo() !1
   ...
   !1 = metadata !{metadata !"x86_avx"}

If a backend does not understand this attribute or this metadata, it will have no effect on code generation.

If this is what you mean, then I believe either approach will work for me.

cool.

Do you have any opinion on the clang part of the proposal?

Just an observation that this should probably be an attribute on
function types, since I assume you want to be able to put it in
function types. Aaron can probably confirm if that is the right way to
do it.

I plan to take the next two weeks off from work, so I probably will respond to emails only
sporadically. I hope you have a happy holiday there too,

Thanks! Enjoy time off!

- Gao.

Cheers,
Rafael

Hi all,

I would like to find out whether anyone will find it useful to add an x86-

specific calling convention for reducing emission of vzeroupper
instructions.

Current implementation:

vzeroupper is inserted to any functions that use AVX instructions. The

insertion points are:

1) before a call instruction;

2) before a return instruction;

Background:

vzeroupper is an AVX instruction; it is inserted to avoid performance
penalty

when transitioning between x86 AVX mode and legacy SSE mode, e.g., when an

AVX function calls a SSE function. However, vzeroupper is a slow
instruction; it

adds to register pressure and hurts performance for AVX-to-AVX calls.

My proposal:

1) (LLVM part) Add an x86-specific calling convention to the LLVM IR which

specifies that an external function will be compiled with AVX support and
its

function definition does not use any legacy SSE instructions, e.g.,

  declare x86_avxcc i32 @foo()

2) (Clang part) Add a function attribute to the clang front-end which
specifies

this calling convention, e.g.,

  extern int foo() __attribute__((avx));

In general, I'm not too keen on adding more calling conventions unless
there's a really powerful need for one from an ABI perspective. This
sounds more like an optimization than an ABI need. What's more, I
worry (a little bit) about confusion that could be caused with the
__vectorcall calling convention (which we do not currently support,
but will need to at some point for MSVC compatibility).

What should happen with this code?

int foo() __attribute__((avx));

void bar(int (*fp)()) {
  int i = fp();
}

void baz(void) {
  bar(foo);
}

Based on your description, this code is valid, but not as performant
as it could be. The vzeroupper would be inserted before fp() is
called, but there's no incompatibility happening. So I guess this
feels more like a regular function attribute than a calling
convention.

Function definitions in a translation unit compiled with -mavx architecture
will

implicitly have this attribute.

Can you safely do that? What about code that does uses inline assembly
to use legacy SSE instructions in a TU compiled with -mavx, for
instance?

~Aaron

In general, I'm not too keen on adding more calling conventions unless
there's a really powerful need for one from an ABI perspective. This
sounds more like an optimization than an ABI need.

I think that is the case.

What's more, I
worry (a little bit) about confusion that could be caused with the
__vectorcall calling convention (which we do not currently support,
but will need to at some point for MSVC compatibility).

What does the __vectorcall does?

What should happen with this code?

int foo() __attribute__((avx));

void bar(int (*fp)()) {
  int i = fp();
}

void baz(void) {
  bar(foo);
}

Based on your description, this code is valid, but not as performant
as it could be. The vzeroupper would be inserted before fp() is
called, but there's no incompatibility happening. So I guess this
feels more like a regular function attribute than a calling
convention.

It is not a calling convention. The issue is more if it is a type or a
decl attribute. Given that putting the attributes on the function
decls is the simplest and should cover most of the cases, I think we
can probably start with that and revisit if we still see too many
vzeroupper being inserted. What do you think?

Function definitions in a translation unit compiled with -mavx architecture
will

implicitly have this attribute.

Can you safely do that? What about code that does uses inline assembly
to use legacy SSE instructions in a TU compiled with -mavx, for
instance?

I think it would take a performance penalty, but I don't expect that
to be common.

Cheers,
Rafael

In general, I'm not too keen on adding more calling conventions unless
there's a really powerful need for one from an ABI perspective. This
sounds more like an optimization than an ABI need.

I think that is the case.

What's more, I
worry (a little bit) about confusion that could be caused with the
__vectorcall calling convention (which we do not currently support,
but will need to at some point for MSVC compatibility).

What does the __vectorcall does?

http://msdn.microsoft.com/en-us/library/dn375768.aspx

It's different than the proposed attribute, but still relates to SIMD
instruction optimizations.

What should happen with this code?

int foo() __attribute__((avx));

void bar(int (*fp)()) {
  int i = fp();
}

void baz(void) {
  bar(foo);
}

Based on your description, this code is valid, but not as performant
as it could be. The vzeroupper would be inserted before fp() is
called, but there's no incompatibility happening. So I guess this
feels more like a regular function attribute than a calling
convention.

It is not a calling convention. The issue is more if it is a type or a
decl attribute. Given that putting the attributes on the function
decls is the simplest and should cover most of the cases, I think we
can probably start with that and revisit if we still see too many
vzeroupper being inserted. What do you think?

That seems reasonable to me.

Function definitions in a translation unit compiled with -mavx architecture
will

implicitly have this attribute.

Can you safely do that? What about code that does uses inline assembly
to use legacy SSE instructions in a TU compiled with -mavx, for
instance?

I think it would take a performance penalty, but I don't expect that
to be common.

Hmm, I was worried about the situation where:

extern int foo(); // compiled without -mavx

void bar() { // compiled in a TU with -mavx
  ...
  // no vzeroupper is inserted before the call instruction because it is
  // implicit due to -mavx
  foo();
  ...
}

I'm not certain whether this sort of pattern could cause problems or
not. If there's no way for it to be problematic, then implicitly
attaching the attribute is reasonable enough. It does mean we're
straying farther from the as-written attributes for the function, but
that's just an unfortunate situation we're already in today and
wouldn't block this feature.

~Aaron

Hi Aaron,
Many thanks for your feedback!

I do not have any opinion right now on how this attribute should interact with
the __vectorcall calling convention. I will need to revisit it later.

Regarding the implicit attachment of this attribute, my intention is to only
imply the avx attribute on function definitions. Since the backend can see what
instructions are being generated in the callee, it should be able to make smart
decisions on whether to emit a vzeroupper before the call instruction. In the
above example, foo() would not implicitly carry the avx attribute because the
compiler sees only its declaration.

- Gao

Hi Aaron,
Many thanks for your feedback!

I do not have any opinion right now on how this attribute should interact with
the __vectorcall calling convention. I will need to revisit it later.

Regarding the implicit attachment of this attribute, my intention is to only
imply the avx attribute on function definitions. Since the backend can see what
instructions are being generated in the callee, it should be able to make smart
decisions on whether to emit a vzeroupper before the call instruction. In the
above example, foo() would not implicitly carry the avx attribute because the
compiler sees only its declaration.

to be clear, in the example

extern void foo(__m256i x);
extern int bar();
void zed(__m256i x) {
  foo(x);
  bar();
  foo(x);
}

we currently produce a vzeroupper before the call to bar and one
before the return. If the attribute is implicitly added only to
definitions, it will only be added to zed. In which case, wouldn't we
need to keep both vzeroupper instructions? The one before bar is
needed because it is missing the attribute. The one before the return
is needed because we don't know what might call zed.

This looks a bit like how -fvisibility=hidden is implemented. That
implementation is extremely complex, so it would be nice to know that
this will not grow in that direction. The issues there are

* The visibility applies to types. This exists so that it is possible
for all data in a class to have a particular visibility. Do you
envision this attribute also being applied to class types to mean "the
entire class is avx"?
* There are complex rules about when a implicit visibility attribute
is active and how to merge implicit and explicit visibilities.

Do you think that with the attribute as proposed the number of
vzeroupper will be low enough (maybe with LTO) that we wouldn't need
to extend it further?

Cheers,
Rafael