Proposal to improve vzeroupper optimization strategy

Hi all,

I would like to make a proposal about changing the optimization strategy
regarding when to insert a vzeroupper instruction in the x86 backend.

Current implementation:
vzeroupper is inserted to any functions that use AVX instructions. The
insertion points are:
1) before a call instruction;
2) before a return instruction;

Rationale:
vzeroupper is an AVX instruction; it is inserted to avoid performance penalty
when switching between x86 AVX mode and SSE mode, e.g., when an AVX function
calls a SSE function.

My proposal:
Default to not insert vzeroupper instruction unless a function is using legacy
SSE instructions. By a legacy SSE instruction, I mean any vector instructions
that do not have a v- prefix, write XMM register but not YMM register. If a
legacy SSE instruction is spotted, then insert a vzeroupper instruction:
1) before a call instruction;
2) before a return instruction;

Explanation:
If all application and libraries are compiled with the same toolchain, then
with this proposal, a function can assume that incoming AVX registers have
their top 128 bits either specified or zeroed. Assuming that legacy SSE
instructions will be seldom generated, it should be rare to have to emit
vzeroupper instructions, which is a slow instruction by itself.

Possible problem:
This proposal is biased towards the situation when all applications and
libraries are compiled with the same toolchain. If it is common case to mix and
match applications built with different toolchains, this approach might lead to
situations when a vzeroupper instruction is missing when calling from a
LLVM-compiled AVX function to a foreign-compiled SSE function, hence a
transition penalty. One possible solution around this issue is to add a
function attribute which specifies whether the caller and callee have the
same architecture. e.g.,
extern int foo __attribute__((nolegacy));
would declare an external function that does not use legacy SSE instruction.

Any thoughts?
- Gao.

Great idea. I reported on this problem before and glad to see someone trying to tackle this.

cheers.

Great! Glad to see you are working on this.

This is essentially equivalent to "don't insert vzeroupper anywhere", as
far as I can tell. (The case of SSE instructions without a v- prefixed
equivalent is rare enough we can separate it from this discussion.)

The reason we need vzeroupper in the first place is because we can't assume
other functions won't use legacy SSE instructions; for example, on most
systems, calling sin() will use legacy SSE instructions. I mean, if you
can make some unusual guarantee about your platform, it might make sense to
disable vzeroupper generation in general, but it simply doesn't make sense
on most platforms.

If you want a mechanism to disable vzeroupper generation for particular
function calls, that might make sense...

-Eli

Hi Manny,
Thanks! You said that you reported on this problem before, do you know whether there is an
existing LLVM bugzilla ticket for this issue?
- Gao.

Hi Eli,

Thanks for the feedback. Please see below.

  • Gao.

Hi Gao,

Eli is right. In many cases the OS is not compiled with AVX support but the application is. In other words, AVX code calling non-avx code is very common.

Thanks,
Nadav

Hi Eli,****

Thanks for the feedback. Please see below.
- Gao.****

** **

From: Eli Friedman [mailto:eli.friedman@gmail.com] ****

Sent: Thursday, September 19, 2013 12:31 PM****

To: Gao, Yunzhong****

Cc: llvmdev@cs.uiuc.edu****

Subject: Re: [LLVMdev] Proposal to improve vzeroupper optimization strategy
****

** **

> This is essentially equivalent to "don't insert vzeroupper anywhere", as
****

> far as I can tell. (The case of SSE instructions without a v- prefixed**
**

> equivalent is rare enough we can separate it from this discussion.)****

** **

So will you be interested in a patch that disables vzeroupper by default?

A patch which adds a switch/LLVM IR function attribute to disable
vzeroupper would be fine. A patch that disables vzeroupper on your
platform would be fine (assuming the target triple is distinguishable).
Turning off vzeroupper by default on all platforms is not fine.

I implemented this possibly over-engineering solution in our local tree to
work****

around some bad instruction selection issues in LLVM backend. When
benchmarking****

on our game codes, I noticed that sometimes legacy SSE instructions were**
**

selected despite existence of AVX equivalent, in which case the vzeroupper
****

instruction was needed. And it is much easier to detect existence of
vzeroupper****

instruction than to detect each single legacy SSE instructions.****

** **

The instruction selection issues were later fixed in our tree (patches to
be****

submitted later), at least for the handful of games I tested on. So a
simple****

change to just disable vzeroupper by default will be acceptable to us as
well.****

** **

> The reason we need vzeroupper in the first place is because we can't
assume****

> other functions won't use legacy SSE instructions; for example, on most*
***

> systems, calling sin() will use legacy SSE instructions. I mean, if you
can****

> make some unusual guarantee about your platform, it might make sense to*
***

> disable vzeroupper generation in general, but it simply doesn't make
sense****

> on most platforms.****

** **

I am confused by this point. By "most systems," do you have in mind a
platform****

where the sin() function was compiled by gcc but the application codes were
****

compiled by clang?

On, for example, OS X, AVX is not enabled by default, so the sin() function
uses legacy SSE instructions. Users can still turn on AVX in their
applications.

-Eli

Is it realistic to worry about performance of vectorized code that does PIC calls into a non-vectorized sin() in libc? Maybe there’s an example other than sin() that is more realistic?

– Sean Silva

Hey Sean,

Is it realistic to worry about performance of vectorized code that does PIC
calls into a non-vectorized sin() in libc? Maybe there's an example other
than sin() that is more realistic?

-- Sean Silva

Hi Eli,

Thanks for the feedback. Please see below.
- Gao.

From: Eli Friedman [mailto:eli.friedman@gmail.com]

Sent: Thursday, September 19, 2013 12:31 PM

To: Gao, Yunzhong

Cc: llvmdev@cs.uiuc.edu

Subject: Re: [LLVMdev] Proposal to improve vzeroupper optimization
strategy

> This is essentially equivalent to "don't insert vzeroupper anywhere",
> as

> far as I can tell. (The case of SSE instructions without a v- prefixed

> equivalent is rare enough we can separate it from this discussion.)

So will you be interested in a patch that disables vzeroupper by default?

A patch which adds a switch/LLVM IR function attribute to disable
vzeroupper would be fine. A patch that disables vzeroupper on your platform
would be fine (assuming the target triple is distinguishable). Turning off
vzeroupper by default on all platforms is not fine.

I implemented this possibly over-engineering solution in our local tree
to work

around some bad instruction selection issues in LLVM backend. When
benchmarking

on our game codes, I noticed that sometimes legacy SSE instructions were

selected despite existence of AVX equivalent, in which case the
vzeroupper

instruction was needed. And it is much easier to detect existence of
vzeroupper

instruction than to detect each single legacy SSE instructions.

The instruction selection issues were later fixed in our tree (patches to
be

submitted later), at least for the handful of games I tested on. So a
simple

change to just disable vzeroupper by default will be acceptable to us as
well.

> The reason we need vzeroupper in the first place is because we can't
> assume

> other functions won't use legacy SSE instructions; for example, on most

> systems, calling sin() will use legacy SSE instructions. I mean, if
> you can

> make some unusual guarantee about your platform, it might make sense to

> disable vzeroupper generation in general, but it simply doesn't make
> sense

> on most platforms.

I am confused by this point. By "most systems," do you have in mind a
platform

where the sin() function was compiled by gcc but the application codes
were

compiled by clang?

On, for example, OS X, AVX is not enabled by default, so the sin()
function uses legacy SSE instructions. Users can still turn on AVX in their
applications.

-Eli

On our systems, there are several libraries that are not compiled for
a particular target at default. The reason being is that we support
many targets and choose the lowest common denominator for packaging
reasons. Also, we have no control over how user libraries are compiled
(compiler or target).

Let's also note that the offending legacy SSE call does not need to be
found within vectorized code. It just has to occur after vectorized
code to incur the transition penalty. For example:

void kung() {
  ... vectorized VEX.256 code ...

  ... lots of scalar VEX.128 code ...

  while(x < y) {
    ... vzeroupper ...
    ... call to legacy SSE function ...
    x++;
  }
}

On a side note, in such a situation, we found it most profitable to
hoist the vzeroupper out of the loop, so that it is only executed once
as needed. Even further, we sacrifice a good amount of compile time to
find near-optimal vzeroupper placement, with noticeable impact on
performance.

Hope that helps,
Cameron