Enabling the vectorizer for -Os

Hi,

I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits. At the same time the loop vectorizer can increase the code size because of two reasons. First, to vectorize some loops we have to keep the original loop around in order to handle the last few iterations. Second, on x86 and possibly other targets, the encoding of vector instructions takes more space.

The loop vectorizer is already aware of the ‘optsize’ attribute and it does not vectorize loops which require that we keep the scalar tail. It also does not unroll loops when optimizing for size. It is not obvious but there are many cases in which this conservative kind of vectorization is profitable. The loop vectorizer does not try to estimate the encoding size of instructions and this is one reason for code growth.

I measured the effects of vectorization on performance and binary size using -Os. I measured the performance on a Sandybridge and compiled our test suite using -mavx -f(no)-vectorize -Os. As you can see in the attached data there are many workloads that benefit from vectorization. Not as much as vectorizing with -O3, but still a good number of programs. At the same time the code growth is minimal. Most workloads are unaffected and the total code growth for the entire test suite is 0.89%. Almost all of the code growth comes from the TSVC test suite which contains a large number of large vectorizable loops. I did not measure the compile time in this batch but I expect to see an increase in compile time in vectorizable loops because of the time we spend in codegen.

I am interested in hearing more opinions and discussing more measurements by other people.

Nadav

.

VectorizationOsSize.pdf (66 KB)

VectorizationOsPerf.pdf (59 KB)

I would like to start a discussion about enabling the loop vectorizer by
default for -Os. The loop vectorizer can accelerate many workloads and
enabling it for -Os and -O2 has obvious performance benefits.

Hi Nadav,

As it stands, O2 is very similar to O3 with a few, more aggressive,
optimizations running, including the vectorizers. I think this is a good
rationale, at O3, I expect the compiler to throw all it's got at the
problem. O2 is somewhat more conservative, and people normally use it when
they want more stability of the code and results (regarding FP, undefined
behaviour, etc). I also use it for finding bugs on the compiler that are
introduced by O3, and making them more similar won't help that either. I'm
yet to see a good reason to enable the vectorizer by default into O2.

Code size is a different matter, though. I agree that vectorized code can
be as small (if not smaller) than scalar code and much more efficient, so
there is a clear win to make it on by default under those circumstances.
But there are catches that we need to make sure are well understood before
we do so.

First, to vectorize some loops we have to keep the original loop around in
order to handle the last few iterations.

Or if the runtime condition in which it could be vectorize is not valid, in
which case you have to run the original.

Second, on x86 and possibly other targets, the encoding of vector

instructions takes more space.

This may be a problem, and maybe the solution is to build a "SizeCostTable"
and do the same as we did for the CostTable. Most of the targets would just
return 1, but some should override and guess.

However, on ARM, NEON and VFP are 32-bits (either word or two half-words),
but Thumb can be 16-bit or 32-bit. So, you don't have to just model how big
the vector instructions will be, but how big the scalar instructions would
be, and not all Thumb instructions are of the same size, which makes
matters much harder.

In that sense, possibly the SizeCostTable would have to default to 2
(half-words) for most targets, and *also* manipulate scalar code, not just
vector, in a special way.

I measured the effects of vectorization on performance and binary size

using -Os. I measured the performance on a Sandybridge and compiled our
test suite using -mavx -f(no)-vectorize -Os. As you can see in the
attached data there are many workloads that benefit from vectorization.
Not as much as vectorizing with -O3, but still a good number of programs.
At the same time the code growth is minimal.

Would be good to get performance improvements *and* size increase
side-by-side in Perf.

Also, our test-suite is famous for having too large a noise, so I'd run it
at least 20x each and compare the average (keeping an eye on the std.dev),
to make sure the results are meaningful or not.

Again, would be good to have that kind of analysis in Perf, and only warn
if the increase/decrease is statistically meaningful.

Most workloads are unaffected and the total code growth for the entire test

suite is 0.89%. Almost all of the code growth comes from the TSVC test
suite which contains a large number of large vectorizable loops. I did not
measure the compile time in this batch but I expect to see an increase in
compile time in vectorizable loops because of the time we spend in codegen.

I was expecting small growth because of how conservative our vectorizer is.
Less than 1% is acceptable, in my view. For ultimate code size, users
should use -Oz, which should never have any vectorizer enabled by default
anyway.

A few considerations on embedded systems:

* 66% increase in size on an embedded system is not cheap. But LLVM haven't
been focusing on that use case so far, and we still have -Oz which does a
pretty good job at compressing code (compared to -O3), so even if we do
have existing embedded users shaving off bytes, the change in their build
system would be minimal.
* Most embedded chips have no vector units, at most single-precision FP
units or the like, so vectorization isn't going to be a hit for those
architectures anyway.

So, in a nutshell, I agree that -Os could have the vectorizer enabled by
default, but I'm yet to see a good reason to do that on -O2.

cheers,
--renato

I would like to start a discussion about enabling the loop vectorizer by default for -Os. The loop vectorizer can accelerate many workloads and enabling it for -Os and -O2 has obvious performance benefits.

Hi Nadav,

As it stands, O2 is very similar to O3 with a few, more aggressive, optimizations running, including the vectorizers. I think this is a good rationale, at O3, I expect the compiler to throw all it’s got at the problem. O2 is somewhat more conservative, and

people normally use it when they want more stability of the code and results (regarding FP, undefined behaviour, etc). I also use it for finding bugs on the compiler that are introduced by O3, and making them more similar won’t help that either. I’m yet

to see a good reason to enable the vectorizer by default into O2.

Just to note that I think a lot of people used to the switches from gcc may be coming in with a different “historical expectations”. At least recently (at least past 5 years), O2 has in practice been “optimizations that are straightforward enough they do achieve speed-ups” while O3 tends to be “more aggressive optimizations which potentially could cause speed-ups, but don’t understand the context/trade-offs well enough so they often don’t result in a speed-up”. (I’ve very rarely had O3 optimzation, rather than some program specific subset of the options, acheive any non-noise-level speed-up over O2 with gcc/g++.) I know it’s been said that llvm/clang should aim for “validated” O2/O3 settings that actually do result in better performance, but then I imagine so did gcc… From what I’ve been seeing I haven’t been seeing any instability of code or results from using the vectorizer. (Mind you, I deliberately try to write code to avoid letting chips with “80-bit intermediate floating point values” use them precisely because it can make things more vulnerable to minor compilation changes.)

Under that view, if the LLVM vectorizer was well enough understood I would think it would be good to include at O2. However, I suspect that the effects from having effectively two versions of each loop around are probably conflicting enough that it’s a better decision to make O3 be the level at which it is blanket enabled.

Cheers,

Dave

(I've very rarely had O3 optimzation, rather than some program specific
subset of the options, acheive any non-noise-level speed-up over O2 with
gcc/g++.)

Hi David,

You surely remember this:

http://plasma.cs.umass.edu/emery/stabilizer

"We find that, while -O2 has a significant impact relative to -O1, the
performance impact of -O3 over -O2 optimizations is indistinguishable from
random noise."

Under that view, if the LLVM vectorizer was well enough understood I would

think it would be good to include at O2. However, I suspect that the
effects from having effectively two versions of each loop around are
probably conflicting enough that it's a better decision to make O3 be the
level at which it is blanket enabled.

My view of O3 is that it *only* regards how aggressive you want to optimize
your code. Some special cases are proven to run faster on O3, mostly
benchmarks improvements that feed compiler engineers, and on those grounds,
O3 can be noticeable if you're allowed to be more aggressive than usual.
This is why I mentioned FP-safety, undefined behaviour, vectorization, etc.

I don't expect O3 results to be faster than O2 results on average, but on
specific cases where you know that the potential disaster is acceptable,
should be fine to assume O3. Most people, though, use O3 (or O9!) in the
expectancy that this will be always better. It not being worse than O2
doesn't help, either. :wink:

I don't think it's *wrong* to put aut-vec on O2, I just think it's not its
place to be, that's all. The potential to change results are there.

cheers,
--renato

(I’ve very rarely had O3 optimzation, rather than some program specific subset of the options, acheive any non-noise-level speed-up over O2 with gcc/g++.)

[snip]

Ø “We find that, while -O2 has a significant impact relative to -O1, the performance impact of -O3 over -O2 optimizations is indistinguishable from random noise.”

That’s something I remember well, but there’s an obvious question lurking in there: is this because the transformations that apply at O3, while they count as “aggressive”, not actually ever transforms to faster code or are they things which are capable of optimizing when used in the right places but we don’t do well at deciding where that is? I don’t have any actual evidence, but I’m inclined towards thinking it’s more likely to be the second (and occasionally having looked at gcc assembly it can be seen to have done things like loop unrolling in the most unlikely to be profitable places). So to simplify a lot the difference between O2 and O3 (at least on gcc) might well be the difference between “guaranteed wins only” and “add some transforms that we don’t predict the optimization effects of well”. At least from some mailing lists I’ve read other people share that view of the optimization flags in practice, not aggressiveness or stability. Maybe they shouldn’t have this “interpretation” in LLVM/clang; I’m just pointing out what some people might expect from previous experience.

Under that view, if the LLVM vectorizer was well enough understood I would think it would be good to include at O2. However, I suspect that the effects from having effectively two versions of each loop around are probably conflicting enough that it’s a better decision to make O3 be the level at which it is blanket enabled.

Ø My view of O3 is that it only regards how aggressive you want to optimize your code. Some special cases are proven to run faster on O3, mostly benchmarks improvements that feed compiler engineers, and on those grounds, O3 can be noticeable if you’re allowed to be more aggressive than usual. This is why I mentioned FP-safety, undefined behaviour, vectorization, etc.

Again, I can see this as a logical position, I’ve just never actually encountered differences in FP-safety or undefined behaviour between O2 and O3. Likewise I haven’t really seen any instability or undefined behaviour from the vectorizer. (Sorry if I’m sounding a bit pendantic; I’ve been doing a lot of performance testing/exploration recently so I’ve been knee deep in the difference between “I’m sure it must be the case that…” expectations and what experimentation reveals is actually happening.)

Ø I don’t expect O3 results to be faster than O2 results on average, but on specific cases where you know that the potential disaster is acceptable, should be fine to assume O3. Most people, though, use O3 (or O9!) in the expectancy that this will be always better. It not being worse than O2 doesn’t help, either. :wink:

Again, my experience is that I haven’t seen any “semantic” disasters from O3, just that it mostly it doesn’t help much, sometimes speeds execution up relative to O2, sometimes slows execution down relative to O2 and certainly increases compile time. It sounds like you’ve had a wilder ride than me and seen more cases where O3 has actually changed observable behaviour.

Ø I don’t think it’s wrong to put aut-vec on O2, I just think it’s not its place to be, that’s all. The potential to change results are there.

This is what I’d like to know about: what specific potential to change results have you seen in the vectorizer?

Cheers,

Dave

No changes, just conceptual. AFAIK, the difference between the passes on O2
and O3 are minimal (looking at the code where this is chosen) and they
don't seem to be particularly amazing to warrant their special place in On
land.

If the argument for having auto-vec on O2 is that O3 makes no difference,
than, why have O3 in the first place? Why not make O3 an alias to O2 and
solve all problems?

cheers,
--renato

Ø No changes, just conceptual. AFAIK, the difference between the passes on O2 and O3 are minimal (looking at the code where this is chosen) and they don’t seem to be particularly amazing to warrant their special place in On land.

Ø If the argument for having auto-vec on O2 is that O3 makes no difference, than, why have O3 in the first place? Why not make O3 an alias to O2 and solve all problems?

I think I’m managing to express myself unclearly again L For me the practical definition of “O2” is “do transformations which are pretty much guaranteed to actually be optimizations” rather than “do all optimizations which don’t carry a risk of disaster”. In which case the argument for or against vectorizing at O2 is whether it’s " pretty much guaranteed to actually be an optimization" or not rather than whether it’s an aggressive optimization or not. I wouldn’t say the argument for auto-vec on O2 isn’t that O3 makes no difference, it’s whether the intrinsic properties of auto-vec pass fit with the criteria which one uses for enabling passes at O2. I think you were suggesting that “aggressive” transforms don’t belong in O2 and auto-vec is “aggressive”, while I tend to think of simplicity/performance-relaiblity as the criteria for O2 and it’s unclear if auto-vec fits that.

Cheers,

Dave

Neat, I like this conservative approach to vectorization. It seems like if
it's good enough for -Os it should be good enough for -O2. I thought the
main objections against vectorization at -O2 centered around code bloat and
regressions of hot but short loops. If these heuristics address those
concerns and compile time doesn't suffer too much, it seems reasonable to
enable at -O2.

My poorly informed 2 cents.

Hi,

Thanks for the feedback. I think that we agree that vectorization on -Os can benefit many programs. Regarding -O2 vs -O3, maybe we should set a higher cost threshold for O2 to increase the likelihood of improving the performance ? We have very few regressions on -O3 as is and with better cost models I believe that we can bring them close to zero, so I am not sure if it can help that much. Renato, I prefer not to estimate the encoding size of instructions. We know that vector instructions take more space to encode. Will knowing the exact number help us in making a better decision ? I don’t think so. On modern processors when running vectorizable loops, the code size of the vector instructions is almost never the bottleneck.

Thanks,
Nadav

Hi,

Thanks for the feedback. I think that we agree that vectorization on -Os
can benefit many programs. Regarding -O2 vs -O3, maybe we should set a
higher cost threshold for O2 to increase the likelihood of improving the
performance ? We have very few regressions on -O3 as is and with better
cost models I believe that we can bring them close to zero, so I am not sure
if it can help that much. Renato, I prefer not to estimate the encoding
size of instructions. We know that vector instructions take more space to
encode. Will knowing the exact number help us in making a better decision ?
I don’t think so. On modern processors when running vectorizable loops, the
code size of the vector instructions is almost never the bottleneck.

You're talking about -Os, where the user has explicitly asked the
compiler to optimize the code size. Saying that the code size isn't a
speed bottleneck seems to miss the point.

I'm not sure that's a fair characterization. In Xcode, for example, -Os is the default setting.
My understanding is that -Os is intended to be optimized-without-sacrificing-code-size. -Oz is where we've being explicitly mandated to prefer code size to all else.

--Owen

Hi,

Thanks for the feedback. I think that we agree that vectorization on -Os
can benefit many programs. Regarding -O2 vs -O3, maybe we should set a
higher cost threshold for O2 to increase the likelihood of improving the
performance ? We have very few regressions on -O3 as is and with better
cost models I believe that we can bring them close to zero, so I am not

sure

if it can help that much. Renato, I prefer not to estimate the encoding
size of instructions. We know that vector instructions take more space to
encode. Will knowing the exact number help us in making a better decision

?

I don't think so. On modern processors when running vectorizable loops,

the

code size of the vector instructions is almost never the bottleneck.

You're talking about -Os, where the user has explicitly asked the
compiler to optimize the code size. Saying that the code size isn't a
speed bottleneck seems to miss the point.

Just to check: reading Nadav's original paragraph, he appears to be talking
about O2 at this point, where the user (in my understanding) only cares
about size indirectly in terms of if it affects performance. Now having said
that I don't actually have a feeling for whether vectorizable code size
affects performance noticeably or not. My suspicion is that in C-family like
languages there's so much other faffing around instructions that any change
is probably lost in the noise. However, for LLVM IR generated directly it
might be noticeable, I'm really don't know.

But if it's a "performance reliable" optimization (as it seems to be) then I
think there's a good case for putting vectorization into the O2 opts.

Cheers,
Dave

Hi,

Thanks for the feedback. I think that we agree that vectorization on -Os
can benefit many programs.

FWIW, I don't yet agree.

Your tables show many programs growing in code size by over 20%. While
there is associated performance improvements, it isn't clear that this is a
good tradeoff. Historically, optimizations which optimize as a direct
result of growing code size have *not* been an acceptable tradeoff in -Os.

From Owen's email, a characterization I agree with:

"My understanding is that -Os is intended to be
optimized-without-sacrificing-code-size."

The way I would phrase the difference between -Os and -Oz is similar: with
-Os we don't *grow* the code size significantly even if it gives
significant performance gains, whereas with -Oz we *shrink* the code size
even if it means significant performance loss.

Neither of these concepts for -Os would seem to argue for running the
vectorizer given the numbers you posted.

Regarding -O2 vs -O3, maybe we should set a higher cost threshold for O2
to increase the likelihood of improving the performance ? We have very few
regressions on -O3 as is and with better cost models I believe that we can
bring them close to zero, so I am not sure if it can help that much.
Renato, I prefer not to estimate the encoding size of instructions. We know
that vector instructions take more space to encode. Will knowing the exact
number help us in making a better decision ? I don’t think so. On modern
processors when running vectorizable loops, the code size of the vector
instructions is almost never the bottleneck.

That has specifically not been my experience when dealing with
significantly larger and more complex application benchmarks.

The tradeoffs you show in your numbers for -Os are actually exactly what I
would expect for -O2: a willingness to grow code size (and compilation
time) in order to get performance improvements. A quick eye-balling of the
two tables seemed to show most of the size growth had associated
performance growth. This, to me, is a good early indicator that the mode of
the vectorizer is running in your -Os numbers is what we should look at
enabling for -O2.

That said, I would like to see benchmarks from a more diverse set of
applications than the nightly test suite. ;] I don't have a lot of faith in
it being representative. I'm willing to contribute some that I care about
(given enough time to collect the data), but I'd really like for other
folks with larger codebases and applications to measure code size and
performance artifacts as well.

In order to do this, and ensure we are all measuring the same thing, I
think it would be useful to have in Clang flag sets that correspond to the
various modes you are proposing. I think they are:

1) -Os + minimal-vectorize (no unrolling, etc)
2) -O2 + minimal-vectorize
3) -O2 + -fvectorize (I think? maybe you have a more specific flag here?)

Does that make sense to you and others?
-Chandler

PS: As a side note, I would personally really like to revisit my proposal
to write down what we mean for each optimization level as precisely as we
can (acknowledging that this is not very precise; it will always be a
judgement call). I think it would help these discussions stay on track.

PS: As a side note, I would personally really like to revisit my proposal to write down what we mean for each optimization level as precisely as we can (acknowledging that this is not very precise; it will always be a judgement call). I think it would help

these discussions stay on track.

I think that would be good; I had the points in the previous email discussion you initiated in mind in interpreting O2 vs O3. The one big thing that I think isn’t borne in mind quite enough is that we need to not only have a fairly clear idea of what ought to be at each level, but when assigning things we need to run “tests” (even throwaway, informatl ones) to verify that actual real-world behaviour does match the intuition. For example, I don’t think the “intent” of gcc’s levels is actually that different from what you proposed; the issue is that the behaviour of the actual flags assigned don’t result in behaviour that matches the intent. (I still never cease to be amazed when I write a simple “pseudo-benchmark” to generate some numbers to calibrate a performance issue, run them on an actual machine and the actual performance is reversed from what I’d expect.)

Cheers,

Dave

+1, having (even a vague) agreed definition of what the different optimization levels mean would be good.

+1 too. Most of the discussion on this thread is related to the
uncertainties in that definition.

cheers,
--renato

Will knowing the exact number help us in making a better decision ? I
don’t think so. On modern processors when running vectorizable loops, the
code size of the vector instructions is almost never the bottleneck.

I'd make a slightly different point: being able to estimate the number of
UOPs will make a big difference if it allows you to fit your loop in the
loop stream detector.

So I'd agree that estimating x86 encoded code size doesn't matter that much
for performance (though I$ pressure is a big issue for may codebases, but I
assume you're talking about tight vectorizable kernels), but estimating
UOPs does matter a great deal.

Hi Chandler,

I am glad that you mentioned it. There are only three benchmarks that gained over 1% and there is only one benchmark that gained over 20%: the TSVC workloads. The TSVC is an HPC benchmark and it is irrelevant for the -Os/-O2 discussion. If you ignore TSVC you will notice that the code growth due to vectorization is 0.01%

0.01% code growth for everything except TSVC sounds pretty good to me. I would be willing to accept 0.01% code growth to gain 2% on gzip and 9% on RC4.

I am constantly benchmarking the compiler and I am aware of a small number of regressions on -O3 using the vectorizer. If you have a different experience then please share your numbers.

I am looking forward to seeing your contributions to the nightly test suite. I would also like to see other people benchmark their applications.

Thanks,
Nadav

The test-suite is not a good representation of the general case, but I
don't think we can reason based on unknown results.

Chandler, can you enable the vectorizer on the examples you gave and
produce a simple size x performance increase?

cheers,
--renato

These look like really awesome results :slight_smile:

I am using clang/LLVM to JIT some code and intuitively our workloads should benefit a lot from vectorization. Is there a way to use apply this optimization to JIT generated code?

Regards,
– Priyendra