[RFC] Non-Temporal hints from Loop Vectorizer

Hello all,

I've been wondering why Clang doesn't generate non-temporal stores when
compiling the STREAM benchmark [1] and therefore doesn't yield optimal
results.

It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
operations and also merges the loads and stores into vector operations.
However it doesn't add the '!nontemporal' metadata which would be needed for
maximal bandwidth on X86.
I briefly looked into this and for non-temporal memory instructions to work,
the memory address would have to be aligned to the vector length which
currently isn't the case neither.

To summarize the following things would be needed to give non-temporal
hints:
1) Ensure correct alignment of merged vector memory instructions
This could be implemented by executing the first (scalar) loop iterations
until the addresses for loads and stores are aligned, similar to what already
happens for the remainder of the loop. The larger alignment would also allow
aligned vector instructions instead of the currently unaligned ones.

2) Give non-temporal hints when different array elements are only used once
per loop iteration
We probably need to analyze the different load and stores per loop iteration
for this...

Any thoughts or any ongoing work that I'm missing?

Thanks,
Jonas

[1] https://www.cs.virginia.edu/stream/

Hello all,

I've been wondering why Clang doesn't generate non-temporal stores when
compiling the STREAM benchmark [1] and therefore doesn't yield optimal
results.

It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
operations and also merges the loads and stores into vector operations.
However it doesn't add the '!nontemporal' metadata which would be needed for
maximal bandwidth on X86.
I briefly looked into this and for non-temporal memory instructions to work,
the memory address would have to be aligned to the vector length which
currently isn't the case neither.

To summarize the following things would be needed to give non-temporal
hints:
1) Ensure correct alignment of merged vector memory instructions
This could be implemented by executing the first (scalar) loop iterations
until the addresses for loads and stores are aligned, similar to what already
happens for the remainder of the loop. The larger alignment would also allow
aligned vector instructions instead of the currently unaligned ones.

2) Give non-temporal hints when different array elements are only used once
per loop iteration
We probably need to analyze the different load and stores per loop iteration
for this…

You probably also want to ensure that you stay in the loop long enough, i.e. have some sort of a dynamic-trip count check or PGO data indicating this.

You essentially want to ensure that reads after the loop were not hitting in the cache even with regular stores. (If you are writing a large area in the loop, a large percentage of those lines are already evicted by the time you exit the loop.)

Adam

From: "Adam Nemet via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Jonas Hahnfeld" <Hahnfeld@itc.rwth-aachen.de>
Cc: "llvm-dev (llvm-dev@lists.llvm.org)" <llvm-dev@lists.llvm.org>
Sent: Tuesday, May 3, 2016 12:21:07 PM
Subject: Re: [llvm-dev] [RFC] Non-Temporal hints from Loop Vectorizer

>
> Hello all,
>
> I've been wondering why Clang doesn't generate non-temporal stores
> when
> compiling the STREAM benchmark [1] and therefore doesn't yield
> optimal
> results.
>
> It turned out that the Loop Vectorizer correctly vectorizes the
> arithmetic
> operations and also merges the loads and stores into vector
> operations.
> However it doesn't add the '!nontemporal' metadata which would be
> needed for
> maximal bandwidth on X86.
> I briefly looked into this and for non-temporal memory instructions
> to work,
> the memory address would have to be aligned to the vector length
> which
> currently isn't the case neither.
>
> To summarize the following things would be needed to give
> non-temporal
> hints:
> 1) Ensure correct alignment of merged vector memory instructions
> This could be implemented by executing the first (scalar) loop
> iterations
> until the addresses for loads and stores are aligned, similar to
> what already
> happens for the remainder of the loop. The larger alignment would
> also allow
> aligned vector instructions instead of the currently unaligned
> ones.
>
> 2) Give non-temporal hints when different array elements are only
> used once
> per loop iteration
> We probably need to analyze the different load and stores per loop
> iteration
> for this…

You probably also want to ensure that you stay in the loop long
enough, i.e. have some sort of a dynamic-trip count check or PGO
data indicating this.

This sounds right. Also, I'll point out that LLVM essentially does not have a memory-hierarchy model based on which such decisions could be made. Work in this area would be welcome.

-Hal

Non-temporal hints are extremely dangerous to use in practice IMO; they hurt far more than they help in real programs. Compilers really should not be using them without the programmer’s knowledge, even if it helps some microbenchmark, because in real programs, the only one who really knows the memory access patterns is the programmer. Unless you can be certain the output of a function or loop is literally bigger than the entire cache and won’t be used again for a long time, forcibly evicting it from cache tends to be a very costly mistake.

—escha

Hello all,

I've been wondering why Clang doesn't generate non-temporal stores when
compiling the STREAM benchmark [1] and therefore doesn't yield optimal
results.

It turned out that the Loop Vectorizer correctly vectorizes the arithmetic
operations and also merges the loads and stores into vector operations.
However it doesn't add the '!nontemporal' metadata which would be needed for
maximal bandwidth on X86.

Also MichaelZ introduced builtins last year to manually force the generation of non-temporal loads and stores: __builtin_nontemporal_load/store. I believe these are documented.

Steve Canon is on vacation, so I’m going to word for word quote his take on the compiler autogenerating nontemporal hints:

"nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope nope n” — Steve Canon

—escha

Agreed with the other replies on this thread, I’d also suggest looking at my RFC:

https://groups.google.com/d/topic/llvm-dev/ZJ8SVCJPpcc/discussion

Which I still have to implement.

Expanding on my initial reaction:

x86 NT stores do have a valid use case, but it is limited, and a compiler absolutely should not be generating them without either (a) an explicit request from the programmer, or (b) a reasonably precise memory model that includes caches of an explicit targeted machine, because they are only beneficial if you can prove that the address being stored to is not in cache, and would not be present in cache at next use even if a normal store were used.

For very simple synthetic benchmarks with large loop trip counts (and by “large” I really mean “staggeringly huge”, given modern L3 sizes), it’s possible to statically know that these criteria are satisfied without a full cache model, but those cases are pretty limited and not obviously beneficial outside of benchmarks.

Weight against this small benefit, there are a two big things that can go wrong with NT stores, which make me very wary of auto-generating them:

- they break some of the usual ordering guarantees.
- the performance is catastrophically bad (significantly worse than normal stores) when the address is present in cache.

My experience is that while there are indeed a few beneficial real-world uses of these instructions, if they aren’t deployed extremely carefully they cause more problems than they solve, and LLVM doesn’t at present have the infrastructure that would be needed to accurately inform their use.

– Steve

Expanding on my initial reaction:

x86 NT stores do have a valid use case, but it is limited, and a compiler absolutely should not be generating them without either (a) an explicit request from the programmer, or (b) a reasonably precise memory model that includes caches of an explicit targeted machine, because they are only beneficial if you can prove that the address being stored to is not in cache, and would not be present in cache at next use even if a normal store were used.

For very simple synthetic benchmarks with large loop trip counts (and by “large” I really mean “staggeringly huge”, given modern L3 sizes), it’s possible to statically know that these criteria are satisfied without a full cache model, but those cases are pretty limited and not obviously beneficial outside of benchmarks.

Weight against this small benefit, there are a two big things that can go wrong with NT stores, which make me very wary of auto-generating them:

- they break some of the usual ordering guarantees.
- the performance is catastrophically bad (significantly worse than normal stores) when the address is present in cache.

My experience is that while there are indeed a few beneficial real-world uses of these instructions, if they aren’t deployed extremely carefully they cause more problems than they solve, and LLVM doesn’t at present have the infrastructure that would be needed to accurately inform their use.

That was my conclusion too.

I actually thought about autogenerating NT hints too, but before diving into implementing that I carried out some experiments on benchmarks like SPEC and similar - I tried to model what the compiler would do there and replaced usual stores in candidate loops with __builtin_nontemporal_store. I didn't see any gains at that moment (I don't remember if I saw losses though, but I can see that we can get them easily), so I dropped that.

Michael

Expanding on my initial reaction:

x86 NT stores do have a valid use case, but it is limited, and a compiler absolutely should not be generating them without either (a) an explicit request from the programmer, or (b) a reasonably precise memory model that includes caches of an explicit targeted machine, because they are only beneficial if you can prove that the address being stored to is not in cache, and would not be present in cache at next use even if a normal store were used.

and profile information. The loop trip is important in this space, too. Adam had pointed this out earlier I think.

For very simple synthetic benchmarks with large loop trip counts (and by “large” I really mean “staggeringly huge”, given modern L3 sizes), it’s possible to statically know that these criteria are satisfied without a full cache model, but those cases are pretty limited and not obviously beneficial outside of benchmarks.

Weight against this small benefit, there are a two big things that can go wrong with NT stores, which make me very wary of auto-generating them:

- they break some of the usual ordering guarantees.
- the performance is catastrophically bad (significantly worse than normal stores) when the address is present in cache.

Agreed. To balance that view, though, the gains can impressively high (under all the caveats pointed out in this thread), too.

My experience is that while there are indeed a few beneficial real-world uses of these instructions, if they aren’t deployed extremely carefully they cause more problems than they solve, and LLVM doesn’t at present have the infrastructure that would be needed to accurately inform their use.

That was my conclusion too.

I actually thought about autogenerating NT hints too, but before diving into implementing that I carried out some experiments on benchmarks like SPEC and similar - I tried to model what the compiler would do there and replaced usual stores in candidate loops with __builtin_nontemporal_store. I didn't see any gains at that moment (I don't remember if I saw losses though, but I can see that we can get them easily), so I dropped that.

There were losses in the ~10% range overall from applying NT stores in some of the hot benchmarks.