RFC: Proposal for a New Pass for Write-Allocate Evasion

We are considering creating an optimization pass to improve memory throughput using the write-allocate evasion technique.

Background

Write-allocate evasion is a technique that omits the operation of reading a cache line from memory for store instructions when writing the entire cache line. [1] demonstrates that Grace can do this automatically, while Sapphire Rapids and Genoa can do it using non-temporal store instructions.

Therefore, we consider it useful to have a feature that recognizes store instructions where write-allocate evasion can be applied and adds non-temporal metadata.

The Intel compiler provides an equivalent feature with the option -qopt-streaming-stores.

Design

Stores with continuous addresses to arrays that do not alias with others in loops can potentially perform write-allocate evasion.

The performance improvement rate can be calculated as follows. Let N be the number of streams of the target stores, M be the number of streams of other stores, and L be the number of streams of loads. The amount of memory transfer is 2*(N+M)+L without write-allocate evasion and N+2M+L with it. Therefore, (2*(N+M)+L)/(N+2M+L) becomes the ideal performance improvement rate. Non-temporal metadata should be added when this value exceeds a threshold.

Since write-allocate evasion may lead to performance degradation due to factors such as the inability to prefetch, it is considered necessary to enable this feature via command-line options or pragmas.

Alternatives

The following interfaces exist for generating non-temporal stores:

  • The OpenMP directive omp simd nontemporal()
  • __builtin_nontemporal_store()

These require users to specify the targets directly, which requires detailed knowledge of write-allocate evasion. Therefore, we believe there is value in a feature that automatically detects stores where write-allocate evasion can be applied.

In addition, the nontemporal directive currently does not actually generate non-temporal stores. (nontemporal instructions not generated for #pragma omp simd nontemporal · Issue #55757 · llvm/llvm-project · GitHub)

References

[1] Microarchitectural comparison and in-core modeling of state-of-the-art CPUs: Grace, Sapphire Rapids, and Genoa

We would appreciate any comments or feedback. Thank you.

Minor warning on non-temporal stores: they interact weirdly with the atomic memory model on x86. So you might need to insert mfence instructions depending on what you’re doing.

In general, I don’t like making an optimization permanently guarded by a compiler flag; if the optimization isn’t reliable enough for everyone to use, it’s probably not reliable enough for anyone to use.

@efriedma-quic Thank you for the advice!

So you might need to insert mfence instructions depending on what you’re doing.

From the description of optimization outside atomic, I interpreted that adding a !nontemporal to a regular store is not problematic because it does not guarantee any concurrency. Is this correct?

In the case of atomic stores, consideration of ordering is necessary, but according to the definition of the store instruction, !nontemporal seem to be ignored if marked as atomic. However, it might be better not to apply it conservatively to loops where atomic instructions exist.

if the optimization isn’t reliable enough for everyone to use, it’s probably not reliable enough for anyone to use.

If the above interpretation regarding the concurrency model is correct, I believe we can trust the correctness of the compilation.
The reason I think control via a flag is necessary is that applying this optimization under conditions other than being memory-bound is likely to degrade performance, and whether a program becomes memory-bound is usually unknown at compile time.

See nontemporal stores behave incorrectly in their interaction with concurrency primitives · Issue #64521 · llvm/llvm-project · GitHub for further discussion of the atomics issue.


Can we use runtime loop versioning to figure out if we become memory-bound? Like, if the trip count is greater than 10000, or something like that? Or do we need to be concerned about memory access patterns outside the loop itself?

If we expect that we only want to apply this transform to specific loops, should this be a loop pragma instead of a flag?