[RFC] pragma for nontemporal load/store

Currently, clang supports emitting nontemporal load/store by built-in functions. A typical use case is like:

#if __has_builtin(__builtin_nontemporal_store)
__builtin_nontemporal_store(func(x[i]),&y[i]);
#else
y[i] = func(x[i]);
#endif

The disadvantage of this approach is that users have to write different code for clang and other compilers, which affects readability and conciseness.

I would like to introduce a pragma to control the emission of nontemporal load/store in clang. The syntax is

#pragma clang nontemporal [load|store|load store]

which applies to the next statement or block. Clang will emit non-temporal load or store or both for any load/store instructions happening in the following statement or block.

This pragma allows users to write cleaner code using non-temporal load/store.

Thanks to Leopold Grinberg for the initial proposal.

@AaronBallman @Artem-B

Can you tell me a bit more about the intended use case for such a pragma where it would provide sufficient benefit over explicit use of builtins? Is there a specific use case that prompted this RFC?

TBH, I’m not convinced that it will be all that useful. In my experience tinkering with non-temporal stores is a pretty fine-grained tool that should not be applied wholesale and employing it per individual memory operation with a builtin is appropriate.

I’m a bit puzzled by the typical use case above. One does not really need to conditionally preprocess each load/store. A common pattern is to just do it once and create a function/macro to do the job and fall back to regular load/store if cache control is not available. E.g.

#if PLATFORM_SUPPORTS_CACHE_CONTROL
inline void store_nontemporal(ptr, value) {
  __platfporm_speciofic_builting_to_do_the_magic(ptr, value);
}
#else
inline void store_nontemporal(ptr, value) {
  *ptr = value;
}
#endif

void user_function() {
  store_nontemporal(ptr, value);
}

NVIDIA’s ptxas does have an option to force cache modifier on all generic and global AS loads/stores, but it’s rarely used (at least I’ve only encountered a handful of cases and in most of them explicit source code changes eventually did a better job).

Few things to consider. (Caveat – I’m assessing the proposal mostly from the CUDA/NVIDIA GPU user’s view. HIP/AMDGPU considerations may be different)

There’s more than one kind of cache-related knobs to control. Nontemporal access may be more common on multiple hardware architectures, but it’s just one of the cache control modes.
E.g. NVIDIA’s PTX has multiple cache levels, cache operations and eviction priorities.

ld.global{.cop}.nc{.level::cache_hint}.type                 d, [a]{, cache-policy};
ld.global{.cop}.nc{.level::cache_hint}.vec.type             d, [a]{, cache-policy};

ld.global.nc{.level::eviction_priority}{.level::cache_hint}.type      d, [a]{, cache-policy};
ld.global.nc{.level::eviction_priority}{.level::cache_hint}.vec.type  d, [a]{, cache-policy};

.cop  =                     { .ca, .cg, .cs };     // cache operation
.level::eviction_priority = { .L1::evict_normal, .L1::evict_unchanged,
                              .L1::evict_first, .L1::evict_last, .L1::no_allocate};
.level::cache_hint =        { .L2::cache_hint };

Providing generic “nontemporal” access, would be neither here nor there – it would provide a kind of nontemporal access, but one would still need to use explicit builtins for more fine-grained control. What kind of cache control is available will be target-dependent, so it’s probably not well suited for specifying it via a pragma. Target-specific builtins will still be the main tool for this kind of job.

Applying pragma to a single statement does not make whole lot of sense to me.
For starters, it would just add noise – one could just use an appropriate function to do the job and end up with more concise code.

If pragma is applied to a function call and the function is inlined, should the pragma apply to all loads/stores done by the function or not? Should the pragma apply to stack or other temporary loads/stores? Probably not.

Then there’s a question of which loads/stores the pragma will apply to address-space-wise. Not all address spaces provide cacheability control knobs. AFAICT, on NVIDIA GPUs only .global memory has it, but one could use them with generic pointers, assuming that the user guarantees that they do point to global memory. We may not know which AS we’re operating on if the code operates on a pointer in generic AS (the default for CUDA), so the pragma would need to be conservative and would have to ignore a lot of operations which could’ve been made nontemporal explicitly by the user, as the user may be in position to guarantee that the pointers do point to the right kind of memory. So, the effectiveness of such pragma will be somewhat limited in practice. Considering that such data flow micromanagement is usually employed in the hot path, explicit use of builtins would be a better choice.

In short, I can see how such pragma could provide some benefit in some cases, but its overall utility looks questionable to me. IMO pragma is a bit too blunt of a tool for the job.

1 Like

I like the function approach, much more generic and not more user code really.

The proposed pragma is no more portable than the builtin; people will still have to write different code for clang and other compilers to address that. Also, there’s no commonly implemented __has_pragma functionality to test for whether the pragma is supported or not, unlike for the builtin. So I’m not certain the pragma is an improvement over what’s already supported in some regards.