How to Generate Non-Coherent Loads on a GPU Backend?

Hello. I am currently developing an MLIR-based codegen for GPU backend. My goal is to generate PTX code akin to “ld.global.nc”, but the current conversion from “memref.load” to “llvm.load” only seems to yield “ld.global” code.

Could anyone advise me on the best approach to accomplish this?

First you need to figure out what’s the LLVM API to achieve this: is there a first class IR construct to express this? Is there an intrinsic? Or do you need inline PTX ASM?
Then we move up to the NVVM/LLVM dialects in MLIR and look into how to express this: is there something that could generalize in the LLVM dialect (doubtful if you didn’t find a first class solution in LLVM)? Should we add an operation to the NVVM dialect? How does the problem of memory coherency generalizes beyond this particular load?

Thanks for your reply.
I find a .ll code for ld.global.nc codegen: llvm/test/CodeGen/NVPTX/ldg-invariant.ll
It seems it is possible to these code properly by llvm. I will try to look into it later.

Also, you may want to check whether ld.global.nc actually buys you anything performance-wise. We did some benchmarking of ld.global vs ld.global.nc vs ld.const on some workloads, and the conclusion was [on sm_75] I don't see any difference between using ld.global / LDG.E.SYS and ld.global.nc / LDG.E.CONSTANT.SYS. Furthermore, it's a significant pessimization to place the right hand matrix (n128k128) in constant memory via a __constant__ variable then load from it with the ld.const PTX / LDC SASS instruction.

IIRC, in the past it used to benefit from a separate data path and read-only texture cache. These days the data path is unified. E.g. Ampere whitepaper NVIDIA Ampere GPU Architecture Tuning Guide says:

Like Volta, the NVIDIA Ampere GPU architecture combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. Another benefit of its union with shared memory, similar to Volta L1 is improvement in terms of both latency and bandwidth.

PTX docs also says that it may be beneficial on some architectures, but does not specify them: PTX ISA 8.3

1 Like

Really appreciate your response. In my observations, in some memory-bound cases(guess your benchmark is mainly compute-bound cases?), nc mode can achieve a slightly better performance(about 3~5%, V100/A100). In nvcc compilation, it will use nc or not automatically; while current mlir->nvvm->ptx pipeline may loses this potential improvment.

I find a .ll code for ld.global.nc codegen: llvm/test/CodeGen/NVPTX/ldg-invariant.ll

!invariant.load seems like an option, but I’m not sure if MLIR is currently generating that. If not, feel free to add support for it. With !invariant.load you can manage each load separately, in finer detail.

Alternatively, you can add the llvm.noalias and llvm.readonly attributes to the function parameter where you want non-coherent loads. In this case, all loads will be inconsistent. (FWIW, solely llvm.noalias might potentially yields more performance than non-coherent cache.)

See an example below:

In my observations, in some memory-bound cases(guess your benchmark is mainly compute-bound cases?), nc mode can achieve a slightly better performance(about 3~5%, V100/A100).

My experience is similar with you. I obtained max 5% performances, but also slowdowns. So it is not always the win. Here my earlier benchmarks from IREE compiler

1 Like