Hello. I am currently developing an MLIR-based codegen for GPU backend. My goal is to generate PTX code akin to “ld.global.nc”, but the current conversion from “memref.load” to “llvm.load” only seems to yield “ld.global” code.
Could anyone advise me on the best approach to accomplish this?
First you need to figure out what’s the LLVM API to achieve this: is there a first class IR construct to express this? Is there an intrinsic? Or do you need inline PTX ASM?
Then we move up to the NVVM/LLVM dialects in MLIR and look into how to express this: is there something that could generalize in the LLVM dialect (doubtful if you didn’t find a first class solution in LLVM)? Should we add an operation to the NVVM dialect? How does the problem of memory coherency generalizes beyond this particular load?
Also, you may want to check whether ld.global.nc actually buys you anything performance-wise. We did some benchmarking of ld.global vs ld.global.nc vs ld.const on some workloads, and the conclusion was [on sm_75] I don't see any difference between using ld.global / LDG.E.SYS and ld.global.nc / LDG.E.CONSTANT.SYS. Furthermore, it's a significant pessimization to place the right hand matrix (n128k128) in constant memory via a __constant__ variable then load from it with the ld.const PTX / LDC SASS instruction.
Like Volta, the NVIDIA Ampere GPU architecture combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. Another benefit of its union with shared memory, similar to Volta L1 is improvement in terms of both latency and bandwidth.
Really appreciate your response. In my observations, in some memory-bound cases(guess your benchmark is mainly compute-bound cases?), nc mode can achieve a slightly better performance(about 3~5%, V100/A100). In nvcc compilation, it will use nc or not automatically; while current mlir->nvvm->ptx pipeline may loses this potential improvment.
I find a .ll code for ld.global.nc codegen: llvm/test/CodeGen/NVPTX/ldg-invariant.ll
!invariant.load seems like an option, but I’m not sure if MLIR is currently generating that. If not, feel free to add support for it. With !invariant.load you can manage each load separately, in finer detail.
Alternatively, you can add the llvm.noalias and llvm.readonly attributes to the function parameter where you want non-coherent loads. In this case, all loads will be inconsistent. (FWIW, solely llvm.noalias might potentially yields more performance than non-coherent cache.)
See an example below:
In my observations, in some memory-bound cases(guess your benchmark is mainly compute-bound cases?), nc mode can achieve a slightly better performance(about 3~5%, V100/A100).