How to Generate Non-Coherent Loads on a GPU Backend?

ck0123 · September 6, 2023, 9:32am

Hello. I am currently developing an MLIR-based codegen for GPU backend. My goal is to generate PTX code akin to “ld.global.nc”, but the current conversion from “memref.load” to “llvm.load” only seems to yield “ld.global” code.

Could anyone advise me on the best approach to accomplish this?

mehdi_amini · September 7, 2023, 1:16am

First you need to figure out what’s the LLVM API to achieve this: is there a first class IR construct to express this? Is there an intrinsic? Or do you need inline PTX ASM?
Then we move up to the NVVM/LLVM dialects in MLIR and look into how to express this: is there something that could generalize in the LLVM dialect (doubtful if you didn’t find a first class solution in LLVM)? Should we add an operation to the NVVM dialect? How does the problem of memory coherency generalizes beyond this particular load?

ck0123 · September 7, 2023, 9:30am

Thanks for your reply.
I find a .ll code for ld.global.nc codegen: llvm/test/CodeGen/NVPTX/ldg-invariant.ll
It seems it is possible to these code properly by llvm. I will try to look into it later.

Artem-B · September 7, 2023, 5:23pm

Also, you may want to check whether ld.global.nc actually buys you anything performance-wise. We did some benchmarking of ld.global vs ld.global.nc vs ld.const on some workloads, and the conclusion was [on sm_75] I don't see any difference between using ld.global / LDG.E.SYS and ld.global.nc / LDG.E.CONSTANT.SYS. Furthermore, it's a significant pessimization to place the right hand matrix (n128k128) in constant memory via a __constant__ variable then load from it with the ld.const PTX / LDC SASS instruction.

IIRC, in the past it used to benefit from a separate data path and read-only texture cache. These days the data path is unified. E.g. Ampere whitepaper NVIDIA Ampere GPU Architecture Tuning Guide says:

Like Volta, the NVIDIA Ampere GPU architecture combines the functionality of the L1 and texture caches into a unified L1/Texture cache which acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. Another benefit of its union with shared memory, similar to Volta L1 is improvement in terms of both latency and bandwidth.

PTX docs also says that it may be beneficial on some architectures, but does not specify them: PTX ISA 8.3

ck0123 · September 8, 2023, 1:31pm

Really appreciate your response. In my observations, in some memory-bound cases(guess your benchmark is mainly compute-bound cases?), nc mode can achieve a slightly better performance(about 3~5%, V100/A100). In nvcc compilation, it will use nc or not automatically; while current mlir->nvvm->ptx pipeline may loses this potential improvment.

grypp · September 11, 2023, 8:18am

I find a .ll code for ld.global.nc codegen: llvm/test/CodeGen/NVPTX/ldg-invariant.ll

!invariant.load seems like an option, but I’m not sure if MLIR is currently generating that. If not, feel free to add support for it. With !invariant.load you can manage each load separately, in finer detail.

Alternatively, you can add the llvm.noalias and llvm.readonly attributes to the function parameter where you want non-coherent loads. In this case, all loads will be inconsistent. (FWIW, solely llvm.noalias might potentially yields more performance than non-coherent cache.)

See an example below:

github.com

llvm/llvm-project/blob/main/llvm/test/CodeGen/NVPTX/load-with-non-coherent-cache.ll#L13-L15


      
          define void @foo1(ptr noalias readonly %from, ptr %to) {
            %1 = load float, ptr %from
            store float %1, ptr %to

In my observations, in some memory-bound cases(guess your benchmark is mainly compute-bound cases?), nc mode can achieve a slightly better performance(about 3~5%, V100/A100).

My experience is similar with you. I obtained max 5% performances, but also slowdowns. So it is not always the win. Here my earlier benchmarks from IREE compiler

Topic		Replies	Views
GPU code generation status: NVidia, OpenCL MLIR	6	3079	October 23, 2020
[RFC] Cleaning up the NVIDIA (and potentially AMD) GPU backend Code Generation gpu , nvptx , amdgpu	5	497	June 29, 2023
Sparse Compiler and GPU Code-Generation MLIR	8	2554	October 20, 2023
MLIR GPU execution without runtime load/unload MLIR	1	444	April 28, 2022
Tensor load or vector load for mlir-vulkan-runner MLIR	2	415	June 29, 2021

How to Generate Non-Coherent Loads on a GPU Backend?

Related Topics