How to build a mlir gpu-codegen project?

I hope to generate GPU Tensor Core code through mlir. I found an ideal project GitHub - mcl-csa/llvm-project-public: The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Note: the repository does not accept github pull requests at this moment. Please submit your patches at http://reviews.llvm.org. , but it doesn’t have ReadMe. I tried to build it many times but failed. Can you tell me how to build it? Thank you. This is my build parameter.

cmake -G Ninja ../llvm -DLLVM_ENABLE_PROJECTS=mlir \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DCMAKE_C_COMPILER=clang \
   -DCMAKE_CXX_COMPILER=clang++ \
   -DLLVM_ENABLE_LLD=ON \
   -DMLIR_ENABLE_CUDA_RUNNER=ON

Without more information about how / why it fails, it’ll be hard to diagnose.

This repo seems like a fork from 6 months ago without custom changes, so it should build the same way as the instructions we have in LLVM.

Thank you for your reply. I will try your suggestion.

I found the Project build file in github, tried for a week, but failed to pass the test. This is the corresponding information. I have tried to update the nvidia graphics driver and CUDA version many times, but it never took effect. My graphics card driver version is RTX3060 526.98, and the CUDA version uses Ubutnu WSL 11.6。

>cmake -G Ninja …/llvm -DLLVM_ENABLE_PROJECTS=“mlir”
-DLLVM_BUILD_EXAMPLES=ON
-DLLVM_TARGETS_TO_BUILD=“X86;NVPTX”
-DCMAKE_BUILD_TYPE=Release
-DLLVM_ENABLE_ASSERTIONS=ON
-DCMAKE_C_COMPILER=clang
-DCMAKE_CXX_COMPILER=clang++
-DLLVM_ENABLE_LLD=ON
-DBUILD_SHARED_LIBS=ON
-DMLIR_ENABLE_CUDA_RUNNER=ON
-DMLIR_INCLUDE_INTEGRATION_TESTS=ON
-DMLIR_RUN_CUDA_TENSOR_CORE_TESTS=ON
-DLLVM_CCACHE_BUILD=OFF
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
>ninja check-mlir
error information:
‘cuEventCreate(&event, CU_EVENT_DEFAULT)’ failed with ‘CUDA_ERROR_ILLEGAL_ADDRESS’
‘cuEventRecord(event, stream)’ failed with ‘CUDA_ERROR_ILLEGAL_ADDRESS’
‘cuStreamSynchronize(stream)’ failed with ‘CUDA_ERROR_ILLEGAL_ADDRESS’
‘cuStreamDestroy(stream)’ failed with ‘CUDA_ERROR_ILLEGAL_ADDRESS’
‘cuModuleUnload(module)’ failed with ‘CUDA_ERROR_ILLEGAL_ADDRESS’
Failed Tests (14):
MLIR :: Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir
MLIR :: Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-and.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-max.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-min.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-op.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-or.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-region.mlir
MLIR :: Integration/GPU/CUDA/all-reduce-xor.mlir
MLIR :: Integration/GPU/CUDA/async.mlir
MLIR :: Integration/GPU/CUDA/gpu-to-cubin.mlir
MLIR :: Integration/GPU/CUDA/multiple-all-reduce.mlir
MLIR :: Integration/GPU/CUDA/shuffle.mlir
MLIR :: Integration/GPU/CUDA/two-modules.mlir

> machine information

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Fri_Dec_17_18:16:03_PST_2021
Cuda compilation tools, release 11.6, V11.6.55
Build cuda_11.6.r11.6/compiler.30794723_0

| NVIDIA-SMI 525.60.02 Driver Version: 526.98 CUDA Version: 12.0|

A few things.

  1. It’d be better if you post any further questions on that repo at Issues · mcl-csa/llvm-project-public · GitHub
  2. It isn’t clear which branch of the repo/commit you are trying to build. The one to use is:
    https://github.com/mcl-csa/llvm-project-public/tree/gpu-codegen-upstream.
  3. The above branch is based on an LLVM/MLIR from about more than 1.5 years ago – so it’s only meant for “as is” use (not for further development) and only to accompany/understand the research paper that’s based on it. It should work fine with CUDA 11.6.
  4. It’s not maintained and won’t be updated further.

If you are interested in trying out and developing MLIR code generation targeting GPU tensor cores, I’d recommend using the official LLVM/MLIR git trunk (latest) – the test cases that generate/use gpu.mma ops in test/ are a good place to start.

HI Everyone,

I’m also interested in the same project and how to use the same ideas with the current trunk.
One of the transformations that seem to be missing upstream is collapsing 2 nested 1D affine.parallel loops to a 2D loop. For instance from

func.func private @matmul(%arg0: memref<16x8xf32>, %arg1: memref<8x8xf32>, %arg2: memref<16x8xf32>) {
    affine.parallel (%arg3) = (0) to (16) {
      affine.parallel (%arg4) = (0) to (8) {
        affine.for %arg5 = 0 to 8 {
          %0 = affine.load %arg0[%arg3, %arg5] : memref<16x8xf32>
          %1 = affine.load %arg1[%arg5, %arg4] : memref<8x8xf32>
          %2 = affine.load %arg2[%arg3, %arg4] : memref<16x8xf32>
          %3 = arith.mulf %0, %1 : f32
          %4 = arith.addf %3, %2 : f32
          affine.store %4, %arg2[%arg3, %arg4] : memref<16x8xf32>
        }
      }
    }
    return
}

to

func.func private @matmul(%arg0: memref<16x8xf32>, %arg1: memref<8x8xf32>, %arg2: memref<16x8xf32>) {
    affine.parallel (%arg3, %arg4) = (0, 0) to (16, 8) {
      affine.for %arg5 = 0 to 8 {
        %0 = affine.load %arg0[%arg3, %arg5] : memref<16x8xf32>
        %1 = affine.load %arg1[%arg5, %arg4] : memref<8x8xf32>
        %2 = affine.load %arg2[%arg3, %arg4] : memref<16x8xf32>
        %3 = arith.mulf %0, %1 : f32
        %4 = arith.addf %3, %2 : f32
        affine.store %4, %arg2[%arg3, %arg4] : memref<16x8xf32>
      }
    }
    return
}

I’m re-implementing the bits needed to do this in my project, but I was wondering if I am missing a path that is already implemented (apart from linalg to scf.parallel) and if not if it would be useful to upstream such a transformation.

The code for this is available here: GitHub - mcl-csa/llvm-project-public at gpu-codegen-upstream (see coalesce affine.parallel utility) under the Apache 2 license. So please feel free to reuse. Yes, this can be contributed to the official repo.

Thanks! It is the branch gpu-codegen-upstream.
By replacing the new hardware environment, the problems mentioned above have been removed from the NVIDIA V100 device in CUDA10.3.
But when I compiled and built according to the github workflow, I encountered a new problem.
According to my understanding, it occurred when the GPU module was finally converted. The following is the error report and related tracking information. I don’t know how to deal with it. Is there any way to track bugs for the mlir-opt tool?I haven’t modified any source code

../../../build-mlir/bin/mlir-opt matmul_opt_final.mlir -mlir-print-stacktrace-on-diagnostic 
-pass-pipeline='gpu.module(convert-gpu-to-nvvm{index-bitwidth=32},
gpu-to-cubin{chip=sm_70 max-reg-per-thread=255 cu-jit-opt-level=4})' 
-gpu-to-llvm | nsys profile --force-overwrite true -o gpu ../../../build-mlir/bin/mlir-cpu-runner -O3 ../../../build-mlir/lib --entry-point-result=void 
> full_pipe.out 2> dump_.txt
**Error1:**
matmul_opt_final.mlir:92:3: error: cuLinkAddData( linkState, CUjitInputType::CU_JIT_INPUT_PTX, const_cast<void *>(static_cast<const void *>(isa.c_str())), isa.length(), kernelName.c_str(), 2, extraJitOptions, extraJitOptionsVals ) 
failed with error code a PTX JIT compilation failed[ptxas application ptx input, line 1096; 
fatal   : Parsing error near '-': syntax error
ptxas fatal   : Ptx assembly aborted due to errors]
  gpu.module @main_kernel {
**Error2:**
matmul_opt_final.mlir:483:3: error: cuLinkAddData( linkState, CUjitInputType::CU_JIT_INPUT_PTX, const_cast<void *>(static_cast<const void *>(isa.c_str())), isa.length(), kernelName.c_str(), 2, extraJitOptions, extraJitOptionsVals ) 
failed with error code a PTX JIT compilation failed[ptxas application ptx input, line 192; 
fatal   : Parsing error near '-': syntax error
ptxas fatal   : Ptx assembly aborted due to errors]
  gpu.module @initC_kernel {
**Error1 Trace informations:**
matmul_opt_final.mlir:92:3: note: diagnostic emitted with trace:
 #0 0x00007f13cc0b2973 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libLLVMSupport.so.13git+0x196973)
 #1 0x00007f13cc63ff43 emitDiag(mlir::Location, mlir::DiagnosticSeverity, llvm::Twine const&) Diagnostics.cpp:0:0
 #2 0x00007f13cc63fdf1 mlir::emitError(mlir::Location, llvm::Twine const&) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRIR.so.13git+0x8bdf1)
 #3 0x00007f13d50963ab emitCudaError(llvm::Twine const&, char const*, cudaError_enum, mlir::Location) SerializeToCubin.cpp:0:0
 #4 0x00007f13d5095ed7 (anonymous namespace)::SerializeToCubinPass::serializeISA(std::string const&) SerializeToCubin.cpp:0:0
 #5 0x00007f13d509356b mlir::gpu::SerializeToBlobPass::runOnOperation() (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRGPUTransforms.so.13git+0x2456b)
 #6 0x00007f13cfd44998 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRPass.so.13git+0xe998)
 #7 0x00007f13cfd44ea3 mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_delete<mlir::Pass> >*, mlir::Pass> >, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRPass.so.13git+0xeea3)
 #8 0x00007f13cfd49ef2 mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool)::$_8::operator()(llvm::MutableArrayRef<mlir::OpPassManager>) const Pass.cpp:0:0
 #9 0x00007f13cfd45f4f mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRPass.so.13git+0xff4f)
#10 0x00007f13cfd449d0 mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRPass.so.13git+0xe9d0)
#11 0x00007f13cfd44ea3 mlir::detail::OpToOpPassAdaptor::runPipeline(llvm::iterator_range<llvm::pointee_iterator<std::unique_ptr<mlir::Pass, std::default_delete<mlir::Pass> >*, mlir::Pass> >, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int, mlir::PassInstrumentor*, mlir::PassInstrumentation::PipelineParentInfo const*) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRPass.so.13git+0xeea3)
#12 0x00007f13cfd46c8c mlir::PassManager::run(mlir::Operation*) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIRPass.so.13git+0x10c8c)
#13 0x00007f13dad12a60 performActions(llvm::raw_ostream&, bool, bool, llvm::SourceMgr&, mlir::MLIRContext*, mlir::PassPipelineCLParser const&) MlirOptMain.cpp:0:0
#14 0x00007f13dad10cf9 processBuffer(llvm::raw_ostream&, std::unique_ptr<llvm::MemoryBuffer, std::default_delete<llvm::MemoryBuffer> >, bool, bool, bool, bool, mlir::PassPipelineCLParser const&, mlir::DialectRegistry&) MlirOptMain.cpp:0:0
#15 0x00007f13dad1167f mlir::MlirOptMain(int, char**, llvm::StringRef, mlir::DialectRegistry&, bool) (/home/rice/mlir/mlir-project/llvm-project-public-gpu-codegen-upstream/build-mlir/bin/../lib/libMLIROptLib.so.13git+0x467f)
#16 0x000000000040a06b main (../../../build-mlir/bin/mlir-opt+0x40a06b)
#17 0x00007f13cb350555 __libc_start_main (/lib64/libc.so.6+0x22555)
#18 0x0000000000409d39 _start (../../../build-mlir/bin/mlir-opt+0x409d39)