Lowering of DmaStartOp and DmaWaitOp

Hi,

I am interested in leveraging the affine dialect passes to perform tiling, vectorization and dma calls insertion. I successfully tiled, vectorized and inserted the dma calls and lowered all dialects to standard.

Now I need to lower it to LLVM dialect, but I discovered that there is no lowering for dma ops. I am interested in implementing it for a specific target. Looking at the StandardToLLVM conversion I can see that such lowering could easily be implemented there.

However it does not look like the right thing to do. Since dma calls will become target dependent in LLVM dialect and then in LLVM IR, the lowering method might become very large, or having multiple implementations, one for each target.

I saw a discussion about splitting the standard dialect, do you guys think that is the correct direction for dma ops? Should them be separated into they own dialect and then there lowered to LLVM dialect? Or it is ok to lower then directly from the standard dialect and what needs to be done is splitting their lowering to a separate file?

Would really appreciate some guidance in this matter.

Could you just lower the dma ops to calls to functions that implement those semantics for your target? The std dialect splitting itself may not solve the problem for you and even with the std dialect, you can always lower them separately from (and before performing) the std to llvm lowering. Do you have LLVM intrinsics for the dma ops on your target?

Thanks for the reply, Uday.

Yes, that is my first plan. I will lower them to our target runtime call. The next step would be to model the runtime calls as LLVM intrinsics.

So, it would be ok to have a guard on my lowering code for checking if it is our target and return failure() otherwise?

I’m not sure I understand your strategy. I assume you’d just have a target-specific custom pipeline. You could just have a target-specific lowering pattern to convert those DMA ops to calls. MLIR doesn’t have the notion of a TargetTransformInfo yet.

I think my strategy was not clear because I am pretty new to MLIR.

My targets runtime calls will be added to LLVM as intrinsics, independently of my work in lowering the std.dma_start/wait ops. As a intermediate step, and for testing only, I would first lower the ops to external calls. Later, when we have merged the intrinsics those calls will call the intrinsics instead.

I guess that my real question is, how do I make the lowering target-specific without the TargetTransformInfo feature? Is there something in MLIR that is used for that or I would use something like std::conditional?

I still suspect this is really the smaller / less important part of the problem that you can temporarily get around in many ways. For eg. put together a custom “target” pipeline instantiating the passes in the pipeline with the right information based on your target, etc.

You will need to implement the lowering in a modular way regardless of the dialect splitting process. The conversion infrastructure doesn’t really care if the source/target ops all belong to the same dialect or to different dialects. Declaring a dialect as (il)legal is equivalent to declaring all its ops (il)legal.

Standard->LLVM is currently an all-or-nothing conversion because of the hard type barrier (standard ops don’t understand LLVM types and vice versa). What folks do today is define their additional conversion patterns, and then have a pass that applies these additional patterns and all the existing standard-to-llvm patterns. This isn’t hard to set up, and already exists in various forms, e.g. -convert-vector-to-llvm. You can add -convert-std-mydma-to-llvm that acts similarly.

We are not very content with this situation and looking into a more progressive conversion where type mismatches are fixed by llvm.mlir.cast that converts values between std and llvm types. For future-proofness, you may also consider writing a conversion that emits those directly and running it separately from std-to-llvm, but it is likely trickier and needs some extra canonicalization of that op.

I really appreciate your suggestions and comments, @bondhugula and @ftynse.

I think this side conversion (-convert-std-mydma-to-llvm) looks like a good approach for now. I will follow this direction.

Thanks again for the helpful comments.

@ftynse and @bondhugula Thank you for the suggestions and comments. To follow up the discussion, would it be a good idea to implement a side dialect, which is target specific and contains my-dma ops, and lower from std to my-dma-dialect and then to llvm dialect? If we use this approach, what else do we have to do to translate llvm dialect (which contains my-dam-dialect intrinsics) to llvm IR, suppose we already have the target-specific dma calls defined in llvm IR? I guess another way to ask the question is that if we have the dma intrinsics defined in llvm ir, how to match the intrinsics defined in my-dma-dialect to the existing intrinsics in llvm?

Thanks for the help!

Wouldn’t this be the same as the way other LLVM dialect ops that map to LLVM intrinsics are lowered? (There are numerous such ops in the LLVM dialect.)

Like @bondhugula said, it is possible to define ops that correspond to LLVM intrinsics, and reasonably straightforward to lower standard DMA to those intrinsics together with other standard operations being lowered to the LLVM dialect. The ops for intrinsics may live in a different dialect than LLVM, but use LLVM dialect types. We have several precedents for that (NVVM, AVX512, ArmNEON).

There is also a tool that lets you generate minimalistic ODS directly from LLVM IR intrinsics Tablegen definitions, see https://github.com/llvm/llvm-project/blob/main/mlir/test/mlir-tblgen/llvm-intrinsics.td for usage.