GSoC proposal: SPIR-V to LLVM IR dialect conversion in MLIR

Hi everyone,

Together with @antiagainst, I have created a proposal on ‘SPIR-V to LLVM IR dialect conversion in MLIR’ for GSoC. I would really appreciate any thoughts, suggestions and comments!

It is available at:

Thanks,

George

Hi George,

That looks like a great project!

I wonder what are your thoughts about the structured control flow constraint and the “convergent” operations?

Thank you!

Regarding the question, I am not sure I have an answer for it yet, but thank you for bringing it to my attention!

For now, I made some quick research on this topic and found an interesting discussion on ‘convergent’ attribute in LLVM mailing list. I think that in the case of SPIR-V to LLVM conversion, the ‘convergent’ attribute may need to appear for barrier and group ops. I will look at what LLVM IR has for it now and will study this issue further :slightly_smiling_face:.

The proposal has been accepted as a GSOC project; congratulations George! @MaheshRavishankar and I are looking forward to working with you. :slight_smile:

@mehdi_amini: Great question. Lots of details to unpack. I haven’t pondered over it thoroughly, but I know some folks that have deep understanding over this issue. I’ll help George to start a discussion later and pulling in folks to share their thoughts along the way.

Quick question: why SPIR-V to LLVM instead of SPIR-V to std (and then reusing our existing std to llvm path)?

For example, it seems more useful to have spv.func → std.func instead of spv.func → llvm.func since std.func can be a starting point to various other dialects.

Of course, this doesn’t prohibit us from going straight spv → llvm for things that are cumbersome to handle today in std, like modeling “convergent” (yay MLIR!)

That’s a good point!

I am relatively new to MLIR but I will try to answer your question :slightly_smiling_face:.

I think that here we want to be ‘independent’ of standard dialect since we only want to target LLVM proper, so then SPIR-V can be ran on cpu which is the main goal.

Some things are hard to handle in std as you said, therefore we need to have not only spv -> std but also handle separately spv -> llvm special cases. This may result in a clumsier conversion path in my opinion, and therefore introduce more problems in the future as dialects change. This does not mean however that having spv -> std should not be possible.

Type will be a hard barrier to cross if we go from SPIR-V to standard. Although the operations are somewhat similar between standard, LLVM and SPIR-V, standard dialect works on types like tensor and memref which is much higher level than what exists in LLVM and SPIR-V, where frequently we want pointer and struct types. For example, it’s quite common to have spv.func taking pointers to structs as arguments. It’s relatively straightforward to map to LLVM, but not so to standard dialect. We might be able to handle a subset of cases for scalar/vector types, but not sure how useful that would be. Directly converting to LLVM are needed for complicated cases anyway, where the solution should naturally handle simpler cases.

Thanks! That’s just the answer I was looking for!

Hi George,

This is an interesting project. I have a question regarding SPIR-V to LLVM mapping section of the document. What do you think about mapping of spv operations (e.g. spv.ControlBarrier ) which can’t be represented by llvm operations or intrinsics, any ideas?

Hi Alexey,

I think this will be dealt with on case by case basis. In the proposal I mentioned entry points and specialisation constants. Since we can have multiple entry points in SPIR-V that specify the execution mode, I think that we might be able to pass those as attributes to llvm.func and from there move on with different execution modes? Another option I have in mind is to firstly identify the possible entry points, and then work on each one separately. spv.specConstant can be modelled as a global variable.

Regarding spv.ControlBarrier, I didn’t consider yet its conversion to LLVM. This is a bit tricky in my opinion. However, we might want to use LLVM’s fence for that with enforced ordering.

LLVM fence is a memory ordering instruction so I think it’s more for spv.MemoryBarrier. spv.ControlBarrier can degenerate into a spv.MemoryBarrier if no control semantics and that case would be relatively straightforward. What’s interesting, as pointed out by @AlexeySotkin, is the normal case. In CPU land I think a similar model is fork/join, @george you can probably survey how that is handled in LLVM and it might give some hints.

An idea I have at the moment, which needs more thinking over the details, is that we can probably split the kernel at the spv.ControlBarrier point and make them as two “entry points” when lowering to LLVM ops so the “runtime” can “launch” them separately (forking threads on launching and joining after each launch) which effectively achieving the goal of a spv.ControlBarrier. This requires us to have some convention on how to convey the entry point scheduling to the runtime but a nice thing with MLIR is that we can also model the runtime side and bring them into one system. :slight_smile:

It might be that for some cases, SPIR-V cannot be lowered into LLVM, and some enhancements might be needed to LLVM itself to handle those situations to have a more principled solution. Such aspects are probably outside the scope of the GSoC project itself.

For the spv.ControlBarrier, they probably lower more naturally to llvm.nvvm.syncthreads . So while the project is titled SPIR-V to LLVM, for such intrinsics we can consider targeting more target specific intrinsics.

I am not so sure that the LLVM compiler has native support for fork/join model. AFAIK this is done by the “runtime”.

I am not sure this would work. I am not fully up to speed on the complete spec of spv.ControlBarrier, but I dont think it says that it cannot be within a control flow. Splitting the kernels this way you will have to snapshot the entire state of register and shared memory that might be created by the first kernel and rematerialize it in the second kernel. Plus depending on the scope of the spv.ControlBarrier there might be more complications.

+1. Agreed. Lowering for the normal ops is already a sizable amount of work which are good for GSOC. These hard problems are good to think and discuss about; not necessarily we should have an implemented landed for the GSOC project.

Good points, Mahesh. It does mean we cannot just treat the two kernels as two normal kernels. More state passing over is needed.

+1. Worth discussing.

I wonder what kind of enhancements that might be; some external calls like __spirv_<OpName>, intrinsics like llvm.spirv.*, or some other form?

Can someone elaborate why can’t spv.ControlBarrier be represented in LLVM with a convergent intrinsic?

In CPU land there is no equivalent to SIMT constraint around convergence requirement. Fork-join models thread that can make independent progress, which isn’t the case for the threads in a workgroup in general.

I think we have two problems to solve here: 1) how to translate the input SPIR-V IR into output LLVM IR and run LLVM optimizations while maintaining the GPU constraints, and 2) how to run the generated LLVM IR on CPU using JIT (or AOT) with the correct semantics. I’m still trying to fully grok all the details involving LLVM convergent, my current understanding is that it’s an attribute that helps with 1) given it prevents incorrect compilation code motion, etc., while the above discussions are around 2) I think: we need to make sure all invocations are properly sync’ed at the point of spv.ControlBarrier when exucuting. But I might miss something here. I’m not super familiar with how LLVM ORC JIT is implemented. How is convergent handled in it? If turning spv.ControlBarrier into a convergent attribute, where should we attach the attribute?

That’s a good point. I pointed to fork/join because it’s the most “similar” model. This is indeed a very important difference. Depending on the particular op, it might (e.g., spv.IAdd) or might not (e.g., spv.GroupNonUniform*) be fine to progress independently to make the final observed result the same. Actually the preemptive scheduling style with threads is also not working well here; I think a better way might be cooperative scheduling like corontines/fibers where we can yield. Then a spv.ControlBarrier will become a potential yield point. This also handles state better given corontines/fibers can resume where they stopped. But this means teaching the LLVM JIT to use cooperative scheduling. I don’t think this is supported out of box at the moment?

If the above thoughts make sense, I think a somewhat generic llvm.yield intrinsic with a scope will do? spv.ControlBarrier can be decomposed to llvm.fence and llvm.yield? A llvm.spirv.controlbarrier would also be a reasonable choice here and it can be seen as a generalization as llvm.nvvm.syncthreads.

Oh, that’s more tricky… I would look into OpenCL implementations on CPU as a source of inspiration? I expect that they have to support OpenCL barrier and this kind of things, and it is implemented in clang/LLVM already so we should have it all available :slight_smile:

I’m curious about this project. Is the idea to map one work-item to one CPU thread? Most of the CPU implementation of OpenCL and other GPU APIs I have seen try to map one warp to one CPU or in case of spv.ControlBarrier one workgroup to one hardware thread. This allow for easy support of convergent kind of instructions as well as workgroup synchronization. In this case convergent instructions that require implicitly accessing neighbor lanes can be expended so that they don’t need a special semantic anymore.
Note that Vulkan allows workgroups of up to 1024 work items. If we use fork-join kind of solution it means it would require have 1024 threads running in parallel.

The downside of this technique is that it makes implementation none trivial. There is definitely a tradeoff between being able to support warp and workgroup level kind of instructions and having a straight forward translation.

As @antiagainst mentioned above, my understanding is that the convergent attribute tells passes that the threads are already convergent at this point and any transformation should not violate this (i.e. move code such that the control dependency at this point changes). That doesn’t seem to match the requirement of a barrier, which is a synchronization point (is there a target agnostic barrier in LLVM IR?). I too am not fully upto speed on this aspect within LLVM.

My view on this is that for now the proposal is focusing on just translating the “scalar parts” of the SPIR-V Dialect into LLVM dialect (arithmatic ops, functions, modules, etc). How to handle the multi-threaded aspect of SPIR-V spec is more involved. I think a better approach is to either

  1. Lower from LLVM Dialect to NVVM (which has the advantage of having a lot of what is needed in LLVM already)
  2. Or develop a “SPIR-V Target” within LLVM core. That can be used to target SPIR-V specific intrinsics (just like nvvm specific intrinsics for compiling CUDA code). This is more long term and needs wider community engagement from many stake holders.

Mapping GPU IR to CPU has to address all the complications you have mentioned.