GSoC 2009: Extending LLVM IR to aid multi-core code generation

Hi,

I would like to extend LLVM IR with two new intrinsic: spawn and join. The spawn intrinsic will indicate that the call it modifies can safely run in parallel, while join intrinsic will indicate that the execution of the current call cannot continue until all previously spawned calls in the list/array passed as argument to join intrinsic have completed and returned their results to it. This extension would allow us to express strict multi-thread computations and introduce possibility of new optimisations passes for parallel computations (see below for some of them).

Once this extension is added I plan to demonstrate its usability and how expressible it is in two ways. Firstly, I will implement randomized work-stealing algorithm in VMKit. This will effectively find an execution schedule that obeys constraints imposed by the multi-thread computation. Secondly, I will extend back-end of a multi-threaded processor (such as XCore) to create threads when call is modified with spawn intrinsic and sync them when join intrinsic is encountered. We can run generated code using a scheduler that implements randomized work-stealing algorithm or when generating code we can pass it through optimisation passes such as copy elimination (to remove unnecessary copy operations), fusion (if parallelism slackness is high, i.e. number of spawned calls is greater then number of available cores we would need to group them) and fission (if parallelism slackness is low than we would need to increase throughput by multiplying work). Please note if generating code for single-threaded processor and we encounter spawn intrinsic we simply ignore it and generate code for a call, while join intrinsic becomes no-op.

Would the above proposal be of interest to LLVM community and Google Summer of Code? If you have any questions/suggestions/ideas please let me know.

Thanks,
Milos.

Hi Milos,

Milos Puzovic wrote:

Hi,

I would like to extend LLVM IR with two new intrinsic: spawn and join. The spawn intrinsic will indicate that the call it modifies can safely run in parallel, while join intrinsic will indicate that the execution of the current call cannot continue until all previously spawned calls in the list/array passed as argument to join intrinsic have completed and returned their results to it. This extension would allow us to express strict multi-thread computations and introduce possibility of new optimisations passes for parallel computations (see below for some of them).

Once this extension is added I plan to demonstrate its usability and how expressible it is in two ways. Firstly, I will implement randomized work-stealing algorithm in VMKit.

Can you be more verbose on this? Are you planning to implement some JVM or .Net extension for supporting your new intrinsics? Or are you just looking for a runtime with a GC?

This will effectively find an execution schedule that obeys constraints imposed by the multi-thread computation. Secondly, I will extend back-end of a multi-threaded processor (such as XCore) to create threads when call is modified with spawn intrinsic and sync them when join intrinsic is encountered.

Why do you need to hack on a processor backend for such purpose? Aren't your OS primitives sufficient?

Nicolas

Hi Nicolas,

2009/3/30 Nicolas Geoffray <nicolas.geoffray@lip6.fr>

Can you be more verbose on this? Are you planning to implement some JVM
or .Net extension for supporting your new intrinsics? Or are you just
looking for a runtime with a GC?

At the moment I am not looking to add any new extensions to JVM or .NET. I would need a runtime with a GC to demonstrate and test building execution schedules using the randomized work-stealing algorithm. I thought that I could factor out some code from VMKit that would help me to get there? Otherwise I would look at working with LLVM interpreter to support multi- and many-core execution with new intrinsic. Once I am happy with the performance I will look into adding extensions to JVM and .NET.

Why do you need to hack on a processor backend for such purpose? Aren’t
your OS primitives sufficient?

Yes they are, but I also want to exploit any specific instructions from the multi-thread processors for creating and distributing threads. For examples, XCore ISA has an instruction TSTART to start and a group of instructions starting with TINIT for initialising different aspects of thread states. It would then be interesting to compare how would that generated code handle load balancing compared to work-stealing algorithm and if they can work together.

Milos.

Can you not achieve the same effect without adding intrinsics? Insert function calls to a __spawn and __join pseudo-function instead?

2009/3/30 Milos Puzovic <milos.puzovic@gmail.com>

2009/3/30 someguy <just.s0m3.guy+llvmdev@gmail.com>

Can you not achieve the same effect without adding intrinsics? Insert function calls to a __spawn and __join pseudo-function instead?

It would make LLVM code generation more difficult because instead of building a new instruction (in this case intrinsic) you will be building complex function calls and therefore making much more changes to the existing front-end of your existing language if you want it to support parallel programing. Analysis of generated LLVM code will become more difficult because now instead of simply looking at the instructions you will also need to look at the function calls and see if they are calling any pseudo-functions which might require changes to the LLVM API as well. Finally, the same goes when writting a back-end for the processor of your choice – you would really like to just add new code for building threads not to modify the existing one for function calls. In summary, adding intrinsics gives you a structure and nice code :slight_smile:

Ideally, I would really like to see these intrinsic as instructions. Because adding new instructions to LLVM language requires changing all of the transformation of LLVM and I really want to have working version of code and useful results by the end of GSoC I have decided to go this way. Once it (if :slight_smile: is demonstrated that this extension does aid multi-core code generation and LLVM community is happy with it I would be happy to add them as instructions.

Thanks,
Milos.

The reason for my suggestion is that adding intrinsics is also considered to be a breaking change.

This is the third question/suggestion in the last week or so where the intrinsics/pseudo-function idea has been raised. Perhaps what llvm really needs is a ‘generic’ pseudo-intrinsic of some kind?

2009/3/30 Milos Puzovic <milos.puzovic@gmail.com>

I think your idea is very interesting. However, some of the issues that concern you might not be as bad as you think.

2009/3/30 someguy <just.s0m3.guy+llvmdev@gmail.com>

Can you not achieve the same effect without adding intrinsics? Insert function calls to a __spawn and __join pseudo-function instead?

It would make LLVM code generation more difficult because instead of building a new instruction (in this case intrinsic) you will be building complex function calls and therefore making much more changes to the existing front-end of your existing language if you want it to support parallel programing.

Is the user expected to add the calls to spawn/join or the compiler? If it’s the compiler adding them, then you don’t need to change the front-end at all, you can do all that in an optimization pass. If it’s the user adding them, then adding calls to “__spawn()” that the compiler will intercept and change into “pthread_create” or whatever, doesn’t require changing the front-end either.

Analysis of generated LLVM code will become more difficult because now instead of simply looking at the instructions you will also need to look at the function calls and see if they are calling any pseudo-functions which might require changes to the LLVM API as well.

I don’t see why looking at function calls is hard. If the compiler knows which functions to look for, has exact knowledge of the semantics/side-effects/dependencies of the calls and assumes that there can never be a different function that happened to have the same name, then analyzing and manipulating function calls is the same as manipulating intrinsics.

Finally, the same goes when writting a back-end for the processor of your choice – you would really like to just add new code for building threads not to modify the existing one for function calls. In summary, adding intrinsics gives you a structure and nice code :slight_smile:

If you are planning to use something like pthreads, the compiler will need to add function calls to the code anyway.

Hi Anthony,

2009/3/30 Anthony Danalis <adanalis@eecs.utk.edu>

Is the user expected to add the calls to spawn/join or the compiler? If it’s the compiler adding them, then you don’t need to change the front-end at all, you can do all that in an optimization pass. If it’s the user adding them, then adding calls to “__spawn()” that the compiler will intercept and change into “pthread_create” or whatever, doesn’t require changing the front-end either.

At the moment I am interested to languages where parallelism is explicit and the way how language developers want to express that parallelism I want to leave entirely to them. I wouldn’t like to restrict them with a special construct or keyword and want to leave them as much freedom as possible. Also, I would like, which I think is very important, to try to separate sequential calls and parallel calls.

I don’t see why looking at function calls is hard. If the compiler knows which functions to look for, has exact knowledge of the semantics/side-effects/dependencies of the calls and assumes that there can never be a different function that happened to have the same name, then analyzing and manipulating function calls is the same as manipulating intrinsics.

I agree it is not hard but it can be tiresome because as you have explained there is a lot of knowledge that compiler needs to acquire and also a lot of assumptions that it needs to make (and some of them might not be guaranteed if you get, for examples, third-party implementation). By introducing this new level of abstraction my hope is to get much cleaner code base.

Thanks,
Milos.

Just to add 2 comments to this discussion:

-- Hans Boehm had an excellent paper in PLDI a few years arguing why "threads cannot be a library." (Google for that phrase.) This is a fundamental problem with compiler optimizations for code that uses libraries like pthreads. This argues in favor of having explicit support in the IR.

-- On the other hand, explicitly parallel languages like Java should be able to compile to the 'spawn' instruction but finding 'join' points is much more difficult. E.g., there are ways to express "ordering" synchronization that could require non-trivial interprocedural analysis to extract.

--Vikram
Associate Professor, Computer Science
University of Illinois at Urbana-Champaign
http://llvm.org/~vadve

someguy wrote:

The reason for my suggestion is that adding intrinsics is also considered to be a breaking change.

This is the third question/suggestion in the last week or so where the intrinsics/pseudo-function idea has been raised. Perhaps what llvm really needs is a 'generic' pseudo-intrinsic of some kind?

I think part of the reason this happens is because the docs on extending LLVM suggest intrinsics as the first approach. It might make sense to clarify that intrinsics are only recommended where special code generation is required, and suggest just using a library interface for anything that can be removed before code generation through IR->IR lowering.

Luke

2009/3/30 Vikram S. Adve <vadve@cs.uiuc.edu>

[snip]

– On the other hand, explicitly parallel languages like Java should
be able to compile to the ‘spawn’ instruction but finding ‘join’
points is much more difficult. E.g., there are ways to express
“ordering” synchronization that could require non-trivial
interprocedural analysis to extract.

One paradigm that I had in mind when proposing join intrinsic was join calculus where join points are defined explicitly thus simplifying their discovery. There has been some work to extend Java with it (Hardware Join Java) and also there are extensions to C#, VB and OCaml while the language Comega was designed around join calculus. Compared to these languages I guess Java would be explicit in spawning but implicit in joining :slight_smile:

Milos.