Available code-generation parallism

I am interested in making my LLVM front-end multi-threaded in a way
similar to the GCC compiler server proposal and was wondering about the
extent that the LLVM passes support it.

Expression-at-a-time parallel construction:
If function definitions are built purely depth-first, such that the
parent pointers are not provided as they are created, what will break?
I noted that the function and module verifiers aren't complaining, at
least not yet. Is there a generic "fixup upward-pointing parent
pointers" pass that can be run afterwords? If not, do I need to
implement and perform that pass? I suspect that emitting code for
individual expressions in parallel will probably end up being too
fine-grained, which leads me to...

Function-at-a-time parallel construction:
Which (if any) LLVM objects support the object-level thread safety
guarantee? If I construct two separate function pass managers in
separate threads and use them to optimize and emit object code for
separate llvm::Function definitions in the program, will this work?
Same question for llvm::Modules.

Thanks,
-Jonathan Brandmeyer

Unfortunately, llvm code generator is not *yet* thread safe. My understanding is there isn't a huge amount of work to do to make it so but I don't have the details.

Evan

I am interested in making my LLVM front-end multi-threaded in a way
similar to the GCC compiler server proposal and was wondering about the
extent that the LLVM passes support it.

Do you have a link for this? I'm not familiar with any parallelism proposed by that project. My understanding was that it was mostly about sharing across invocations of the compiler.

Expression-at-a-time parallel construction:
If function definitions are built purely depth-first, such that the
parent pointers are not provided as they are created, what will break?
I noted that the function and module verifiers aren't complaining, at
least not yet. Is there a generic "fixup upward-pointing parent
pointers" pass that can be run afterwords? If not, do I need to
implement and perform that pass? I suspect that emitting code for
individual expressions in parallel will probably end up being too
fine-grained, which leads me to...

Are you talking about building your AST or about building LLVM IR. The rules for constructing your AST are pretty much defined by you. The rules for constructing LLVM IR are a bit more tricky. The most significant issue right now is that certain objects in LLVM IR are uniqued (like constants) and these have use/def chains. Since use/def chain updating is not atomic or locked, this means that you can't create llvm ir on multiple threads. This is something that I'm very much interested in solving someday, but no one is working on it at this time (that I'm aware of).

Function-at-a-time parallel construction:
Which (if any) LLVM objects support the object-level thread safety
guarantee? If I construct two separate function pass managers in
separate threads and use them to optimize and emit object code for
separate llvm::Function definitions in the program, will this work?
Same question for llvm::Modules.

Unfortunately, for the above reason... basically none. The LLVM code generators are actually very close to being able to run in parallel. The major issue is that they run a few llvm IR level passes first (LSR and codegen prepare) that hack on LLVM IR before the code generators run. Because of this, they inherit the limitations of LLVM IR passes. Very long term, I'd really like to make the code generator not affect the LLVM IR being put into them, but this is not likely to happen anytime in the near future.

If you're interested in this, tackling the use/def atomicity issues would be a great place to start.

-Chris

> I am interested in making my LLVM front-end multi-threaded in a way
> similar to the GCC compiler server proposal and was wondering about
> the
> extent that the LLVM passes support it.

Do you have a link for this? I'm not familiar with any parallelism
proposed by that project. My understanding was that it was mostly
about sharing across invocations of the compiler.

> Expression-at-a-time parallel construction:
> If function definitions are built purely depth-first, such that the
> parent pointers are not provided as they are created, what will break?
> I noted that the function and module verifiers aren't complaining, at
> least not yet. Is there a generic "fixup upward-pointing parent
> pointers" pass that can be run afterwords? If not, do I need to
> implement and perform that pass? I suspect that emitting code for
> individual expressions in parallel will probably end up being too
> fine-grained, which leads me to...

Are you talking about building your AST or about building LLVM IR.
The rules for constructing your AST are pretty much defined by you.
The rules for constructing LLVM IR are a bit more tricky. The most
significant issue right now is that certain objects in LLVM IR are
uniqued (like constants) and these have use/def chains. Since use/def
chain updating is not atomic or locked, this means that you can't
create llvm ir on multiple threads. This is something that I'm very
much interested in solving someday, but no one is working on it at
this time (that I'm aware of).

What about "inventing" pseudo-constants (which point to the right
thing) and build the piece of IR with them. When done, grab mutex and
RAUW it in. Alternatively, submit to a privileged thread that performs
the RAUW.
The trick is to prepare the def/use chain(s) to a degree that the
mutex is only held a minimal time. If only IR-builder threads are
running concurrently there is no danger that a real constant vanishes,
leaving behind a stale reference from a pseudo-constant.

Any major headaches I have ignored?

Cheers,

   Gabor

That could work. It would also have to be done for global values as well, and inline asm objects etc. However, I don't see any show-stoppers. The implementation could be tricky, but a nice property of your approach is that the single threaded case could be made particularly fast (instead of doing atomic ops or locking always).

-Chris

> What about "inventing" pseudo-constants (which point to the right
> thing) and build the piece of IR with them. When done, grab mutex and
> RAUW it in. Alternatively, submit to a privileged thread that performs
> the RAUW.
> The trick is to prepare the def/use chain(s) to a degree that the
> mutex is only held a minimal time. If only IR-builder threads are
> running concurrently there is no danger that a real constant vanishes,
> leaving behind a stale reference from a pseudo-constant.

That could work. It would also have to be done for global values as
well, and inline asm objects etc. However, I don't see any show-
stoppers. The implementation could be tricky, but a nice property of
your approach is that the single threaded case could be made
particularly fast (instead of doing atomic ops or locking always).

It might work for the IR construction phase, but not for optimization
and emitting object code. The locking issue is going to be severe
because it will be nearly (completely?) impossible to guarantee a
globally consistent lock order for any given def/use chain. Therefore,
such a solution would require a kind of high-level contention manager
akin to software transactional memory (STM). Even the fastest STMs in
research are much slower than locking. I think that there is a better
way.

I would like to propose a different solution: Lift all internalized
objects to be unique at the Module level instead of globally. This will
require an initial pass to be performed (called IRFinalize?), and
equality of Type objects by pointer comparison will not be valid until
after this pass is complete. The Module is already the unit of
compilation once LLVM IR has been initially emitted for most cases, and
it should be straightfoward to structure the pass such that it can work
on single functions if the user is compiling at that level.

The IRFinalize pass can allocate the bookkeeping storage for identifying
duplicate Constants and Types and then release it so long as none of the
optimization and analysis passes 1) Emit new Types or 2) are broken by
duplicate Constants.

-Jonathan

PS: What is RAUW? I'll volunteer the clerical work of adding it to the
Lexicon if you'd be kind enough to hand me a small dose of clue :slight_smile:

Jonathan Brandmeyer wrote:

PS: What is RAUW? I'll volunteer the clerical work of adding it to the
Lexicon if you'd be kind enough to hand me a small dose of clue :slight_smile:

ReplaceAllUsesWith. It can be used to substitute any one Value for another (except that on a Constant, replaceUsesOfWithOnConstant should be used instead).

For example, given:

   %x = add i32 %i, 0

an optimization would call %x->RAUW(%i).

Nick

Jonathan Brandmeyer wrote:

It might work for the IR construction phase, but not for optimization
and emitting object code. The locking issue is going to be severe
because it will be nearly (completely?) impossible to guarantee a
globally consistent lock order for any given def/use chain. Therefore,
such a solution would require a kind of high-level contention manager
akin to software transactional memory (STM). Even the fastest STMs in
research are much slower than locking. I think that there is a better
way.

We in the TM community are always looking for "real-world" applications that can benefit from TM for benchmarking purposes. In the past we have found that unstructured graph and worklist algorithms are a prime example where TM can really help.

If you can provide a reference I would be interested in looking at the code inside LLVM for transformation/codegen passes that result in the problems that you describe.

-Luke

> > What about "inventing" pseudo-constants (which point to the right
> > thing) and build the piece of IR with them. When done, grab mutex and
> > RAUW it in. Alternatively, submit to a privileged thread that performs
> > the RAUW.
> > The trick is to prepare the def/use chain(s) to a degree that the
> > mutex is only held a minimal time. If only IR-builder threads are
> > running concurrently there is no danger that a real constant vanishes,
> > leaving behind a stale reference from a pseudo-constant.

> That could work. It would also have to be done for global values as
> well, and inline asm objects etc. However, I don't see any show-
> stoppers. The implementation could be tricky, but a nice property of
> your approach is that the single threaded case could be made
> particularly fast (instead of doing atomic ops or locking always).

It might work for the IR construction phase, but not for optimization
and emitting object code. The locking issue is going to be severe

When optimization is done, you can certainly go on and emit code
in parallel for many functions. Since the IR is not mutated any
more, nothing needs to be locked. I believe the SDGraphs can be
isolated from each other. But Dan may know better...

Cheers,

    Gabor

> I am interested in making my LLVM front-end multi-threaded in a way
> similar to the GCC compiler server proposal and was wondering about
> the
> extent that the LLVM passes support it.

Do you have a link for this? I'm not familiar with any parallelism
proposed by that project. My understanding was that it was mostly
about sharing across invocations of the compiler.

Nope, you're right. I'm not sure where I got that idea, but I certainly
don't see it in their whitepaper.

Are you talking about building your AST or about building LLVM IR.
The rules for constructing your AST are pretty much defined by you.
The rules for constructing LLVM IR are a bit more tricky. The most
significant issue right now is that certain objects in LLVM IR are
uniqued (like constants) and these have use/def chains. Since use/def
chain updating is not atomic or locked, this means that you can't
create llvm ir on multiple threads. This is something that I'm very
much interested in solving someday, but no one is working on it at
this time (that I'm aware of).

I'm referring to implementing the construction, optimization, and object
code generation in parallel.

> Function-at-a-time parallel construction:
> Which (if any) LLVM objects support the object-level thread safety
> guarantee? If I construct two separate function pass managers in
> separate threads and use them to optimize and emit object code for
> separate llvm::Function definitions in the program, will this work?
> Same question for llvm::Modules.

Unfortunately, for the above reason... basically none. The LLVM code
generators are actually very close to being able to run in parallel.
The major issue is that they run a few llvm IR level passes first (LSR
and codegen prepare) that hack on LLVM IR before the code generators
run. Because of this, they inherit the limitations of LLVM IR
passes. Very long term, I'd really like to make the code generator
not affect the LLVM IR being put into them, but this is not likely to
happen anytime in the near future.

If you're interested in this, tackling the use/def atomicity issues
would be a great place to start.

What about lazy unification of uniqued values after IR construction? If
that pass is performed on a per-module basis, then all of the Modules
will be isolated in memory from each other. The front-end can partition
its source into N modules in whatever way it sees fit. Then it can
instantiate a PassManager and Module per thread and build the IR into
them. That isn't quite as nice as taking advantage of per-function
parallelism where the individual passes allow it, but it would be a step
in the right direction.

Why are Constants uniqued? Is it purely for the memory savings?

-Jonathan

Jonathan Brandmeyer wrote:

What about lazy unification of uniqued values after IR construction? If
that pass is performed on a per-module basis, then all of the Modules
will be isolated in memory from each other. The front-end can partition
its source into N modules in whatever way it sees fit. Then it can
instantiate a PassManager and Module per thread and build the IR into
them. That isn't quite as nice as taking advantage of per-function
parallelism where the individual passes allow it, but it would be a step
in the right direction.

Going further, why should it be per-module? We could let the front-end decide if uniqued values like constants, types, ... are per-function, per-module or global.

When using LLVM as a JIT (eg vmkit), the nice thing if it's per-function is that we can delete all constants a function uses after JIT code generation. For example, I've noticed that 5M are used only for constants when running vmkit with a large application (eg Tomcat). I surely would like these 5M to be deleted when JIT code generation is finished.

Nicolas

SelectionDAGs have a few things that are globally uniqued as well,
such as type lists, but those could be fixed if needed. Also, the
pool-allocated memory is currently reused for each SelectionDAG,
though that could easily be reorganized in order to allow multiple
SelectionDAGs to be processed at the same time.

SelectionDAGs do contain references back to the LLVM IR for some
things, and there are some situations where new LLVM IR Constants
are created during SelectionDAG processing.

Dan

It's more than that. Constants can be arbitrarily complicated (array of structs of constantexprs of...) and uniquing allows by-pointer equality comparisons. Likewise for types.

-Chris