Multi-Threading Compilers

Greetings All,

For the last few months since the GCC Cauldron I've been researching/helping out
plan for multi-threading GCC. I've posted my GCC ideas here:
https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIpC9gUMjOxk/edit

as I've heard there is interest on the LLVM side as well. I've talked to some of the IBM folks from
the Toronto area about it face to face due to them having more middle end and back end knowledge
then me.

Not sure if anyone is interested in my ideas or wants me to extend my help on the LLVM side,

Nick

Hi Nicholas,

You might want to check out MLIR: its pass manager is already automatically and implicitly multithreaded.

-Chris

Hi Nicholas,

You might want to check out MLIR: its pass manager is already automatically and implicitly multithreaded.

-Chris

Chris,

I was aware that LLVM was moving to MLIR at some point due to this. I've curious as
to how MLIR deals with IPO as that's the problem I was running into. Even if you have
pipelines what guarantees do we have about IPO or are there any. For example if an
analysis pass runs after a loop pass for heuristics to get good data, does MLIR ensure
that? The problem isn't getting a pipeline as much as IPO issues in terms of rescheduling
in a pipeline or incorrectly splitting up into pipelines. I yet to find a good solution on the
GCC side as well and it seems that there will be issues with this no matter what, not
sure of what trade offs the LLVM side is willing to make.

The other issues is that graph data structures to my knowledge do not allow insertion
of multiple nodes at the same time or how to scale the graphs for callers or SSA. Not
sure if you guys have encapsulated the operators and data structures in a series of
classes as that would be the first part. The other part is figuring out how to link and
interact with build systems to launch threads from make -j or other similar things.

Thanks for the notice about MLIR through maybe my IPO is not really there
but after reading parts of it seems to be a issue through a little smaller still
and thanks for the prompt response,

Nick

Hi Nicholas,

You might want to check out MLIR: its pass manager is already automatically and implicitly multithreaded.

-Chris

Chris,

I was aware that LLVM was moving to MLIR at some point due to this. I’ve curious as
to how MLIR deals with IPO as that’s the problem I was running into. Even if you have
pipelines what guarantees do we have about IPO or are there any. For example if an
analysis pass runs after a loop pass for heuristics to get good data, does MLIR ensure
that? The problem isn’t getting a pipeline as much as IPO issues in terms of rescheduling
in a pipeline or incorrectly splitting up into pipelines. I yet to find a good solution on the
GCC side as well and it seems that there will be issues with this no matter what, not
sure of what trade offs the LLVM side is willing to make.

Hi Nicholas,

It is very difficult to mix things like loop passes and IPO passes in any system, unless one is prepared to preserve the analysis results that the other depend on. One nice thing about MLIR is that it defines this away, by allowing explicit representation of loops in the IR. This means that it isn’t an analysis pass that gets “invalidated” like the existing LLVM LoopInfo analysis pass. It is just structurally impossible for this to happen. I don’t think that all of the AnalysisPass’s in LLVM have been super successful.

The other issues is that graph data structures to my knowledge do not allow insertion
of multiple nodes at the same time or how to scale the graphs for callers or SSA. Not
sure if you guys have encapsulated the operators and data structures in a series of
classes as that would be the first part. The other part is figuring out how to link and
interact with build systems to launch threads from make -j or other similar things.

Yeah, MLIR handles that by factoring global use-def chains on symbols (e.g. functions) as being different than local use-def chains. This makes it much more efficient. You can find more information on MLIR symbols here.

Thanks for the notice about MLIR through maybe my IPO is not really there
but after reading parts of it seems to be a issue through a little smaller still
and thanks for the prompt response,

Sure, happy to help. It would be great to see a GIMPLE dialect in MLIR someday :slight_smile:

-Chris

Chris, I asked the GCC what they want to do about MLIR and am waiting for a reply. Anyhow what is the status and what parts are we planning to move to MLIR in LLVM/Clang. I’ve not seen any discussion on that other than starting to plan for it. Perhaps we should start a wiki page for that as I don’t think all parts need to be MLIR. I don’t have access to writing for the wiki so unfortunately I can’t start writing it up unless I get access. Regards and this is one reason you do your research before writing a multi-threading program a lot of the research or work has been done at this point, Nick

As far as I know, there is no (detailed/discussed/agreed upon/...) plan
to move any existing functionality in LLVM-Core or Clang to MLIR. There
are some people that expressed interest in there is Chris's plan on how
the transition could look like.

If we want to look into multi-threading for the existing LLVM
infrastructure there are certainly possibilities. Computing (function)
analyses results prior to CGSCC and Module passes for example. When it
comes to transformations it is less straight forward but there is still
interest in looking into it.

I'm traveling but if you are interested we can discuss this further next
week.

Cheers,
  Johannes

Johannes,
That's fine. Unfortunately I've a student so the GCC side knows as well but I will
only be working on this in my spare time. We did get somewhere with my paper
here:
https://docs.google.com/document/d/1po_RRgSCtRyYgMHjV0itW8iOzJXpTdHYIpC9gUMjOxk/edit

I've looking into the SSA trees currently but we can discuss what you guys want me
to start researching first. The research does need to happen including profiling due
to avoiding making mistakes in choosing major implementation details.

Otherwise just ping me and the list when your ready or other people can,

Nick

Anyhow what is the status and what parts are we planning to move to
MLIR in LLVM/Clang. I've not seen any discussion on that other than
starting to plan for it.

As far as I know, there is no (detailed/discussed/agreed upon/...) plan
to move any existing functionality in LLVM-Core or Clang to MLIR. There
are some people that expressed interest in there is Chris's plan on how
the transition could look like.

Yep, agreed, I gave a talk a couple days ago (with Tatiana) with a proposed path forward, but explained it as one possible path. We’ll share the slides publicly in a few days after a couple things get taken care of.

-Chris

+Chandler & Alina for new pass manager context

Hi Nicholas,

You might want to check out MLIR: its pass manager is already automatically and implicitly multithreaded.

-Chris
Chris,

I was aware that LLVM was moving to MLIR at some point due to this.

FWIW: I don’t /tihnk/ that’s quite an accurate representation of the situation. MLIR is now part of the LLVM project, but so far I haven’t seen any discussions of moving the core LLVM project itself to MLIR (admittedly I’m not super clear on the details of MLIR or what that’d look like - I would’ve guessed that LLVM would itself be a lower level that a given MLIR stack might lower to, etc). There’s been some discussion of folks interested in experimenting with using MLIR in Clang, though that’s not a project goal right now.

I’ve
curious as
to how MLIR deals with IPO as that’s the problem I was running into.

FWIW I believe LLVM’s new pass manager (NPM) was designed with parallelism and the ability to support this situation (that MLIR doesn’t? Or doesn’t to the degree/way in which the NPM does). I’ll leave it to folks (Chandler probably has the most context here) to provide some more detail there if they can/have time.

Historically speaking, all of the LLVM pass managers have been designed to support multithreaded compilation (check out the ancient history of the WritingAnLLVMPass doc if curious).

The problem is that LLVM has global use-def chains on constants, functions and globals, etc, so it is impractical to do this. Every “inst->setOperand” would have to be able to take locks or use something like software transactional memory techniques in their implementation. This would be very complicated and very slow.

MLIR defines this away from the beginning. This is a result of the core IR design, not the pass manager design itself.

-Chris

I’ve
curious as
to how MLIR deals with IPO as that’s the problem I was running into.

FWIW I believe LLVM’s new pass manager (NPM) was designed with parallelism and the ability to support this situation (that MLIR doesn’t? Or doesn’t to the degree/way in which the NPM does). I’ll leave it to folks (Chandler probably has the most context here) to provide some more detail there if they can/have time.

Historically speaking, all of the LLVM pass managers have been designed to support multithreaded compilation (check out the ancient history of the WritingAnLLVMPass doc if curious).

I think the specific thing that might’v been a bit different in the NPM was to do with analysis invalidation in a way that’s more parallelism friendly than the previous one - but I may be misrepresenting/misundrstanding some of it.

The problem is that LLVM has global use-def chains on constants, functions and globals, etc, so it is impractical to do this. Every “inst->setOperand” would have to be able to take locks or use something like software transactional memory techniques in their implementation. This would be very complicated and very slow.

Oh, yeah - I recall that particular limitation being discussed/not addressed as yet.

MLIR defines this away from the beginning. This is a result of the core IR design, not the pass manager design itself.

What does MLIR do differently here/how does it define that issue away? (doesn’t have use-lists built-in?)

  • Dave

I’ve
curious as
to how MLIR deals with IPO as that’s the problem I was running into.

FWIW I believe LLVM’s new pass manager (NPM) was designed with parallelism and the ability to support this situation (that MLIR doesn’t? Or doesn’t to the degree/way in which the NPM does). I’ll leave it to folks (Chandler probably has the most context here) to provide some more detail there if they can/have time.

Historically speaking, all of the LLVM pass managers have been designed to support multithreaded compilation (check out the ancient history of the WritingAnLLVMPass doc if curious).

I think the specific thing that might’v been a bit different in the NPM was to do with analysis invalidation in a way that’s more parallelism friendly than the previous one - but I may be misrepresenting/misundrstanding some of it.

The problem is that LLVM has global use-def chains on constants, functions and globals, etc, so it is impractical to do this. Every “inst->setOperand” would have to be able to take locks or use something like software transactional memory techniques in their implementation. This would be very complicated and very slow.

Oh, yeah - I recall that particular limitation being discussed/not addressed as yet.

MLIR defines this away from the beginning. This is a result of the core IR design, not the pass manager design itself.

What does MLIR do differently here/how does it define that issue away? (doesn’t have use-lists built-in?)

The major thing is that constants and global-like objects don’t produce SSA values and thus don’t have use-lists. https://mlir.llvm.org/docs/Rationale/#multithreading-the-compiler discusses this a bit.

For constants, the data is stored as an Attribute(context uniqued metadata, have no use-list, not SSA). This attribute can either placed in the attribute list(if the operand is always constant, like for the value of a switch case), otherwise it must be explicitly materialized via some operation. For example, the [std.constant](https://mlir.llvm.org/docs/Dialects/Standard/#constant-operation) operation will materialize an SSA value from some attribute data.

For references to functions and other global-like objects, we have a non-SSA mechanism built around symbols. This is essentially using a special attribute to reference the function by-name, instead of by ssa value. You can find more information on MLIR symbols here.

Along with the above, there is a trait that can be attached to operations called [IsolatedFromAbove](https://mlir.llvm.org/docs/Traits/#isolatedfromabove). This essentially means that no SSA values defined above a region can be referenced from within that region. The pass manager only allows schedule passes on operations that have this property, meaning that all pipelines are implicitly multi-threaded.

The pass manager in MLIR was heavily inspired by the work on the new pass manager in LLVM, but with specific constraints/requirements that are unique to the design of MLIR. That being said, there are some usability features added that would also make great additions to LLVM: instance specific pass options and statistics, pipeline crash reproducer generation, etc.

Not sure if any of the above helps clarify, but happy to chat more if you are interested.

– River

River, The big thing from my reading of the Pass Manager in MLIR is that it allows us to iterate through a pass per function or module as a group allowing it to run in async. I’ve proposed this on the GCC side: Its to walk through the IPA passes which are similar to analyze passes on the LLVM side. Nick

I’ve
curious as
to how MLIR deals with IPO as that’s the problem I was running into.

FWIW I believe LLVM’s new pass manager (NPM) was designed with parallelism and the ability to support this situation (that MLIR doesn’t? Or doesn’t to the degree/way in which the NPM does). I’ll leave it to folks (Chandler probably has the most context here) to provide some more detail there if they can/have time.

Historically speaking, all of the LLVM pass managers have been designed to support multithreaded compilation (check out the ancient history of the WritingAnLLVMPass doc if curious).

I think the specific thing that might’v been a bit different in the NPM was to do with analysis invalidation in a way that’s more parallelism friendly than the previous one - but I may be misrepresenting/misundrstanding some of it.

The problem is that LLVM has global use-def chains on constants, functions and globals, etc, so it is impractical to do this. Every “inst->setOperand” would have to be able to take locks or use something like software transactional memory techniques in their implementation. This would be very complicated and very slow.

Oh, yeah - I recall that particular limitation being discussed/not addressed as yet.

MLIR defines this away from the beginning. This is a result of the core IR design, not the pass manager design itself.

What does MLIR do differently here/how does it define that issue away? (doesn’t have use-lists built-in?)

The major thing is that constants and global-like objects don’t produce SSA values and thus don’t have use-lists. https://mlir.llvm.org/docs/Rationale/#multithreading-the-compiler discusses this a bit.

For constants, the data is stored as an Attribute(context uniqued metadata, have no use-list, not SSA). This attribute can either placed in the attribute list(if the operand is always constant, like for the value of a switch case), otherwise it must be explicitly materialized via some operation. For example, the [std.constant](https://mlir.llvm.org/docs/Dialects/Standard/#constant-operation) operation will materialize an SSA value from some attribute data.

For references to functions and other global-like objects, we have a non-SSA mechanism built around symbols. This is essentially using a special attribute to reference the function by-name, instead of by ssa value. You can find more information on MLIR symbols here.

Along with the above, there is a trait that can be attached to operations called [IsolatedFromAbove](https://mlir.llvm.org/docs/Traits/#isolatedfromabove). This essentially means that no SSA values defined above a region can be referenced from within that region. The pass manager only allows schedule passes on operations that have this property, meaning that all pipelines are implicitly multi-threaded.

The pass manager in MLIR was heavily inspired by the work on the new pass manager in LLVM, but with specific constraints/requirements that are unique to the design of MLIR. That being said, there are some usability features added that would also make great additions to LLVM: instance specific pass options and statistics, pipeline crash reproducer generation, etc.

Not sure if any of the above helps clarify, but happy to chat more if you are interested.

– River

  • Dave

River,
The big thing from my reading of the Pass Manager in MLIR is that it allows us to iterate through
a pass per function or module as a group allowing it to run in async. I’ve proposed this
on the GCC side:
https://gcc.gnu.org/ml/gcc/2020-02/msg00247.html

Its to walk through the IPA passes which are similar to analyze passes on the LLVM side.

Hi Nicholas,

I can’t say anything about the GCC side, but this isn’t a particularly novel aspect of the MLIR pass manager. In many ways, the pass manager is the easiest/simplest part of the multi-threading problem. The bigger problem is making sure that the rest of the compiler infrastructure is structured in a way that is thread-safe, or can be made thread-safe. This is why most of the discussion is based around how to model things like constants, global values, etc. When I made MLIR multi-threaded a year ago, a large majority of my time was spent outside of the pass manager. For a real example, I spent much more time just on multi-threaded pass timing than making the pass manager itself multi-threaded.

– River

Actually in my experience, the biggest problem is if we can detect IPO and run async guarantees on that. MLIR runs operations but only for a module or set of functions without this. One of my dreams would be to run passes in parallel including IPO detection and stop if it cannot continue pass a IPO pass or set of passes due to changes. Maybe MLIR does do that but its the one bottleneck that is really hard to fix,

What MLIR does (that would require quite some work in LLVM) is making sure that you can process and transform functions in isolation, allowing to run local optimizations in parallel. This does not solve the IPO problem you’re after. As I understand it, this is a difficult thing to design, and it requires consideration about how you think the passes and the pass-pipeline entirely.

Running function-passes and “local” optimizations in parallel in LLVM isn’t possible because the structures in the LLVMContext aren’t thread-safe, and because the IR itself isn’t thread-safe. Something like just DCE or CSE a function call requires to modify the callee (through its use-list).

I’ve been thinking about it on the GCC side for the last few months. It’s non trivial through something similar to this may be possible but I will need to research both pass managers better: I’m looking on some RFCS for the pass manager and SSA iterators on the GCC side. Through one easy thing to do is a lot of core classes in both GCC/LLVM there are loops that run a lot. LRA on the GCC for register allocation does. We may want to figure out how and if its a good idea to run these loops in small worker functions that are only used by the one function that requires it. The only real question is some loops are only very performance intensive on corner cases which we may require heuristics in order to decide when to launch i.e. number of elements iterating through e.t.c. Maybe that’s a bad idea but after looking through the GCC and LLVM side a little this seems to be a later tedious but trivial fix mostly due to figuring out what loops are expensive enough to do this. I believe Johannes was going to reach out next week about the Module Classes and we will go for there, Nick

This is a recent desktop.
Xubuntu 19.10
Compiling for 10.0.0 clang and llvm. See below.

For this test, running 14 processors in a gui VM. The cores are hyperthreaded, processors are twice the cores, but all the cores before the run are showing negligible activity.

compile_commands.json has 3022 entries.

The ninja compile run lasted 7 minutes and 43 seconds with 99% all processor usage throughout.

We then have 7*60+43 = 463 seconds.

Compile seconds per compile line in compile_commands.json 463/3022 = 0.1532 seconds. Average compile time per processor would be about 14*0.1532 seconds.

cmake -G Ninja -DLLVM_ENABLE_PROJECTS=“clang;llvm” -DLLVM_USE_LINKER=lld -DCMAKE_BUILD_TYPE=“Release” -DLLVM_TARGETS_TO_BUILD=X86 -DLLVM_ENABLE_LIBPFM=OFF -DRUN_HAVE_GNU_POSIX_REGEX=0 -DRUN_HAVE_THREAD_SAFETY_ATTRIBUTES=0 -Wno-dev …/llvm &> /home/nnelson/Documents/cmake.log

ninja &> /home/nnelson/Documents/ninja.log

Here are some useful pages from Threading Building Blocks.

Task-Based Programming
Appendix A Costs of Time Slicing The point When the number of compiles exceeds the number of cores such that all the cores are utilized, nothing is gained by trying to multi-thread the individual compiles. In fact, loading up the cores with more threads or tasks than there are cores will reduce compiling efficiency because of time slicing. And sequencing through more tasks than less when the cores are not overloaded will reduce compiling efficiency because more tasks have to be loaded and unloaded to the cores. Neil Nelson

That makes a lot of sense Neil. Two counterpoints though:

  1. In development situations, it is common to rebuild one file (the one you changed) without rebuilding everything. Speeding that up is useful.

  2. LLVM is a library that is used for a lot more than just C compilers. Many of its use cases are not modeled by bulk “compile all the code” workloads like you describe.

You’re right that multithreading isn’t a panacea, but in my opinion, it is important to be able to provide access to multicore compilation for use cases that do benefit from it.

-Chris

In addition to what Chris said, there’s also the case of large TUs / Unity files. Given that currently the compilation of a single TU is not multi-thread, you can get “spikes” as the one below, where only one core (out of many) is working. One minute wasted, when I have many cores available:

For this specific case, not having Unity files makes the build uniform (no spikes like the one above), but it takes 10x more time to compile (20k TUs and 25k .h files compiled, which reduce down to ~600 Unity).

To fix the spike, you then have to resort to a iterative optimization algorithm running nightly to find the best trade-off between size of unity and build times. This complicates things further.

The point being, +1 for multi-threading the compiler :slight_smile:

Alex.