[LLVM Dev] [Discussion] Function-based parallel LLVM backend code generation

Hi, community:

For the sake of our business need, I want to enable "Function-based parallel code generation" to boost up the compilation of single module, please see the details of the design and provide your feedbacks on below aspects, thanks!
1. Is this idea the proper solution for my requirement
2. This new feature will be enabled by llc -thd=N and has no impact on original llc when -thd=1
3. Can this new feature of llc be accepted by community and merged into LLVM code tree

Patches
The patch is divided into four separated parts, the all-in-one patch could be found here:
http://llvm-reviews.chandlerc.com/D1152

Design

Background
1. Our business need to compile C/C++ source files into LLVM IR and link them into a big BC file; the big BC file is then compiled into binary code on different arch/target devices.
2. Backend code generation is a time-consuming activity happened on target device which makes it an important user experience.
3. Make -j or file based parallelism can't help here since there is only one big BC file; function-based parallel LLVM backend code generation is a good solution to improve compilation time which will fully utilize multi-cores.

Overall design strategy and goal
1. Generate totally same binary as what single thread output
2. No impacts on single thread performance & conformance
3. Little impacts on LLVM code infrastructure

Current status and test result
1. Parallel llc can generate same code as single thread by "objdump -d", it could pass 10 hours stress test for all performance benchmark
2. Parallel llc can introduce ~2.9X performance gain on XEON sever for 4 threads

Thanks
Wan Xiaofei

Parallel.CG.7z (24.1 KB)

While I think the end goal you’re describing is close to the correct one, I see the high-level strategy for getting there somewhat differently:

  1. The code generators are only one collection of function passes that might be parallelized. Many others might also be parallelized profitably. The design for parallelism within LLVM’s pass management infrastructure should be sufficiently generic to express all of these use cases.

  2. The idea of having multiple pass managers necessitates (unless I misunderstand) duplicating a fair amount of state. For example, the caches in immutable analysis passes would no longer be shared, etc. I think that is really unfortunate, and would prefer instead to use parallelizing pass managers that are in fact responsible for the scheduling of passes.

  3. It doesn’t provide a strategy for parallelizing the leaves of a CGSCC pass manager which is where a significant portion of the potential parallelism is available within the middle end.

  4. It doesn’t deal with the (numerous) parts of LLVM that are not actually thread safe today. They may happen to work with the code generators you’re happening to test, but there is no guarantee. Notable things to think about here are computing new types, the use-def lists of globals, commandline flags, and static state variables. While our intent has been to avoid problems with the last two that could preclude parallelism, it seems unlikely that we have succeeded without thorough testing to this point. Instead, I fear we have leaned heavily on the crutch of one-thread-per-LLVMContext.

  5. It adds more complexity onto the poorly designed pass manager infrastructure. Personally, I think that cleanups to the design and architecture of the pass manager should be prioritized above adding new functionality like parallelism. However, so far no one has really had time to do this (including myself). While I would like to have time in the future to do this, as with everything else in OSS, it won’t be real until the patches start flowing.

Thanks for your comments, see my reply below, thanks.

Thanks

Wan Xiaofei

Please see Shuxin's proposal on "parallelizing post-IPO stage". It seems the two projects are related.

Evan

Yes, the purpose is similar, we started this job from last year;
But it Shuxin's solution is module based (correct me if I am wrong), we tried this solution and failed for many reasons, it is described in my design document
https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgjY-vhyfySg/edit?usp=sharing

we need discuss two solution and compare them, then adopt one solution

The biggest difference of module based parallelism and function based parallelism are
1. how to partition module into different pieces which consume similar time, it is a difficult question
2. How to make sure the generated binary is same each time
3. if 2 can't be achieved, it is difficult to validate the correctness of parallelism

Thanks
Wan Xiaofei

Yes, the purpose is similar, we started this job from last year;
But it Shuxin's solution is module based (correct me if I am wrong), we tried this solution and failed for many reasons, it is described in my design document
https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgjY-vhyfySg/edit?usp=sharing

we need discuss two solution and compare them, then adopt one solution

The biggest difference of module based parallelism and function based parallelism are
1. how to partition module into different pieces which consume similar time, it is a difficult question

Why difficult?

2. How to make sure the generated binary is same each time

It depends on what is the same. In the merged version, constant may keep one copy, while
in the partition version, constant may be duplicated as the post-IPO passes may generated
some constant, and they cannot share with the same constant generated in other partitions.

All these issues don't sound to be a problem in practice.

3. if 2 can't be achieved, it is difficult to validate the correctness of parallelism

It is nothing about the correctness.

In addition to the concerns Chandler figure out,
I’m curious about :
execution time of pristine-llc vs “modified-llc with -thd=1”, and
the exec-time of pristine-clang vs clang-linked-with-the-modified-llc.

Thanks

Hi, community:

For the sake of our business need, I want to enable "Function-based parallel code generation" to boost up the compilation of single module, please see the details of the design and provide your feedbacks on below aspects, thanks!
1. Is this idea the proper solution for my requirement
2. This new feature will be enabled by llc -thd=N and has no impact on original llc when -thd=1
3. Can this new feature of llc be accepted by community and merged into LLVM code tree

Patches
The patch is divided into four separated parts, the all-in-one patch could be found here:
http://llvm-reviews.chandlerc.com/D1152

Design
https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgjY-vhyfySg/edit?usp=sharing

Background
1. Our business need to compile C/C++ source files into LLVM IR and link them into a big BC file; the big BC file is then compiled into binary code on different arch/target devices.
2. Backend code generation is a time-consuming activity happened on target device which makes it an important user experience.
3. Make -j or file based parallelism can't help here since there is only one big BC file; function-based parallel LLVM backend code generation is a good solution to improve compilation time which will fully utilize multi-cores.

Overall design strategy and goal
1. Generate totally same binary as what single thread output
2. No impacts on single thread performance & conformance
3. Little impacts on LLVM code infrastructure

Current status and test result
1. Parallel llc can generate same code as single thread by "objdump -d", it could pass 10 hours stress test for all performance benchmark
2. Parallel llc can introduce ~2.9X performance gain on XEON sever for 4 threads

Ignoring FE time which can be fully parallelized and assuming 10%
compile time is spent in serial module passes, 25% time is spent in
CGSCC pass, the maximum speed up that can be gained by using function
level parallelism is less than 3x. Even adding support for parallel
compilation for leaves of CG in CGSCC pass won't help too much -- the
percentage of leaf functions is < 30% in large apps I have seen.

Module based parallelism proposed by Shuxin has max speed up of 10x,
assuming body cloning does not add a lot overhead and build farm with
hundred/thousands of nodes is used.

David

Ignoring FE time which can be fully parallelized and assuming 10%
compile time is spent in serial module passes, 25% time is spent in
CGSCC pass, the maximum speed up that can be gained by using function
level parallelism is less than 3x. Even adding support for parallel
compilation for leaves of CG in CGSCC pass won't help too much -- the
percentage of leaf functions is < 30% in large apps I have seen.

Can you clarify what you're basing these assumption on or how you derived
your data?

Module based parallelism proposed by Shuxin has max speed up of 10x,
assuming body cloning does not add a lot overhead and build farm with
hundred/thousands of nodes is used.

Body cloning does add some overhead, so that actually needs to be measured.
Also, many don't have such a build farm.

Ignoring FE time which can be fully parallelized and assuming 10%
compile time is spent in serial module passes, 25% time is spent in
CGSCC pass, the maximum speed up that can be gained by using function
level parallelism is less than 3x. Even adding support for parallel
compilation for leaves of CG in CGSCC pass won't help too much -- the
percentage of leaf functions is < 30% in large apps I have seen.

Can you clarify what you're basing these assumption on or how you derived
your data?

Those numbers are purely speculative -- does Clang has an option to
dump the time breakout of each passes such as -ftime-report in GCC?

thanks,

David

We have the functionality... I thought we wired -ftime-report up to it? If
that doesn't work I'll have to go digging.

I just checked -- it produces a flat report -- it would be nice to
produce something similar to -debug-pass=Structure outputs.

David

Yea, improving the nested timing information and other improvements to
reflect the pass management structure are on my list for the significant
changes to the pass manager infrastructure I mentioned in my original
comments.

Hi, community:

For the sake of our business need, I want to enable "Function-based parallel code generation" to boost up the compilation of single module, please see the details of the design and provide your feedbacks on below aspects, thanks!
1. Is this idea the proper solution for my requirement 2. This new
feature will be enabled by llc -thd=N and has no impact on original
llc when -thd=1 3. Can this new feature of llc be accepted by
community and merged into LLVM code tree

Patches
The patch is divided into four separated parts, the all-in-one patch could be found here:
http://llvm-reviews.chandlerc.com/D1152

Design
https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgj
Y-vhyfySg/edit?usp=sharing

Background
1. Our business need to compile C/C++ source files into LLVM IR and link them into a big BC file; the big BC file is then compiled into binary code on different arch/target devices.
2. Backend code generation is a time-consuming activity happened on target device which makes it an important user experience.
3. Make -j or file based parallelism can't help here since there is only one big BC file; function-based parallel LLVM backend code generation is a good solution to improve compilation time which will fully utilize multi-cores.

Overall design strategy and goal
1. Generate totally same binary as what single thread output 2. No
impacts on single thread performance & conformance 3. Little impacts
on LLVM code infrastructure

Current status and test result
1. Parallel llc can generate same code as single thread by "objdump
-d", it could pass 10 hours stress test for all performance benchmark
2. Parallel llc can introduce ~2.9X performance gain on XEON sever for
4 threads

Ignoring FE time which can be fully parallelized and assuming 10% compile time is spent in serial module passes, 25% time is spent in CGSCC pass, the maximum speed up that can be gained by using function level parallelism is less than 3x. Even adding support for parallel compilation for leaves of CG in CGSCC pass won't help too much -- the percentage of leaf functions is < 30% in large apps I have seen.

Module based parallelism proposed by Shuxin has max speed up of 10x, assuming body cloning does not add a lot overhead and build farm with hundred/thousands of nodes is used.

[Xiaofei] for SpecCPU2006, I got the data function passes consume >90% of total time in llc by vtune (I don't enable LTO); here I only consider llc without LTO, the max parallelism depends how many threads are started.

David

Yes, the purpose is similar, we started this job from last year; But
it Shuxin's solution is module based (correct me if I am wrong), we
tried this solution and failed for many reasons, it is described in my
design document
https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgj
Y-vhyfySg/edit?usp=sharing

we need discuss two solution and compare them, then adopt one solution

The biggest difference of module based parallelism and function based
parallelism are 1. how to partition module into different pieces which
consume similar time, it is a difficult question

Why difficult?

2. How to make sure the generated binary is same each time

It depends on what is the same. In the merged version, constant may keep one copy, while in the partition version, constant may be duplicated as the post-IPO passes may generated some constant, and they cannot share with the same constant generated in other partitions.

All these issues don't sound to be a problem in practice.

3. if 2 can't be achieved, it is difficult to validate the correctness
of parallelism

It is nothing about the correctness.

[Xiaofei] why? I don't understand it very well here, you mean it can generate totally identical binaries as the original llc, including the function order (function order may not affect code quality, but we should make sure the output is same in each run)?

From: Shuxin Yang [mailto:shuxin.llvm@gmail.com]
Sent: Wednesday, July 17, 2013 1:50 AM
To: Wan, Xiaofei
Cc: Evan Cheng; Shuxin Yang; LLVM Developers Mailing List (llvmdev@cs.uiuc.edu)
Subject: Re: [LLVMdev] [LLVM Dev] [Discussion] Function-based parallel LLVM backend code generation

Yes, the purpose is similar, we started this job from last year; But
it Shuxin's solution is module based (correct me if I am wrong), we
tried this solution and failed for many reasons, it is described in my
design document
https://docs.google.com/document/d/1QSkP6AumMCAVpgzwympD5pI3btPJt4SRgj
Y-vhyfySg/edit?usp=sharing

we need discuss two solution and compare them, then adopt one solution

The biggest difference of module based parallelism and function based
parallelism are 1. how to partition module into different pieces which
consume similar time, it is a difficult question

Why difficult?

2. How to make sure the generated binary is same each time

It depends on what is the same. In the merged version, constant may keep one copy, while in the partition version, constant may be duplicated as the post-IPO passes may generated some constant, and they cannot share with the same constant generated in other partitions.

All these issues don't sound to be a problem in practice.

3. if 2 can't be achieved, it is difficult to validate the correctness
of parallelism

It is nothing about the correctness.

[Xiaofei] why?

I'm not sure what you are asking here. Are you asking why partition still preserve correctness?

I don't understand it very well here, you mean it can generate totally identical binaries as the original llc,

I didn't say *totally* identical. IPO with partition could be slightly different from the one without partition,
say the former may have duplicated constant in the .text section, which is totally acceptable in practice.
So long as partition remain unchanged, each time LTO+partition should yield same result.

including the function order (function order may not affect code quality, but we should make sure the output is same in each run)?

By "the same", which are involved in comparison:
    1. IPO w/ partition vs IPO wo/ partition,
    2. n-th run of IPO w/ partition n+1-th run of IPO w/partition

If 2) should show some difference, it means it has some bugs.

  If you are talking about 1), nobody can guarantee 1) generate *totally* identical objections,
and I don't understand we have such esoteric need.

Per <http://www-plan.cs.colorado.edu/diwan/asplos09.pdf>, function order can affect performance by up to 15%.

Yes, sometime it may affect code layout which may affect performance; theoretically it is better if we could generate totally identical code, including function order, meanwhile, it is easy to validate whether parallelism is correct, just compare the outputs simply