How to contribute on LLVM project as beginner

Hi LLVM project Leaders,
I am a software engineer working on several other open source projects, recently I am very interested in LLVM technology, espically on backend part. I have taken two months studying the documents from llvm.org in my spare time. As a beginner, I would like to contribute some code to LLVM project, from the “Google Summer of Code 2019”, I found one project “Debug Info should have no effect on codegen” that I may able to contribute, not sure if the project has already been completed? If there are still tasks exist, how can I join in? Or is that any other project I can work on? I would spend 10~20 hours on LLVM development every week as I want to gather experience to find a job as LLVM developer in the furture. I am a quickly learning, I would be very appricate if you could help me and give me some guides, so that I would run faster on my way to LLVM field. Many thanks.

Hi Chris,

“Debug info should have no effect on codegen” would be a fine project for you; nobody is working on it that I know of. Another way to contribute would be to go to our Bugzilla (bugs.llvm.org) and search for open bugs with the “beginner” keyword.

Regarding the “debug info has no effect on codegen” project, unfortunately I am having IT issues that keep me from providing much in the way of specific suggestions, so what follows is fairly generic.

In principle, you compile some piece of code with and without –g, and see if there is any difference in the generated instructions. My experience is that you want to compile to a .o file, and then use a disassembler to dump the text sections. This will give you a cleaner diff than using –S to generate assembler files.

I also recommend compiling with -ffunction-sections and probably -fexceptions. The former will put each compiled function into its own object-file section, so that differences in one function won’t affect the disassembly of a later function. The latter option should work around one fairly intractable known difference: -g will cause the compiler to emit directives to produce call-frame information, and these tend to act as instruction-scheduling barriers. Using –fexceptions (I am 95% sure that is the correct option) should cause the non-dash-g compilation to use the same directives, and avoid that known difference.

You can repeat this experiment with different optimization levels, as differences are far more likely to show up with optimization.

Once you find a difference, you can begin experimenting with ways to identify specific compiler passes that are contributing to the difference. A very useful tool here is the backend option -opt-bisect-limit=N where N is the number of passes to execute. Because it is a backend option, you would use it this way:

clang –c –O2 –mllvm –opt-bisect-limit=100 foo.c –o foo.o

clang –c –O2 –mllvm –opt-bisect-limit=100 foo.c –g –o foo-g.o

Then disassemble and diff as usual. After you have identified a problematic pass, you can try your hand at fixing it yourself, or you can file a bug (with a reduced reproducer if at all possible) and move on to another sample.

Of course you will need some sample source code to run experiments on. This can be anything convenient. You could try it on any personal projects you have, or you could find a random code generator, or whatever you like. Some people have recommended LLVM’s own ‘test-suite’ project although I have not looked at it in any detail.

Good luck, and feel free to post additional questions on llvm-dev if you run into any problems.

–paulr

Hi Chris,

“Debug info should have no effect on codegen” would be a fine project for you; nobody is working on it that I know of. Another way to contribute would be to go to our Bugzilla (bugs.llvm.org) and search for open bugs with the “beginner” keyword.

Regarding the “debug info has no effect on codegen” project, unfortunately I am having IT issues that keep me from providing much in the way of specific suggestions, so what follows is fairly generic.

In principle, you compile some piece of code with and without –g, and see if there is any difference in the generated instructions. My experience is that you want to compile to a .o file, and then use a disassembler to dump the text sections. This will give you a cleaner diff than using –S to generate assembler files.

I also recommend compiling with -ffunction-sections and probably -fexceptions. The former will put each compiled function into its own object-file section, so that differences in one function won’t affect the disassembly of a later function. The latter option should work around one fairly intractable known difference: -g will cause the compiler to emit directives to produce call-frame information, and these tend to act as instruction-scheduling barriers. Using –fexceptions (I am 95% sure that is the correct option) should cause the non-dash-g compilation to use the same directives, and avoid that known difference.

You can repeat this experiment with different optimization levels, as differences are far more likely to show up with optimization.

Once you find a difference, you can begin experimenting with ways to identify specific compiler passes that are contributing to the difference. A very useful tool here is the backend option -opt-bisect-limit=N where N is the number of passes to execute. Because it is a backend option, you would use it this way:

clang –c –O2 –mllvm –opt-bisect-limit=100 foo.c –o foo.o

clang –c –O2 –mllvm –opt-bisect-limit=100 foo.c –g –o foo-g.o

Then disassemble and diff as usual. After you have identified a problematic pass, you can try your hand at fixing it yourself, or you can file a bug (with a reduced reproducer if at all possible) and move on to another sample.

Of course you will need some sample source code to run experiments on. This can be anything convenient. You could try it on any personal projects you have, or you could find a random code generator, or whatever you like. Some people have recommended LLVM’s own ‘test-suite’ project although I have not looked at it in any detail.

Good luck, and feel free to post additional questions on llvm-dev if you run into any problems.

–paulr

Your script looks OK, though you won’t want to use the -opt-bisect-limit= option until you’ve found a case where code-generation changes. Instead, that’s a tool which you could use to narrow down the pass inside LLVM which is causing the change.

The problem is that your input code is far too simple to trigger any interesting optimisations. I’d suggest starting with either some code from the LLVM test suite (https://github.com/llvm/llvm-test-suite), or some code generated by csmith (https://embed.cs.utah.edu/csmith/). The former has the advantage of being (mostly) real code people actually write, and the latter can generate a large amount of complex code without any external dependencies (so it’s easy to build).

I’d also suggest looking into creduce (https://embed.cs.utah.edu/creduce/), which will allow you to quickly reduce a large input file which triggers a bug down to a much smaller one.

Oliver

A few other things to note:

There’s a tool in clang here ( https://github.com/llvm/llvm-project/tree/master/clang/utils/check_cfc ) called check_cfc which uses the same basic idea as the script above. It’s designed to transparently wrap clang invocations so that any differences in codegen will actually trigger a build failure. There are a few more details in these slides ( https://llvm.org/devmtg/2015-04/slides/Verifying_code_gen_dash_g_final.pdf ). Ultimately it doesn’t matter which tools you use in order to find bugs, but you may find it useful.

We’ve got a meta-bug here to which we’ve been attaching already-reported bugs in this area ( https://bugs.llvm.org/show_bug.cgi?id=37728 ) which might be a nice place to start so that you can try replicating the results. In particular https://bugs.llvm.org/show_bug.cgi?id=42138 is a bug that one of our interns found recently using the check_cfc script with llvm test-suite (and then reducing with creduce). Unfortunately it was right at the end of his internship so he didn’t get a chance to try and fix it. It might be a good starting point to have a go at replicating the failure and then trying to figure out what’s happening and fixing it (assuming that it’s still present). I’m sure that there are plenty of people in the community willing to help out with any specific issues you run into along the way.

Good luck, with whichever approach you take!

-Greg

Built LLVM 8.0.1 for debug using -DCMAKE_EXPORT_COMPILE_COMMANDS=ON.

Put together a sequence using clang/utils/check_cfc on the compile list using the same compile parameters except for the -g and -o options that check_cfc provides.

2506 files were successfully processed by check_cfc out of a total of 2833.

Three was the maximum number of differences obtained with the possibly interesting types here.

4 of this type.
- push + sh %r
- mov + v %r
- sub + b $0

3 of this type.
- pq 66 + v -0
- v -0 + v %r
- v %r + llq 6d

1 each of these types.
- pq ef + pq ea
- pq 68 + a -0
- a -0 + llq 6c

- pq 3a + vb $0
- vb $0 + a -0
- a -0 + v %r

- jmpq + mov
- mov + and
- and + mov

Regards, Neil Nelson

Nice work! Are you planning to track down where these differences come from, or do you plan to file bugs for them?

–paulr