Code coverage metrics for LLVM Compiler Infrastructure

Code coverage is the percentage of code that is executed by automated tests. Code coverage tells us which lines of a project have been executed and which have not. The LLVM Compiler Infrastructure has upwards of 100,000 tests (unit, regression, and test suite) that can be executed with a single make check-all command, and a very large codebase consisting of millions of lines and multiple subprojects such as Clang, MLIR, OpenMP, Polly etc.

It is therefore natural that some parts of the compiler are better checked by testcases than others. While code coverage cannot tell us if a given test suite is adequate or not; what it can tell us is if the test suite covers all areas of the codebase. Code coverage metrics also give information about line and branch execution counts which in turn can be used for profile guided optimizations. Code coverage can also tell us about the amount of dead code (unreachable code) in LLVM and such code can be removed. Finally, a codebase wide audit may throw up surprising results and lead to some research papers discussing these results.

In this proposal I describe how we can go about implementing code coverage for the LLVM codebase, how to interface it with tools like LCOV, and how to integrate GitHub actions such that the coverage percentage for each subproject is displayed in the GitHub repository.

Proposal has been submitted along with resume. You can find it here: Ashutosh_Pandey_GSoC_LLVM.pdf - Google Drive

4 Likes

I am very happy to see a proposal for this, although I regret that I would not be able to mentor it myself.

I am not clear that the proposal covers the merging of data across various tools (clang, opt, llc, llvm-mc, etc) which is necessary for an overall view of code-coverage, given the way lit tests are written (using the various individual tools). It may be too late to update the proposal, but if you could say a few words here, I would be interested.

1 Like

Thank you for your kind words @pogo59 , Since this project idea was not on the list of proposals and the LLVM page didn’t have clear instructions on who to reach out to if that is the case, I just took a shot in the dark and submitted it without listing any mentors. I hope its picked up and read all the same, though.

If you can direct this to some mentors it would be great.

Now to answer your question:

I didn’t delve too deep into the report generation aspect as that is more about the tools than LLVM itself (and I felt the proposal was already getting too long). But there are a few ways to ‘merge’ data:

1.) Case 1 : The subprojects in question can be built using -DLLVM_ENABLE_PROJECTS and the testcases can be executed using a single make/ninja check-all

In this case there’s no need to do anything special, the subprojects are already covered if the testcases are executed. In page 14 of my proposal I have attached the output of a test run with llvm/tools/llvm-mca subdirectory, which is one of the tools you mentioned.

2.) Case 2 : If the projects need to be built separately and/or the testcases need to be executed separately.

LCOV supports merging multiple coverage reports into one: lcov(1): graphical GCOV front-end - Linux man page
–add-tracefile tracefile
We can set the counters to zero using -zerocounters and then execute the testcases with the coverage enabled compiler binary to know what percentage of the tool is covered by its testcases.

We can also run make check-all and then set that as a baseline with -initial option.

Relevant stackoverflow question: g++ - Is it possible to merge coverage data from two executables with gcov/gcovr? - Stack Overflow

llvm-cov: llvm-cov - emit coverage information — LLVM 16.0.0git documentation can also merge multiple files and can output to HTML. (using -format=html).

We could also test with the test suite: GitHub - llvm/llvm-test-suite and note the delta by setting the baseline as we see fit, to know the utility of those tests (coverage wise).

Thanks for the additional information, very nice!

I have reached out to my team to see if anyone is able to pick this up. One previous GSOC mentor mentioned that your proposal is much better than anything he had reviewed previously! But he does not feel that he has the necessary expertise to mentor your project.

@akorobeynikov can you offer any suggestions for finding a mentor?

1 Like

I’m afraid there is no single “pool of mentors”. So, yes, @ashpande looks like you’d need to check the commit history in the corresponding area and / or ask corresponding code owners.

1 Like

@akorobeynikov @pogo59 I think the closest we can get to experts on code coverage in the LLVM ecosystem are the people involved in llvm-cov (LLVM’s code coverage tool): llvm-project/llvm/tools/llvm-cov at main · llvm/llvm-project · GitHub

The commit history is here: History for llvm/tools/llvm-cov - llvm/llvm-project · GitHub

The commits show @petrhosek and @kazutakahirata have quite a few commits there but I am still not sure who the ‘code owners’ are. If someone can point me to them I’d be happy to reach out and ask if they would consider mentoring me for this project.

I am starting to get a little worried because it looks like proposals without mentors get rejected – I don’t know what the internal deadline is, but if I am not able to reach out to a mentor soon I’ll get disqualified without my proposal being evaluated :confused:

I really don’t want this one to fail for lack of a mentor, so if it’s not too late, try putting me as a mentor. I don’t really have the right expertise in all the areas you will touch, but this isn’t really a traditional coding project either.

@akorobeynikov can you let me know what I need to do to sign up.

1 Like

Hello! @ashpande Thanks for bringing this topic. I think it is quite important for the project quality.

@pogo59 @akorobeynikov if a mentor for this topic is not found yet, I would be glad to be the one )

1 Like

@kpdev42 @pogo59 – in fact you can easily co-mentor the project. Will it work? :slight_smile:

1 Like

Thank you so much @pogo59 and @kpdev42 :slight_smile: . I had reached out to a couple of code maintainers but they may be busy. I’d be glad to have you both as mentors. The project is quite big (In terms of trying to verify the output anyway) so I think I’ll need all the advice I can get!

I’ll leave it up to @kpdev42 whether a co-mentor would be a good thing. I very much want this to go forward, and I am willing to help out.

1 Like

@ashpande I read your proposal and it looks interesting and matches my interests—specifically, I brought up code coverage for the Fuchsia project where we encountered many issues around scaling, reliability and usability, and this was the motivation for many of the improvements I did to llvm-cov and other parts of LLVM coverage support.

I have some concerns with your proposal though. You put a lot of emphasis on gcov and lcov, and comparison of results collected with different tools. I’m not convinced that this would be a valuable use of your time. In my experience, Clang source-based code coverage is superior in terms of data collection. Where llvm-cov comes short, and why many projects still prefer lcov is the usability of the output. I think this is where we should focus our efforts and I have ideas in that area such as Support per-directory index files for HTML coverage report · Issue #54711 · llvm/llvm-project · GitHub.

In general, I think that it should be an explicit goal for LLVM project to use LLVM tools where possible as a forcing functions to improve our tools where they come short, and this includes code coverage. If you were amenable to changing your proposal and prioritize improvements to llvm-cov, I’d be willing to co-mentor you. I have also mentioned this project to other team members who might be interested in co-mentoring this project.

There are also some inaccuracies in your proposal. I’d be happy to point those out if you can a share a version in a format that I can comment on (for example Google Docs).

1 Like

Hi @petrhosek , thank you for your detailed reply. I have made an editable version of the proposal available here: Ashutosh_Pandey_GSOC_LLVM.docx - Google Docs

I am open to amending the proposal to include improvements to llvm-cov. The only reason I used gcov in the proposal was because I was familiar with it, having worked on it previously to enable code coverage for glibc glibc-cov/glibc_coverage.patch at main · ashpande/glibc-cov · GitHub. I wanted to submit a proof of concept along with the proposal, rather than just a wishlist of what I wanted to do.

Few things that come to mind are coverage for other extensions (.ll , .def etc) and supporting nesting more than one layer deep (LCOV has this problem). LLVM has many files that are within subdirectories of subdirectories and this can make it a little confusing to navigate the report, even if file paths are preserved.

I am not exactly sure how to go about implementing improvements to llvm-cov though, which is why I didn’t write a lot about it in my proposal.

I am open to making amendments and changing the scope of the proposal to have improvements to llvm-cov along with generating a code coverage report for LLVM codebase (I am sure applying it to such a large target will throw up some interesting results).

One thing I just wanted to point out was that unfortunately the GSOC proposal submission window is closed, so the current version above has been uploaded on the portal. Students and mentors can make changes to their idea afterwards, but for now I’ll just have to make any amendments here and will not be able to upload it on the GSOC website. I hope that is not a problem.

I think that @petrhosek is more suitable for mentoring this project than me. So, I renounce my candidature :slight_smile: But still I will be follow this project and happy to help with anything

1 Like

I’ve put a few comments on the google doc version of the proposal.

Same here :slight_smile: definitely happy to help or co-mentor, let me know.

1 Like

Thank you for your comments on the proposal. I got some clarity regarding file extensions and debug mode from them.

As long as any mentor is ready to take this up, I can begin working on it right away, atleast the background reading (CMAKE and getting familiar with LLVM’s build process, along with llvm-cov and its internals).

Have you looked at the LLVM_BUILD_INSTRUMENTED_COVERAGE CMake option? There’s actually a lot of tooling built into LLVM’s CMake for generating code coverage reports.

The CoverageReport.cmake module generates targets that merge profile data and generate HTML reports. That module currently only handles LLVM itself, but it could be extended to handle sub-projects too without too much difficulty.

2 Likes

Because I think my last comment probably didn’t convey this: code coverage is great, and we should keep improving what we have.

It is worth noting, we do actually have a bot today that runs code coverage on all of the LLVM repo and publishes results here: https://lab.llvm.org/coverage/coverage-reports/index.html

There are lots of things we can do to improve these reports and the coverage numbers (which would be great), but we should make sure that we’re not just creating more ways to do the same thing. Leveraging existing infrastructure, tools and reporting mechanisms would be a big win for the community.

2 Likes

+1 here. Anecdotally, I use LLVM_BUILD_INSTRUMENTED_COVERAGE to generate coverage for a downstream project that uses LLVM. We don’t use the .html generation, but upload the coverage data to CodeCov instead. Any improvements here would likely compound for users that also rely on the functionality (especially when it comes to resolving incorrect coverage or related bugs, we actually had to revert our coverage build to use clang-11 because recent clangs have various coverage regressions).

– River

3 Likes