Commit module to Git after each Pass

Hello,

I had a stupid idea recently that turned out not so stupid after all. I wanted to be able to “see” an entire pass pipeline in action to find unnecessary transformations and/or missed opportunities and generally improve the debug-ability of LLVM.

So as the title suggest, I implemented an equivalent of “-print-after-all” but instead of printing into stdout I dump into a file that get commit into a temporary git. There are some quirks with it but it’s working and is actually awesome. For example, at first sight, I see multiple time lcssa and instcombine cancelling each other’s work.

Of course, that has a big impact on compile time when enabled, but that’s still practical (git being quite good at its job) when debugging.

There are improvement I can make, but would you guys be interested in such feature?

Hi Alexandre. I can definitely see how it could be useful to track
changes through git commits, and take advantage of your favourite repo
history viewer to see changes. How much of your current implementation
is handled via modifications to LLVM vs an external helper script? For
instance, I might imagine trying to achieve something similar through
a script that parses the output of -print-after-all in order to create
the desired files+commits.

Best,

Alex

Today it is entirely in llvm. It is even more costly than -print-after-all as it:

  • print to file
  • print the entire module (after each basic block, for basic block passes, it will still print the entire module where only one basic block changed)
  • call git 2 times (add then commit) and wait for them to finish (I even save all the empty commits)

The reason I print the entire module is so that git is able to show/compress the change properly. Then a simple git log --patch show the change of each pass.

Ideally, the -debug output of each pass would be piped as the git commit message, and the passes name+options would be the commit title. But I didn’t have time to do that.

My goal was not particularly to be efficient but to be thorough so as to preserve the maximum of behavioral information of each pass while not affecting their order and behavior.

You are right, there could be major speed improvement gained by doing so as external processing. For instance, the output would be in the format that git patch takes as input. Could this be a GSoC?

Today I use git filter-branch + opt as a post processing tool to generate the reg.dot files of the region info. In any way, the feature is a breath of fresh air in our development here. That’s why I wanted to share. It will quickly become my reflex to debug with it… probably because I am at ease with git though.

Could you format the output so that it is compatible with git fast-import?
https://www.git-scm.com/docs/git-fast-import

Maybe. I can’t compute the diff/patch for git though, it will have to do it itself. Would that still work?

I am not sure I will have time to work on improving that.
For those who want to try it out I put it there: https://reviews.llvm.org/D44244

And yes, this uses all possible forms of ugly. :slight_smile:

TLDR; fast import reads a sequence of commands from stdin to write whole files / trees / commits directly into a compressed pack archive.

Primarily designed for bulk importing history from other source control systems.

If the git commands you are using were a significant performance problem, I figured this would be quicker.

Have you considered writing each function or even basic block to a separate file?

This is interesting, and might be useful. I don’t know that this is broadly useful enough for upstream inclusion, but if you could post this to github somewhere, I might play with it.

There might also be room to factor out common functionality. We’ve also run into the need to print whole-module instead of containing construct (i.e. this loop). If we added upstream support for something along the lines of -print-module-after-all, building the git history could easily be done as a post processing step.

Philip

Alexandre posted the patch here https://reviews.llvm.org/D44244

I made a similar suggestion regarding handling this as a
post-processing step. I like your proposed -print-module-after-all
name.

Best,

Alex

The print-module-after-all type of option exists in upstream:

-print-module-scope - When printing IR for print-[before|after]{-all} always print a module IR

commit 7d160f714357f6784ead669ce516e94991c12e5a
Author: Fedor Sergeev <fedor.sergeev@azul.com>

For this to be really usable in this setup we need additionally to:
- extend -print-module-scope to cover basic block passes
- introduce a clear way to separate module IRs as those are being printed by -print-after-all

But yes, it should work, and a wrapper that pipes to git fast-import seems to be the best way to handle it.

regards,
Fedor.

For this to be really usable in this setup we need additionally to:
- extend -print-module-scope to cover basic block passes
- introduce a clear way to separate module IRs as those are being printed by -print-after-all

But yes, it should work, and a wrapper that pipes to git fast-import seems to be the best way to handle it.

A simple 20-lines perl script does the trick pretty easily: filter-LLVM-ir-print.pl - Pastebin.com

(this assumes my local modification to introduce the *** END OF ** IR DUMP marked at the end of -print-module-scope's IR module dump)

] git init
] RA/bin/opt -O3 some-ir.ll -disable-output -print-after-all -print-module-scope 2>&1 | filter-LLVM-ir-print.pl | git fast-import --done --date-format=now
....

Majority of time is spent to actually print the IR (~2m for my testcase).
Fast-import takes just a second.

regards,
Fedor.

Hmm...

I tried Alexandre's fix from D44244 and surprisingly it appears that just using -print-module-scope w/o
any additional git actions is waaaay slower on my testcase than -git-commit-module-all.

Hell, even a plan -print-after-all is slower:

] time R/bin/opt -O3 some-ir.ll -disable-output -git-commit-after-all 2>/dev/null
real 0m8.041s
user 0m7.133s
sys 0m0.936s
] time R/bin/opt -O3 some-ir.ll -disable-output -print-after-all 2>/dev/null

real 0m13.575s
user 0m6.179s
sys 0m7.394s

I cant really explain that...

regards,
Fedor.

Does git-commit-after-all print correctly after all the passes? Maybe I messed it up and it skip some passes, therefore having less to do?

Either that, or piping has a higher cost than writing to file. Looks like it surprisingly spends much less time in system more when going through file. Maybe that’s because the file is consistently around the same size and is mmapped into memory continuously, while piping require regular (more than once per module) context switches between the two processes?

Honestly, I would say something is wrong (aka. first paragraph). I didn’t build that with efficiency in mind in any way…

Does git-commit-after-all print correctly after all the passes? Maybe I messed it up and it skip some passes, therefore having less to do?

I did verify that total amount of lines committed to git is reasonably high:

] git rev-list master | while read cmt; do git show $cmt:some-ir.ll; done | wc -l
1587532

corresponding number for -print-after-all (w/o print-module-scope):
] time R/bin/opt -O3 some-ir.ll -disable-output -print-after-all 2>&1 | wc -l
219328
]

Also amount of commits seems to be right as well.

Either that, or piping has a higher cost than writing to file. Looks like it surprisingly spends much less time in system more when going through file. Maybe that's because the file is consistently around the same size and is mmapped into memory continuously, while piping require regular (more than once per module) context switches between the two processes?

Honestly, I would say something is wrong (aka. first paragraph). I didn't build that with efficiency in mind in any way...

Well, git by itself is so focused on performance, so its not surprising to me that even using git add/git commit does not cause
performance penalties.

regards,
Fedor.

Huh. Great! :grin:

I don’t believe my poor excuse from earlier (else we should map all pipes into files!), but I’m curious why we spend less time in system mode when going through file than pipe. Maybe /dev/null is not as efficient as we might think? I can’t believe I’m saying that…

Well, git by itself is so focused on performance, so its not surprising
to me that even using git add/git commit does not cause
performance penalties.

Sure, but still, I write more stuff (entire module) into a slower destination (file). Even ignoring git execution time it’s counter intuitive.

The only difference is that while I write more, it overwrite itself continuously, instead of being a long linear steam. I was thinking of mmap the file instead of going through our raw_stream, but maybe that’s unnecessary then…

The most likely answer is that the printer used by print-after-all is slow. I know there were some changes made around passing in some form of state cache (metadata related?) and that running printers without doing so work, but are dog slow. I suspect the print-after-all support was never updated. Look at what we do for the normal IR emission “-S” and see if print-after-all is out of sync.

Philip

Does https://reviews.llvm.org/D44132 help at all?

If this is faster than -print-after-all we may actually consider pushing that in the code base then? (after diligent code review of course)

Note that it uses the same printing method as -print-after-all:

  • create a pass of the same pass kind as the pass we just ran
  • use Module::print(raw_ostream) to print (except -print-after-all only print the concerned part and into stdout)

If there is improvement to be done to print-after-all it might also improve git-commit-after-all. (unless that only improve speed when printing constructs smaller than module)

In any case, it is, to me, much more usable (and extensible) than -print-after-all. But requires git to be in PATH (I’m curious if that works on Windows).

git-commit-after-all solution has one serious issue - it has a hardcoded git handling which
makes it look problematic from many angles (picking a proper git,
selecting exact way of storing information, creating repository, replacing the file etc etc).

Just dumping information in a way that allows easy subsequent machine processing
seems to be a more flexible, less cluttered and overall clean solution that allows to avoid
making any of “user interface” decisions mentioned above.

We need to understand why git-commit-after-all works faster than print-after-all.
I dont believe in magic… yet :slight_smile:

And, btw, thanks for both the idea and the patch.

regards,
Fedor.

If this is faster than -print-after-all we may actually consider pushing
that in the code base then? (after diligent code review of course)

Note that it uses the same printing method as -print-after-all:
- create a pass of the same pass kind as the pass we just ran
- use Module::print(raw_ostream) to print (except -print-after-all only
print the concerned part and into stdout)

If there is improvement to be done to print-after-all it might also
improve git-commit-after-all. (unless that only improve speed when printing
constructs smaller than module)

In any case, it is, to me, much more usable (and extensible) than
-print-after-all. But requires git to be in PATH (I'm curious if that works
on Windows).

I don't really have a dog in this fight, but my guess is that this wouldn't
be used that often, so I'm not sure speed is that much of an issue.

I'd considered something similar before, but instead of directly invoking
another tool, e.g., git, I was considering providing a hook mechanism that
would allow users to use whatever they want.

However, post-processing the output via a script seems like the cleanest,
least invasive solution, assuming the deficiencies others have noted could
be addressed.