monorepo: bad performance when using gitk / git log

Hi!

Anyone else experiencing performance problems when using the new monorepo?

My experience is that performance of gitk (and git log) sometimes is really bad when working in the monorepo.

I’ve mainly seen it when using gitk on specific files/directories, but since gitk seems to be using “git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – ” it is possible to observe the same thing when using git log.

The problem can be seen when creating a brand new commit (with a new file):

bash-4.1$ git clone https://github.com/llvm/llvm-project.git llvm-project

bash-4.1$ cd llvm-project

bash-4.1$ touch dummy

bash-4.1$ git add dummy

bash-4.1$ git commit -m “test”

[master 6539b74dd0e] test

1 file changed, 0 insertions(+), 0 deletions(-)

create mode 100644 llvm/dummy

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – dummy > /dev/null

198.37user 0.40system 3:18.67elapsed 100%CPU (0avgtext+0avgdata 696456maxresident)k

0inputs+0outputs (0major+175765minor)pagefaults 0swaps

But also when examining older files, here are some tests using the monorepo:

bash-4.1$ git clone https://github.com/llvm/llvm-project.git llvm-project

bash-4.1$ cd llvm-project

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD > /dev/null

5.15user 0.26system 0:05.42elapsed 99%CPU (0avgtext+0avgdata 220344maxresident)k

0inputs+0outputs (0major+56131minor)pagefaults 0swaps

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – README.md > /dev/null

155.20user 0.34system 2:35.45elapsed 100%CPU (0avgtext+0avgdata 636744maxresident)k

0inputs+0outputs (0major+160862minor)pagefaults 0swaps

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – llvm/CODE_OWNERS.TXT > /dev/null

55.48user 0.34system 0:55.80elapsed 100%CPU (0avgtext+0avgdata 690124maxresident)k

0inputs+0outputs (0major+174196minor)pagefaults 0swaps

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – llvm/test/CodeGen/Generic/bswap.ll > /dev/null

192.97user 0.33system 3:13.19elapsed 100%CPU (0avgtext+0avgdata 696496maxresident)k

0inputs+0outputs (0major+176003minor)pagefaults 0swaps

Same tests when using the old llvm repo (there is no README.md so I skipped that test here):

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD > /dev/null

2.72user 0.12system 0:02.84elapsed 99%CPU (0avgtext+0avgdata 136628maxresident)k

0inputs+0outputs (0major+36354minor)pagefaults 0swaps

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – CODE_OWNERS.TXT > /dev/null

2.74user 0.19system 0:02.93elapsed 99%CPU (0avgtext+0avgdata 344756maxresident)k

0inputs+0outputs (0major+88975minor)pagefaults 0swaps

bash-4.1$ /usr/bin/time git log --no-color -z --pretty=raw --show-notes --parents --boundary HEAD – test/CodeGen/Generic/bswap.ll > /dev/null

3.76user 0.19system 0:03.96elapsed 99%CPU (0avgtext+0avgdata 380416maxresident)k

0inputs+0outputs (0major+98218minor)pagefaults 0swaps

The example with test/CodeGen/Generic/bswap.ll indicates that it can take 193/4=48 times longer time to open gitk (or run git log) on a file when using the monorepo(!?!?).

I’m not so familiar with the inner details of git. Could this be a bad repack of the llvm-projects repo or something?

Or is it just that we now squeeze so many commits into the same repo that I should expect the performance to be even worse in the future?

The figures above is when using git 2.14.1, but I’ve also tried 2.20.0 with similar results.

Regards,

Björn

Björn Pettersson A via llvm-dev <llvm-dev@lists.llvm.org> writes:

I’m not so familiar with the inner details of git. Could this be a bad
repack of the llvm-projects repo or something?

Or is it just that we now squeeze so many commits into the same repo
that I should expect the performance to be even worse in the future?

All of your log commands log the entire history of the repository.
Since the monorepo contains the history of all projects, it's a lot more
than the individual project repositories used to contain.

I don't know what gitk does in terms of logging. If it insists on
logging the entire history, then yes, it's going to be slower with the
monorepo.

Personally, I rarely have the need to log further back than a couple of
years of history and the monorepo has been all right for that. On the
rare occasion I need to look back much further, the extra time hasn't be
burdensome. But then I never use gitk.

                          -David

I use GitExtension and have no performance issues. However i noticed GitExt will only visualize the last 3 years on the overview. When looking at a specific files history or blame, it will show the entire history of it(last 10 years)

The problem here seems to be due to the combination of specifying --parents, and specifying a pathname to filter by. I can certainly reproduce a remarkable slowness with that combination from git…

On my machine:
$ time git log --parents --oneline origin/master > /dev/null
real 0m4.001s

$ time git log origin/master – llvm/test/CodeGen/Generic/bswap.ll > /dev/null
real 0m5.332s

$ time git log --parents --oneline origin/master – llvm/test/CodeGen/Generic/bswap.ll > /dev/null

real 2m48.944s

That said, I use gitk frequently, and had not noticed performance issues. But, I’d never tried invoking it with a path on the command-line, only with ref names, so it’s not hitting the bad case.

Nor have I noted issues with git log, but again, I’d never have run it with --parents, so I don’t hit this bad case.

Maybe worth reporting as a possible bug to git? Surely whatever algorithm it’s using shouldn’t be this slow.

I asked about this on git@vger.kernel.org:

https://public-inbox.org/git/20190402132756.GB13141@sigill.intra.peff.net/T/#m1fd5da534d39f967a8ce8b3361bc2e00b9214f31

I’ve already got an answer that we seem to be unlucky with some access patterns when doing “git log –parents” in the monorepo,

and that we hit some quadratic analysis of the commit history. Hopefully something they can fix (Jeff King already had some ideas).

Ah – awesome news! Sounds like it may be fixed soon.

Indeed, the dates in the early history of the svn repository did jump around a bit, because “clang” was imported from an external repository in 2007, while it had already been under development for a year.

To be precise, the SVN revisions r38537 through r39730, except for r39142, were imported into the SVN repository, with their original commit dates. Those dates are thus out of order compared with surrounding commits. E.g. r38535 has the date 2007-07-11 08:47:55 +0000, while r38537 is from a year earlier, 2006-06-18 05:42:02 +0000.