Rewriting f18's history for inclusion in llvm monorepo

Hi List,

Following on from previous conversations about integrating f18 with the
llvm monorepo, we wanted to preserve as much history as we can, but also
to have a history without merge commits.

I've just submitted a pull request containing a "flatten.sh" which tries
to do this. Further information is in the pull request. To help with
review I've pushed the rewritten history up as well.

Pull request: https://github.com/flang-compiler/f18/pull/854
Example rewritten history: https://github.com/peterwaller-arm/f18/tree/new

It's not perfect yet, in particular for merge commits:

* The commit messages aren't great (yet).
* We could talk about exactly what metadata we want to preserve for merges.

For now I've assumed that the second-parent of the merge commit contains
the relevant authorship information for the patch, so the GIT_AUTHOR_*
is taken from this, which is the last commit before a pull request is
merged.

Once we're happy with this in flang-dev, we can present this to llvm-dev
and adapt the script for submission.

Your input is welcomed.

Regards,

- Peter

Peter, thanks for working on this!

Steve, Eric, thoughts? How does this compare to what you've been working on?

-Hal

Having done it for MLIR integration in the monorepo, if you’re already rewriting the history you can also immediately move everything in a “flang” subdirectory.
It provides better git log for files compared to a subtree merge.

I am using the git-filter-repo tool and invoke it with:

git-filter-repo --path-rename :mlir/ --force --refs master

An update.

I've managed to preserve much more history.

In the new scheme, I've managed to preserve 2,159 of 2,181 non-merge
commits (in my first attempt there were only 683 commits).

Additionally, I've added metadata trailers to the commit messages to
record the original commit in the f18 repo, and a link to the pull
request, where this information is available. Right now it's present on
trivial commits but missing for others even if they only required an
"easy" rebase.

Further explanation below.

In git terminology the "tree sha" determines what is present in the file
system. This can be seen with `git rev-parse SHA^{tree}` or `git
cat-file -p SHA`. The key thing I seek to preserve is the tree sha of
merge commits on the mainline branch. If these are the same, then the
contents are the same at those commits.

There are three cases I consider. In each case, we want to ensure that
the contents of the filesystem at that commit are the same.

1. If no commits happened to mainline since the fork for a PR, then we
can rewrite the merge as a fast-forward. The tree shas of the individual
commits are preserved.

* This is a trivial case, and seems obviously safe with respect to
preserving semantics, except that it introduces commits on the mainline
which were previously in a branch. We lose the information about when
the commits landed on mainline.

2. If commits happened on mainline since the branch forked, we must
rebase. Now things are trickier. Intermediate commits in a PR (those
before it was merged) may now be semantically wrong, according to the
usual problems of rebasing. However, if the rebase has no conflicts, and
the tree shas are the same at the end of the rebase, then this may be
good enough. By the final commit of the rebase, the tree is the same as
the merge commit, so no difference is introduced there.

3. If commits happen on the mainline, and a rebase has conflicts, then
things get harder. It turns out that there are only 6 of those merge
commits with 66 commits on the branches. It might be possible to handle
those manually, but I'm unlikely to do it myself. For now, those are
squashed as before. Their commit hashes are written to hard.txt by the
flatten.sh shellscript. If someone cares about those enough, then they
could do a rebase for each of the "hard" MERGESHA given at the end of
this email with `git checkout -b rewritten-MERGESHA MERGESHA^2; git
rebase MERGESHA^1`, and fix all the conflicts carefully. If they did
this and pushed the six rewritten-MERGESHA branches somewhere,
flatten.sh could pull in those rebased commits at the appropriate
moment. So long as `git diff MERGESHA rewritten-MERGESHA` is empty after
the rebase, I don't see why this wouldn't work. Just beware getting the
semantic correctness of the individual patches correct may not be easy.

Another known deficiency is that we don't currently handle merges from
mainline back into feature branches cleanly. There are a few of those
and they look pretty weird. I haven't yet thought about this in great
detail. I'll see if I can fix this issue.

The current rewritten history is up at
https://github.com/peterwaller-arm/f18/commits/rewritten-history. I have
pushed the new script up to https://github.com/flang-compiler/f18/pull/854.

Regards,

- Peter

p.s. Here is the current output of the script:

Original history had 2181 non-merge commits.
New history has 2165 commits.

Preserved: 2159 Easy: 568 Hard merges: 6 Hard commits: 66

Merge commits which need rebasing:

b9f25364a8b201ab71f6208f1923d8ca8670595a
92a20cbdc9ec72a97ce0ea1f733b61ce1ae77de7
f11ceaa7c9df03fe5ad8cd68e5ebb9b5e1853595
d24de5513e6f746a539aaded6091759fa54998e4
2d20bc549c441c243b6085fe821d2eefd6594f39
71ae0d091585537738059637144f1985fd4b05f1

Another known deficiency is that we don't currently handle merges from
mainline back into feature branches cleanly. There are a few of those
and they look pretty weird. I haven't yet thought about this in great
detail. I'll see if I can fix this issue.

Ah. Here is an example problematic case. 690b6f0d1 moves lots of files
around the directory hierarchy. When git does a rebase of 02e8c4c86
(Initial work on the representation of types.) onto 34fc2397a (Merge
pull request #4), the rebased 02e8c4c86 ends up undoing the rename.

Currently, in flatten.sh, this results in the merge commit 2b3b07f10
existing in the new flattened history (with one parent), which has the
effect of redoing this rename.

In time-going forward order, here's what's happened:

* Mainline does a bunch of renames {foo.h => blah/foo.h}
* Branch modifies foo.h
* Branch merges mainline, and the recursive merge strategy "does the
right thing with rename detection"

Result: modified blah/foo.h.

What do you get when you rebase the merge onto mainline? You end up with
a new `foo.h` living along side `blah/foo.h`. I even tried `git rebase
--merge -Xfind-renames`, to no avail. What do I expect to get? Ideally,
I'd get just the modification to blah/foo.h.

I wasn't yet able to figure out how to provoke git-rebase to deal with
this case. The simplest thing might just be to leave these "single
parent merge" commits in place, since they have the effect of correcting
the work-tree contents to what they should be. Unfortunately it means
the "modified foo.h" commits are wrong, since they're modifying the
wrong files. The other obvious solution might be to detect this case and
throw it into the "hard" bucket.

Here is the git log --graph --oneline for the problematic merge:

* 7f2bbe3f4 Merge pull request #3 from tskeith/type

\
* 9d06c385d Adapt to new directory for idioms.cc, idioms.h.
* 2b3b07f10 Merge remote-tracking branch 'upstream/master' into type
>\
>/
/|

* | 34fc2397a Merge pull request #4 from ThePortlandGroup/directories

\ \
* | e10e43b69 Tweaked .clang-format, then ran it.
* | 690b6f0d1 Impose a directory structure. Move files around.

Introduce an intermediate "parser" namespace.

/ /
* ecdffa374 Address some of the review comments.
* 02e8c4c86 Initial work on the representation of types.
/

* f40b5e40d Markdown improvements.

I just had a quick go at this by detecting merges into branches and it
results in 199 squashed merge commits, which consolidate a total of 603
commits.

I'm now going to stop working on this until there is more public
feedback about precisely what needs to be achieved.

Generally, I'm beginning to think trying to preserve the individual
commits in branches is fraught, especially when there is merge activity
going into those branches. It really needs someone to go and rebase all
those branches and check that the result is correct, and I think there
may be many cases where it is not.

When linearizing the history in this way, the only points with a simple
set of guarantees about are the first-parent commits on the mainline branch.

Perhaps there is a better approach I've not considered, but I'm thinking
it would require someone who is more adept at patch algebra than me :slight_smile:

Have a good weekend,

- Peter

Hi Peter

I think this is fantastic progress - thanks so much for your hard work here.

To everyone else: with Peter's script we can get ourselves a straightline history of F18 with a meagre 6 squash commits in it.
I think that if we can fix the issue with locating the original commit so we can put this in the commit message of the new commit then we can still manually piece together a good git history for these 6 commits.

I propose this is good enough to present to the LLVM community as our patch set to put on top of the monorepo - anyone disagree with that?

Ta
Rich

Another thought:

Issue and pull-request numbers referenced in a commit message as they
currently are will create references to issues and pull-requests in the
LLVM repository, rather than the f18 repository. These will contaminate
random github issues and pull requests in llvm-project, and make these
issue numbers confusing for everybody, I think.

I propose to fix this by rewriting the messages as part of history
rewriting so that #123 becomes flang-compiler/f18#123. This is how
cross-repository references are specified on github.

If we do this, the reference will be mutual - if you're looking at the
issue in the f18 repository, it will show any commit in llvm-project
which references it.

Please note that while I have a prototype rewritten history pushed,
github is showing references into my f18 fork in f18's issues, which is
also a bit of a mess. I'll make those go away by deleting the branch and
if necessary the fork, once it is not of any further use.

Regards,

- Peter

Hi All,

A third attempt, following feedback and study.

There were issues with the shell script leading to surprising trees and
generating the Original-commit trailer which I found easier to
workaround by using the lower-level C api provided by libgit2. If you
want to see the script please take a look at the pull request:
https://github.com/flang-compiler/f18/pull/854 - I warn you, it's ugly!
The old quote, "I wanted to write a shorter program, but I didn't have
the time" comes to mind :).

Now there is a linear history, keeping the empty merge commits. The
commits rewrite the content under the flang/ directory and take the
current llvm-project master branch as the parent for (what was) the root
commit. This is something that can in principle be pushed to
llvm-project, assuming everyone (and llvm-dev) are all happy.

=== Key links:

* Tree, merged with LLVM:
https://github.com/peterwaller-arm/f18/tree/rewritten-history-v2-llvm-project-merge

* Rewritten history:
https://github.com/peterwaller-arm/f18/commits/rewritten-history-v2-llvm-project-merge

* Rewritten history without llvm merge:
https://github.com/peterwaller-arm/f18/commits/rewritten-history-v2

* Link to the program pull request:
https://github.com/flang-compiler/f18/pull/854

=== Next steps:

* I understand that the flang community would like to push this into
upstream before the llvm-10 branch in mid-January.
* I'll email llvm-dev to solicit feedback with the intent that we would
like to do this in the near future.
* Modulo any feedback from this email or llvm-dev, I believe it's ready
to go. It just requires someone to follow the steps, run the script, and
push the resulting branch onto llvm-project.
* When we're ready to pull the trigger, I think we should:
* permanently stop accepting commits on flang-compiler/f18, and
redirect those commits to llvm-project.
* run the rewrite script
* verify the rewrite (which should be fairly easily)
* push the new history into llvm-project.

I've sent a message to the llvm-dev mailing list explaining that we
intend to merge in the near future, all being well.

Link: http://lists.llvm.org/pipermail/llvm-dev/2019-December/137661.html

Hi All,

I've made an issue to track a checklist of what needs to happen to push
the new history up at https://github.com/flang-compiler/f18/issues/876.

Please comment there or edit the list if you think of further items
which need addressing.

Regards,

- Peter