git strategy for handling the llvm-project repo together with ours

kai_plociennik · May 7, 2021, 11:28am

Dear all,

we are integrating support in Clang + LLVM for a target architecture (an accelerator device) we are developing. Currently, we have a git repository, which contains LLVM 9 together with the added/changed files from us.

Now, we want to have a sustainable strategy for handling the llvm-project repository together with our changes. To summarize, we want to migrate to LLVM 12, start with 12.0.0, add our target architecture support based on this version, continue development of our target, and from time to time get the latest changes from the official llvm-project repo so we keep up to date over time.

Hence, we somehow have to "merge" the llvm-project and our repository, and keep history of both. As far as I know, there are several ways we could do this, e.g. using submodules or subtree merge.

In the future, if support for our target architecture is mature, and the hardware is publicly available, for us it would be interesting to have our target support in the official llvm-project repository, in which case our additions potentially would have to go into the official repo.

My question is: Is there an intended way of how we should handle this with git?

Any hints would be greatly appreciated.

Best regards,

Kai Plociennik

pogo59 · May 7, 2021, 12:51pm

Dear all,

we are integrating support in Clang + LLVM for a target architecture (an
accelerator device) we are developing. Currently, we have a git
repository, which contains LLVM 9 together with the added/changed files
from us.

Now, we want to have a sustainable strategy for handling the
llvm-project repository together with our changes. To summarize, we want
to migrate to LLVM 12, start with 12.0.0, add our target architecture
support based on this version, continue development of our target, and
from time to time get the latest changes from the official llvm-project
repo so we keep up to date over time.

Hence, we somehow have to "merge" the llvm-project and our repository,
and keep history of both. As far as I know, there are several ways we
could do this, e.g. using submodules or subtree merge.

I haven't personally had to deal with a new target, so someone who
has done it might have different suggestions. I've cc'd Min who
seems to be the one driving the recent addition of the M68K target,
which is the most recent new target added to LLVM.

I do have extensive experience with managing downstream changes and
handling merges, so I have suggestions based on that experience.

In your situation, I think the simplest way to manage your changes
is to start with a clone of upstream LLVM, which has its 'main'
branch. Then create a 'target' branch off of 'main' and do all your
work on 'target'. When you decide to update to a new revision of
LLVM, you use 'git pull' to update 'main', and then *rebase* your
'target' branch. This tactic keeps all your changes at HEAD, with
history intact, which vastly simplifies figuring out where bugs
have come from. It would also be straightforward to move from a
rare pull/rebase to doing this more frequently, which will reduce
the "merge pain" of each pull/rebase.

I do *not* recommend what we (Sony) did, which is to keep our
changes intermixed with upstream changes. That decision was made
too long ago to do anything about it with practical cost. We have
an automated merge system to keep ourselves continually updated to
upstream HEAD, which is an ongoing maintenance cost but much much
better than trying to do the same thing once every six months or so.

More about how we operate can be found here:
https://llvm.org/devmtg/2015-10/#tutorial4

In the future, if support for our target architecture is mature, and the
hardware is publicly available, for us it would be interesting to have
our target support in the official llvm-project repository, in which
case our additions potentially would have to go into the official repo.

This is another reason to try to keep your downstream changes at HEAD.
It will be much easier to post your new target's patches if you are
already based on top of 'main'.

Best Regards,
--paulr

Min-Yih_Hsu · May 7, 2021, 4:42pm

Hi,

Dear all,

we are integrating support in Clang + LLVM for a target architecture (an
accelerator device) we are developing. Currently, we have a git
repository, which contains LLVM 9 together with the added/changed files
from us.

Now, we want to have a sustainable strategy for handling the
llvm-project repository together with our changes. To summarize, we want
to migrate to LLVM 12, start with 12.0.0, add our target architecture
support based on this version, continue development of our target, and
from time to time get the latest changes from the official llvm-project
repo so we keep up to date over time.

Hence, we somehow have to “merge” the llvm-project and our repository,
and keep history of both. As far as I know, there are several ways we
could do this, e.g. using submodules or subtree merge.

I haven’t personally had to deal with a new target, so someone who
has done it might have different suggestions. I’ve cc’d Min who
seems to be the one driving the recent addition of the M68K target,
which is the most recent new target added to LLVM.

Yes, we actually had the exact same problem when we tried to migrate from LLVM 8 or so to mono repo. We used a script back then: https://github.com/M680x0/M680x0-llvm/issues/58
But note that this script only does ~70% of the work. I still spent quite some time doing some migrations manually and cleaning up.

I do have extensive experience with managing downstream changes and
handling merges, so I have suggestions based on that experience.

In your situation, I think the simplest way to manage your changes
is to start with a clone of upstream LLVM, which has its ‘main’
branch. Then create a ‘target’ branch off of ‘main’ and do all your
work on ‘target’. When you decide to update to a new revision of
LLVM, you use ‘git pull’ to update ‘main’, and then rebase your
‘target’ branch. This tactic keeps all your changes at HEAD, with
history intact, which vastly simplifies figuring out where bugs
have come from. It would also be straightforward to move from a
rare pull/rebase to doing this more frequently, which will reduce
the “merge pain” of each pull/rebase.

Second this approach. This is also roughly what I did before M68k got upstreamed. It’s also easier for you to collect patches if you plan to upstream your target in the future (just pick commits on the tip of your target branch). The only (little) downside was that there were some users who wanted to try our tree and using rebase meant that they needed to git pull --force every time they updated.

I do not recommend what we (Sony) did, which is to keep our
changes intermixed with upstream changes. That decision was made
too long ago to do anything about it with practical cost. We have
an automated merge system to keep ourselves continually updated to
upstream HEAD, which is an ongoing maintenance cost but much much
better than trying to do the same thing once every six months or so.

More about how we operate can be found here:
https://llvm.org/devmtg/2015-10/#tutorial4

In the future, if support for our target architecture is mature, and the
hardware is publicly available, for us it would be interesting to have
our target support in the official llvm-project repository, in which
case our additions potentially would have to go into the official repo.

Upstream the target is another whole story. I can provide some tips and suggestions on this matter when you decide to do so.

Best
-Min

gcmn · May 7, 2021, 5:55pm

To avoid the issue of constantly having to rewrite history, you could merge from your main branch instead of rebasing. If I was contributing to a project and I constantly had to drop my local history, I’d probably be a bit grumpy. You lose the benefit of your entire stream of patches being based on HEAD, but I’m guessing that eventual upstreaming wouldn’t be merging things in the historical order of patches anyway? (but maybe I’m wrong, I’ve never upstreamed a target). Merge commits get a bad rap, but I think they’re actually quite a useful tool in git when used judiciously, and if you know about git log --first-parent most of people’s complaints about them are handled.

mehdi_amini · May 7, 2021, 7:04pm

Another aspect is that rebasing a long-lived branch leads to an history that does not make sense: you would likely just fix the APIs uses for the top of the branch after rebasing which will lead to most of the history that can’t be build: this kills bisection since you can’t build previous revision of your own project (unless you actually fix every individual commit during rebase, but that’s not scalable).

kai_plociennik · May 10, 2021, 7:22am

Thank you very much for your detailed suggestions on my quetions, this helped me a lot!

Best regards,

Kai Plociennik

pogo59 · May 10, 2021, 9:58pm

Geoffrey, Mehdi,

Excellent observations, however I think it's worth remembering
the stated use-case.

Geoffrey Martin-Noble wrote:

To avoid the issue of constantly having to rewrite history,

Note that the OP said:

from time to time get the latest changes from the official
llvm-project repo so we keep up to date over time.

I think "from time to time" is far from "constantly." If they
were going to do continual updates, I wouldn't suggest rebasing
at all; but when updates are rare, I think it's an extremely
viable choice. Also they said,

In the future, if support for our target architecture is mature,
and the hardware is publicly available,

This (especially the not-public part) implies to me that the cadre
of developers is small, and imposing a rare (2x/year?) requirement
to do a force-pull or just re-clone is not a harsh burden.

Mehdi AMINI wrote:

Another aspect is that rebasing a long-lived branch leads to an
history that does not make sense: you would likely just fix the
APIs uses for the top of the branch after rebasing which will
lead to most of the history that can't be build

I find rebasing is effectively a commit-by-commit merge-to-HEAD.
Normally when I've done this, conflicts are quite likely for API
changes, which would have to be fixed up in the middle of the
rebase before you could do the --continue; not a fix-at-the-end
situation.

It's true that for a new target, most of the work would be in
target-specific files and git wouldn't notice any conflicts.
If I were doing this semi-annual rebase, I'd probably want to do
an incremental build after each commit just to catch that kind
of thing. Tedious but not super expensive compute-wise, for a
new target, and very scriptable.

Rebasing instead of merging would also *improve* bisection, if
you pay attention to keeping the rebased commits buildable.
I promise you, having done it, bisecting the current problem to
a 6-months-of-upstream-changes merge commit *really* isn't helpful.
Eliminating those headaches was a significant benefit of our
conversion to continual integration.

Basically, for bisection to work in a reasonable way, you have
to have either a linear history of small merges like we get now
with our continual integration, or you want a linear history
that is pure upstream with local changes at the very tip.

Anyway, best of luck to the original poster, and back to doing
real work!

--paulr

mehdi_amini · May 10, 2021, 10:19pm

Geoffrey, Mehdi,

Excellent observations, however I think it’s worth remembering
the stated use-case.

Geoffrey Martin-Noble wrote:

To avoid the issue of constantly having to rewrite history,

Note that the OP said:

from time to time get the latest changes from the official
llvm-project repo so we keep up to date over time.

I think “from time to time” is far from “constantly.” If they
were going to do continual updates, I wouldn’t suggest rebasing
at all; but when updates are rare, I think it’s an extremely
viable choice. Also they said,

In the future, if support for our target architecture is mature,
and the hardware is publicly available,

This (especially the not-public part) implies to me that the cadre
of developers is small, and imposing a rare (2x/year?) requirement
to do a force-pull or just re-clone is not a harsh burden.

Mehdi AMINI wrote:

Another aspect is that rebasing a long-lived branch leads to an
history that does not make sense: you would likely just fix the
APIs uses for the top of the branch after rebasing which will
lead to most of the history that can’t be build

I find rebasing is effectively a commit-by-commit merge-to-HEAD.
Normally when I’ve done this, conflicts are quite likely for API
changes, which would have to be fixed up in the middle of the
rebase before you could do the --continue; not a fix-at-the-end
situation.

It’s true that for a new target, most of the work would be in
target-specific files and git wouldn’t notice any conflicts.
If I were doing this semi-annual rebase, I’d probably want to do
an incremental build after each commit just to catch that kind
of thing. Tedious but not super expensive compute-wise, for a
new target, and very scriptable.

Rebasing instead of merging would also improve bisection, if
you pay attention to keeping the rebased commits buildable.

Yes, that the option I mentioned as “not scalable”: if you are actively developing, you’ll have lot of code churn and making sure all the intermediate state of your branch are buildable on top of the most recent LLVM can be a lot of “unnecessary” work (I’m using quote because it seems subjective here).
Also you may have O(100s) or O(1000s) commits to build and fix individually instead of making sure the end state is good. Take this over multiple years of developments potentially…

I promise you, having done it, bisecting the current problem to
a 6-months-of-upstream-changes merge commit really isn’t helpful.

Yes: the answer to this isn’t obviously rebasing to me, I rather merge multiple times a day (continuously if possible).
Ideally you’re only limited by the length of your testing suite: if you have a bot that continuously does “attempt-merge, test, and push-if-passing” you get very good bisection for the minimum amount of work (you only have to fix something when an upstream commit breaks you, and the fix is scoped to the actual commit that broke you). This may be what you’re mentionning below with “small merges”?

pogo59 · May 11, 2021, 12:43pm

Rebasing instead of merging would also *improve* bisection, if
you pay attention to keeping the rebased commits buildable.

Yes, that the option I mentioned as "not scalable": if you are
actively developing, you'll have lot of code churn and making sure
all the intermediate state of your branch are buildable on top of
the most recent LLVM can be a lot of "unnecessary" work (I'm using
quote because it seems subjective here).
Also you may have O(100s) or O(1000s) commits to build and fix
individually instead of making sure the end state is good. Take this
over multiple years of developments potentially...

So, "it depends." I'd naively expect that a new target would not
have O(1000s) commits. Again, remember the context of the OP.
My recommendations for a generic downstream project (e.g., targeting
a gaming console ) would be different from the stated use-case.

I promise you, having done it, bisecting the current problem to
a 6-months-of-upstream-changes merge commit *really* isn't helpful.

Yes: the answer to this isn't obviously rebasing to me, I rather
merge multiple times a day (continuously if possible).

The OP clearly stated "time to time" updates, and my answer is
based on conforming to that schedule, rather than proposing an
entirely different schedule.

Ideally you're only limited by the length of your testing suite:
if you have a bot that continuously does "attempt-merge, test,
and push-if-passing" you get very good bisection for the minimum
amount of work (you only have to fix something when an upstream
commit breaks you, and the fix is scoped to the actual commit
that broke you). This may be what you're mentionning below with
"small merges"?

Exactly, and that's what Sony has been doing for some years now.
We're averaging about one manual intervention per day, which is
reasonable for us; but might be excessive for a very small project
such as the OP's, who might rather schedule a big-bang merge/rebase
once every 6-12 months. Investing time in getting individual
commits to work might well be worth the trouble in that situation.

--paulr

Topic		Replies	Views
[RFC] One or many git repositories? LLVM Dev List Archives	291	204	August 10, 2016
Contributing a new target to LLVM LLVM Dev List Archives	2	119	March 27, 2015
Official git mirroring of llvm, clang, lldb, test-suite, etc.? LLVM Dev List Archives	16	79	February 1, 2011
Managing two versions of LLVM on the computer? LLVM Dev List Archives	2	156	May 19, 2019
Repo directory layout LLVM Dev List Archives	3	135	May 24, 2018

git strategy for handling the llvm-project repo together with ours

Related Topics