git strategy for handling the llvm-project repo together with ours

Dear all,

we are integrating support in Clang + LLVM for a target architecture (an accelerator device) we are developing. Currently, we have a git repository, which contains LLVM 9 together with the added/changed files from us.

Now, we want to have a sustainable strategy for handling the llvm-project repository together with our changes. To summarize, we want to migrate to LLVM 12, start with 12.0.0, add our target architecture support based on this version, continue development of our target, and from time to time get the latest changes from the official llvm-project repo so we keep up to date over time.

Hence, we somehow have to "merge" the llvm-project and our repository, and keep history of both. As far as I know, there are several ways we could do this, e.g. using submodules or subtree merge.

In the future, if support for our target architecture is mature, and the hardware is publicly available, for us it would be interesting to have our target support in the official llvm-project repository, in which case our additions potentially would have to go into the official repo.

My question is: Is there an intended way of how we should handle this with git?

Any hints would be greatly appreciated.

Best regards,

Kai Plociennik

Dear all,

we are integrating support in Clang + LLVM for a target architecture (an
accelerator device) we are developing. Currently, we have a git
repository, which contains LLVM 9 together with the added/changed files
from us.

Now, we want to have a sustainable strategy for handling the
llvm-project repository together with our changes. To summarize, we want
to migrate to LLVM 12, start with 12.0.0, add our target architecture
support based on this version, continue development of our target, and
from time to time get the latest changes from the official llvm-project
repo so we keep up to date over time.

Hence, we somehow have to "merge" the llvm-project and our repository,
and keep history of both. As far as I know, there are several ways we
could do this, e.g. using submodules or subtree merge.

I haven't personally had to deal with a new target, so someone who
has done it might have different suggestions. I've cc'd Min who
seems to be the one driving the recent addition of the M68K target,
which is the most recent new target added to LLVM.

I do have extensive experience with managing downstream changes and
handling merges, so I have suggestions based on that experience.

In your situation, I think the simplest way to manage your changes
is to start with a clone of upstream LLVM, which has its 'main'
branch. Then create a 'target' branch off of 'main' and do all your
work on 'target'. When you decide to update to a new revision of
LLVM, you use 'git pull' to update 'main', and then *rebase* your
'target' branch. This tactic keeps all your changes at HEAD, with
history intact, which vastly simplifies figuring out where bugs
have come from. It would also be straightforward to move from a
rare pull/rebase to doing this more frequently, which will reduce
the "merge pain" of each pull/rebase.

I do *not* recommend what we (Sony) did, which is to keep our
changes intermixed with upstream changes. That decision was made
too long ago to do anything about it with practical cost. We have
an automated merge system to keep ourselves continually updated to
upstream HEAD, which is an ongoing maintenance cost but much much
better than trying to do the same thing once every six months or so.

More about how we operate can be found here:
https://llvm.org/devmtg/2015-10/#tutorial4

In the future, if support for our target architecture is mature, and the
hardware is publicly available, for us it would be interesting to have
our target support in the official llvm-project repository, in which
case our additions potentially would have to go into the official repo.

This is another reason to try to keep your downstream changes at HEAD.
It will be much easier to post your new target's patches if you are
already based on top of 'main'.

Best Regards,
--paulr

Hi,

Dear all,

we are integrating support in Clang + LLVM for a target architecture (an
accelerator device) we are developing. Currently, we have a git
repository, which contains LLVM 9 together with the added/changed files
from us.

Now, we want to have a sustainable strategy for handling the
llvm-project repository together with our changes. To summarize, we want
to migrate to LLVM 12, start with 12.0.0, add our target architecture
support based on this version, continue development of our target, and
from time to time get the latest changes from the official llvm-project
repo so we keep up to date over time.

Hence, we somehow have to “merge” the llvm-project and our repository,
and keep history of both. As far as I know, there are several ways we
could do this, e.g. using submodules or subtree merge.

I haven’t personally had to deal with a new target, so someone who
has done it might have different suggestions. I’ve cc’d Min who
seems to be the one driving the recent addition of the M68K target,
which is the most recent new target added to LLVM.

Yes, we actually had the exact same problem when we tried to migrate from LLVM 8 or so to mono repo. We used a script back then: https://github.com/M680x0/M680x0-llvm/issues/58
But note that this script only does ~70% of the work. I still spent quite some time doing some migrations manually and cleaning up.

I do have extensive experience with managing downstream changes and
handling merges, so I have suggestions based on that experience.

In your situation, I think the simplest way to manage your changes
is to start with a clone of upstream LLVM, which has its ‘main’
branch. Then create a ‘target’ branch off of ‘main’ and do all your
work on ‘target’. When you decide to update to a new revision of
LLVM, you use ‘git pull’ to update ‘main’, and then rebase your
‘target’ branch. This tactic keeps all your changes at HEAD, with
history intact, which vastly simplifies figuring out where bugs
have come from. It would also be straightforward to move from a
rare pull/rebase to doing this more frequently, which will reduce
the “merge pain” of each pull/rebase.

Second this approach. This is also roughly what I did before M68k got upstreamed. It’s also easier for you to collect patches if you plan to upstream your target in the future (just pick commits on the tip of your target branch). The only (little) downside was that there were some users who wanted to try our tree and using rebase meant that they needed to git pull --force every time they updated.

I do not recommend what we (Sony) did, which is to keep our
changes intermixed with upstream changes. That decision was made
too long ago to do anything about it with practical cost. We have
an automated merge system to keep ourselves continually updated to
upstream HEAD, which is an ongoing maintenance cost but much much
better than trying to do the same thing once every six months or so.

More about how we operate can be found here:
https://llvm.org/devmtg/2015-10/#tutorial4

In the future, if support for our target architecture is mature, and the
hardware is publicly available, for us it would be interesting to have
our target support in the official llvm-project repository, in which
case our additions potentially would have to go into the official repo.

Upstream the target is another whole story. I can provide some tips and suggestions on this matter when you decide to do so.

Best
-Min

To avoid the issue of constantly having to rewrite history, you could merge from your main branch instead of rebasing. If I was contributing to a project and I constantly had to drop my local history, I’d probably be a bit grumpy. You lose the benefit of your entire stream of patches being based on HEAD, but I’m guessing that eventual upstreaming wouldn’t be merging things in the historical order of patches anyway? (but maybe I’m wrong, I’ve never upstreamed a target). Merge commits get a bad rap, but I think they’re actually quite a useful tool in git when used judiciously, and if you know about git log --first-parent most of people’s complaints about them are handled.

Another aspect is that rebasing a long-lived branch leads to an history that does not make sense: you would likely just fix the APIs uses for the top of the branch after rebasing which will lead to most of the history that can’t be build: this kills bisection since you can’t build previous revision of your own project (unless you actually fix every individual commit during rebase, but that’s not scalable).

Thank you very much for your detailed suggestions on my quetions, this helped me a lot!

Best regards,

Kai Plociennik

Geoffrey, Mehdi,

Excellent observations, however I think it's worth remembering
the stated use-case.

Geoffrey Martin-Noble wrote:

To avoid the issue of constantly having to rewrite history,

Note that the OP said:

from time to time get the latest changes from the official
llvm-project repo so we keep up to date over time.

I think "from time to time" is far from "constantly." If they
were going to do continual updates, I wouldn't suggest rebasing
at all; but when updates are rare, I think it's an extremely
viable choice. Also they said,

In the future, if support for our target architecture is mature,
and the hardware is publicly available,

This (especially the not-public part) implies to me that the cadre
of developers is small, and imposing a rare (2x/year?) requirement
to do a force-pull or just re-clone is not a harsh burden.

Mehdi AMINI wrote:

Another aspect is that rebasing a long-lived branch leads to an
history that does not make sense: you would likely just fix the
APIs uses for the top of the branch after rebasing which will
lead to most of the history that can't be build

I find rebasing is effectively a commit-by-commit merge-to-HEAD.
Normally when I've done this, conflicts are quite likely for API
changes, which would have to be fixed up in the middle of the
rebase before you could do the --continue; not a fix-at-the-end
situation.

It's true that for a new target, most of the work would be in
target-specific files and git wouldn't notice any conflicts.
If I were doing this semi-annual rebase, I'd probably want to do
an incremental build after each commit just to catch that kind
of thing. Tedious but not super expensive compute-wise, for a
new target, and very scriptable.

Rebasing instead of merging would also *improve* bisection, if
you pay attention to keeping the rebased commits buildable.
I promise you, having done it, bisecting the current problem to
a 6-months-of-upstream-changes merge commit *really* isn't helpful.
Eliminating those headaches was a significant benefit of our
conversion to continual integration.

Basically, for bisection to work in a reasonable way, you have
to have either a linear history of small merges like we get now
with our continual integration, or you want a linear history
that is pure upstream with local changes at the very tip.

Anyway, best of luck to the original poster, and back to doing
real work!

--paulr

Geoffrey, Mehdi,

Excellent observations, however I think it’s worth remembering
the stated use-case.

Geoffrey Martin-Noble wrote:

To avoid the issue of constantly having to rewrite history,

Note that the OP said:

from time to time get the latest changes from the official
llvm-project repo so we keep up to date over time.

I think “from time to time” is far from “constantly.” If they
were going to do continual updates, I wouldn’t suggest rebasing
at all; but when updates are rare, I think it’s an extremely
viable choice. Also they said,

In the future, if support for our target architecture is mature,
and the hardware is publicly available,

This (especially the not-public part) implies to me that the cadre
of developers is small, and imposing a rare (2x/year?) requirement
to do a force-pull or just re-clone is not a harsh burden.

Mehdi AMINI wrote:

Another aspect is that rebasing a long-lived branch leads to an
history that does not make sense: you would likely just fix the
APIs uses for the top of the branch after rebasing which will
lead to most of the history that can’t be build

I find rebasing is effectively a commit-by-commit merge-to-HEAD.
Normally when I’ve done this, conflicts are quite likely for API
changes, which would have to be fixed up in the middle of the
rebase before you could do the --continue; not a fix-at-the-end
situation.

It’s true that for a new target, most of the work would be in
target-specific files and git wouldn’t notice any conflicts.
If I were doing this semi-annual rebase, I’d probably want to do
an incremental build after each commit just to catch that kind
of thing. Tedious but not super expensive compute-wise, for a
new target, and very scriptable.

Rebasing instead of merging would also improve bisection, if
you pay attention to keeping the rebased commits buildable.

Yes, that the option I mentioned as “not scalable”: if you are actively developing, you’ll have lot of code churn and making sure all the intermediate state of your branch are buildable on top of the most recent LLVM can be a lot of “unnecessary” work (I’m using quote because it seems subjective here).
Also you may have O(100s) or O(1000s) commits to build and fix individually instead of making sure the end state is good. Take this over multiple years of developments potentially…

I promise you, having done it, bisecting the current problem to
a 6-months-of-upstream-changes merge commit really isn’t helpful.

Yes: the answer to this isn’t obviously rebasing to me, I rather merge multiple times a day (continuously if possible).
Ideally you’re only limited by the length of your testing suite: if you have a bot that continuously does “attempt-merge, test, and push-if-passing” you get very good bisection for the minimum amount of work (you only have to fix something when an upstream commit breaks you, and the fix is scoped to the actual commit that broke you). This may be what you’re mentionning below with “small merges”?

Rebasing instead of merging would also *improve* bisection, if
you pay attention to keeping the rebased commits buildable.

Yes, that the option I mentioned as "not scalable": if you are
actively developing, you'll have lot of code churn and making sure
all the intermediate state of your branch are buildable on top of
the most recent LLVM can be a lot of "unnecessary" work (I'm using
quote because it seems subjective here).
Also you may have O(100s) or O(1000s) commits to build and fix
individually instead of making sure the end state is good. Take this
over multiple years of developments potentially...

So, "it depends." I'd naively expect that a new target would not
have O(1000s) commits. Again, remember the context of the OP.
My recommendations for a generic downstream project (e.g., targeting
a gaming console :blush: ) would be different from the stated use-case.

I promise you, having done it, bisecting the current problem to
a 6-months-of-upstream-changes merge commit *really* isn't helpful.

Yes: the answer to this isn't obviously rebasing to me, I rather
merge multiple times a day (continuously if possible).

The OP clearly stated "time to time" updates, and my answer is
based on conforming to that schedule, rather than proposing an
entirely different schedule.

Ideally you're only limited by the length of your testing suite:
if you have a bot that continuously does "attempt-merge, test,
and push-if-passing" you get very good bisection for the minimum
amount of work (you only have to fix something when an upstream
commit breaks you, and the fix is scoped to the actual commit
that broke you). This may be what you're mentionning below with
"small merges"?

Exactly, and that's what Sony has been doing for some years now.
We're averaging about one manual intervention per day, which is
reasonable for us; but might be excessive for a very small project
such as the OP's, who might rather schedule a big-bang merge/rebase
once every 6-12 months. Investing time in getting individual
commits to work might well be worth the trouble in that situation.

--paulr