[RFC] Splitting clang's TargetInfo.cpp

RFC ⚙ D147962 [RFC][clang] Pull experimental targets' info out of TargetInfo.cpp (NFC)

CodeGen/TargetInfo.cpp is already 12.5k lines and will continue to grow with new targets added.
The idea is to make a precedent so that new targets add implementation in separate files instead of putting everything into TargetInfo.cpp.

Obvious concern is that such splitting makes it difficult to navigate through git history. So the question is how far should we go with that? The options are, in order of the amount of changes and their impact on the git history:

  • Pull out trivial targets (M68k, TCE). Hopefully, this will be enough.
  • Pull all experimental targets (this is what the RFC does).
  • Pull ABIInfo, DefaultABIInfo, and utility function implementations into separate cpp file (RFC only pulls declarations into ABIInfoImpl.h, the definitions are kept in TargetInfo.cpp).
  • Refactor it completely. I have done it locally as a PoC, but it would be better if it is done on per-target basis by target maintainers if losing git history is acceptable to them. I also don’t want to be blamed too much.

Something like this has been done in
https://reviews.llvm.org/D30372
https://reviews.llvm.org/D35701

3 Likes

As the one who did this the last time (D35701), this was a fantastic idea for the Targets.cpp stuff, and I humbly suggest our CodeGen owners would want to do the same.

I found at the time that doing the split ‘all at once’ was of great benefit, and made the history way easier to deal with: there was a single ‘split’ commit.

I’d suggest doing the “Refactor it completely”. Doing the big “tear off the bandaid” all at once is better. It kind of stinks from a ‘tougher to track history’ perspective, but it is a single inflection point, rather than a series of them.

As far as your commit, I’d hope that @rjmccall @efriedma-quic and @asl can comment here, and do the review, I think this is very useful.

One additional thing that is really nice out of it from a downstream: it makes breaking up code ownership internally way easier when we can use file-based flags, rather than trying to figure out who owns what individually in a file.

2 Likes

One thing you can do here is to use --author='TargetInfo Refactoring <clang@llvm.org>' so that it is “obvious” in the blame to follow from a parent commit. git blame has a --ignore-rev flag; I suppose one could contribute a flag like --ignore-author for such things.

1 Like

Splitting all in one commit seems like the way to go. I know there’s quite a bit of shared code between the different targets, especially the paired targets like x86/x86_64 and ARM/AArch64, but it shouldn’t be hard to figure out ways to make that work.

I’m willing to be the committer if you’re really worried about getting blamed, but setting an author seems like a better alternative. Please just do whatever refactors you need to do separately from the splitting so that the final touch-everything commit will just be moving code around.

Thank you all. Honestly, I didn’t expect such a positive response.

Not that I’m worried of my name being written somewhere. I’m afraid that I won’t be able to instantly fix broken upstream if something goes wrong (e.g. missing include file). If it is not fixed quickly, it will be reverted, and after recommitting it the history will become rewritten three times.
llvm’s phabricator workflow is new to me, and arcanist continues to surprise me in different ways. I’m still to learn how to properly revert a commit or land stacked patches, for example. If someone could guide me through that once, I’d be grateful.

Here are the patches. Please read to the end before checking them out :slight_smile:
[1/8] ⚙ D148089 [clang][CodeGen] Break up TargetInfo.cpp [1/8]
[2/8] ⚙ D148090 [clang][CodeGen] Break up TargetInfo.cpp [2/8]
[3/8] ⚙ D148091 [clang][CodeGen] Break up TargetInfo.cpp [3/8]
[4/8] ⚙ D148092 [clang][CodeGen] Break up TargetInfo.cpp [4/8]
[5/8] ⚙ D148093 [clang][CodeGen] Break up TargetInfo.cpp [5/8]
[6/8] ⚙ D150178 [clang][CodeGen] Break up TargetInfo.cpp [6/8]
[7/8] ⚙ D150215 [clang][CodeGen] Break up TargetInfo.cpp [7/8]
[8/8] ⚙ D148094 [clang][CodeGen] Break up TargetInfo.cpp [8/8]

The first 5 of 6 patches are small refactorings necessary for the last patch.

It so happened that when I reached the last target (XCore), I realized that the changes could be more straightforward. I.e. MyTarget.h contains just:

class TargetCodeGenInfo;
std::unique_ptr<TargetCodeGenInfo> createMyTargetCodeGenInfo(<...>);

and MyTarget.cpp contains a piece of code literally cut&pasted from TargetInfo.cpp. Currently, I had to split the cut piece into h/cpp pair, which required non-trivial changes like out-lining some inline methods. (I didn’t do it manually – IDE helped with it, but still.) Some of the refactoring patches will become unnecessary, too. I can rework it if necessary.

I’m also tempted to join ABIInfo.h with ABIInfoImpl.h and the corresponding cpp files even though ABIInfo.h serves as “internal interface” header (it is included in CGCall.cpp, CGBuiltin.cpp, and, for some reason, in CodeGenModule.cpp). Please let me know if I should do that.

FWIW I would really like it if more files would get split up. For a newcomer (like myself), it’s really hard to get through these huge files (e.g. Sema.h). Splitting them up to have files that are at most ~3k lines would make it way easier to find stuff.

2 Likes

+1, we should split up TargetInfo.cpp. I previously opposed splitting this file for reasons that don’t really make sense to me anymore.

If this kind of work excites you, may I direct your attention to X86ISelLowering.cpp, one of the slowest files to compile in all of LLVM.

I’m not exactly excited, it is just my IDE is slow to respond when editing this file.

I have added two more commits and reworked the last one – there are no more *.h files.

[6/8] ⚙ D150178 [clang][CodeGen] Break up TargetInfo.cpp [6/8]
[7/8] ⚙ D150215 [clang][CodeGen] Break up TargetInfo.cpp [7/8]
[8/8] ⚙ D148094 [clang][CodeGen] Break up TargetInfo.cpp [8/8]

The 7-th commit adds factory functions for creating target implementations. This makes the last commit trivial as I it no longer splits the implementations into *.h and *.cpp file (everything now lives in *.cpp files).

I think I can’t make this any simpler, and I’m finally satisfied with the result. Please take a look.

oh lovely!

You could take this opportunity to clang-format the new files too? I don’t know that it needs it, but it can’t hurt.

I was a bit hesitant about reformatting them (one cannot just Ctrl+C(opy) a new file and Ctrl+F(ind) it in the old file), but in the end I did it. I also added them to clang-formatted-files.txt that allegedly ensures they are kept formatted.

Moving and reformatting in the same commit should ofcourse not be encouraged IMHO.
You could reformat in a pre-commit or post-commit and then do the code move in a separate commit.

But for all downstream maintainers, and also for git to sometimes understand that code has moved (?), my experience says that it is better to do not mix moving and editing code in the same commit (at least try to avoid it as much as possible).

I can leave reformatting to target maintainers, but git won’t understand that the code has moved. Just because there was one file and now there are multiple of them.
There is no code editing in the last commit, except that some formerly static functions were made global.
Should I still revert formatting? (That won’t be difficult, I just want to know how people feel about it.)

Git IME understands when a file has been moved/renamed, but not when it has been split.

Given that one file is being rearranged into many, that’s the commit that will cause downstream conflicts (speaking from experience here). If the reformat is not extensive, I’d think reformatting in the same commit would be okay. If it is extensive, then yeah separate commits.

Doing a reformat immediately after the split shouldn’t cause any additional downstream pain, really.

Given that one file is being rearranged into many, that’s the commit that will cause downstream conflicts (speaking from experience here).

I haven’t thought about it. I wrongly assumed that downstream changes can only add new code, not modify existing.
Now that you mentioned it, I think it’s best to not reformat the code in this commit.

git show --color-moved will still infer things about splits (though may infer incorrectly when lines are common).

1 Like

I’m going to land the last two patches next weekend. If anyone wants to review them please do so or ping me if I need to delay the merge.