Cross Translation Unit Support in Clang

I think this post refers to the subject of the recent “Cross Translation Unit Support in Clang” thread, but I’m not entirely sure I understand the issue, so I’ll start a new thread rather than potentially pollute that one.

So I’ve been working on a tool[1], based on libTooling, that automatically translates C code to SaferCPlusPlus (essentially, a memory-safe subset of C++). The problem I have, if I’m understanding it correctly, is that it can only convert one source file (which corresponds to a “translation unit” right?) at a time, due to libTooling’s limitation. Right? Does that make sense?

Ok, so converting a source file (potentially) modifies the file itself and any included header files as well. The problem is when multiple source files include the same header file, because conversion of a source file requires the included header files to be in their original form. They can’t have been modified by a previous conversion of another source file. So after (individually) converting all the source files in a project, you can end up with multiple versions of a header file that were modified in different ways by the conversions of different source files. Right? (An example[2].)

So this means to get the final converted version of the header file you have to merge the modifications made by each conversion operation. And sometimes the modifications are made on the same line so the merge tool can’t do the merge automatically. (At least meld doesn’t.) This is really annoying.

Now, if libTooling were able to operate on the AST of the entire project at once this problem would go away. Or if you think the AST of the whole project would often be too big, then at least multiple specified translation units at a time would help.

I’m not sure if this is what’s being offered in the “Cross Translation Unit Support in Clang” thread? I would want the Rewriter::overwriteChangedFiles() function to apply to all the source translation units of the AST as well.

Am I making make sense? Feel free to set me straight.

Noah

[1] https://github.com/duneroadrunner/SaferCPlusPlus-AutoTranslation
[2] https://github.com/duneroadrunner/SaferCPlusPlus-AutoTranslation/tree/master/examples/lodepng

Hi Noah!

(Oops, I meant to start a new thread :slight_smile: Thanks for your response.

(Oops, I meant to start a new thread :slight_smile: Thanks for your response.

Yes. Libtooling replacements are decoupled from the source manager fit that reason and do that exact job.

The idea is to export just the information you need, keyed on the thing you want to change. That way, you can fully parallelize both parsing and postprocessing in a large code base.

Unfortunately that only scales to small projects in the generalized case. The static analyzer gets away with it because it only loads things close to the function it analyzes, instead of needing a global view.

What if you only modify the file that was given to the tool? I.e. don't modify the header files. Would that work?

In my tool I ignore everything that is not declared in the file given to the tool [1].

You could also copy the files and do the transformation on the copies. Might be a better idea anyway, to avoid overwriting the original files.

[1] https://github.com/jacob-carlborg/dstep/blob/master/dstep/translator/Translator.d#L315

...

I'm not saying it's not feasible. I'm not even saying it's not
reasonable. But you can see why it'd be nicer for me if libTooling could
just present me with a combined AST. :slight_smile:

Unfortunately that only scales to small projects in the generalized case.
The static analyzer gets away with it because it only loads things close to
the function it analyzes, instead of needing a global view.

Scales in what way? Memory usage? I just watched Gabor's talk where, if I
understood correctly, he mentions that the combined ASTs for llvm took up
40Gb when exported to disk. I'm assuming that's not compressed or anything.
There are plenty of workstations with 40Gb+ of RAM. The static analyzer may
need significant additional memory for its processing, but my tool does not.

And tools like mine aren't part of the interactive development process.
They may only ever need to be applied to a project once. So it might even
be acceptable for it to take a week or whatever to execute. (I don't see
why this wouldn't apply, to a lesser degree, to the static analyzer as
well.)

Are you sure you're not being overly conservative about the scaling issue?
Or am I somehow just being naive here? For me, I feel that it's a very
inconvenient feature omission, and I'm not sure I really understand the
reluctance.

Anyway, if I understood correctly, Gabor was implying that I could somehow
use/abuse the new feature to achieve a combined AST? By importing all the
functions in the other translation units manually? And if memory use really
is an issue, is there a way to import a function into the AST, check it
out, then lose it when your done with it?

Noah

Both memory and compute.

There are also plenty of workstations that don’t have that much RAM. I personally like tools that I can run on a machine without killing all other jobs, too :slight_smile:

Generally, in my experience, for development speed, sanity and error resilience, faster turnaround times help.

We might disagree on the overhead of the 2-phase solution. We have built libtooling around this, and have quite a bit of infrastructure (that’s also getting expanded over time) to make this as easy as possible.

...

I'm not saying it's not feasible. I'm not even saying it's not
reasonable. But you can see why it'd be nicer for me if libTooling could
just present me with a combined AST. :slight_smile:

Unfortunately that only scales to small projects in the generalized
case. The static analyzer gets away with it because it only loads things
close to the function it analyzes, instead of needing a global view.

Scales in what way? Memory usage?

Both memory and compute.

I just watched Gabor's talk where, if I understood correctly, he mentions
that the combined ASTs for llvm took up 40Gb when exported to disk. I'm
assuming that's not compressed or anything. There are plenty of
workstations with 40Gb+ of RAM.

There are also plenty of workstations that don't have that much RAM. I
personally like tools that I can run on a machine without killing all other
jobs, too :slight_smile:

Also note that 40Gb on disk might not correspond to 40Gb+ RAM. One of the
reason the dumps are so big because the headers are duplicated in the AST
files of separate TUs. In case we utilize lazy loading from AST files (i.e.
loading only the definitions we need), and merging the definitions into the
original AST, we might save a lot of memory compared to the disk dumps. Or
in case of modules, the duplication would be much less on the disks too.