Proposal: libclang code generation

I started a conversation in #llvm a few days ago regarding adding code
generation support to libclang (I'm indygreg on IRC). People seemed
receptive to the idea, so, I performed some preliminary work around a
proposal for the API. I've produced a couple of new header files [1][2]
with the proposed API. Higher-level comments and questions are in a
README [3].

I'm requesting review of the proposal from the larger community. I'd
like to get "sign off" of the C API/.h files before I start working on
the implementation so I'm confident I'm producing the right thing.

There likely also needs to be a larger discussion around the structure
of libclang. The gory details are in the README [3]. The tl;dr is that
the existing API pigeonholes generic functionality as specific to the
Index module and feature expansion of libclang will likely require a)
API breakage due to required refactoring, and/or b) an ill-conceived
code layout and naming convention.

This is my first time hacking on LLVM or Clang. I fully expect that
there are many deficiencies in my proposal due to ignorance. I have
thick skin, so let me hear it!

[1]
https://github.com/indygreg/clang/blob/libclang_compiler/include/clang-c/CodeGen.h
[2]
https://github.com/indygreg/clang/blob/libclang_compiler/include/clang-c/Compiler.h
[3]
https://github.com/indygreg/clang/blob/libclang_compiler/README.libclang_changes

Gregory Szorc
gregory.szorc@gmail.com

I don't think I understand why this needs so much new API. With libclang, you create a translation unit by providing a file (on disk or unsaved) and a set of compiler flags. At this point, the API already has everything that it needs to be able to compile the input file. I would expect this to be accomplished with just a clang_compileTranslationUnit() function (which, if the translation unit has not changed, just needs to do the codegen - it can skip reparsing), possibly with the option of changing the output from the location specified with the original compile options.

David

Chalk one up for my ignorance: I now see that the translation unit API
could be the front part of compilation.

That being said, I feel that having a higher-level compiler API that
bypasses (or at least abstracts) the translation unit bit would be more
user friendly (not to mention loosely coupled). For example, consumers
would only need to inspect the result of 1 "process input" function as
opposed to 2 (the diagnostics examination is the heavy bit). There are
also some nasty bits of the translation unit implementation that could
just be ignored (like various functions writing to stderr - something a
library should never do).

I realize that incrementally going from a translation unit to generated
code would be a requirement. In a separate compiler API, that could be
facilitated by clang_codegen_compileTranslationUnit() or
clang_codegen_createInputFromTranslationUnit(), which is then fed into
clang_codegen_generateOutput(). clang_compileTranslationUnit() would
also work if we're sticking to the monolithic module approach. But, I
would still like a higher-level API that could bypass translation units,
for reasons given above. clang_compileFromSourceFile() perhaps?

I'm lacking experience with the Clang project to really do much beyond
throw ideas around and see what sticks.

If nobody else weighs in, I'll rewrite the proposal to reflect David's
suggestions.

Greg

Chalk one up for my ignorance: I now see that the translation unit API
could be the front part of compilation.

That being said, I feel that having a higher-level compiler API that
bypasses (or at least abstracts) the translation unit bit would be more
user friendly (not to mention loosely coupled). For example, consumers
would only need to inspect the result of 1 "process input" function as
opposed to 2 (the diagnostics examination is the heavy bit).

The small number of consumers that only care about going directly from source to generated code would get to skip the translation unit, but this API doesn't allow any flexibility for people who want to inspect the translation unit and later generate code from it.

Consumers should be inspecting diagnostic output after parsing and before generating code *anyway*, since codegen will unceremoniously fail if there are any errors in the AST. This is naturally a two-step process, and you aren't saving much at all by making it into one step.

There are
also some nasty bits of the translation unit implementation that could
just be ignored (like various functions writing to stderr - something a
library should never do).

You can't actually avoid any of the details of the translation unit implementation because, as you've noted below, we need to have a function that takes a CXTranslationUnit and generates code from it.

Writing to stderr (when not under some kind of debugging setting) would be a bug in libclang regardless. What are you referring to, specifically?

I realize that incrementally going from a translation unit to generated
code would be a requirement. In a separate compiler API, that could be
facilitated by clang_codegen_compileTranslationUnit() or
clang_codegen_createInputFromTranslationUnit(), which is then fed into
clang_codegen_generateOutput(). clang_compileTranslationUnit() would
also work if we're sticking to the monolithic module approach.

I don't know what you mean by 'monolithic', but having this functionality is critical.

But, I
would still like a higher-level API that could bypass translation units,
for reasons given above. clang_compileFromSourceFile() perhaps?

I'd rather not have this source -> generated code function at all. It's not sufficient for a significant number of use cases (which will need TU -> generated code), will be redundant with the more general TU -> generated code version, and saves a total of 1 line of code for those use cases that don't care about the TU.

I'm lacking experience with the Clang project to really do much beyond
throw ideas around and see what sticks.

If nobody else weighs in, I'll rewrite the proposal to reflect David's
suggestions.

I fully agree with David.

  - Doug

That being said, I feel that having a higher-level compiler API
that bypasses (or at least abstracts) the translation unit bit
would be more user friendly (not to mention loosely coupled). For
example, consumers would only need to inspect the result of 1
"process input" function as opposed to 2 (the diagnostics
examination is the heavy bit).

The small number of consumers that only care about going directly
from source to generated code

How can you assert knowledge of the number of consumers of a feature
that isn't available yet? I can think of plenty of use cases for people
wanting to go directly from source to generated code, hence why I
proposed a *supplemental* API that does just that.

but this API doesn't allow any flexibility for people who want to
inspect the translation unit and later generate code from it.

Correct. If they need to inspect the TU, they can always use the source
-> TU + TU -> generated code APIs instead of the higher-level source ->
generated code one.

I'd rather not have this source -> generated code function at all.
It's not sufficient for a significant number of use cases (which will
need TU -> generated code),

No functionality will be precluded by the existence of this API. See above.

will be redundant with the more general TU -> generated code version

Correct.

and saves a total of 1 line of code for those use cases that don't care about the TU.

No, it saves more than 1 line since

Consumers should be inspecting diagnostic output after parsing and
before generating code *anyway*, since codegen will unceremoniously
fail if there are any errors in the AST.

I argue that any assistance you can give in the form of
easier-to-consume APIs will be appreciated (especially in the land of C).

Unless my argument rebutting had any convincing effect, I'll just
recognize there is no interest for this higher-level source -> generated
code API and will remove it from the proposal. I do feel that consumers
will independently implement this utility API themselves, adding
justification for its existence in libclang. Time will tell.

Writing to stderr (when not under some kind of debugging setting)
would be a bug in libclang regardless. What are you referring to,
specifically?

I was initially referring to
CIndex.cpp:clang_parseTranslationUnit_Impl(). However, upon further
inspection it appears this is conditional on
CIndexer->getDisplayDiagnostics(), so my claim is wrong.

clang_compileTranslationUnit() would also work if we're sticking to
the monolithic module approach.

I don't know what you mean by 'monolithic'

I'm trying to ascertain what the vision for the composition of libclang
is in terms of naming and "modularity." Currently, everything is defined
in one "monolithic" header (Index.h) and exists in a single C naming
"namespace" (clang_doSomething). As the scope of libclang is expanded,
there will be naming issues unless a more formal "module" convention is
established. See
https://github.com/indygreg/clang/blob/libclang_compiler/README.libclang_changes#L104
for more details.

My (yet unanswered) question is whether things should continue to be
stapled onto Index.h with the existing naming conventions or whether we
should use this new functionality as an opportunity to start factoring
things into modules that are more clearly denoted and loosely coupled
e.g. clang_codegeneration_generateFromTranslationUnit() as opposed to
clang_generateCode() or clang_generateCodeFromTranslationUnit().

If adopted, I /think/ the code generation features could be implemented
without breaking ABI compatibility. But, it would certainly look
inconsistent from a style perspective. Hence my open question. I feel
you'll probably say "just use the existing convention." But, I had to
ask because I feel that things will eventually blow up and I don't like
compounding an existing problem if it can be avoided, especially when it
comes to API design.

Greg