API generation

Hi,

I work on WebAssembly, and I was hoping we would eventually use LLVM libc for end-to-end Wasm toolchain. I have some questions about "ground truth" approach to libc API. I am sorry if those have been asked, could not find the answers looking through mailing list messages and code reviews.

http://lists.llvm.org/pipermail/libc-dev/2019-October/000003.html

http://lists.llvm.org/pipermail/libc-dev/2019-October/000009.html

I was wondering what does API generation buy for the developers and users. Maybe the question is how did previous implementations of libc get away without generating headers, but also is API generation a reasonable and foolproof solution.

Most importantly, the motivation seems to be that there are a few potential standards a libc implementation needs to comply with. But how many substantially different APIs are there realistically? If it is in lower single digits, does this really make it worth the effort?

Secondly, libc API is not only types and function prototypes, it typically includes depends on "feature test macros". I am not sure it is possible to gracefully support those in a generated API. Encoding test macros in API "ground truth" rules would make API rules as complex as C macro code they are trying to replace. Leaving test macros up to the C header files would result in a mix of preprocessor and rule logic which would probably be more confusing than going all the way in either (preprocessor or generation) direction.

Finally, somewhat rhetorical point on precedent and expertise. There is enough precedent for a portable libc API written directly; likewise C/C++ developers can understand and modify C headers without ramp-up - not sure that can be said about tablegen. Writing header files is a relatively simple part of the development process and there is a lot of it happening inside and outside of LLVM.

Best,

Petr

Hi Petr,

As I understand it, the WASI interface is now very close to CloudABI, which is one of the use cases I was interested in. There are two slightly conflated goals for the header generation that, I think, are going to need deconflating in the future:

  - Being able to support different sets of standards (e.g. pure C11, POSIX, POSIX + GNU extensions + BSD extensions) so that a compilation unit can opt into only a subset of the required things.

  - Being able to support different sets of standards so that an implementation can ship a useful standards-compilant subset (e.g. just C11 on a non-POSIX platform).

  - Being able to support different subsets for builds with different sets of available target abstractions.

For legacy compatibility, the current WASI libc supports libpreload, but it's often easier to support a Capsicum environment with a more CloudABI-like interface that disallows all of the explicit global namespace operations. For WASI / Capsicum deployments, I would like to be able to build a version of libc that exposes only CloudABI-like symbols, so I get linker failures (that I can then fix) when I use something that relies on access to the global namespace.

The first and second of these look very similar, but the second and third are the ones that share useful tooling. Existing libc implementations support the former with a load of macros to conditionally expose things. These are annoying to maintain (particularly if, for example, a BSD extension is later standardised in POSIX: you then need to rework the logic in the headers for exposing them). Ideally, we'd just add POSIX20 or whatever to the list of standards and let the tool deal with it. For the first use case, I think we will still end up needing conditional exposure via macros, but that's easier to machine generate than to write by hand.

For the second and third use cases, the goal in both cases is to make subsetting easier. We could later extend this with some static analysis plugins that check for isolation (e.g. C11 can't depend on POSIX, Capsicum-safe functions can't depend on non-Capsicum-safe functions).

The final benefit that we haven't really explored yet for header generation is supporting different compiler annotations for API contracts that are not expressible in standard C. For example, the Windows headers use SAL annotations to define in / out parameters, the size of buffers, and so on. There are GNU extensions for some of these, but they often go in different places (e.g. as function attributes with parameters that index a specific function parameter versus parameter attributes). If we encode the high-level contracts in the TableGen, then we should be able to generate MS C and GNU C variants of the same set of interfaces.

The TableGen format lets us put a lot more metadata on the functions and definitions than we would necessarily want to end up in any given build of the headers.

I agree that we are going to end up with TableGen files that are quite complex, but I believe that we should end up with a cleaner separation of concerns. I have worked on a libc that did this manually, and refactoring any of the macro code is very painful because it is all very order-dependent and changes have non-local effects. In the TableGen world, the back end will parse all of the definitions, build the dependency graph, and then generate the macros. A change that requires reworking macros across half a dozen files is not a problem in this context.

David

Hi Petr,

Thanks a lot for your questions. David has already provided very good
answers. I have added my views and answers inline.

I work on WebAssembly, and I was hoping we would eventually use LLVM
libc for end-to-end Wasm toolchain. I have some questions about "ground
truth" approach to libc API. I am sorry if those have been asked, could
not find the answers looking through mailing list messages and code reviews.

http://lists.llvm.org/pipermail/libc-dev/2019-October/000003.html

http://lists.llvm.org/pipermail/libc-dev/2019-October/000009.html

I now have a patch out for review: https://reviews.llvm.org/D70197

The patch shows the up to date header generation scheme.

I was wondering what does API generation buy for the developers and
users.

For the users, the benefit is probably negligible/minimal: the header
files they include will have
much less macro and #ifdef clutter.

Developers can be of various kinds, so let me try to list the benefits
for two kinds of developers I can think of:
1. For developers working on LLVM-libc: Clear cut separation of
standards, platform configs and implementation makes adding new API,
implementation or a platform config (like the "config/linux/api.td"
file in the above patch) a straightforward task.
2. For developers putting together a libc for their platform: Instead
of adding inclusion and exclusion macros to header files, they merely
write a config for their platform, like the "config/linux/api.td" file
in the above patch.

Maybe the question is how did previous implementations of libc
get away without generating headers, but also is API generation a
reasonable and foolproof solution.

I have had this question myself and tried asking around to get
answers. Unfortunately, I did not get a good answer yet.

Most importantly, the motivation seems to be that there are a few
potential standards a libc implementation needs to comply with. But how
many substantially different APIs are there realistically? If it is in
lower single digits, does this really make it worth the effort?

Yes, I agree that the number of standards we have to support will be
fairly small. But, there will be a much larger number of configs that
we will have to cater to. That is, there will be a large number of
platforms which will want to pick and choose from the small numbers of
standard we support. Header generation makes this possible without
using hard to debug/maintain #ifdefs in the header files. Note that I
used the word "cater" and not "support" because I do not think we want
to support all of the configs upstream. A lot of these configs will be
maintained downstream and the proposed header generation scheme makes
it straightforward to maintain them downstream.

Secondly, libc API is not only types and function prototypes, it
typically includes depends on "feature test macros". I am not sure it is
possible to gracefully support those in a generated API. Encoding test
macros in API "ground truth" rules would make API rules as complex as C
macro code they are trying to replace. Leaving test macros up to the C
header files would result in a mix of preprocessor and rule logic which
would probably be more confusing than going all the way in either
(preprocessor or generation) direction.

If you look at the patch I have pointed you to, the ground truth files
are devoid of any test macros. They are merely a listing of what the
standards prescribe. Also, as David pointed out, there is scope to
extend them with platform independent annotations. We do not yet know
what these annotations are going to look like, but the current set up
keeps that door open if and when we are ready for them.

About feature test macros, the current setup (in the above patch) does
not include them. However, I am of the opinion that we can certainly
include them. For example, if the ground truth file for a standard (in
the "spec" directory) can specify the feature macro for that standard,
then we can enclose the generated API from that standard within the
corresponding feature macro. This would work best if we have some
baseline over which extensions are enabled optionally based on the
corresponding feature macro being defined.

Finally, somewhat rhetorical point on precedent and expertise. There is
enough precedent for a portable libc API written directly; likewise
C/C++ developers can understand and modify C headers without ramp-up -
not sure that can be said about tablegen. Writing header files is a
relatively simple part of the development process and there is a lot of
it happening inside and outside of LLVM.

True that TableGen is an unfamiliar format. But if you consider glibc
for example, sure you can edit the header files directly to add/change
the API. But, you will have to then add a conformance test in a data
file which is not a normal C header file. On the other hand, to
add/change the API in LLVM libc, one only needs to edit the
corresponding ground truth file written in tablegen format; The header
file is tool generated. So, I think that the developer burden in
LLVM-libc is actually much less: 1. One does not have to worry about
editing header files and tripping over the macros and #ifdefs in them,
2. It eliminates the need for conformance tests.

Thanks,
Siva Chandra

Hi David,

Thanks for the answers, I am going to send a separate reply to Siva's message about API generation. There was a conversation about upstreaming WASI during Wasm CG meeting, I thought that with this effort eventually there would be all the pieces for a end-to-end Wasm toolchain (not only WASI-based, but JS as well) in LLVM.

+ Dan Gohman, in case he has any thoughts on this, as he maintains current WASI implementation

Best,

Petr

Hi Siva,

Thank you for the answers. I think this is a valuable effort in and of itself, even as an attempt. My only concern is generation becoming really complicated when feature tests and targets get added to the mix. For example, intersecting a few standards with a few features and a few targets can be non-trivial (if that is the intended level of flexibility). The hope would be that the generation model will be able to resolve that; at least it does not seem that the code used to generate headers is anywhere near the amount of code inside existing libc include, which is a good thing.

I have maintained runtime libraries in the past, though not libc, and I don't have a strong preference between hand-edited headers and tablegen. If HdrGen works for libc it potentially can be extended to other uses, for example maintaining consistency between intrinsics implementation and their use.

Best,

Petr

I want to highlight this point. Most libc implementations are quite closely tied to a particular platform. Those that are less OS-specific (glibc and musl come to mind) ship with a lot of their own scaffolding and are difficult to subset.

LLVM as a whole aims to provide a kit of parts for assembling a toolchain. Clang is expected to work with different C++ standard library implementations, libc++ is expected to work with different C and C++ runtime libraries and different C standard implementations. The philosophy behind the LLVM libc is similar: some people will use a fairly stock configuration, just as some people use clang + libc++ + libc++abi + compiler-rt, others will want to use just a subset.

Embedded platforms, minimal-TCB environments and sandboxed environments all want to be able to ship the smallest subset of libc that enables their specific workload. This will rarely be the same small subset.

We need to be able to both carefully layer the internal implementation to avoid cross-dependencies (try compiling libc++ without iostream to see how difficult this is if you try to add it later) and define subsets of the public ABI. The goal for the header generation is to automate the second part and help with tooling for checking the first.

David