[RFC] Handling implementation limits

This RFC asks the community for its interest in centralizing Clang's
implementation limits and how you feel about the proposed approach.

Abstract

This RFC asks the community for its interest in centralizing Clang's
implementation limits and how you feel about the proposed approach.

Thank you for looking into this, I think something along these lines
could be valuable both for internal development purposes as well as
for our users. I'm generally supportive of the idea.

Abstract

At the moment the implementation limits of the Clang compiler are hidden
in magic numbers in the code. These numbers sometimes differ between the
frontend and the backend. They are not always properly guarded by compiler
warnings, instead they trigger assertion failures.

This proposal suggests a TableGen based solution to better maintain these
limits. This solution will be able to generate a header and documentation
of these limits. The latter is required by the C++ standard [implimits].

It's also required by the C standard where they have Implementation
Limits clauses that detail situations where the implementation may
place limits on constructs.

The problem

The proposal tries to solve 2 issues:
* Implementation limits are not always clear and duplicated in the source.
* Document C++ Implementation quantities is required by the C++ standard
  [implimits].

Do you anticipate also trying to cover the C standard implementation
limits as well? I think it would make sense to do so -- there should
be a large amount of overlap between the C limits and the C++ limits.

~Aaron

Thanks for the feedback.

This RFC asks the community for its interest in centralizing Clang’s
implementation limits and how you feel about the proposed approach.

Abstract

At the moment the implementation limits of the Clang compiler are hidden
in magic numbers in the code. These numbers sometimes differ between the
frontend and the backend. They are not always properly guarded by compiler
warnings, instead they trigger assertion failures.

Technically, crashing is still a valid way of indicating non-acceptance,
although obviously I agree that we should diagnose these things properly.
(They can’t just be warnings, though.)

This proposal suggests a TableGen based solution to better maintain these
limits. This solution will be able to generate a header and documentation
of these limits. The latter is required by the C++ standard [implimits].

I think this is a great approach.

The problem

The proposal tries to solve 2 issues:

  • Implementation limits are not always clear and duplicated in the source.
  • Document C++ Implementation quantities is required by the C++ standard
    [implimits].

I suspect that the biggest part of this project by far will be testing
these implementation limits and then figuring out all the places that they
fall over.

Unclear limits

The compiler has several limitations often ‘hidden’ as the width of a
bit-field in a structure. This makes it hard to find these limits. Not
only for our users but also for developers. While looking at a proper
limit for the maximum width of a bit-field in the frontend I discovered
it didn’t matter what the frontend picked, the backend already had a
limit [D71142]. To fix this issue the frontend and backend should have
the same limit. To avoid duplicating the value it should be in a header
available to the frontend and backend.

FWIW, we don’t generally refer to IRGen as the “backend”.

In many cases, the right code change will probably be to introduce a
static assertion linking the implementation limit to some value in code,
rather than using the limit directly. For example, many limits will
be absolute numbers, and it is probably better to
static_assert(IQ_MaxWidgets < (1ULL << SomeBitFieldWidth)) than to
try to make that bit-field have the exact right width. This is also
a pattern that works even if the limit is stored in a normal field
of type (say) unsigned; such places should also have a static_assert
in order to remind the reader/maintainer that there’s an implementation
limit affected.

John.

    This RFC asks the community for its interest in centralizing Clang's
    implementation limits and how you feel about the proposed approach.

    Abstract
    ========

    At the moment the implementation limits of the Clang compiler are hidden
    in magic numbers in the code. These numbers sometimes differ between the
    frontend and the backend. They are not always properly guarded by compiler
    warnings, instead they trigger assertion failures.

/Technically/, crashing is still a valid way of indicating non-acceptance,
although obviously I agree that we should diagnose these things properly.
(They can’t just be warnings, though.)

    This proposal suggests a TableGen based solution to better maintain these
    limits. This solution will be able to generate a header and documentation
    of these limits. The latter is required by the C++ standard [implimits].

I think this is a great approach.

Strong +1 here. I agree that all of these limits should be centralized somewhere
and documented for end-users. MSVC has a nice page with some of the limits
(Compiler Limits | Microsoft Learn). As a bonus
it will also help with fuzzing.

    The problem
    ===========

    The proposal tries to solve 2 issues:
    * Implementation limits are not always clear and duplicated in the source.
    * Document C++ Implementation quantities is required by the C++ standard
    [implimits].

I suspect that the biggest part of this project by far will be testing
these implementation limits and then figuring out all the places that they
fall over.

I think that it would make sense to group all of these tests in a new sub-folder,
say "tests/Implimits" for example, because they are not specific to Sema or the parser
but actually test that a given limit is supported end-to-end.

Bruno

This is a very good idea as documenting these limits will be very helpful for anyone who is using Clang within a MISRA coding environment, where these limits need to be known when configuring static analysis tools.

Chris Tapp, MISRA C++ Chairman

While I definitely want us to try to document our implimits, I WILL say that we want to make sure we’re not publishing inaccurate numbers. Some of the limits are much harder than just testing. For example, template depth CAN be limited by -ftemplate-depth, however we end up running out of stack space or memory with some fairly simple templates that way even before 1024.

If we publish these, I’d be wary that it would become expected that we could do template depth of 1024 (rather than that being a theoretical limit).

One other concern I have is a more specific one, BitFieldBits, which limits the size by setting a bitfield size itself (in CGRecordLayout). The current size is chosen in a way that seems to intentionally make a size of CGBitFieldInfo not have padding. Any time we create one of these we need to make sure that we document the reason for the size, so that we have an understanding of why it was set that size.

Finally, I believe we should diagnose on all of the implimits that we come across as well. BitFieldBits is very important one to do semantic analysis diagnosis on, particularly if we’re going to document it.

This RFC asks the community for its interest in centralizing Clang's
implementation limits and how you feel about the proposed approach.

Abstract

While I definitely want us to try to document our implimits, I WILL say that we want to make sure we’re not publishing inaccurate numbers. Some of the limits are much harder than just testing. For example, template depth CAN be limited by -ftemplate-depth, however we end up running out of stack space or memory with some fairly simple templates that way even before 1024.

If we publish these, I’d be wary that it would become expected that we could do template depth of 1024 (rather than that being a theoretical limit).

I agree with this, but I also feel like this is not a huge problem in practice. Everything that the compiler does is subject to the availability of resources (and other permissions/capabilities). These limits being discussed impose additional restrictions, and so limit the capabilities of the compiler even if resources are available, and so long as we provide the general disclaimer, I think that we can proceed without considering resource constraints specifically.

As far as rationale documentation goes, however, it would certainly make sense to note which limits are "practically unlimited", in the sense that you're likely to run out of resources far before you hit that particular threshold.

-Hal

One other concern I have is a more specific one, BitFieldBits, which limits the size by setting a bitfield size itself (in CGRecordLayout). The current size is chosen in a way that seems to intentionally make a size of CGBitFieldInfo not have padding. Any time we create one of these we need to make sure that we document the reason for the size, so that we have an understanding of why it was set that size.

Finally, I believe we should diagnose on all of the implimits that we come across as well. BitFieldBits is very important one to do semantic analysis diagnosis on, particularly if we’re going to document it.

> At the moment the implementation limits of the Clang compiler are hidden
> in magic numbers in the code. These numbers sometimes differ between the
> frontend and the backend. They are not always properly guarded by
> compiler
> warnings, instead they trigger assertion failures.

*Technically*, crashing is still a valid way of indicating non-acceptance,
although obviously I agree that we should diagnose these things properly.
(They can’t just be warnings, though.)

Agreed I should have used the word diagnostic instead. If the limit is
exceeded it will give an error.

> The compiler has several limitations often 'hidden' as the width of a
> bit-field in a structure. This makes it hard to find these limits. Not
> only for our users but also for developers. While looking at a proper
> limit for the maximum width of a bit-field in the frontend I discovered
> it didn't matter what the frontend picked, the backend already had a
> limit [D71142]. To fix this issue the frontend and backend should have
> the same limit. To avoid duplicating the value it should be in a header
> available to the frontend and backend.

FWIW, we don’t generally refer to IRGen as the “backend”.

In many cases, the right code change will probably be to introduce a
static assertion linking the implementation limit to some value in code,
rather than using the limit directly. For example, many limits will
be absolute numbers, and it is probably better to
`static_assert(IQ_MaxWidgets < (1ULL << SomeBitFieldWidth))` than to
try to make that bit-field have the exact right width. This is also
a pattern that works even if the limit is stored in a normal field
of type (say) `unsigned`; such places should also have a `static_assert`
in order to remind the reader/maintainer that there’s an implementation
limit affected.

If the field SomeBitField is limited by a bit-field width it will also
create the constant SomeBitFieldMax which contains the maximum value. So
that would make it easier to directly use the maximum.

-Mark

> The proposal tries to solve 2 issues:
> * Implementation limits are not always clear and duplicated in the source.
> * Document C++ Implementation quantities is required by the C++ standard
> [implimits].
>
> I suspect that the biggest part of this project by far will be testing
> these implementation limits and then figuring out all the places that they
> fall over.

I also expect this will be the largest part of the project. But once the
basics are available we can do this part one step at a time.

I think that it would make sense to group all of these tests in a new sub-folder,
say "tests/Implimits" for example, because they are not specific to Sema or the parser
but actually test that a given limit is supported end-to-end.

Agreed, the end-to-end testing is required. I like the idea to put them
in their own test directory.

-Mark

> While I definitely want us to try to document our implimits, I WILL
> say that we want to make sure we’re not publishing inaccurate
> numbers. Some of the limits are much harder than just testing. For
> example, template depth CAN be limited by -ftemplate-depth, however
> we end up running out of stack space or memory with some fairly
> simple templates that way even before 1024.

> If we publish these, I’d be wary that it would become expected that
> we could do template depth of 1024 (rather than that being a
> theoretical limit).

I agree with this, but I also feel like this is not a huge problem in
practice. Everything that the compiler does is subject to the
availability of resources (and other permissions/capabilities). These
limits being discussed impose additional restrictions, and so limit
the capabilities of the compiler even if resources are available, and
so long as we provide the general disclaimer, I think that we can
proceed without considering resource constraints specifically.

I think it's a good idea to add a general disclaimer that these limits
are the limits of the compiler.

We should indeed only document limits that are accurate. In some cases
these limits are not yet known to us. That isn't ideal, but in some
cases it will be the truth. Hopefully over time we can fill in these
blanks.

> One other concern I have is a more specific one, BitFieldBits, which
> limits the size by setting a bitfield size itself (in
> CGRecordLayout). The current size is chosen in a way that seems to
> intentionally make a size of CGBitFieldInfo not have padding. Any
> time we create one of these we need to make sure that we document
> the reason for the size, so that we have an understanding of why it
> was set that size.

> Finally, I believe we should diagnose on all of the implimits that
> we come across as well. BitFieldBits is very important one to do
> semantic analysis diagnosis on, particularly if we’re going to
> document it.

As far as rationale documentation goes, however, it would certainly
make sense to note which limits are "practically unlimited", in the
sense that you're likely to run out of resources far before you hit
that particular threshold.

I like the suggestion to add a rationale to why a limit was chosen. I'm
not sure we can determine all of our historical choices, but it would be
good to add them for future choices. (I indeed suspect using the number
of available bits to fill a bit-field to x bytes is often used as
rationale.)

We indeed should diagnose these limits, that's one of the reasons to
look into this task. I noticed several limits being diagnosed by an
assertion, which is less than ideal.

-Mark

You could do that unconditionally by just defining a width that’s wide
enough to store any value up to the limit.

John.

Thanks for all the feedback!

Based on the feedback given, I've updated the proof-of-concept [D72053].

Added support for some additional features:
* Allow limits from the C standard, either sharing the C++ constants or
   separately.
* Allow to document the design choices for the limit.
* Allow to track the status of the limit. (This may become obsolete ones all
   limits are implemented, but that can take a while.)

Added some additional fields to test the C support.

Removed the .inc file from the patch, this should not be committed.

Does Objective-C also have limits which should be added?

[D72053] https://reviews.llvm.org/D72053

-Mark

Do you plan to also support things that aren’t implementation limits but give warnings under -Wpedantic? For example the length of a string literal in LiteralSupport.cpp. It checks for something like 509 for C, 4095 for C99 or 65536 for C++.

These are part of the limits of the standards, else they can be added to
the Non standard limits.

I added a feature to the proof-of-concept [D72053]. This makes it
possible to add remarks to a limit. For 'Characters in a string literal
(after concatenation)' I added a remark regarding -Wpedantic and allowed
3 different limits, depending whether C89, C99, or C++ is used.

[D72053] https://reviews.llvm.org/D72053

-Mark

I'm not sure why we're generating warnings under -Wpedantic in this case, actually. Neither C nor C++ actually forbids string literals longer than the limit; they just don't require implementations to accept them. -Wpedantic is not supposed to contain portability warnings.

-Eli

Maybe its -ansi and not -Wpedantic. I admit I didn’t check exactly. It just knew there was some warning.

It IS -Wpedantic (after talking to you directly, ansi was forcing not C99 mode), but it isn’t included in there explicitly. There is logic in clang tablegen that puts a diagnostic in pedantic based on some rules (including off-by-default, isExtension, etc).