Problem statement
In 2023, we proposed the -fbounds-safety
extension to Clang which allows the compiler to automatically introduce reliable bounds checks to pointer accesses in C [1]. The extension adds several different bounds annotations (__counted_by
, __sized_by
, __ended_by
, etc.) which the programmers can add to pointer types. These annotations are mainly used on struct members, where they typically refer to another member of the same struct that semantically stores the size of the array the pointer points to, and on parameters, where they typically refer to another parameter which stores the size in a similar manner. Unfortunately, the C language doesn’t currently have any features, such as methods for structs, that require a struct member to be referenced within the definition of the struct, and so the language simply provides no way to do it. -fbounds-safety
therefore needed to invent a rule.
The target audience of this document is the compiler implementers. This document provides rationale for the current name lookup behavior of -fbounds-safety
for bounds annotations in structs as a response to the comments from Clang and GCC communities. We also suggest diagnostics to mitigate potential ambiguity, and propose new builtins that can be used as a suppression and disambiguation mechanism.
Hereafter, “the current rule/behavior” stands for the rule that has been proposed in the original RFC [1] which is currently implemented in Apple’s fork of llvm-project, and is being upstreamed to the mainline.
The current name lookup rule in bounds annotations
Within bounds annotations, the compiler always finds a member of the same struct first regardless of the declaration order. When a matching member is found, it is considered a reference to a member of the current struct instance. If there is no member with the name of search, it looks up the outer scopes. Since bounds annotations in structs are required to refer to another member of the same struct except for constant variables or constant expressions, if the name matches in an outer scope, the compiler rejects code unless the found name is a constant variable. This design choice is optimized for the common use cases of __counted_by
and other bounds annotations, where the bound of a struct member pointer will be almost always another member of the same struct.
We have millions of lines of C code, including audio codecs, the networking stack in an OS kernel, image parsing libraries, and encryption/decryption libraries, that have adopted the current model. We haven’t encountered any adopter that actually wanted to use a global constant as a count instead of a member of the same struct in the same name.
This means as shown in the following examples, we haven’t seen any adopter yet that actually meant to use the global constant len
instead of the member len
as the count of buf
.
const int len = 10;
struct foo {
int len;
int *__counted_by(len) buf;
};
struct bar {
int *__counted_by(len) buf;
int len;
};
- Side note: globals are generally not allowed to be used as a count of struct members, function parameters, or local variables), because the compiler cannot keep track of the invariant that the global variable is actually storing the right bound for every single instance of the struct or every single use of the global potentially in a different translation units. However, a constant global can still be used as a count for struct members, function parameters, or local variables.
We understand that this doesn’t mean that there will be no conflicting names between globals and struct members. Name conflicts haven’t been problematic in the current model because a member declaration naturally shadows a global variable of the same name, aligning with adopter expectations.
Also, regardless of how rare it is, we agree that the programming model should still provide a disambiguation mechanism, i.e., an option to choose a global constant instead of a struct member inside the bounds annotations. In “Proposed Builtins (disambiguation / suppression mechanisms)”, we propose new builtins to provide such disambiguation mechanisms.
This way, the adopters can continue to write less code in most common scenarios, while explicitly using a builtin only in rare circumstances. As stated in [2], “A well-chosen syntax significantly helps programmers learn new concepts and avoids silly errors by making them harder to express than their correct alternatives ”. This approach helps avoid unnecessary mistakes, if we agree that choosing a global instead of a struct member for a bounds annotation is most likely a mistake. Writing as little code as possible is also important to enable large-scale adoption. The main purpose of this extension is to secure existing large code bases that cannot be rewritten with a safe language overnight.
Existing code relying on the current behavior
Clang (one in the mainline llvm-project) and GCC currently have some limited support for the bounds annotations proposed by -fbounds-safety
. Clang supports the counted_by(m), sized_by(m), counted_by_or_null(m), and sized_by_or_null(m)
attributes and they are currently only applicable to struct members. GCC supports the counted_by(m)
attribute and it is currently applicable only to flexible array members. The attributes supported in Clang and GCC can be used to refine the array bounds sanitizer and the __builtin_dynamic_object_size
builtin. Both Clang and GCC partially implement the current name lookup behavior. The Linux kernel adopted the counted_by
annotation for flexible array members. There may be more users as the two major compilers have support for this.
The guarded_by(m)
attribute in Clang is used for the -Wthread-safety
thread safety analysis which warns when certain fields are used while not statically holding a particular lock. m
is required to be an identifier which names another field of the same structure that stores the lock. The attribute is available to use in both C and C++. The attribute has a similar name lookup rule as -fbounds-safety for struct members.
-fbounds-safety
has been implemented in Apple’s fork of llvm-project and has been adopted in millions of lines of production code at Apple. These code bases rely on the current name lookup behavior.
Clarifications
-fbounds-safety introduces a new scoping rule for C that breaks the existing code?
This is not the case. The current approach does not change how the existing C features are currently working. This introduces a struct scope (instantiation scope) for names used inside the bounds annotations only. We have some suggestions on how to introduce a struct scope for C, but that is out of the scope of this document. The question then becomes how to handle the inconsistency between name lookup in bounds annotations and existing similar language features, which we will discuss in details in “Design Rationale” .
Bounds annotations are the C-only feature?
Our initial focus has been on C. However, we do see C++ support for the extension as an important next step. It is not uncommon for C code to be compiled as C++, or for C++ code to use a C library. Neither scenario should serve as a means to bypass safety checks.
At Apple, we have started using the bounds annotations in C++ to allow C and C++ to securely interoperate in scenarios where the bounds annotations are used on interfaces with pointers that are shared between the two languages. Also, we recently shared this vision in the LLVM Memory Safety Working Group meeting, and this is some of the projects we may potentially collaborate with the group. For this reason, it’s a requirement to design a syntax that can be written compatibly with the both languages.
-fbounds-safety
suggests the same behavior for C and C++. The same looking code must express the same meaning, while some of the behavior may seem new to both of the languages. The new behavior, however, should be contained within bounds annotations and should be reasonably explained based on the new functionality that we try to introduce.
Without a new syntax, how do we support the use case where the count is in a nested struct?
A question has been raised on without a new syntax how to express code like below where the bound of a buffer is stored in a nested struct.
struct _header {
int len;
};
struct _sized_buf {
struct _header header;
int *__counted_by(header.len) buf;
};
A member name within a bounds annotation refers to the member of the instance of the current struct. This is similar to the member variable definition where struct _header header;
defined in a struct means the member of the instance of the current struct. It’s also the same as other C-family languages that already support methods in structs or classes. When an unqualified name is resolved to a member name, it means a reference to the member of this instance of the struct. Therefore, header.len
should be interpreted as a member access to the member header
of this instance of the struct. Conceptually, this is equivalent to __self.header.len
or this->header.len
.
Design Rationale
The current name lookup for bounds annotations is similar to the name lookup for methods of structs
From the functionality perspective, expressions within bounds annotations are similar to methods for structs in that when invoked they perform operations on the members that are referred to. While methods are invoked explicitly, the expressions in bounds annotations are invoked implicitly whenever the associated pointer is used. In the following example, __sized_by
is an annotation to express the bounds of the pointer in byte size. p->buf
is used for the array subscript. At this point, the size of the buffer is evaluated as p->col * p->row
as it is annotated.
struct s {
int col;
int row;
uint8_t *__sized_by(col * row) buf;
};
uint8_t get_elem(struct s *p, int i, int j) {
// The size of 'p->buf' is evaluated as
// 'p->col * p->row'.
return p->buf[p->col * i + j];
}
The body of a method should be able to see the whole class/struct scope because they need to be able to perform operations using members of the class/struct. Similarly, expressions within bounds annotations should be able to see the whole class/struct scope in order to calculate the bounds using members of the class/struct. Consequently, the name lookup rule for bounds annotations resembles the name lookup in class/struct methods.
While C doesn’t have a concept of methods for structs, many other C-family languages such as C++, C#, and Java already have the concept of methods for structs/classes. Thus, the name lookup for bounds annotations in structs are similar to the name lookup in method bodies in those C-family languages that already have the concept of methods.
Inconsistencies are not necessarily wrong; sometimes, it’s necessary to properly support a novel feature
Because this is a new concept, the behavior may seem inconsistent with some similar looking features that exist in C and C++.
C doesn’t have the concept of methods for structs, so there is no way to refer to a member within the struct. Thus, in C, the name lookup within a struct always skips the struct scopes. This leads to this inconsistent behavior with arrays (see “Inconsistency with the name lookup in array brackets in C” for more detailed discussions).
C++ doesn’t have the concept where a variable declaration is attached with an expression such that it is implicitly evaluated every time the variable used. The closest thing may be a constructor for a use-defined type which is evaluated when the type is instantiated. This leads to this inconsistent behavior with the name lookup for member definitions (see “Inconsistency with C++ when forward referencing a member” for more elaborations).
However, being different isn’t necessarily problematic as long as it can be justified for the new functionality. Sometimes, it’s necessary to properly support the new use cases if the base language doesn’t already have such a behavior to best support them. The behavioral difference between the new feature and the existing ones can be explained because they are different.
There can be alternative behaviors to introducing this new feature by keeping the behavior as consistent as possible with the existing language features. However, such alternative approaches trade usability of the new feature for consistency with the existing ones. Ironically, such approaches can actually make the behavior more confusing and error-prone because the existing features were not properly designed to support the new use cases in the first place (see “Inconsistency with the name lookup in array brackets in C” for detailed discussions).
Some may still prefer strict consistency over usability. However, then that’s a matter of preference based on different design philosophies rather than that the current behavior is something that must be fixed.
Inconsistency with the name lookup in array brackets in C
In the following example, int
arr[len]
refers to the global constexpr len
, not the struct member len
.
constexpr int len = 10;
struct s {
int len;
int arr[len]; // this is the global constexpr `len`.
};
For bounds annotations in a similar example, __counted_by(len)
means the member len
, not the global len
.
constexpr int len = 10;
struct s {
int len;
int *__counted_by(len) buf; // this is the member `len`.
};
This is an example where the name lookup for the bounds annotation __counted_by
diverges with arrays. Following the same behavior as arrays means the meaning of the above code will be changed to use the global constexpr len
instead. This would be problematic for bounds annotations because it’s almost always the peer member that stores the bounds information.
In fact, even without __counted_by
, the following code would naturally read like the size of buf
would likely to be stored in another member len
. It is a very common idiom in C to track a pointer and its bound in the same struct.
constexpr int len = 10;
struct s {
int len;
int *buf;
};
struct s2 {
int *buf;
int len;
};
In addition, the behavior to find the global instead of the member for an unqualified name is counter-intuitive given that the member is defined as int len;
and the subsequent use of __counted_by(len)
is in the same struct.
Some may argue that this is how C works already. However, even for arrays, the current name lookup is arguably counter-intuitive.
constexpr int len = 10;
struct s {
int len;
int arr[len]; // refers to global constexpr `len`.
};
A member is defined as int len
right before arr[len]
and yet this code means a global len
instead. This behavior may be understandable because arrays cannot use a member as a size anyway. However, the same is not true for bounds annotations.
Alternative behavior: unqualified name resolves to global; use new syntax to access members
An alternative approach would be following the same behavior as arrays for unqualified names, and using a new syntax (e.g., __self
or .
) to access the member (similar to Qing’s proposal). However, introducing a new syntax doesn’t change the fact that the default behavior without the syntax selects the unlikely alternative, i.e., a global. Hence, this behavior would still lead to lots of unwanted mistakes.
[2] states that “A well-chosen syntax significantly helps programmers learn new concepts and avoids silly errors by making them harder to express than their correct alternatives ”. The alternative approach does exactly the opposite. The programmers will have to write more code for the correct behavior, and less code for the incorrect one.
In addition to avoiding mistakes, writing as little code as possible for most common use cases is important to allow large-scale adoption. The main purpose of this extension is to secure existing billions of lines of code bases in the wild that cannot be rewritten with a safe language in a timely fashion.
Given this functional difference between arrays and bounds annotations, the variation in the name lookup behavior in C can also be justified: int *__counted_by(len)
finds the member len
which shadows the global constexpr len
. On the other hand, arrays cannot have a member as a size, so len
has to be the global constexpr for arrays. In fact, the difference between bounds annotations and arrays hasn’t been a source of confusion during our actual adoption experience. The programmers didn’t seem confused because they distinguish bounds annotations and arrays.
On the other hand, if __counted_by(len)
doesn’t refer to the preceding member len
, it would be hard to justify in C++. In most other C-family languages (C++, C#, Java, etc.) that already support methods in structs/classes, unqualified names can be used to refer to a member of the same struct/class. C might not be already doing that partly because C structs do not support methods for structs so there was no compelling use case yet to refer to a member within a struct. However, programmers would be generally familiar with this name lookup behavior, given that most programmers use more than one language.
Last but not least, the alternative solution will break almost all of our existing adopters.
Inconsistency with C++ when forward referencing a member
In C++, arr[len]
in the following example, refers to the member len
. This is consistent with the current name lookup behavior in -fbounds-safety.
constexpr int len = 10;
struct s {
static constexpr int len = 20;
int arr[len]; // refers to member `len` in C++
};
In the following example, arr[len]
refers to the global len
because the member len
has not been declared when the name len
is used for the array member.
constexpr int len = 10;
struct s {
int arr[len]; // refers to global `len`.
static constexpr int len = 20;
};
In the similar code with a bounds annotation, __counted_by(len)
refers to the member len
.
constexpr int len = 10;
struct s {
int *__counted_by(len) buf; // refers to member `len`.
int len;
};
This is where the name lookup behavior seems to be inconsistent with C++. However, as discussed earlier in “The current name lookup for bounds annotations is similar to the name lookup for methods of structs” an expression within a bounds annotation is more like a method of the struct so the name lookup for bounds annotations work as in a method body.
In C++, an unqualified name lookup behaves differently depending on the context. When an unqualified name is used in a body of method or constructor, it finds the name from the entire enclosing struct/class scope. This means, inside the body of the method get_m()
, the unqualified name m
refers to the member m
.
constexpr int m = 10;
struct foo {
int n = m; // refers to 'foo::m'
int get_m() { return m; } // refers to 'foo::m'
static constexpr int m = 20;
};
When it comes to a member or type definition within a struct/class, the unqualified name lookup can only find members that are already declared up to the point where the name is actually used.
However, int n = m
still means the member m
in the following example. This is because behind the scenes it creates a default constructor that initializes n
with m
. Therefore, the name lookup for the default initialization must see the whole struct/class scope and the name lookup follows how it works for method bodies.
Similarly, the expression len
in __counted_by(len)
is evaluated every time the buffer is used as if such a method is called implicitly. Therefore, name lookup within bounds annotations must access the entire struct/class scope, analogous to how it works in method bodies.
Alternative behavior #1: follow the basic name lookup for member definitions in C++; allow forward reference only if the name doesn’t match
A potential alternative would be to follow the typical name lookup rule for member definitions in C++. Then, search for a forward reference only if the name doesn’t already exist. However, such behavior would make code fragile because the meaning would change if anyone accidentally introduce a global of the same name referred to by the bounds annotation. This means, in order to prevent the accidental change of meaning, the compiler should report an error whenever a global name conflicts with the member name referred to by any bounds annotation. This may be doable but many shared concerns for reporting an error like this.
constexpr int len = 10; // note: `len` is declared here
struct s {
int *__counted_by(len) buf; // error: `len` is ambiguous
int len; // note: `len` is declared here
};
Alternative behavior #2: follow the basic name lookup for member definitions in C++; use new syntax only for forward referencing
Another potential alternative would be to follow the typical name lookup rule for member definitions in C++ and use some new syntax only for forward referencing a member. This can work, but it wouldn’t be ideal to always type the extra syntax unnecessarily, given that it would be very rare to write code such that the member name used by a bounds annotation happens to be in conflict with a global name. And that it’s almost always the peer member that stores the bound of the pointer member.
That said, we agree that the current name lookup for bounds annotations can still be confusing when there is a name conflict with a global. We should emit a warning for the above example.
Isn’t it an undefined behavior to change the meaning of unqualified name within a class?
In C++, an unqualified name used in a class/struct should hold the same meaning throughout the complete class/struct scope. Otherwise, it’s undefined behavior. Technically, the current rule doesn’t necessarily cause a UB in this example because len
used inside __counted_by
consistently means the member len
throughout the class scope.
constexpr int len = 10;
struct s {
int *__counted_by(len) buf; // consistently refers to the member `len`.
int len;
};
This is analogous to the case with a method body where the name lookup sees the entire class scope. The following is not a UB because the reference to the name len
consistently means the member len
:
constexpr int len = 10;
struct s {
void foo() { return len; } // consistently refers to the member `len`.
int len;
};
However, it will be a UB if there is another use of the name len
interpreted as the global constexpr before the member len
is declared. Because then the member declaration would change the meaning of len
in the class scope. However, such code is already a UB in the absence of bounds annotations. In fact, the bounds annotation has nothing to do with this UB. GCC already reports a warning in this case with “-Wchanges-meaning” for C++ (https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wchanges-meaning).
constexpr int len = 10;
struct s {
int bytes[len];
int len; // error: declaration of 'int s::len' changes meaning of 'len' [-Wchanges-meaning]
};
Revised Proposal
The revised proposal inherits the same name lookup behavior explained in “the current name lookup rule in bounds annotations”. Additionally, the proposal suggests new diagnostics to alert programmers about ambiguous situations and proposes new builtins that can be used to suppress the diagnostics and to disambiguate member and global references.
Suggested New Diagnostics
The current approach explained in “The current name lookup rule in bounds annotations” might lead to confusion in cases where names used in bounds annotations conflict with globally defined constants. To help mitigate ambiguity and potential confusion, we propose introducing targeted compiler warnings. Again, adding a new syntax for member accesses also doesn’t solve the ambiguity issue because code can still be (mistakenly) written without the syntax, which then will silently mean most likely unintended behavior.
For example, consider this situation:
constexpr int len = 10;
struct s {
int *__counted_by(len) buf; // potential confusion: does this refer to the member or the global 'len'?
int len;
};
In this case, the compiler should emit a diagnostic warning to alert the programmer about the ambiguity:
warning: bounds annotation '__counted_by(len)' refers to struct member 'len' which shadows global constant 'len'
As discussed earlier, GCC already emits a warning in some similar situations for C++.
Similarly, confusion may arise if an unqualified name within an array declaration and the same name within a bounds annotation resolve differently, as shown below:
constexpr int len = 10;
struct s {
int len;
int *__counted_by(len) buf; // refers to struct member `len`.
int arr[len]; // refers to global constexpr `len`
};
In this scenario, the compiler could issue a specific warning highlighting this discrepancy:
warning: inconsistent use of identifier 'len':
'__counted_by(len)' refers to struct member 'len',
'arr[len]' refers to global constant 'len'
These diagnostics won’t alter the current default behavior (which resolves to the intended struct member), but they will help programmers identify and correct ambiguous or confusing cases. Users can explicitly suppress these warnings by using the proposed builtins (__builtin_member_ref
or __builtin_global_ref
) to clarify their intent.
Proposed Builtins (disambiguation / suppression mechanisms)
This proposal doesn’t introduce a new syntax at this time. Since this is currently being done as a vendor extension and without the support from the language committee, this syntax shouldn’t intrude too much on the normal language syntax. Introducing a syntax like .n
or ::n
could potentially create a conflict with future language features, when these have to be part of an arithmetic expression (.n + 3 * .m << 4
). It’s worth nothing that in our internal adopters, there are quite a few cases where the count value involves some arithmetic expressions. The standard vendor-extension space we use in Clang for things like this is __builtin_*
and attributes.
Macro-defining the builtins will also make code easy to migrate once the new syntax for scope specifiers is set and stone.
We propose two builtins to allow disambiguate between a member name and a global name, and allow suppress new warnings caused by name conflicts:
__builtin_member_ref(name)
: always looks for the current instantiation scope, regardless of the declaration order. Referring to an undeclared member is an error. This may be replaced with “__self.” or a new syntax once we get the blessing from the C/C++ communities.__builtin_global_ref(name)
: always looks for the enclosing global scope (a disambiguation mechanism for local scopes is not supported). This may be replaced with some new syntax once we get the blessing from the C committee. This can be replaced with::
in C++.
In the following example, the global constexpr len
is shadowed by the member len
.
constexpr int len = 10;
struct foo {
int len; // shadows the global constexpr `len`.
int *__counted_by(len) buf;
};
In order to still use the global constexpr len
, one can add __builtin_global_name(len)
to specify the scope as shown below:
constexpr int len = 10;
struct foo {
int len;
int *__counted_by(__builtin_global_ref(len)) buf;
};
In the following example, the compiler emits a warning because the member name len
conflicts with a global constexpr len
.
constexpr int len = 10;
struct foo {
int *__counted_by(len) buf; // warn: `len` is resolved to the name declared in the global scope.
int len;
};
In order to suppress the warning and specify the intended scope, one can either add __builtin_member_ref(len)
or __builtin_global_ref(len)
depending on the meaning of the code:
constexpr int len = 10;
struct foo {
int *__counted_by(__builtin_member_ref(len)) buf; // no warning
int len;
};
If we take the “__self” syntax then it can also be used as a suppression mechanism:
constexpr int len = 10;
struct foo {
int *__counted_by(__self.len) buf; // no warning
int len;
};
Using __builtin_global_ref
to use the global name will suppress the warning too:
constexpr int len = 10;
struct foo {
int *__counted_by(__builtin_global_ref(len)) buf; // no warning
int len;
};
Relation to Qing’s “__self” Proposal
Qing’s “__self” proposal suggests aligning the name lookup rules for bounds annotations with the existing C behavior for array sizes. In Qing’s model, unqualified identifiers within bounds annotations would default to referring to global or outer-scope names, similar to how array sizes are resolved incurrent C semantics. To explicitly reference a struct member, Qing proposes introducing the syntax __self.member
, while there still seems a debate on which syntax to use.
As discussed previously in “Inconsistency with the name lookup in array brackets in C”, defaulting to global scope is problematic. This extension must also work for C++. When code is written like below, __counted_by(len)
choosing the global constexpr len
rather than the member len
is not justifiable in any other C-family languages including C++ that already has a concept of struct scope.
constexpr int len = 10;
struct s {
int len;
int *__counted_by(len);
};
In addition, this behavior conflicts with typical user intent and introduces a behavior that is error-prone.
Another potential alternative that doesn’t have this problem would be to always mandate explicit scope specifier in either way: for member access and for global scope.
constexpr int len = 10;
struct s1 {
int len;
int *__counted_by(__builtin_global_ref(len));
};
struct s2 {
int len;
int *__counted_by(__self.len);
};
struct s3 {
int len;
int *__counted_by(len); // error: illformed
};
Then, this is not ideal from the user’s perspective. The users are always required to write more code even though the name conflict situation is very rare. Again, writing as little code as possible is important for large-scale adoption.
In contrast, our proposal defaults unqualified identifiers to struct members, aligning closely with common usage patterns and user expectations. The behavioral difference with arrays is still explainable as bounds annotations are functionally similar to methods in structs that are implicitly invoked when the annotated pointer is used. And that must see the whole struct scope to perform operations with the members (see “Design Rationale” for detailed discussions). We still provide explicit builtins, such as __builtin_global_ref
and __builtin_member_ref
, allowing programmers to clearly specify global or struct member scope when ambiguity arises.
Notably, our approach doesn’t restrict programmers from consistently using “__self” or some scope specifier for members, even though it’s not required. Programmers who prefer explicitness — particularly in contexts without C++ compatibility concerns or where they wish to avoid warnings due to name conflicts — can still consistently use this notation. Under our proposal, such usage closely resembles Qing’s proposal, except that programmers would need to explicitly use __builtin_global_ref
to reference a global instead of a struct member with the same name.
References
[1] RFC: Enforcing bounds safety in C (-fbounds-safety) RFC: Enforcing Bounds Safety in C (-fbounds-safety)
[2] The Design and Evolution of C++, Bjarne Stroustrup, 1994.