RFC: Bounds Safety in C Syntax Compatibility with GCC

Introduction

The Bounds Safety feature is an exciting new feature, which many programmers are eager to integrate into their code—in particular, Linux kernel developers. To this end, the new attributes it introduces need to be supported by the GCC compiler, among others.

The attributes’ syntax allow the programmer to specify either a single identifier or a simple expression. Each identifier in the attribute must be a member of the containing struct:

 1 size_t len, size;
 2 struct X {
 3   size_t len;
 4   int *buf __counted_by(len); // refers to 'len' on line 3.
 5 };
 6
 7 struct Y {
 8   size_t len;
 9   size_t size;
10   void *buf __sized_by(len * size); // 'len' & 'size' on lines 8 & 9.
11 };

The main issue is that GCC’s C compiler isn’t able to perform general delayed parsing of expressions, which is required to support how expressions within these attributes are parsed, because C only requires a single-pass parser.

Another issue is that (by design) non-struct identifiers aren’t accessible within the attributes. This precludes the use of enums, constant variables, and non-const variables whose values never change.

Explanation

The issue at hand is that C has no scoping rule that applies to structures. And adding non-codified scoping isn’t just an overreach for a set of features, but many compilers, GCC included, simply aren’t able to add such resolution rules. Thus, other compilers won’t be able to perform the general delayed parsing of expressions required to support this feature.

To resolve this issue, we need a way to signify to the compiler where an identifier is declared. One way is to add a forward declaration of struct fields to indicate to the compiler that said identifiers are fields within the struct. Any identifiers not found within the forward declaration list are resolved using C’s normal name lookup rules.

enum { PADDING = 42 };
struct X {
  int *buf __counted_by(size_t len; len + PADDING);
  size_t len;
};

Note: The type of the forward declared identifier must be the same as the type in the struct.

Even with this syntax change, there still exists some ambiguity of which form of the attribute to use. For example, an expression (len) (parens included) would indicate to the parser that the programmer wants to use the expression version of the attribute rather than the single identifier version. Other cases may require look-ahead tokens before the parser can determine which version to use (e.g. len + 0). To remove this ambiguity, we append a _expr to the attribute name. For example:

enum { PADDING = 42 };
size_t size;

struct A {
  int *buf __counted_by_expr(size_t len; len * size + PADDING);
  size_t len;
};

const size_t len;
struct B {
  int *buf __counted_by_expr(len * size + PADDING);
};

Proposal

The “single identifier” syntax is supported by both Clang and GCC (with some minor changes to lookup rules), so this proposal focuses on expression handling.

We propose appending the suffix “_expr” for attributes that take expressions and adding forward declarations of struct identifiers used within the expression. (We’ll use the counted_by attribute as an example.)

counted_by_expr ::= 'counted_by_expr' '(' <decl_list> <expr> ')'
decl_list ::= /* empty */ | <decl> ';' <decl_list>
decl ::= <type> <identifier>

All forward declared identifiers must exist within the containing struct and must have the same type. Any identifiers that weren’t forward declared are resolved using normal C lookup rules.

An alternate syntax, that avoids adding forward declarations, is to use the “designator syntax” for struct members:

__counted_by_expr ( .len + PADDING );

This syntax is used for designated initializers in both C and C++, so it already has some support within the languages. It’s also compact and has a fairly clean syntax.

Some downsides are that the preceding dot (.) is easier to miss both reading and writing.

C++ Compatibility

The forward declaration syntax isn’t necessary for C++, because of C++'s scoping rules. It’s easier to be explicit by using this-> and ::. When parsing a struct for C, the C++ parser will of course need to verify that the correct forward declarations are made.

Example Usage

Here are some examples of how the proposed changes act.

These two examples are equivalent:

counted_by (len)
counted_by_expr (size_t len; len)

These two examples are not equivalent:

counted_by (len) // refers to 'len' in the struct
counted_by_expr (len) // refers to 'len' outside the struct

For this example, len and scale in the attribute reference the non-struct variables.

constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); 
};

The forward declaration isn’t added if no struct fields are used. The following would result in an error, because the compiler would parse a type, expecting a forward declaration, but it wouldn’t find one.

typedef unsigned short TY;
struct A {
  int TY;
  int *buf __counted_by_expr(TY);
};
1 Like

Thanks for writing this up! Just to clarify, many alternatives have been discussed with GCC developers already, and what is proposed here is the compromise that GCC folks are generally okay with. If there aren’t any strong objections on the Clang side, then we’ll have a workable path to using these features in the Linux kernel regardless of compiler. :slight_smile:

Thanks @void for writing this up!

This proposal has a benefit of not changing the behavior of already existing attributes, such as guarded_by and counted_by which currently only allow identifiers as input, at least in the mainline LLVM and GCC. This is achieved by keeping the original behavior for identifiers, while introducing a separate new attribute variant counted_by_expr to accept expressions, which follow the default name lookup rule for C.

The proposal tries to address GCC’s concerns with our original approach. In summary, our original approach was to allow counted_by and similar attributes to use unqualified names to express members of the structs. The attributes can also accept simple expressions as below:

struct s {
  int scale;
  int len;
  int *__counted_by(len * scale) buf; // refers to member `len` and member `scale` 
};

In my understanding, the main concern from GCC with this original proposal was that this would require significant changes both in the C language and the GCC’s parser (C doesn’t have a struct scope, and unlike Clang, C and C++ frontends do not share the code base in GCC). These changes include the introduction of the new scoping rule that is a great departure from the C language specification, and the implementation of late parsing. So GCC would not support our original proposal without having a full language specification and blessing from the C standard committee. GCC is okay if the proposed behavior is limited to identifiers only because it requires less changes.

However, this new proposal still doesn’t address the main issue with C++ compatibility that the same code like below will be interpreted differently in C and C++. The main reason why this is important is that there are many headers in practice (including C standard library headers) that are shared between C and C++; and that C++ needs to understand these attributes to secure the boundary between C and C++ although the initial design was focused on C. So such a silent divergence is something we must avoid.

constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); // In C, it's global `len` and `scale`
};

In order to avoid such a silent divergence,

  1. we need to disallow use of unqualified names even for globals; make names always explicit:
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); // error: use of unqualified names
};
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (__global(len) * __global(scale)); // okay
};
  1. or, we need to disallow unqualified name matching members:
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); // error: unqualified names shouldn't match members, use the explicit syntax (__global(N) or __self.)
};
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int *buf __counted_by_expr (len * scale); // this works because there's no matching member names.
};

Adding these diagnostics does not have the same problem that “a change in unrelated header leads to diagnostics in code using __counted_by”, because diagnostics do not depend on the existence of global names.

The previous GCC’s proposals to use __self also had this same problem with C++ compatibility and that was one of the reasons why we couldn’t take that approach.

To give you heads up, we’re planning to propose an alternative proposal that may address the concerns from both GCC and Clang.

CC @AaronBallman @erichkeane @fay59 @rjmccall @DougGregor @delcypher @hnrklssn

1 Like

If I understand correctly, C++ thinks len and scale are the struct members because it is parsing counted_by_expr in instance scope (which C lacks). Is that right?

Isn’t that simply an implementation choice? Why can’t C++ parse counted_by_expr in global scope, the same way C does?

I guess the argument is that this is then less consistent in C++. But I wonder why this is a specific problem how for this extensions, given that this inconsistency between C and C++ does already exist in the core language for decades and this did not bother anybody enough to do something about it. For example, C and C++ have silent semantic difference in the following example:

int val[5];

struct s {
  int val[4];
  char buf[sizeof(val)];
} a;

And in C++ for the second struct there is UB, and it is different to the first on clang and you get an error with GCC:

struct t {
  int val[4];
  char buf[sizeof(val)];
} b;

You can access the global variables in C++ with :: as usual. So my recommendation would simply be to produce diagnostics in C++ in the cases where there is ambiguity.

A __global(x) makes sense, but this is then just

#ifdef __cplusplus
#define __global(x) ::x
#else
#define __global(x)
#endif

I would not force its use in either C++ or C given that the common case should not be ambiguous.

Implementing it won’t be a problem, but the behavior is not really explainable for C++, which understands the struct scope. You have a spelling int len; right before and you use it in __counted_by_expr(len * ...) but the same spelling len has different meanings in the same struct (because of that, arguably, this will be confusing for some C programmers too). If we decide to have C++ follow the C name lookup rule, C++ programmers will likely make mistakes, because the code naturally reads differently for C++ programmers. To avoid mistakes, the compiler should diagnose on ambiguous cases that C++ programmers would naturally read differently.

Now this becomes even harder to explain because __counted_by(len) understands len as member. But len in __counted_by_expr(len) has a different meaning.

In summary, I think we could make so that C++ follow the C name scoping rule, but then we shouldn’t make the design error-prone. Making it always explicit, or diagnose on ambiguous cases as I suggested above would prevent mistakes.

This is unfortunate, but that doesn’t mean we want to introduce more of these problems. We really need to avoid introducing silent inconsistency especially when the feature can be used to make the interoperation between C and C++ safe.

1 Like

It will be something like this. We can introduce a builtin for C that asks as a global scope specifier.

#ifdef __cplusplus
#define __global(x) ::x
#else
#define __global(x) __builtin_global_ref(x)

I have a different opinion. Using globals itself isn’t very common either, so forcing it wouldn’t really affect the common experience of using this feature, while it will prevent unnecessary mistakes.

One (of many) differences between C and C++ are, as many people pointed out before, the different scoping rules. C has two, C++ has myriads. The discussion in the other thread was to try to force C to support these new features without violating the language, or modifying it ways that haven’t been codified yet. To that end, we’ve shifted the burden of specifying which scoping rule to apply from the language that can’t support such things onto the language which can.

C++ is able to unambiguously express which identifier it wants to use in this feature by using this-> and ::. So it feels correct to shift the burden onto C++ to use those forms. A la Martin:

#ifdef __cplusplus
#define __global(x) ::x
#else
#define __global(x)
#endif

Note that an analogous __local is only required if we choose not to have the (superfluous) forward declarations in the C+±only version of these attributes.

By shifting the burden from C to C++, we should have something that’s workable for everyone. And adding :: isn’t a huge departure for C++ programmers, who are generally used to that concept.

Yes, this is in line with what I’m going to propose as an alternative. But we still need to avoid a silent semantic divergence, or coming up with a design that’s too error-prone in either language.

:+1: Thank you!

@AaronBallman @erichkeane This RFC includes some ideas that may affect other parts in Clang too. Especially, it suggests introducing forward declarations inside the attributes that take expressions as a way to access members (otherwise an unqualified name inside the attributes find the local/global scopes). Do you have any feedback on this?