RFC: Bounds Safety in C Syntax Compatibility with GCC

Introduction

The Bounds Safety feature is an exciting new feature, which many programmers are eager to integrate into their code—in particular, Linux kernel developers. To this end, the new attributes it introduces need to be supported by the GCC compiler, among others.

The attributes’ syntax allow the programmer to specify either a single identifier or a simple expression. Each identifier in the attribute must be a member of the containing struct:

 1 size_t len, size;
 2 struct X {
 3   size_t len;
 4   int *buf __counted_by(len); // refers to 'len' on line 3.
 5 };
 6
 7 struct Y {
 8   size_t len;
 9   size_t size;
10   void *buf __sized_by(len * size); // 'len' & 'size' on lines 8 & 9.
11 };

The main issue is that GCC’s C compiler isn’t able to perform general delayed parsing of expressions, which is required to support how expressions within these attributes are parsed, because C only requires a single-pass parser.

Another issue is that (by design) non-struct identifiers aren’t accessible within the attributes. This precludes the use of enums, constant variables, and non-const variables whose values never change.

Explanation

The issue at hand is that C has no scoping rule that applies to structures. And adding non-codified scoping isn’t just an overreach for a set of features, but many compilers, GCC included, simply aren’t able to add such resolution rules. Thus, other compilers won’t be able to perform the general delayed parsing of expressions required to support this feature.

To resolve this issue, we need a way to signify to the compiler where an identifier is declared. One way is to add a forward declaration of struct fields to indicate to the compiler that said identifiers are fields within the struct. Any identifiers not found within the forward declaration list are resolved using C’s normal name lookup rules.

enum { PADDING = 42 };
struct X {
  int *buf __counted_by(size_t len; len + PADDING);
  size_t len;
};

Note: The type of the forward declared identifier must be the same as the type in the struct.

Even with this syntax change, there still exists some ambiguity of which form of the attribute to use. For example, an expression (len) (parens included) would indicate to the parser that the programmer wants to use the expression version of the attribute rather than the single identifier version. Other cases may require look-ahead tokens before the parser can determine which version to use (e.g. len + 0). To remove this ambiguity, we append a _expr to the attribute name. For example:

enum { PADDING = 42 };
size_t size;

struct A {
  int *buf __counted_by_expr(size_t len; len * size + PADDING);
  size_t len;
};

const size_t len;
struct B {
  int *buf __counted_by_expr(len * size + PADDING);
};

Proposal

The “single identifier” syntax is supported by both Clang and GCC (with some minor changes to lookup rules), so this proposal focuses on expression handling.

We propose appending the suffix “_expr” for attributes that take expressions and adding forward declarations of struct identifiers used within the expression. (We’ll use the counted_by attribute as an example.)

counted_by_expr ::= 'counted_by_expr' '(' <decl_list> <expr> ')'
decl_list ::= /* empty */ | <decl> ';' <decl_list>
decl ::= <type> <identifier>

All forward declared identifiers must exist within the containing struct and must have the same type. Any identifiers that weren’t forward declared are resolved using normal C lookup rules.

An alternate syntax, that avoids adding forward declarations, is to use the “designator syntax” for struct members:

__counted_by_expr ( .len + PADDING );

This syntax is used for designated initializers in both C and C++, so it already has some support within the languages. It’s also compact and has a fairly clean syntax.

Some downsides are that the preceding dot (.) is easier to miss both reading and writing.

C++ Compatibility

The forward declaration syntax isn’t necessary for C++, because of C++'s scoping rules. It’s easier to be explicit by using this-> and ::. When parsing a struct for C, the C++ parser will of course need to verify that the correct forward declarations are made.

Example Usage

Here are some examples of how the proposed changes act.

These two examples are equivalent:

counted_by (len)
counted_by_expr (size_t len; len)

These two examples are not equivalent:

counted_by (len) // refers to 'len' in the struct
counted_by_expr (len) // refers to 'len' outside the struct

For this example, len and scale in the attribute reference the non-struct variables.

constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); 
};

The forward declaration isn’t added if no struct fields are used. The following would result in an error, because the compiler would parse a type, expecting a forward declaration, but it wouldn’t find one.

typedef unsigned short TY;
struct A {
  int TY;
  int *buf __counted_by_expr(TY);
};
2 Likes

Thanks for writing this up! Just to clarify, many alternatives have been discussed with GCC developers already, and what is proposed here is the compromise that GCC folks are generally okay with. If there aren’t any strong objections on the Clang side, then we’ll have a workable path to using these features in the Linux kernel regardless of compiler. :slight_smile:

Thanks @void for writing this up!

This proposal has a benefit of not changing the behavior of already existing attributes, such as guarded_by and counted_by which currently only allow identifiers as input, at least in the mainline LLVM and GCC. This is achieved by keeping the original behavior for identifiers, while introducing a separate new attribute variant counted_by_expr to accept expressions, which follow the default name lookup rule for C.

The proposal tries to address GCC’s concerns with our original approach. In summary, our original approach was to allow counted_by and similar attributes to use unqualified names to express members of the structs. The attributes can also accept simple expressions as below:

struct s {
  int scale;
  int len;
  int *__counted_by(len * scale) buf; // refers to member `len` and member `scale` 
};

In my understanding, the main concern from GCC with this original proposal was that this would require significant changes both in the C language and the GCC’s parser (C doesn’t have a struct scope, and unlike Clang, C and C++ frontends do not share the code base in GCC). These changes include the introduction of the new scoping rule that is a great departure from the C language specification, and the implementation of late parsing. So GCC would not support our original proposal without having a full language specification and blessing from the C standard committee. GCC is okay if the proposed behavior is limited to identifiers only because it requires less changes.

However, this new proposal still doesn’t address the main issue with C++ compatibility that the same code like below will be interpreted differently in C and C++. The main reason why this is important is that there are many headers in practice (including C standard library headers) that are shared between C and C++; and that C++ needs to understand these attributes to secure the boundary between C and C++ although the initial design was focused on C. So such a silent divergence is something we must avoid.

constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); // In C, it's global `len` and `scale`
};

In order to avoid such a silent divergence,

  1. we need to disallow use of unqualified names even for globals; make names always explicit:
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); // error: use of unqualified names
};
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (__global(len) * __global(scale)); // okay
};
  1. or, we need to disallow unqualified name matching members:
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int scale; 
  int len;
  int *buf __counted_by_expr (len * scale); // error: unqualified names shouldn't match members, use the explicit syntax (__global(N) or __self.)
};
constexpr int len = 20;
constexpr int scale = 4;
struct s {
  int *buf __counted_by_expr (len * scale); // this works because there's no matching member names.
};

Adding these diagnostics does not have the same problem that “a change in unrelated header leads to diagnostics in code using __counted_by”, because diagnostics do not depend on the existence of global names.

The previous GCC’s proposals to use __self also had this same problem with C++ compatibility and that was one of the reasons why we couldn’t take that approach.

To give you heads up, we’re planning to propose an alternative proposal that may address the concerns from both GCC and Clang.

CC @AaronBallman @erichkeane @fay59 @rjmccall @DougGregor @delcypher @hnrklssn

1 Like

If I understand correctly, C++ thinks len and scale are the struct members because it is parsing counted_by_expr in instance scope (which C lacks). Is that right?

Isn’t that simply an implementation choice? Why can’t C++ parse counted_by_expr in global scope, the same way C does?

I guess the argument is that this is then less consistent in C++. But I wonder why this is a specific problem how for this extensions, given that this inconsistency between C and C++ does already exist in the core language for decades and this did not bother anybody enough to do something about it. For example, C and C++ have silent semantic difference in the following example:

int val[5];

struct s {
  int val[4];
  char buf[sizeof(val)];
} a;

And in C++ for the second struct there is UB, and it is different to the first on clang and you get an error with GCC:

struct t {
  int val[4];
  char buf[sizeof(val)];
} b;

You can access the global variables in C++ with :: as usual. So my recommendation would simply be to produce diagnostics in C++ in the cases where there is ambiguity.

A __global(x) makes sense, but this is then just

#ifdef __cplusplus
#define __global(x) ::x
#else
#define __global(x)
#endif

I would not force its use in either C++ or C given that the common case should not be ambiguous.

Implementing it won’t be a problem, but the behavior is not really explainable for C++, which understands the struct scope. You have a spelling int len; right before and you use it in __counted_by_expr(len * ...) but the same spelling len has different meanings in the same struct (because of that, arguably, this will be confusing for some C programmers too). If we decide to have C++ follow the C name lookup rule, C++ programmers will likely make mistakes, because the code naturally reads differently for C++ programmers. To avoid mistakes, the compiler should diagnose on ambiguous cases that C++ programmers would naturally read differently.

Now this becomes even harder to explain because __counted_by(len) understands len as member. But len in __counted_by_expr(len) has a different meaning.

In summary, I think we could make so that C++ follow the C name scoping rule, but then we shouldn’t make the design error-prone. Making it always explicit, or diagnose on ambiguous cases as I suggested above would prevent mistakes.

This is unfortunate, but that doesn’t mean we want to introduce more of these problems. We really need to avoid introducing silent inconsistency especially when the feature can be used to make the interoperation between C and C++ safe.

1 Like

It will be something like this. We can introduce a builtin for C that asks as a global scope specifier.

#ifdef __cplusplus
#define __global(x) ::x
#else
#define __global(x) __builtin_global_ref(x)

I have a different opinion. Using globals itself isn’t very common either, so forcing it wouldn’t really affect the common experience of using this feature, while it will prevent unnecessary mistakes.

One (of many) differences between C and C++ are, as many people pointed out before, the different scoping rules. C has two, C++ has myriads. The discussion in the other thread was to try to force C to support these new features without violating the language, or modifying it ways that haven’t been codified yet. To that end, we’ve shifted the burden of specifying which scoping rule to apply from the language that can’t support such things onto the language which can.

C++ is able to unambiguously express which identifier it wants to use in this feature by using this-> and ::. So it feels correct to shift the burden onto C++ to use those forms. A la Martin:

#ifdef __cplusplus
#define __global(x) ::x
#else
#define __global(x)
#endif

Note that an analogous __local is only required if we choose not to have the (superfluous) forward declarations in the C+±only version of these attributes.

By shifting the burden from C to C++, we should have something that’s workable for everyone. And adding :: isn’t a huge departure for C++ programmers, who are generally used to that concept.

Yes, this is in line with what I’m going to propose as an alternative. But we still need to avoid a silent semantic divergence, or coming up with a design that’s too error-prone in either language.

:+1: Thank you!

@AaronBallman @erichkeane This RFC includes some ideas that may affect other parts in Clang too. Especially, it suggests introducing forward declarations inside the attributes that take expressions as a way to access members (otherwise an unqualified name inside the attributes find the local/global scopes). Do you have any feedback on this?

Thank you for the new RFC! I have some questions:

Same canonical type? e.g., is this fine?

typedef int frobble;
struct S {
  int *buf __counted_by_expr(int i; i + 12);
  frobble i;
};

Also, regarding “same type”, what about a case like this:

struct S {
  int *buf __counted_by_expr(int len; len + 12);
  int len : 4;
};

where the bit-field has type int but is a bit-field which is a bit of a distinct type in other ways.

Will there be a warning diagnostic for code like:

int len;
struct S {
  int len;
  int *buf __counted_by(len); // Or __counted_by_expr
};

due to the potential for confusion?

Honestly, all of this seems like a pretty novel way around now having a local-scope operator to me.

The forward declarations are strange to me. I come down on the favor of SOME sort of local-scope operator as an extension (the __self operator which ends up being a no-op in C++ to modify it in C).

int len;
struct S {
  int len;
  int *buf __counted_by(len); // Or __counted_by_expr
};

THAT example is absolutely flabbergasting to me FWIW. I realize that the idea of an attribute ‘looking up’ inside of the struct itself is strange in C, but its a builtin, we can design it how we want. It seems to me to make sense to have it have custom lookup rules if that makes it the most ergonomic.

ALTERNATIVELY: What about some way of declaring the elements as a ‘pair’ (that is, have the operator ‘wrap’ both declarations, and thus be unambiguous)? I presume some of this is not wanting to change the layout of the struct though? So I’m not sure that works too well.

So in the end, either having __counted_by have clever lookup rules, or some sort of scoping operator/qualifier makes the most sense to me. Spell it:
S::len or __self(len) or something like that, plus make sure you get the warning that Aaron asked for (that the one used isn’t the one from the local struct) I think makes this ergonomic. I have a VAST preference for the former (since that is just how C++ spells it today). In C this ends up being a pretty trivial qualification (and we can limit it to JUST <structname>:: if that is important to us).

I thought about it more and I wonder if you are open to hold the idea for counted_by_expr until the next C committee meeting scheduled on late August. In the mean time, we can keep using and developing on counted_by(id) as is because the GCC community is also fine with it.

I’m asking because we are planning to propose the -fbounds-safety programming model, and the name lookup rule for bounds annotations in the next C standard committee meeting. I expect the standard committee to advise to apply a consistent rule for an identifier and an expression. So this approach to separating __counted_by and __counted_by_expr will likely not pass the committee. That means if we do this now, then we would have to change the syntax yet again if the standard committee decides to do something else.

For the standard, I’m thinking about proposing two options:

  1. Consistently apply the struct scope to identifiers and expressions (IOW, apply the same counted_by(id) name look up rule to counted_by(expr) too) (as proposed here)
  2. Or alternatively, requiring _Self. (or this->) to access a member like Qing’s proposal but combining with a mechanism to avoid a silent meaning change of existing code already using __counted_by, __guarded_by, etc. as well as a silent divergence between C and C++ as suggested earlier:

The problem with “1) Consistently applying the struct scope to identifiers and expressions” was that it’s considered a significant change for GCC to support such rules for expressions. It seems they at least need some blessing from the C standard committee and a full specification for that. So I’m hoping we can get a blessing on this approach in the next standard committee meeting.

The problem with “2) Using __self. with a mechanism to avoid a silent meaning change” is that it will break existing code that already uses __counted_by(id) and __guarded_by(id) in C and C++, which take the first approach. We would need to think about a good migration plan if this alternative approach ended up getting a blessing from the C standard committee.

1 Like

I find the forward declaration idea acceptable. However, using two keywords to interpret the same input differently confuses a presumably clear intention. Even previously confident users can no longer be sure which means which.

How about this:

  1. There is only one __counted_by with two forms, with or without the forward declaration.
  2. In both forms, __counted_by does not interpret an unqualified id as a global variable at all in both C and C++.
  3. In C, the member referred to by an unqualified id in the expression has to be declared, either using the forward declaration form or simply preceding that __counted_by in member declarations.
  4. In C++, the forward declaration form is interpreted in the same way as that in C, while the expression-only form allows looking up member names that appear later in the member declarations or are declared in the base(s).
  5. How to access global variables? Not needed. We can introduce static member variables to C, allowing
    constexpr int len = 20;
    constexpr int scale = 4;
    struct s {
      static constexpr int g_len = len;
      static constexpr int g_scale = scale;
    
      int scale;
      int len;
      int *buf __counted_by(g_len * g_scale);
    };
    
    Or just adding a global namespace qualifier :: to C.

I would say yes. C’s type system isn’t strict enough to reject this.

Interesting example. The purpose of the forward declarations is to facilitate single-pass parsers, like the one GCC has. Someone on the GCC team would have to chime in here. My initial remark is that it should be a bit field as well:

__counted_by_expr(int len : 4; len + 12)

I don’t think we should warn about this for __counted_by. A warning for __counted_by_expr might be useful in this situation, but it has to be a warning that can be turned off without having to rename struct fields or globals.

The problem is that the __self (or equivalent) keyword needs to be required because C has no struct level local scope, and cobbling together such a scoping rule for this feature is a large overreach.

This example is illustrative, but probably not the best indicator of how code will be written most of the time. :slight_smile: Less glibly, when we originally designed these attributes, it seemed “natural” for us to reference struct fields without having a special prefix. This would imply that it’s the opposite (that the above would use the global) which is more counter intuitive. I realize that this counters the proposal of the __counted_by_expr attribute, which explicitly treats the above as referring to the global, but we had to make compromises for this to work with all compilers.

C doesn’t have a :: operator though. And again, using __self() would have to be required, which is a large burden on the user.