Language extension: __attribute__((bounds))/__attribute__((__bounded__))

Hi,

I started working on adding a new attribute to Clang, to annotate
pointers with bounds information,
before going further I'd like to hear your opinion whether this is
acceptable for clang (or if we can
modify this proposal for it to be acceptable).

Given a pointer field in a struct, or a pointer parameter, the proposed
__attribute__ would tell
the compiler which field/parameter holds the bounds of the buffer pointer.

This is not a completely new attribute, in fact OpenBSD has had
something similar for GCC,
but only for function parameters [1].

Some possible use cases for this attribute:
- emit a warning when a buffer pointer is passed to a function if the
length argument is larger than the
actual length of the buffer
- use this attribute in the clang static analyzer to do bounds-checks
- use this attribute to do static analysis in LLVM
- use clang to compile a subset of C, with this attribute, where the
bounds of all pointers are known,
and reject programs that are not part of this subset

OpenBSD's attribute gives warnings only when the parameters are
constant, but I'd like to take it further,
and also warn/error when the length is stored in a struct, later read
and passed as parameter to a function.

For now I implemented the sema+cg part for the bounds attribute, small
example attached at end of this mail [2]

Since my attribute will do more/other than the OpenBSD attribute, to
avoid clashes, I named my attribute 'bounds' (suggestions welcome), but
I intend to support BSD's __bounded__ attribute too!
Also the __bounded__ attribute uses indexes, rather than names, which I
consider error prone.

Proposed semantics:
struct foo {
  unsigned n;
  unsigned *x __attribute__((bounds(n)));
};

This will declare that x is an array of 'n' elements, where 'n' is the
field of the same struct.
NULL pointers will have size 0 (and so will buffers of size 0 which may
or may not be null).
It will be enforced at 2 places:
- storing values: 'n' must always be accurate and reflect the current
size of 'x' (which means store order matters), with one exception: if
the 'struct foo' variable, or pointer to it is only visible in current
function, and doesn't escape, it is allowed to store in any order.
- dereferencing x: the index must always be in the range [0, n)
- check that 'n' really is the size of the allocation
- if the size is unknown, it will warn/emit error too

Note that these checks don't have to be done in clang necesarely, they
can be done in LLVM (in fact I'd prefer, it is easier
to do dataflow analysis there).

This attribute should work for pointers to constant size arrays, VLAs
[3], pointers coming from malloc (and malloc-like wrappers).

For function parameters the semantics is as described in [1], except
checks will be done for non-constant sized buffers too.

The emitted LLVM code will use metadata to capture this information, see
[2].

[1]
http://www.openbsd.org/cgi-bin/man.cgi?query=gcc-local&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html
[2]
--------- Sema example
int n;
int *x __attribute__((bounds(n))); // expected-error {{'bounds'
attribute applies to pointer fields only}}
int z __attribute__((bounds(n))); // expected-error {{'bounds' attribute
applies to pointer fields only}}
unsigned o;

struct foo {
    unsigned n;
    int ns;// expected-note {{declared at}}
    int *a __attribute__((bounds(1))); // expected-error {{attribute
requires unquoted parameter}}
    int *b __attribute__((bounds(n, 1))); // expected-error {{attribute
requires 0 argument(s)}}
    int *c __attribute__((bounds(m))); // expected-error {{use of
undeclared identifier 'm'}}
    int *d __attribute__((bounds(o))); // expected-error {{use of
undeclared identifier 'o'}}
    int *e __attribute__((bounds(ns))); // expected-error {{'bounds'
attribute requires parameter 1 to be the name of an unsigned field}}
};

struct foo foos;
-------------------- Codegen example
struct foo {
    unsigned n;
    int *b __attribute__((bounds(n)));
};

struct foo foos;
// CHECK: !llvm.boundsinfo.foo = !{!0}
// CHECK: !0 = metadata !{i32 1, i32 0}

[3] not sure about VLAs, a too large VLA could exceed stack size and
crash the program

Best regards,
--Edwin

Hi Edwin,

I find this proposal interesting! I have a few comments inline, but I also wanted to draw attention to Microsoft's "Standard Annotation Language" (or SAL), which were originally derived to extend buffer overflow checking by the compiler (and other tools):

http://msdn.microsoft.com/en-us/library/ms235402(VS.80).aspx

While I'm not suggesting that we definitely implement SAL in Clang, I think it is worth at least considering some of the ideas there, as the encompass not only the annotation you propose but a more general set of annotations that can be used for security checking. Moreover, there is some value having a common set of buffer overflow annotations that work across various platforms and compilers.

Comments inline.

Hi,

I started working on adding a new attribute to Clang, to annotate
pointers with bounds information,
before going further I'd like to hear your opinion whether this is
acceptable for clang (or if we can
modify this proposal for it to be acceptable).

Given a pointer field in a struct, or a pointer parameter, the proposed
__attribute__ would tell
the compiler which field/parameter holds the bounds of the buffer pointer.

Is this extent in bytes, or in the number of items?

This is not a completely new attribute, in fact OpenBSD has had
something similar for GCC,
but only for function parameters [1].

Some possible use cases for this attribute:
- emit a warning when a buffer pointer is passed to a function if the
length argument is larger than the
actual length of the buffer
- use this attribute in the clang static analyzer to do bounds-checks
- use this attribute to do static analysis in LLVM
- use clang to compile a subset of C, with this attribute, where the
bounds of all pointers are known,
and reject programs that are not part of this subset

I agree very much that the attribute would be useful in all of these contexts.

OpenBSD's attribute gives warnings only when the parameters are
constant, but I'd like to take it further,
and also warn/error when the length is stored in a struct, later read
and passed as parameter to a function.

This kind of symbolic reasoning could be done in the clang static analyzer.

For now I implemented the sema+cg part for the bounds attribute, small
example attached at end of this mail [2]

Since my attribute will do more/other than the OpenBSD attribute, to
avoid clashes, I named my attribute 'bounds' (suggestions welcome), but
I intend to support BSD's __bounded__ attribute too!
Also the __bounded__ attribute uses indexes, rather than names, which I
consider error prone.

These attribute names are so similar that it would inevitably create conflicts. SAL uses '_ecount' and '_bcount' (element and byte count respectively), so you could use something like 'buffer_length' and the like. It's only a few more characters and explicitly states what the attribute does.

Proposed semantics:
struct foo {
unsigned n;
unsigned *x __attribute__((bounds(n)));
};

This will declare that x is an array of 'n' elements, where 'n' is the
field of the same struct.
NULL pointers will have size 0 (and so will buffers of size 0 which may
or may not be null).

In terms of the parsing rules, would you allow 'n' to be declared after the variable it bounds? e.g.:

struct foo {
unsigned *x __attribute__((bounds(n)));
unsigned n;
};

Here 'n' is declared after 'x'.

The same question applies to function parameters.

It will be enforced at 2 places:
- storing values: 'n' must always be accurate and reflect the current
size of 'x' (which means store order matters), with one exception: if
the 'struct foo' variable, or pointer to it is only visible in current
function, and doesn't escape, it is allowed to store in any order.

I understand the motivation behind these restrictions, but my initial reaction is that they would be overly onerous for many contexts. For example, consider:

void bar(struct foo *p) {
  p->x = 0;
  p->n = 0;
}

Wouldn't this be deemed illegal since 'p->x' can potentially be referenced elsewhere? Is the concern mainly that 'p->x' might be referenced in another thread?

- dereferencing x: the index must always be in the range [0, n)
- check that 'n' really is the size of the allocation
- if the size is unknown, it will warn/emit error too

Can you be a little more specific by what you mean by "unknown"?

It seems to me that we need to distinguish between runtime and static checking. With static/compiler analysis, there may be many cases where we cannot prove the index is within the range [0, n), but obviously can prove that it is within this range at runtime.

Note that these checks don't have to be done in clang necesarely, they
can be done in LLVM (in fact I'd prefer, it is easier
to do dataflow analysis there).

The analysis will certainly be easier at the LLVM IR level, although reporting the diagnostics in an intelligible manner (for static checking) will probably lend itself more to doing it in Clang. This is a common tradeoff.

This attribute should work for pointers to constant size arrays, VLAs
[3], pointers coming from malloc (and malloc-like wrappers).

Could you provide an example of using malloc() with these attributes? I'm assuming something like:

   p->x = malloc(n);
   p->n = n;

For function parameters the semantics is as described in [1], except
checks will be done for non-constant sized buffers too.

The emitted LLVM code will use metadata to capture this information, see
[2].

[1]
http://www.openbsd.org/cgi-bin/man.cgi?query=gcc-local&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html

Food for thought. Suppose I have 'struct foo' as defined above, and do the following:

void baz(unsigned **x_ref);

void test() {
   struct foo s;
   s.x = malloc(10);
   s.n = 10;
   baz(&s.x);
}

Should this be flagged with a warning by the compiler, as the invariant between 's.x' and 's.n' might be violated by the call to 'baz'? If I had:

void test_with_assign() {
   struct foo s;
   s.x = malloc(10);
   s.n = 10;
   baz(&s.x);
   s.n = 20;
}

How should I go about enforcing the invariant "storing values: 'n' must always be accurate and reflect the current size of 'x'", either statically or dynamically, as 'x' may have changed after the call to 'baz' and we have no further extent information? I guess what I'm saying is that without knowledge of the implementation of 'baz' this couldn't be enforced by static checking at all (unless we make pessimistic assumptions or require more annotations). I'm not certain what information you would track at runtime to ensure these invariants.

Cheers,
Ted

Hi Edwin,

I find this proposal interesting!

Hi Ted,

Thanks for the patience of reading through my (long) mail.

I have a few comments inline, but I also wanted to draw attention to
Microsoft's "Standard Annotation Language" (or SAL), which were
originally derived to extend buffer overflow checking by the compiler
(and other tools):

http://msdn.microsoft.com/en-us/library/ms235402(VS.80).aspx

While I'm not suggesting that we definitely implement SAL in Clang, I
think it is worth at least considering some of the ideas there, as the
encompass not only the annotation you propose but a more general set
of annotations that can be used for security checking. Moreover,
there is some value having a common set of buffer overflow annotations
that work across various platforms and compilers.

It seems the SAL language doesn't use attributes, but reserved keywords,
it would be trivial to implement a sal.h header once we have the attributes:
#define _ecount(variable) __attribute__((ebounds(variable)))
....

Adding attributes is easier than adding new keywords to clang, isn't it?

Comments inline.

Hi,

I started working on adding a new attribute to Clang, to annotate
pointers with bounds information,
before going further I'd like to hear your opinion whether this is
acceptable for clang (or if we can
modify this proposal for it to be acceptable).

Given a pointer field in a struct, or a pointer parameter, the proposed
__attribute__ would tell
the compiler which field/parameter holds the bounds of the buffer
pointer.

Is this extent in bytes, or in the number of items?

Number of items, because the number of bytes may be target dependent.
However we could have another attribute that is number of bytes.

This is not a completely new attribute, in fact OpenBSD has had
something similar for GCC,
but only for function parameters [1].

Some possible use cases for this attribute:
- emit a warning when a buffer pointer is passed to a function if the
length argument is larger than the
actual length of the buffer
- use this attribute in the clang static analyzer to do bounds-checks
- use this attribute to do static analysis in LLVM
- use clang to compile a subset of C, with this attribute, where the
bounds of all pointers are known,
and reject programs that are not part of this subset

I agree very much that the attribute would be useful in all of these
contexts.

OpenBSD's attribute gives warnings only when the parameters are
constant, but I'd like to take it further,
and also warn/error when the length is stored in a struct, later read
and passed as parameter to a function.

This kind of symbolic reasoning could be done in the clang static
analyzer.

For now I implemented the sema+cg part for the bounds attribute, small
example attached at end of this mail [2]

Since my attribute will do more/other than the OpenBSD attribute, to
avoid clashes, I named my attribute 'bounds' (suggestions welcome), but
I intend to support BSD's __bounded__ attribute too!
Also the __bounded__ attribute uses indexes, rather than names, which I
consider error prone.

These attribute names are so similar that it would inevitably create
conflicts. SAL uses '_ecount' and '_bcount' (element and byte count
respectively), so you could use something like 'buffer_length' and the
like. It's only a few more characters and explicitly states what the
attribute does.

I don't care too much about how the attribute is called, so whatever is
easier to understand works for me. buffer_length or buffer_element_count
would do just fine.

Proposed semantics:
struct foo {
unsigned n;
unsigned *x __attribute__((bounds(n)));
};

This will declare that x is an array of 'n' elements, where 'n' is the
field of the same struct.
NULL pointers will have size 0 (and so will buffers of size 0 which may
or may not be null).

In terms of the parsing rules, would you allow 'n' to be declared
after the variable it bounds? e.g.:

struct foo {
unsigned *x __attribute__((bounds(n)));
unsigned n;
};

Here 'n' is declared after 'x'.

The same question applies to function parameters.

I intend to allow it, currently it doesn't work because I haven't
figured out how to do it yet:
ProcessDeclAttribute is called before 'n' is parsed in your example, and
d->getDeclContext()->lookup() doesn't find it.
Maybe I should add the name to the AST, and look it up later?

It will be enforced at 2 places:
- storing values: 'n' must always be accurate and reflect the current
size of 'x' (which means store order matters), with one exception: if
the 'struct foo' variable, or pointer to it is only visible in current
function, and doesn't escape, it is allowed to store in any order.

I understand the motivation behind these restrictions, but my initial
reaction is that they would be overly onerous for many contexts. For
example, consider:

void bar(struct foo *p) {
p->x = 0;
p->n = 0;
}

Wouldn't this be deemed illegal since 'p->x' can potentially be
referenced elsewhere? Is the concern mainly that 'p->x' might be
referenced in another thread?

Yes, another thread is one concern, but not what I was referrring to: it
may read an out-of-date value. But then another thread should use a lock
before accessing common data, and proving that it does, or doesn't is
out of scope for my analysis.

Rather my concern was another function might be called:
p->x=0;
foobar(p);
p->n=0;

Or:
p->x=0;
foobar(z);
p->n = 0;

And mayalias(p, z), or mustalias(p, z).

For the purposes of this annotation, it is enough if no other function
is called between setting the two fields. Of course this must include
any stores that write to a pointer that mayalias those fields,
which may lead to false positives.

- dereferencing x: the index must always be in the range [0, n)
- check that 'n' really is the size of the allocation
- if the size is unknown, it will warn/emit error too

Can you be a little more specific by what you mean by "unknown"?

p->n = n;
p->x = foo(n);

Unless we see the body of foo(), or unless it has the malloc-like
attribute (which btw should be validated that it really allocs as many
elements as requested), or a bounds attribute on the return value, we
can't assume anything about the pointer it returns.
In this case clang/llvm should emit a warning/error, just as it would if
the wrong value would be stored for n: we don't know the size of the
pointer.

There is also a problem when a function's body is not known, but it has
annotations. In this case we can do two things:
- if the annotations come from a system header, assume they are
correct, and that whoever wrote them knew what they were doing
- if the annotations come from a user header, we could either trust it
(but what if the function's body was compiled without including the header
that defines the annotations?), or give a warning that we can't verify
its body. Again this could be a compiler flag.

Can the clang analyzer check properties accross translation units, and
across library boundaries?
Maybe it could compile to bitcode, and try to link the bitcodes, the
LLVM linker could verify that you don't link functions which have
contradicting metadata annotations,
or missing annotations.

Either way, trusting user annotations on a function without a body is
not good, it is very easy to mistakenly fool the analyzer into thinking
the program is safe, when it is not.

It seems to me that we need to distinguish between runtime and static
checking. With static/compiler analysis, there may be many cases
where we cannot prove the index is within the range [0, n), but
obviously can prove that it is within this range at runtime.

I agree that a compile time check can't prove everything, so I think
there should be a flag: give warning/error only when compiler can
statically prove
you violate the annotation, or give warning/error when compile-time
check can't prove everything, or insert runtime checks.
Inserting runtime checks is not trivial, and not on my plan.

As a start I'd go for implementing the first part, because that is what
I currently need (accept only programs that pass all compile-time
checks, and add
annotations to guide the compile-time checks).

Also giving warnings is enough, I can use -Werror if I want them to be
errors.

Note that these checks don't have to be done in clang necesarely, they
can be done in LLVM (in fact I'd prefer, it is easier
to do dataflow analysis there).

The analysis will certainly be easier at the LLVM IR level, although
reporting the diagnostics in an intelligible manner (for static
checking) will probably lend itself more to doing it in Clang. This
is a common tradeoff.

In either case I'll need this info at the IR level (even if checking is
done in clang). Having it at the IR level also allows other languages to
use this annotation.
Although I can't give so nicely formatted html reports from LLVM IR, I
can give a source:file:line:errormessage-style reports from there, given
that I have debug info,
and I run the analysis early enough that I still have all relevant debug
info.

This attribute should work for pointers to constant size arrays, VLAs
[3], pointers coming from malloc (and malloc-like wrappers).

Could you provide an example of using malloc() with these attributes?
I'm assuming something like:

  p->x = malloc(n);
  p->n = n;

Yes, except it should have a nullpointer check, something like:
p->x = malloc(n);
if (!p->x)
    return;
p->n = n;

Or:
p->n = n;
p->x = malloc(n)
if (!p->x) {
  p->n = 0;
...
}

For function parameters the semantics is as described in [1], except
checks will be done for non-constant sized buffers too.

The emitted LLVM code will use metadata to capture this information, see
[2].

[1]
http://www.openbsd.org/cgi-bin/man.cgi?query=gcc-local&apropos=0&sektion=0&manpath=OpenBSD+Current&arch=i386&format=html

Food for thought. Suppose I have 'struct foo' as defined above, and
do the following:

void baz(unsigned **x_ref);

void test() {
  struct foo s;
  s.x = malloc(10);
  s.n = 10;
  baz(&s.x);
}

Should this be flagged with a warning by the compiler, as the
invariant between 's.x' and 's.n' might be violated by the call to
'baz'? If I had:

void test_with_assign() {
  struct foo s;
  s.x = malloc(10);
  s.n = 10;
  baz(&s.x);
  s.n = 20;
}

How should I go about enforcing the invariant "storing values: 'n'
must always be accurate and reflect the current size of 'x'", either
statically or dynamically, as 'x' may have changed after the call to
'baz' and we have no further extent information? I guess what I'm
saying is that without knowledge of the implementation of 'baz' this
couldn't be enforced by static checking at all (unless we make
pessimistic assumptions or require more annotations). I'm not certain
what information you would track at runtime to ensure these invariants.

It should give a warning that bounds info may be inaccurate both at the
call to baz, and at s.n=20;
Enforcing it at runtime is not among my goals as stated above, but I
think someone may implement a mudflap-like approach for it.

To sum up my immediate goals with this attribute:
- allow annotating system headers (which includes clang's own
lib/Headers headers)
- allow annotating user functions/structs, but the compiler must check
that these annotations are accurate (i.e. the function really never
accesses the buffer outside its bounds,
or it really returns a buffer that has at least as many elements as it
claims)
- emit warnings if the program can't be statically proven to always obey
the attributes
- emit warnings if the program can't be statically proven to obey
bounds, given information from this attribute

Long term goals would be:
- insert runtime checks where that is easy to do (i.e. we have an
attribute on a parameter and an index, but can't statically prove
anything about the index: runtime check index against parameter),
and emit warning otherwise
- do cross-translation-unit checks

Could be done, but not on my TODOlist:
- do runtime checks for all pointers, this could use a mudflap-like or
softbound-like approach

Best regards,
--Edwin