C sequence-point analysis

Hi all,

I'm trying to write a tool for detecting undefined behavior in C
regarding sequence points and side effects.

I'm not sure whether it should be a Clang plugin or LLVM run or
something completely different (I'm really new to both Clang and LLVM)
and that's what I need advice with.

For my work, I need to use both AST and alias analysis

Clang plugin:
  +relatively easy to use
  +access to AST with all the needed info EXCEPT alias analysis (right?)
  -no alias analysis, I'd need to write one myself

LLVM run:
  +built-in alias analysis (I'd like to use it, writing my own alias
analysis is not really what my work is all about)
  -I do NOT have access to AST
  -I don't know it at all (but I'm ready to learn it if it shows up to be
the best option)

The big PROBLEM is: a behavior that is undefined in C (and which Clang
has access to) might be (and in my case WILL be) well defined in LLVM
(for example, i=i++; is undefined in C but in LLVM code it will be
already well defined and the result will depend on Clang behavior).

So I thought I could use both, in Clang create a list of rules, for
example "On line L, there is an undefined behavior if X aliases with Y"
and then SOMEHOW dig this info from LLVM run.

Is this a good idea? Is there a way (other than output to file from
Clang and then read it in LLVM) to give this set of rules to LLVM? I'd
also be glad for any other idea, not necessarily including LLVM and
Clang, to solve this problem.

Thanks in advance!

You might want to look into the implementation of Clang’s UBSan feature (-fsanitize=undefined). Like the other sanitizers (address, memory, and thread) UBSan works by adding extra checks into the LLVM IR from the Clang frontend. LLVM compiles those checks as it would any other IR and they are used to verify that the behavior is correct.

The simplest example of this might be bounds checking of a static array (and this is one of the things UBSan can check for) or overflow of a signed integer. The frontend simply adds the checks a programmer might write if they were coding defensively against such a circumstance. Then LLVM just compiles the code as normal and when you execute the program, if you trigger the check to fail, an error message is printed.

For the particular check you’re interested in implementing… I’m not sure exactly how you’ll go about implementing that check or how you’d avoid false positives, but UBSan is probably the first place to look and the ideal place for this to live, if possible.

Hi, David, thanks for your reply!

If I understood UBSan correctly, it just adds run-time checks, so
there is no error informing about possible undefined behavior before
running the program (and even then, it might never occur).

I don't really think there is a reasonable way to check this type of
errors during runtime and even if there was, my goal is to issue a
warning during compilation.

What I can do in Clang is: check for this kind of error if I
completely forget about pointers - I don't do any alias analysis and
just assume no two different variables alias. So if I could just check
in Clang whether they do alias or not, my problem would be solved.

As I don't know about any way I could achieve this, I think about
using LLVM for this (or combination of both, as i wrote: Clang to make
a set of must-not-alias rules and LLVM to check these rules) which
would still warn during compilation.

As far as I understood, sanitizers are of no use here, am I correct?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi, David, thanks for your reply!

If I understood UBSan correctly, it just adds run-time checks, so
there is no error informing about possible undefined behavior before
running the program (and even then, it might never occur).

I don't really think there is a reasonable way to check this type of
errors during runtime and even if there was, my goal is to issue a
warning during compilation.

What I can do in Clang is: check for this kind of error if I
completely forget about pointers - I don't do any alias analysis and
just assume no two different variables alias. So if I could just check
in Clang whether they do alias or not, my problem would be solved.

As I don't know about any way I could achieve this, I think about
using LLVM for this (or combination of both, as i wrote: Clang to make
a set of must-not-alias rules and LLVM to check these rules) which
would still warn during compilation.

As far as I understood, sanitizers are of no use here, am I correct?

If you specifically want static checking, then no, the sanitizers aren't
the right tool.

For static checking you have either Clang warnings and CFG, including the
existing -Wsequence-point warning which catches some cases of unsequenced
operations, or you have the Full Power of the Clang Static Analyzer which
might be where you could get some alias information to add more accuracy
(in the case of reduced false negatives that the Clang warning necessarily
has to have to be cheap enough to run at compile time).

- David

Lukas,
FYI: -fcatch-undefined-behavior and -fsanitize=underfined are synonyms for
the time being. IIRC, the -fcatch-undefined-behavior flag will be
deprecated at some point. I was really just trying to point out that
David and I are suggesting the same thing.

Chad

Hi all,

I'm trying to write a tool for detecting undefined behavior in C
regarding sequence points and side effects.

I'm not sure whether it should be a Clang plugin or LLVM run or
something completely different (I'm really new to both Clang and LLVM)
and that's what I need advice with.

For my work, I need to use both AST and alias analysis

Clang plugin:
        +relatively easy to use
        +access to AST with all the needed info EXCEPT alias analysis
(right?)
        -no alias analysis, I'd need to write one myself

LLVM run:
        +built-in alias analysis (I'd like to use it, writing my own alias
analysis is not really what my work is all about)
        -I do NOT have access to AST
        -I don't know it at all (but I'm ready to learn it if it shows up
to be
the best option)

The big PROBLEM is: a behavior that is undefined in C (and which Clang
has access to) might be (and in my case WILL be) well defined in LLVM
(for example, i=i++; is undefined in C but in LLVM code it will be
already well defined and the result will depend on Clang behavior).

So I thought I could use both, in Clang create a list of rules, for
example "On line L, there is an undefined behavior if X aliases with Y"
and then SOMEHOW dig this info from LLVM run.

Is this a good idea? Is there a way (other than output to file from
Clang and then read it in LLVM) to give this set of rules to LLVM? I'd
also be glad for any other idea, not necessarily including LLVM and
Clang, to solve this problem.

I've thought about this a bit from the point of view of adding a check to
-fsanitize=undefined. We already detect cases where it's "locally" obvious
that the expression will have undefined behavior, for instance, we'll warn
on your example of "i = i++;".

From the ubsan perspective, I think the thing to do here is to form a list

of pairs of subexpressions within the full-expression where, if the two
subexpressions refer to the same object, we will have unsequenced accesses
(in the presence of ?:, &&, and ||, we'll need to be careful to only check
these if the relevant subexpression was actually evaluated). Then, perhaps
at the end of the full-expression, we emit a check that no such pair
evaluated to the same address.

Building the set of pairs is a little tricky, but the code implementing
-Wunsequenced can probably be reused to form this set.

That said, we have a representational weakness in LLVM here: we have no way
to express that the language semantics allow us to freely reorder two
memory operations. If we could emit somehow (perhaps as metadata) the
information that we have about full-expressions, we could use that as input
to an LLVM alias analysis -- then we could use the same representation both
to drive a check and to drive optimizations. In the long term, that's
probably the superior approach.

(Not-fully-baked ideas follow...) We could imagine this working something
like the following. Given:

  (a++, b) = c ? d++ : e

We divide it into separately-sequenced portions:

full expression: seq 0
a++: seq 1 for load, seq 2 for store
b: seq 3
c: seq 4
d++: seq 5 for load, seq 6 for store
e: seq 7
assignment: seq 8

We know that:

  seq 2 and seq 3 are sequenced after seq 1
  seq 4 and seq 5 are sequenced after seq 3
  seq 6 is sequenced after seq 5
  seq 8 is sequenced after seq 3, 5, and 7
  seq 1-8 are nested within seq 0 (so are by default not sequenced with
other things inside seq 0)

We could then express this as metadata as:

  %a = load @a, !seq !seq1
  %a.1 = add %a, 1
  store @a, %a.1, !seq !seq2
  ; ...

!seq0 = !{}
!seq1 = !{seq0} ; parent is 0
!seq2 = !{seq0, !{seq1}} ; parent is 0, sequenced after 1
!seq3 = !{seq0, !{seq1}} ; parent is 0, sequenced after 1
; ...

... with the semantics that memory operations A and B are known not to
alias if at least one is a load, and they have the same parent, and neither
is (transitively) sequenced before the other.

Hi, Richard,

Thanks for confirming my thoughts! That's what I meant, make a set of
rules using Clang and then check them by LLVM.

As I said, I am relatively new to Clang and have never used LLVM
before. So, what's the best way to do this? Do I make a Clang plugin?
I can do that (I already did a part of the analyzer, however, without
any alias analysis; and I never did any change to the AST). And if
yes, how do I send some (meta)data to LLVM? The only way I can think
of is to save them to a file. And then, do I run a LLVM pass? Just
asking if I understood correctly. My problem is now more about using
Clang together with LLVM (as you confirmed my approach was a good
idea, thanks for that again!).

Now, if I understood correctly, Clang Static Analyzer is just a set of
libraries in Clang - which means, it's what I already tried while
doing the Clang plugin. Am I correct?

If yes, I don't know of any way I could do static analysis in Clang
(given I am not willing to write one myself) and would need to somehow
send the data to LLVM (Do some changes in AST? Save to external file?
I don't know, never done this before.) and do alias analysis in LLVM,
as I suggested in the original message.

Did I miss something? Particularly some alias analysis in Clang?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi, Richard,

Thanks for confirming my thoughts! That's what I meant, make a set of
rules using Clang and then check them by LLVM.

As I said, I am relatively new to Clang and have never used LLVM
before. So, what's the best way to do this? Do I make a Clang plugin?
I can do that (I already did a part of the analyzer, however, without
any alias analysis; and I never did any change to the AST). And if
yes, how do I send some (meta)data to LLVM? The only way I can think
of is to save them to a file. And then, do I run a LLVM pass? Just
asking if I understood correctly. My problem is now more about using
Clang together with LLVM (as you confirmed my approach was a good
idea, thanks for that again!).

To clarify, I'm suggesting that you teach Clang's IR generation to emit
code to catch these bugs when the compiled program is run. That shouldn't
require any alias analysis or similar at the point where you insert the
checks (if you wanted to be a bit cleverer, you could emit metadata from
Clang and use an LLVM pass to emit the checks, and avoid emitting checks
for cases where alias analysis could prove they're unnecessary, but the
optimizer should remove the redundant checks anyway, so such cleverness
should hopefully be unnecessary).

We have deliberately avoided having any checks in Clang that operate by
inspecting the IR post-optimization, since such checks tend to be very
unstable across changes to the compiler, and that generates pain for users
of the checks (a seemingly-unrelated change to the code could cause the
optimizer to make a different decision and expose a problem, leading to an
unexpected build break for a -Werror build).

Your last message confused me a lot. Are you suggesting inserting
run-time checks into the AST? What about the whole metadata-thing from
your original message? I thought those were metadata for LLVM so I
could check the must-not-alias rules defined in them.

Run-time check is not my goal, I am trying to write a static check (of
course, false positives will happen). I understand your warning about
changes in the compiler, but anyway, is it possible to check this
statically (using the set of rules I am talking about, and maybe the
same rules you wrote about, but as I say, I'm confused about it)?

A short example so I make it clear:

Let us have this input:

int i = 42;
int *j = &i;
i = (*j)++;

In Clang, from the AST, I can make the following set (with only one
member in this example) of rules:

On the row 3, there is undefined behavior IF i may alias with j.

Now, can I "somehow" send this to LLVM and in LLVM, check whether i
and j do or do not alias?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Your last message confused me a lot. Are you suggesting inserting
run-time checks into the AST? What about the whole metadata-thing from
your original message? I thought those were metadata for LLVM so I
could check the must-not-alias rules defined in them.

The idea I was expressing there was that emitting the sequencing rules as
metadata would allow LLVM to optimize better by using them as a source of
aliasing information.

Run-time check is not my goal, I am trying to write a static check (of

course, false positives will happen). I understand your warning about
changes in the compiler, but anyway, is it possible to check this
statically (using the set of rules I am talking about, and maybe the
same rules you wrote about, but as I say, I'm confused about it)?

A short example so I make it clear:

Let us have this input:

int i = 42;
int *j = &i;
i = (*j)++;

In Clang, from the AST, I can make the following set (with only one
member in this example) of rules:

On the row 3, there is undefined behavior IF i may alias with j.

Now, can I "somehow" send this to LLVM and in LLVM, check whether i
and j do or do not alias?

Yes, this is possible. If you want to write a check that can be pushed into
upstream Clang, I would not advise doing it this way, because (as
mentioned) we have gone to lengths to avoid such checks. Instead, I'd
suggest writing a static analysis checker and/or a sanitizer (insert
instrumentation to detect problems when the compiled code is run).

If you don't care about pushing this upstream, this is something you can
reasonably check in a custom LLVM pass, with some custom Clang code to emit
metadata to track the sequencing rules. You would attach the metadata to
some of the instructions emitted by Clang (attaching to loads and stores
probably makes most sense here), then write a custom LLVM pass that looks
at the metadata and determines whether unsequenced modifications have
happened through aliasing pointers.

I would expect you'd get significantly better results (fewer false
negatives) through runtime instrumentation, though.