Unions in the Clang Static Analyzer

I was having some trouble with using the static analyzer on code that involves unions. Looking through the code, it seems like this is a known problem and unions aren’t particularly well supported. Personally, I was able to make some progress by just ripping out all the checks for union types, particularly in RegionStore. However, that’s obviously a hacky solution, so to ensure that my check will keep working, I’d like to improve upstream support for unions if possible. Has anybody thought about this problem before/is there already a design for this? Alternatively, have people collected test cases that would benefit from improving union support? I’m sure I’m not the first person to hit this issue, so I’d appreciate any pointers to prior work.

Hello,

I’m not aware of any unfinished work on union support. I’d gladly give a few hints.

RegionStore itself already operates in a union-friendly manner: even though it presents itself as a MemRegion->SVal map, internally it only stores base regions and byte offsets as keys, regardless of how the bytes stored are interpreted.

However, i suspect bigger problems with actually representing symbolic values for fields that are interpreted in multiple ways during the analysis. First of all, reinterpret-casts of basic types are in general modeled poorly. For example, cast from pointer to integer is represented as a special LocAsInteger SVal sub-class, support for which in SValBuilder is worse than for symbols; at the same time cast from integer to pointer is not represented at all, as i mentioned recently in . Then, unions are even more fun because they challenge the concept of SymbolRegionValue, which is a very basic thing. Consider an example: union U { intptr_t x; void *p; } void foo(union U u) { intptr_t X = u.x; // reg_$0<u.x> void *P = u.p; // &SymRegion{reg_$1<u.p>} if (X == P) { … } } The symbolic values for X and P look completely different, the worst thing here probably being that X is NonLoc while P is Loc. Yet, they represent the same thing. How would we unify these values? If you come up with a single symbol to represent them, what would be the type of that symbol? Or would you prefer to constrain these symbols to be equal, and rely on the constraint solver to figure things out? So making a proper union support is going to be a fun research project. Smaller incremental improvements that fix particular cases you encounter are also very much welcome. There’s not necessarily much thought behind the safety-first bail-outs, so if you are seeing significant improvements by removing them, we could consider lifting them; it may also turn out that they’re necessary, and an experienced eye may sometimes understand why, but as long as there is no comprehensive explanation in comments and no testcase coverage, i’d rather encourage you to remove them and see what happens; especially considering that you’ve got access to a good codebase to test these changes upon.

My rough conception on how this would work is that the first access
would create the symbol and determine the type and then accesses
through other members would be equivalent to a cast to the other type.

At the very least, I would like to make sure that symbols are
preserved through union assignment (and passing unions to functions),
if they are only every accessed through the same union member, as well
as simple cases like:

union {
int *intptr;
float *floatptr;
}

Just having that work reliably would probably have fixed 90% of the
trouble I was having here.

Keno

Hi all,

There is one thing that can make the implementation a bit easier :slight_smile:
The reading of a union field that was not stored last may cause undefined behavior. A discussion on this issue can be found on
https://www.securecoding.cert.org/confluence/display/c/EXP39-C.+Do+not+access+a+variable+through+a+pointer+of+an+incompatible+type

I guess we need some language lawyers here. Aaron, can I disturb you?

02.03.2017 20:12, Artem Dergachev via cfe-dev пишет:

Hi all,

There is one thing that can make the implementation a bit easier :slight_smile:
The reading of a union field that was not stored last may cause undefined
behavior. A discussion on this issue can be found on
https://www.securecoding.cert.org/confluence/display/c/EXP39-C.+Do+not+access+a+variable+through+a+pointer+of+an+incompatible+type

I guess we need some language lawyers here. Aaron, can I disturb you?

Certainly -- is there advice in particular you're looking for? I'm not
overly familiar with the static analyzer design, but I'm to help as I
can.

A note below...

02.03.2017 20:12, Artem Dergachev via cfe-dev пишет:

Hello,

I'm not aware of any unfinished work on union support. I'd gladly give a few
hints.

RegionStore itself already operates in a union-friendly manner: even though
it presents itself as a MemRegion->SVal map, internally it only stores base
regions and byte offsets as keys, regardless of how the bytes stored are
interpreted.

However, i suspect bigger problems with actually representing symbolic
values for fields that are interpreted in multiple ways during the analysis.
First of all, reinterpret-casts of basic types are in general modeled
poorly. For example, cast from pointer to integer is represented as a
special LocAsInteger SVal sub-class, support for which in SValBuilder is
worse than for symbols; at the same time cast from integer to pointer is not
represented at all, as i mentioned recently in
[cfe-dev] pointer values, taint propogation .

Then, unions are even more fun because they challenge the concept of
SymbolRegionValue, which is a very basic thing. Consider an example:

  union U { intptr_t x; void *p; }

  void foo(union U u) {
    intptr_t X = u.x; // reg_$0<u.x>
    void *P = u.p; // &SymRegion{reg_$1<u.p>}
    if (X == P) { ... }
  }

The symbolic values for X and P look completely different, the worst thing
here probably being that X is NonLoc while P is Loc. Yet, they represent the
same thing.

They do not represent the same thing -- they designate the same region
of storage, but the underlying objects are not "the same" because they
have independent lifetimes. A union member within its lifetime is
"active", and the static analyzer would need to track which member of
the union is active to be able to reason about the program
appropriately.

~Aaron

Hi Aaron!

Thank you for response. I'm curious if such cases like described:

   union U { intptr_t x; void *p; }

   void foo(union U u) {
     intptr_t X = u.x; // reg_$0<u.x>
     void *P = u.p; // &SymRegion{reg_$1<u.p>}
     if (X == P) { ... }
   }

have defined behavior or not. I have seen many discussions on this but I haven't seen any strict rule on this.

03.03.2017 08:49, Aaron Ballman пишет:

Hi Aaron!

Thank you for response. I'm curious if such cases like described:

  union U { intptr_t x; void *p; }

  void foo(union U u) {
    intptr_t X = u.x; // reg_$0<u.x>

This could be undefined behavior if, prior to calling foo(), u.x was
not the active union member or if there was no active union member.

    void *P = u.p; // &SymRegion{reg_$1<u.p>}

Assuming there was not already UB, this has undefined behavior because
the lifetime of u.p has not begun (despite some compilers supporting,
and even documenting, this undefined behavior by allowing the
expression to act as a duck-type cast, which is a perfectly acceptable
form of undefined behavior).

    if (X == P) { ... }
  }

have defined behavior or not. I have seen many discussions on this but I
haven't seen any strict rule on this.

The strict rules on it are in [class.union] and [basic.life]p1 (to get started).

~Aaron

This behavior also differs between C and C++. For example, C has explicitly:

[N1570, Footnote 95]
If the member used to read the contents of a union object is not the
same as the member last used to store a value in the object, the
appropriate part of the object representation of the value is
reinterpreted as an object representation in the new type as described
in 6.2.6 (a process sometimes called ‘‘type punning’’). This might be
a trap representation.

where object representation is essentially what you'd get if you
memcpyed the value.

Yes, if everything lines up just perfectly, then in C this may be
okay. I say may because it can still be a trap representation, it
could be a misaligned access, there could be unspecified values within
the union member, etc (so it depends on the types and the specific
values), so it is *not* strictly a safe operation to do in C either.

~Aaron

Sorry for bumping an old thread, but I ended up trying to pick up the project that brought this up again after
giving up last year because of the analyzer’s lacking union support. Naturally I ran into the same problems again :/.
It seems like the first step here is to decide what model of unions the analyzer intends to support. I’d be
happy to try fix some things here, but of course those decisions need to be made first.

Ideally we should implement things according to the standard, producing an UndefinedVal() every time the value is undefined. This will also force us to teach our uninitialized value checkers to understand why the value is uninitialized.

However this approach would prevent us from making union-specific undefined values an opt-in thing. And i believe that we might eventually want that because not all developers are worried about undefined behavior; there might be codebases in which it is normal to believe that on a specific platform type-punning is always going to happen, even in C++.

So i guess we’ll have to stick to believing in type punning as much as possible and never produce UndefinedVal()s in ExprEngine but let union-specific checkers decide if we want to emit warnings on any of those.

Sorry for the slightly off topic, but I it always bothered me that we use UndefinedVal to represent uninitialized values (and any kind of undefined value regardless of the reason it is being undefined).

If somebody volunteered to clean this up, we could have the union-specific undefined values as an opt-in thing.
While this would probably be a lot of work, it could be a 0th step towards proper union support (though arguably this might not be the best return on invest).

Yup, we could use tracking a-la taint analysis in checkers for union based undefined values, or in all checkers for undefined values.

I don’t think this is blocking union support, even in the form of technical debt. We may have union support without emitting union-specific warnings at all. Modeling defined-but-weird values consistently is the difficult part here.