Even more clang ideas

Geeze, after I sent my message I came up with a few more...

Here's the -W warnings I usually (try) to run with:

-Wmissing-prototypes -Wreturn-type -Wformat -Wunused-parameter -Wunused-variable -Wunused-value -Wuninitialized -Wshadow -Wsign-compare -Wshorten-64-to-32 -Wextra -Winit-self -Wsequence-point -Wswitch-default -Wstrict-aliasing=2 -Wundef -Wpointer-arith -Wbad-function-cast -Wstrict-prototypes -Wmissing-declarations -Wredundant-decls -Wunreachable-code -Wcast-align -Wdiscard-qual -Wcast-qual -Wstrict-overflow=5

"32 <-> 64 bit issues / potential problems".

The warning flag '-Wshorten-64-to-32' is a good start. '-Wconversion' is also useful when switching to 64 bits. clang can leverage 'meta-information' against the problem too, like using the knowledge that NS*Integer can switch sizes between 32 and 64 bits, but a 32 bit in assigned/passed to a NSInteger won't scale the same way. clang can help catch this potential gotchas early before they become difficult to undo/fix problems is these types of things can go silently undetected and default compiler warning levels. '-Wcast-align' is also something that can be checked/validated.

64 -> 32 bit issues is definitely a good source of checks, especially for code that must run for different archs. Your example about NSInteger is especially a good one.

I actually recently implemented a simple check relating to this topic relating to the use of CFNumberCreate. If one isn't careful with the use of this function, on a 64-bit architecture CFNumberCreate can actually fail to initialize some of the bits of the freshly created CFNumber because the integer size is greater than the integer provided by the programmer. I think a lot of little simple checks like these would be both (a) relatively easy to implement and (b) potentially catch a lot of subtle bugs.

The design of the static analysis library is to help make the implementation of these checks relatively straightforward without any deep program analysis knowledge. I myself won't be able to implement all of these checks, and hopefully as the tool evolves others will feel comfortable in implementing some of these checks as well.

"Cross architecture issues"

I can't think of any off the top of my head, but collecting possible cross architecture issues patterns would be helpful.

I think this basically relates to the previous issue: APIs and type definitions can have different invariants or properties on different archs. Some of these invariants could be checked readily with static analysis.

"Possible restrict and const qualification recommendations (and validation)"

This really requires deep inter-procedure analysis, but if it's available, then clang might be able to reason certain things about the inter-procedure effects and possibilities. Const can sometimes lead to better code generation, but the real wins are usually possible with restrict. If deep inspection is possible, then some degree of validation of the use of a restrict qualified pointer is probably possible as well.

I don't have a reference off the top of my head, but I do know there was some research on doing example what you suggest. Accurately inferring const and restrict may require a fairly precise points-to analysis, which gets tricky with all the messiness of C. That said, this is something that could potentially be done, at least in some localized cases.

Actually, since you're obviously deep down in the guts of the grammar and compiler interactions, maybe you can offer an opinion on the following: Historically Objective-C was effectively nothing more than a fancy pre-processor front end the C compiler. In fact, there was often a trivial one to one mapping from an Objective-C statement to a plain C statement.

@interface MyClass : ParentClass { char *buffer; } @end
becomes something like
typedef struct { /* ParentClass definitions */ char *buffer; } MyClass;

When you get right down to it, there's nothing special about a 'class', it's literally nothing more than struct.

Now, object oriented programming is built on polymorphic abilities, each class inherits all of its super classes methods/ivars/etc. So if we have the following:

@interface MyClass : ParentClass { char *buffer; } @end
@interface MyMutableClass : MyClass { int mutationCount; } @end

In code, we refer to an instantion of one of these objects with:

MyClass *myClassObj;
MyMutableClass *myMutableClassObj;

OO programing (and objc) allows for the following to take place:

myClassObj = myMutableClassObj;

because MyMutableClass is a subclass of MyClass. No problem, right?

I'm of the opinion that this is actually a problem. The problem has nothing to do with the (correct) OO design paradigm or any particular conceptual fault, but it has to do with C.

Objective-C was designed a long time ago, in the pre-ANSI K&R days as a matter of fact. Such assignments were possible under older K&R and (I think, but may be wrong) ANSI rules. It was frowned upon, wasn't terribly good style, but you could do it and for most architectures this isn't a problem because the compiler essentially treated all pointers as equivalents. Of course, the compiler is free to perform pointer alignment due to the assignment, but this never happened in practice (at least not for any of the main architectures that are still with us today).

The @interface definitions is literally like the following statements:

typedef struct { char *buffer; } MyClass;
typedef struct { char *buffer; int mutationCount; } MyMutableClass;

Or, if we really wanted to, we could drop the typedef and use declare it as any other struct. Pointers to 'instantiated objects' in code are either identical to their Objective-C counterparts if typedefs are used, or something like the following if structs are used:

struct MyClass *myClassObj;
struct MyMutableClass *myMutableClassObj;

Fast forward to C99 and consider the same statement:

myClassObj = myMutableClassObj;

In C99, this statement is expressly forbidden as 'pointers of one type may not point to a different type (except void)'. Only pointers of the same type may alias each other. This is the 'strict aliasing' rule(s).

So... there's a bit of a conflict. Such pointer aliasing is permitted under the concepts of object oriented programming, but it is expressly forbidden under C99 rules. From a purely compiler perspective, when you prototype a method as

- (NSArray *)someMethod;

you literally mean that you are returning a type of NSArray *, and not any of its subclasses.

I'm not certain that the C99 rules apply in this way to Objective-C types, since the Objective-C type system is completely outside the scope of C99. The fact that Objective-C was originally implemented as a layer above C just means the compiler had less information to go on. One can easily get around the problem you mentioned by having the C implementation of Objective-C just use void* for all Objective-C object references (or, as you point out later, simple disable strict aliasing rules for Objective-C code).

It is, in fact, an error to return a NSMutableArray in a method that's prototyped to return an NSArray due to C pointer aliasing rules. The 'id' type is the closest thing that Objective-C has to a 'generic object pointer type', so if a method wants to return a pointer to an object of more than one type, it really should declare the return type as 'id'. Again, this is due to the C pointer aliasing rules rather than any OO conceptual rules.

Again, I'm not certain how much C99's aliasing rules apply to Objective-C object references. Objective-C doesn't have a formal specification akin to C99, so the specification (if you want to call it that) is whatever the current compiler implementation allows.

There are others on this list that can comment on this particular issue with much more authority than myself.

It really starts to become a problem when you turn on the optimizer and it begins to do optimizations that are dependent on this aliasing invariant. When I realized that this could actually be a serious, very subtle problem, and started digging I found evidence to support it. For example, '-fstrict-aliasing' is disabled on Apples GCC for ObjC code.

Interesting. I think this illustrates my point that the strict aliasing rules in C99 don't really apply to Objective-C, at least in the implementation provided by GCC. This is clearly a deliberate choice, likely to avoid the issues you mentioned.

Using '-fast' on .m files causes the compiler to emit 'cc1obj: warning: command line option "-fast" is valid for C/C++ but not for ObjC'

I'm of the opinion that Objective-C and C are so closely linked together that one can not simply say 'Pointers can not aliasing to different types. Except for ObjC class type pointer, they can alias to any of their subclasses.'

It gets even more interesting when one considers categories, which allow one to implement essentially "open types" in Objective-C. The highly dynamic nature of Objective-C allows one to change the methods implemented by a class at runtime, which can essentially change the subtyping relationships between objects at runtime. In that sense, the class hierarchy is only a set of guidelines for subtyping relationships between Objective-C objects. From that observation, I'm not certain that any conservative strict aliasing assumptions could be made by the compiler concerning Objective-C objects.

It just not possible from a practical stand point, ESPECIALLY in something like GCC where it's pragmatically impossible to separate out the two languages.

I'm not an expert on the GCC IR where the optimizer does much of its work, but the GCC frontend has a notion of the Objective-C type system, and uses that information to issue warnings in some cases. For example:

#include <Cocoa/Cocoa.h>

void foo() {
   NSString* s;
   NSObject* o;

   o = s;
   s = o;
}

gcc emits warning for the assignment of 'o' to 's' because the object referred to by 'o' may not be a subclass of NSString:

/tmp/t.m:8: warning: assignment from distinct Objective-C type

If one could use the Objective-C class hierarchy information to make conservative assumptions for use with strict aliasing optimizations, I'm not certain why you think gcc couldn't use that information. The point I made above, however, means that even having the class hierarchy information available may not be enough make such assumptions.

- Ted

It really starts to become a problem when you turn on the optimizer
and it begins to do optimizations that are dependent on this
aliasing invariant. When I realized that this could actually be a
serious, very subtle problem, and started digging I found evidence
to support it. For example, '-fstrict-aliasing' is disabled on
Apples GCC for ObjC code.

Interesting. I think this illustrates my point that the strict
aliasing rules in C99 don't really apply to Objective-C, at least in
the implementation provided by GCC. This is clearly a deliberate
choice, likely to avoid the issues you mentioned.

-fstrict-aliasing being off by default has more to do with its implementation in GCC than it does the objc type system. I don't think that any of the things John is talking about affect -fstrict-aliasing.

When llvm-gcc (and eventually clang) supports type based alias analysis, it will almost certainly be on by default, even on the mac.

Using '-fast' on .m files causes the compiler to emit 'cc1obj:
warning: command line option "-fast" is valid for C/C++ but not for
ObjC'

-fast is basically the "optimize for SPEC" flag. There are no objc programs in spec.

-Chris

It is, in fact, an error to return a NSMutableArray in a method
that's prototyped to return an NSArray due to C pointer aliasing
rules. The 'id' type is the closest thing that Objective-C has to a
'generic object pointer type', so if a method wants to return a
pointer to an object of more than one type, it really should declare
the return type as 'id'. Again, this is due to the C pointer
aliasing rules rather than any OO conceptual rules.

Again, I'm not certain how much C99's aliasing rules apply to
Objective-C object references. Objective-C doesn't have a formal
specification akin to C99, so the specification (if you want to call
it that) is whatever the current compiler implementation allows.

The current specification of Objective-C is "The Objective-C 2.0 Programming Language" at <http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/ >.

John is also incorrect about the above: It is *not* an error, in Objective-C, to return an instance of a subclass from a method prototyped as returning an instance of the superclass.

Objective-C is its own language that extends C99, not a preprocessor for C99, and this is one of the extensions that Objective-C adds. In fact, in Objective-C it is not possible to say "this method returns an instance of specifically this class and no other class" -- you can only say "this method returns an instance of this class or any subclass."

That is by design, and is not just an artifact of its original mid-1980s implementation as a preprocessor.

   -- Chris

John may be thinking of the fact that most init methods and such
return id, for example -[NSArray initWithObjects:]. This isn't
because it would be incorrect to return (NSArray *), it's because
subclassers would have to redeclare every init method, or else the
compiler would issue a warning for lines such as

NSMutableArray *mutableArray = [[NSMutableArray alloc]
initWithObjects:obj1, obj2, nil];

Actually, since the +alloc method returns id (for the same reason),
the compiler cannot even tell that -[NSMutableArray initWithObjects:]
should be preferred over -[NSArray initWithObjects:] here.

It'd be nice if Objective-C had return types that meant "instance of
receiver" and "instance of receiver's class".

-Ken

It is, in fact, an error to return a NSMutableArray in a method

that’s prototyped to return an NSArray due to C pointer aliasing

rules. The ‘id’ type is the closest thing that Objective-C has to a

‘generic object pointer type’, so if a method wants to return a

pointer to an object of more than one type, it really should declare

the return type as ‘id’. Again, this is due to the C pointer

aliasing rules rather than any OO conceptual rules.

Again, I’m not certain how much C99’s aliasing rules apply to

Objective-C object references. Objective-C doesn’t have a formal

specification akin to C99, so the specification (if you want to call

it that) is whatever the current compiler implementation allows.

The current specification of Objective-C is “The Objective-C 2.0 Programming Language” at <http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/>.

Thanks Chris! What I meant by a “formal” specification is something with the detail to write a compiler and runtime with. The document you reference (while excellent) doesn’t provide that kind of detail. In our implementation of Objective-C support in Clang, we are really leaning on the knowledge of specific people who either had an hand in conceiving the language or implemented its support in gcc.

John is also incorrect about the above: It is not an error, in Objective-C, to return an instance of a subclass from a method prototyped as returning an instance of the superclass.

I don’t think I addressed that point directly, and I’m glad you did so. This idea follows from standard type theory for functions: the return type is allowed to be a covariant type.

Objective-C is its own language that extends C99, not a preprocessor for C99, and this is one of the extensions that Objective-C adds.

Exactly.

In fact, in Objective-C it is not possible to say “this method returns an instance of specifically this class and no other class” – you can only say “this method returns an instance of this class or any subclass.”

Excellent point. I don’t think most standard OO languages allow you to express the first statement either. (others please chime in on this one if you know better!)

It is, in fact, an error to return a NSMutableArray in a method

that’s prototyped to return an NSArray due to C pointer aliasing

rules. The ‘id’ type is the closest thing that Objective-C has to a

‘generic object pointer type’, so if a method wants to return a

pointer to an object of more than one type, it really should declare

the return type as ‘id’.

John is also incorrect about the above: It is not an error, in

Objective-C, to return an instance of a subclass from a method

prototyped as returning an instance of the superclass.

John may be thinking of the fact that most init methods and such
return id, for example -[NSArray initWithObjects:]. This isn’t
because it would be incorrect to return (NSArray *), it’s because
subclassers would have to redeclare every init method, or else the
compiler would issue a warning for lines such as

NSMutableArray *mutableArray = [[NSMutableArray alloc]
initWithObjects:obj1, obj2, nil];

This limitation also exists in Java with its own collection classes (collections use type erasure). Lists, Maps, etc., all contain objects that subclass Object (the root of the Object hierarchy). Clients must use downcasts when retrieving objects from collections. Generics in Java helps reduce the typing for this, but its just syntactic sugar (the compiler inserts the downcast checks):

http://java.sun.com/j2se/1.5.0/docs/guide/language/generics.html

Actually, since the +alloc method returns id (for the same reason),
the compiler cannot even tell that -[NSMutableArray initWithObjects:]
should be preferred over -[NSArray initWithObjects:] here.

It’d be nice if Objective-C had return types that meant “instance of
receiver” and “instance of receiver’s class”.

I agree that this kind of polymorphism would be nice to have in Objective-C. In many cases this kind of polymorphism can be implemented in C++ using templates.

Cool. In that case, will strict-aliasing support for Objective-C take into consideration the subtyping relationships in Objective-C? The original point with regards to strict-aliasing really concerns whether or not C99 strict-aliasing rules apply verbatim to pointers to Objective-C objects.

Almost certainly not. I'm not an objc expert, but my understanding is that people often define proxy objects that don't necessarily derive from the specified base class. The ObjC type system is mostly a "recommendation" not a "requirement" afaik.

-Chris

It really starts to become a problem when you turn on the optimizer
and it begins to do optimizations that are dependent on this
aliasing invariant. When I realized that this could actually be a
serious, very subtle problem, and started digging I found evidence
to support it. For example, '-fstrict-aliasing' is disabled on
Apples GCC for ObjC code.

-fstrict-aliasing is disabled by default for all languages, not just ObjC, in Apple GCC.

Sorry for the response delay...

It is, in fact, an error to return a NSMutableArray in a method
that's prototyped to return an NSArray due to C pointer aliasing
rules. The 'id' type is the closest thing that Objective-C has to a
'generic object pointer type', so if a method wants to return a
pointer to an object of more than one type, it really should declare
the return type as 'id'. Again, this is due to the C pointer
aliasing rules rather than any OO conceptual rules.

Again, I'm not certain how much C99's aliasing rules apply to
Objective-C object references. Objective-C doesn't have a formal
specification akin to C99, so the specification (if you want to call
it that) is whatever the current compiler implementation allows.

The current specification of Objective-C is "The Objective-C 2.0 Programming Language" at <http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/>.

John is also incorrect about the above: It is *not* an error, in Objective-C, to return an instance of a subclass from a method prototyped as returning an instance of the superclass.

Yes, this traces its roots back to Brad Cox's ideas and the original StepStone objc compiler. This was all done in K&R C days, pre-ANSI even, and pointer rules were an awful lot looser back then.

Objective-C is its own language that extends C99, not a preprocessor for C99, and this is one of the extensions that Objective-C adds.

True enough, but Objective-C isn't exactly formally defined. As someone here put it, it's pretty much "whatever the compiler happens to compile."

While it's been a long time since I've hacked on GCC internals or ported it to a new architecture, it used to be that objc was really nothing more than an 'integrated preprocessor to GCC'. Instead of pre-processing the results and rewriting objc statements in to their equivalent C statements, it just directly creates the internal tree representations, essentially 'rewriting' on the fly.

An example of an older objc pre-processor: ftp://ftp.wustl.edu/pub/aminet/dev/c/OCT-1.99.lha It's interesting to note that this particular pre-processor seems to be free from influence of any other objc front end, and allegedly a fairly close translation of 'Object Oriented Programming: An Evolutionary Approach', which laid out the bulk of objc (circa '86).

From a high level view, it would seem that GCC still uses the object == struct representation internally, effectively turning all classes into a struct, with each ivar a member of that struct (and inheriting all of the parents ivars). For example:

#import <Foundation/NSObject.h>

@interface MyObject : NSObject { @public int count; void *ptr; }
@end

@implementation MyObject
-(int)count { return(count); }
-(void)setPtr:(void *)newPtr { ptr = newPtr; count++; }
@end

int main(int argc, char *argv[]) {
   MyObject *obj = NULL;

   obj = [[MyObject alloc] init];

   int x = [obj count];
   [obj setPtr:NULL];
   x = [obj count];

   int y = obj->count;
   void *optr = obj->ptr;

   obj->ptr = NULL;

   return(0);
}

When we look at the gimple representation of this (gcc -fdump-tree-gimple-all -c FILE.m), it looks like it's still the same basic pre-processor infrastructure in place:

;; Function -[MyObject count] (-[MyObject count])

-[MyObject count] (selfD.2219, _cmdD.2220)
{
   intD.0 D.2227;

   D.2227 = selfD.2219->countD.2206;
   return D.2227;
}

;; Function main (main)

main (argcD.2245, argvD.2246)
{
   // snip
   struct MyObject * objD.2249;
   // snip
   D.2273 = OBJ_TYPE_REF(objc_msgSend_Fast;D.2270->0) (D.2270, _OBJC_SELECTOR_REFERENCES_1.5D.2272);
   objD.2249 = (struct MyObject *) D.2273;
   // snip
   yD.2256 = objD.2249->countD.2206;
   optrD.2260 = objD.2249->ptrD.2207;
   objD.2249->ptrD.2207 = 0B;
   //snip

Or, in other words, pretty much a objc -> c preprocessed representation.

C and ObjC are deeply intertwined. Because ObjC is built on top of C, the rules and quirks of C bubble up to ObjC, but the reverse is not necessarily true. In the old 'preprocessor' ObjC model, this wasn't a problem: there was always a 1:1 translation. Whatever C code emerged on the other side, you could analyze it in the context of standard C rules. This model was chosen for its obvious simplicity and it allowed for 'strict superset of C' compatibility.

In fact, in Objective-C it is not possible to say "this method returns an instance of specifically this class and no other class" -- you can only say "this method returns an instance of this class or any subclass."

I disagree. Such things were possible in older compilers because of pointer rules. Type punning was always frowned upon, and is now strictly forbidden in C99. Except for 'union's, C doesn't really provide a means for multi-type pointer representations. It's not really a question of what proper OO design paradigm is, it's a question of "where the rubber meets the road": How do you represent the concept in C.

Take the following Objective-C code:

-(MyObject *)who { return(self); }

The gimple representation is:

  ;; Function -[MyObject who] (-[MyObject who])

  -[MyObject who] (selfD.2246, _cmdD.2247)
  {
    struct MyObject * D.2251;

    D.2251 = selfD.2246;
    return D.2251;
  }

So, when you say you are returning - (NSArray *)arrayByDoingSomething, and we're using the 'whatever the compiler compiles' standard, you are literally saying you are returning a 'struct NSArray *'. In C99 rules, the meaning of this is very, very clear and unambiguous: You return a pointer to a struct NSArray and a struct NSArray ONLY. The code that calls it is almost certainly going to be something like 'NSArray *array = [obj arrayByDoingSomething', which against turns in to a 'struct NSArray *'.

To return anything other than a struct NSArray * breaks C99 type aliasing rules. And because ObjC is so tightly integrated with C, and the fact that the bulk of the GCC compiler is geared towards C, one can not simply dismiss this as 'Well, that's how ObjC defines things.' Because you have a pointer to an object, you must live within C99's pointer rules. And more to the point, you need to deal with the fact that people writing code generation parts of the compiler are going to implicitly assume that C99's pointer aliasing rules are being followed and write code that depends on that invariant being true.

That is by design, and is not just an artifact of its original mid-1980s implementation as a preprocessor.

True enough. It was enabled by the fact that mid-1980's C compilers allowed for such 'free-for-all' type punning. But C moved on, and explicitly made such type punning forbidden. It is a far, FAR from trivial matter to untangle the two.

http://mail-index.netbsd.org/tech-kern/2003/08/11/0001.html
http://www.mail-archive.com/list@epicsol.org/msg00489.html

So, my question isn't really what the Objective-C 2.0 manual says (which is very little, and definitely not 'standards definition' quality), it's more about 'how do you do it?' (in the context of a modern, C99 optimizing compiler).

It doesn't matter how the concept is represented in C, because modern Objective-C compilers are *not* simple translators from Objective-C to C; they must treat the language as a language in its own right, because it does add additional semantics atop C.

First and foremost an Objective-C compiler must understand and preserve the semantics of the Objective-C language -- which includes proper handling of subtype relationships within the class hierarchy.

   -- Chris

Type punning was always frowned upon, and is now
strictly forbidden in C99. Except for 'union's, C doesn't really
provide a means for multi-type pointer representations. It's not
really a question of what proper OO design paradigm is, it's a
question of "where the rubber meets the road": How do you represent
the concept in C.

We need not represent it in C. We only need to be able to represent it in an AST for clang. For gcc, we only need to be able to represent it in trees/gimple.

To return anything other than a struct NSArray * breaks C99 type
aliasing rules. And because ObjC is so tightly integrated with C, and
the fact that the bulk of the GCC compiler is geared towards C, one
can not simply dismiss this as 'Well, that's how ObjC defines
things.'

Actually, one can. Please see the notion of alias sets in gcc for example.

So, my question isn't really what the Objective-C 2.0 manual says
(which is very little, and definitely not 'standards definition'
quality), it's more about 'how do you do it?' (in the context of a
modern, C99 optimizing compiler).

In gcc, easy, TYPE_ALIAS_SET (x) = 0;.