AST Representation of Conversions

Hi,

One thing I have started once and then aborted, but which Argiris
recently contacted me off-list about, is the AST representation of
conversions and casts. Currently, we have absolutely minimal
information: CastExpr (the base of all conversions) stores only the
operand. ImplicitCastExpr stores only whether the result is an lvalue.
ExplicitCastExpr (the base of all explicit casts) stores the target type
as written.
None of these store any of the information Sema has worked very hard to
acquire.
What kind of cast is it? Bitcast? Truncation? Extension? C++ has a large
variety of things that a conversion, especially a C-style cast, can do:
convert with constructor; convert with conversion operator; do a
hierarchy cast, potentially to a virtual base, which could mean adding
an offset to the pointer or dereferencing a pointer; do a raw bitcast
(reinterpret_cast is good at that); do an integer or floating point
extension/truncation; and even weirder things (member pointer casts,
explicit cast of the address of an overloaded function).

Obviously we need to save some information about the cast in the AST.
The question is what, and where.

Sema needs more detailed information about conversions than anyone else,
because it has to order them for function overloading. It needs to know
when an lvalue-to-rvalue conversion is performed, when a qualifier
conversion is performed, and precisely what conversions are done
in-between. Doug Gregor has implemented this, and it works.
CodeGen needs far less information, but still can use a lot. In the C
case, it currently recreates the necessary information by inspecting the
types again, but this approach is not tenable in the C++ case.
(Currently, an attempt to codegen a C++-specific conversion will
probably crash.) CodeGen needs to distinguish:
- a raw bitcast (reinterpret_cast of pointers and pointer/integer pairs,
reinterpret_cast of lvalues to references)
- a floating point truncation (double -> float)
- a floating point extension (float -> double)
- an integer truncation (int -> short)
- an integer extension (short -> int)
- a static hierarchy cast without virtual bases (add an offset to the
pointer)
- a static hierarchy cast with virtual bases (fetch the pointer to the
virtual base, and then add an offset)
- a dynamic hierarchy cast (emit calls to support library)
- a user-defined conversion via constructor (call that constructor)
- a user-defined conversion via conversion operator (call that operator)
- a static hierarchy cast of a member object pointer (adjust the value
of that pointer)
- a static hierarchy cast of a member function pointer (I have no idea
how that works)
- function and array decay
- GCC aggregate casts in various forms
- vector and extvector casts
- Objective-C casts
I think that's everything. In short, CodeGen also cares about pretty
much everything.

I don't know what other clients would need. The Index library definitely
wants to know about implicitly called functions (conversion operators
and constructors). The static analyzer would probably want the same
information as CodeGen. Other static code introspection tools probably
want all information too.

Essentially, I think, we will have to enhance or wrap
ImplicitConversionSequence from SemaOverload.h to also be able to
represent conversions that are only explicitly possible. Then we put it
into the AST library and give CastExpr one of those.
The problem with this approach is that it is heavy.
ImplicitConversionSequence is a heavy object (40 bytes on 32-bit without
considering alignment, 80 bytes on 64-bit if alignment works the way I
think it does), and every single ImplicitCastExpr (think of all the
"usual integral conversions" in C) would bear this weight, as would
casts that don't need this information, like const_cast (noop to
codegen), dynamic_cast (always runtime calls) and reinterpret_cast
(always bitcast).
An option would be to rearrange the hierarchy, but this makes it reflect
the implementation instead of the logical grouping. Currently the
hierarchy makes sense to programmers:
http://clang.llvm.org/doxygen/classclang_1_1CastExpr.html
If we were to rearrange it to fit the needs of data storage, CastExpr
would be the direct base of CXXConstCastExpr, CXXDynamicCastExpr,
CXXReinterpretCastExpr and ComplexCastExpr. ComplexCastExpr would hold
the conversion sequence and be the base of CXXStaticCastExpr,
CXXFunctionalCastExpr, CStyleCastExpr and ImplicitCastExpr. Not pretty.

Does anyone else have suggestions on how to solve this problem?

Sebastian

Essentially, I think, we will have to enhance or wrap
ImplicitConversionSequence from SemaOverload.h to also be able to
represent conversions that are only explicitly possible. Then we put it
into the AST library and give CastExpr one of those.
The problem with this approach is that it is heavy.
ImplicitConversionSequence is a heavy object (40 bytes on 32-bit without
considering alignment, 80 bytes on 64-bit if alignment works the way I
think it does), and every single ImplicitCastExpr (think of all the
"usual integral conversions" in C) would bear this weight, as would
casts that don't need this information, like const_cast (noop to
codegen), dynamic_cast (always runtime calls) and reinterpret_cast
(always bitcast).
An option would be to rearrange the hierarchy, but this makes it reflect
the implementation instead of the logical grouping. Currently the
hierarchy makes sense to programmers:
http://clang.llvm.org/doxygen/classclang_1_1CastExpr.html
If we were to rearrange it to fit the needs of data storage, CastExpr
would be the direct base of CXXConstCastExpr, CXXDynamicCastExpr,
CXXReinterpretCastExpr and ComplexCastExpr. ComplexCastExpr would hold
the conversion sequence and be the base of CXXStaticCastExpr,
CXXFunctionalCastExpr, CStyleCastExpr and ImplicitCastExpr. Not pretty.

Does anyone else have suggestions on how to solve this problem?

Hi Sebastian,

ImplicitCastExpr models aspects of the underlying implementation (AST nodes inserted by Sema to simplify the code generator). IIRC, CastExpr is abstract. I think this division is nice/clean.

Would it make sense to add "heavier" sub-classes of ImplicitCastExpr to model the C++-specific conversions? Would this avoid the space overhead you are talking about?

I'm just starting to get my feet wet with clang's C++ support, so I don't have any C++-specific insights.

snaroff

Hi Sebastian,

One thing I have started once and then aborted, but which Argiris
recently contacted me off-list about, is the AST representation of
conversions and casts. Currently, we have absolutely minimal
information: CastExpr (the base of all conversions) stores only the
operand. ImplicitCastExpr stores only whether the result is an lvalue.
ExplicitCastExpr (the base of all explicit casts) stores the target type
as written.
None of these store any of the information Sema has worked very hard to
acquire.
What kind of cast is it? Bitcast? Truncation? Extension? C++ has a large
variety of things that a conversion, especially a C-style cast, can do:
convert with constructor; convert with conversion operator; do a
hierarchy cast, potentially to a virtual base, which could mean adding
an offset to the pointer or dereferencing a pointer; do a raw bitcast
(reinterpret_cast is good at that); do an integer or floating point
extension/truncation; and even weirder things (member pointer casts,
explicit cast of the address of an overloaded function).

Obviously we need to save some information about the cast in the AST.
The question is what, and where.

Right.

CodeGen needs to distinguish:
- a raw bitcast (reinterpret_cast of pointers and pointer/integer pairs,
reinterpret_cast of lvalues to references)
- a floating point truncation (double -> float)
- a floating point extension (float -> double)
- an integer truncation (int -> short)
- an integer extension (short -> int)
- a static hierarchy cast without virtual bases (add an offset to the
pointer)
- a static hierarchy cast with virtual bases (fetch the pointer to the
virtual base, and then add an offset)
- a dynamic hierarchy cast (emit calls to support library)
- a user-defined conversion via constructor (call that constructor)
- a user-defined conversion via conversion operator (call that operator)
- a static hierarchy cast of a member object pointer (adjust the value
of that pointer)
- a static hierarchy cast of a member function pointer (I have no idea
how that works)
- function and array decay
- GCC aggregate casts in various forms
- vector and extvector casts
- Objective-C casts
I think that's everything. In short, CodeGen also cares about pretty
much everything.

Quite an exhaustive list! I can't think of any you missed, except perhaps "no-op" conversions that merely adjust types (e.g., by adding qualifiers) and require no code generation.

I don't know what other clients would need. The Index library definitely
wants to know about implicitly called functions (conversion operators
and constructors). The static analyzer would probably want the same
information as CodeGen. Other static code introspection tools probably
want all information too.

I suspect you're right. We at least need that much information in each cast (regardless of whether it is implicit or explicit).

Essentially, I think, we will have to enhance or wrap
ImplicitConversionSequence from SemaOverload.h to also be able to
represent conversions that are only explicitly possible. Then we put it
into the AST library and give CastExpr one of those.
The problem with this approach is that it is heavy.
ImplicitConversionSequence is a heavy object (40 bytes on 32-bit without
considering alignment, 80 bytes on 64-bit if alignment works the way I
think it does), and every single ImplicitCastExpr (think of all the
"usual integral conversions" in C) would bear this weight, as would
casts that don't need this information, like const_cast (noop to
codegen), dynamic_cast (always runtime calls) and reinterpret_cast
(always bitcast).

ImplicitConversionSequence is quite heavy, and I don't think that clients need that much information. It seems to me that we could get away with adding an enum (covering all the kinds of conversions you mentioned above) and a declaration (that points to a constructor or conversion function). The Expr class already has enough spare bits to store the enum, so CastExpr (and its descendents) would only have to grow by a single pointer. That, IMO, is an acceptable trade-off, since we'll be making CodeGen easier for C and possibly for C++.

An option would be to rearrange the hierarchy, but this makes it reflect
the implementation instead of the logical grouping. Currently the
hierarchy makes sense to programmers:
http://clang.llvm.org/doxygen/classclang_1_1CastExpr.html
If we were to rearrange it to fit the needs of data storage, CastExpr
would be the direct base of CXXConstCastExpr, CXXDynamicCastExpr,
CXXReinterpretCastExpr and ComplexCastExpr. ComplexCastExpr would hold
the conversion sequence and be the base of CXXStaticCastExpr,
CXXFunctionalCastExpr, CStyleCastExpr and ImplicitCastExpr. Not pretty.

No, not pretty. Our hierarchy is really nice for describing the syntactic and semantic behavior of these expressions, and it would be a shame if we had to sacrifice that clarity to save a few bytes.

  - Doug

I can't see how a pointer which would always be null in C code could
possibly make CodeGen easier for C code...

Note that we can always recalculate the appropriate declaration; it
might make sense to put the relevant code into the AST so it can be
shared between Sema and CodeGen.

-Eli

ImplicitConversionSequence is quite heavy, and I don't think that
clients need that much information. It seems to me that we could get
away with adding an enum (covering all the kinds of conversions you
mentioned above) and a declaration (that points to a constructor or
conversion function). The Expr class already has enough spare bits to
store the enum, so CastExpr (and its descendents) would only have to
grow by a single pointer. That, IMO, is an acceptable trade-off, since
we'll be making CodeGen easier for C and possibly for C++.

I can't see how a pointer which would always be null in C code could
possibly make CodeGen easier for C code...

The enum makes CodeGen easier for C code, since we don't need to figure out what kind of conversion to perform. The constructor/conversion function pointer, of course, won't help C at all.

Note that we can always recalculate the appropriate declaration; it
might make sense to put the relevant code into the AST so it can be
shared between Sema and CodeGen.

It's expensive to recalculate, because you'll have to perform overload resolution again. That's far too much code to move into AST.

  - Doug

ImplicitConversionSequence is quite heavy, and I don't think that
clients need that much information. It seems to me that we could get
away with adding an enum (covering all the kinds of conversions you
mentioned above) and a declaration (that points to a constructor or
conversion function). The Expr class already has enough spare bits to
store the enum, so CastExpr (and its descendents) would only have to
grow by a single pointer. That, IMO, is an acceptable trade-off, since
we'll be making CodeGen easier for C and possibly for C++.

I can't see how a pointer which would always be null in C code could
possibly make CodeGen easier for C code...

The enum makes CodeGen easier for C code, since we don't need to figure out
what kind of conversion to perform. The constructor/conversion function
pointer, of course, won't help C at all.

Right... but the point is, you're proposing adding a pointer to
CastExpr that's completely useless for C code. It might be that it's
not worth the effort to avoid here, but we've been trying to avoid
bloating the C AST exclusively for the benefit of C++ code.

Note that we can always recalculate the appropriate declaration; it
might make sense to put the relevant code into the AST so it can be
shared between Sema and CodeGen.

It's expensive to recalculate, because you'll have to perform overload
resolution again. That's far too much code to move into AST.

Oh, I see. Then scratch that.

-Eli

CodeGen needs to distinguish:
- a raw bitcast (reinterpret_cast of pointers and pointer/integer pairs,
reinterpret_cast of lvalues to references)
- a floating point truncation (double -> float)
- a floating point extension (float -> double)
- an integer truncation (int -> short)
- an integer extension (short -> int)
- a static hierarchy cast without virtual bases (add an offset to the
pointer)
- a static hierarchy cast with virtual bases (fetch the pointer to the
virtual base, and then add an offset)
- a dynamic hierarchy cast (emit calls to support library)
- a user-defined conversion via constructor (call that constructor)
- a user-defined conversion via conversion operator (call that operator)
- a static hierarchy cast of a member object pointer (adjust the value
of that pointer)
- a static hierarchy cast of a member function pointer (I have no idea
how that works)
- function and array decay
- GCC aggregate casts in various forms
- vector and extvector casts
- Objective-C casts
I think that's everything. In short, CodeGen also cares about pretty
much everything.
    
Quite an exhaustive list! I can't think of any you missed, except perhaps "no-op" conversions that merely adjust types (e.g., by adding qualifiers) and require no code generation.
  
I don't see where floating point to integer conversion and integer to floating-point fit.

Cédric Venet wrote:

CodeGen needs to distinguish:
<snip>
    
Quite an exhaustive list! I can't think of any you missed, except
perhaps "no-op" conversions that merely adjust types (e.g., by
adding qualifiers) and require no code generation.
  
I don't see where floating point to integer conversion and integer to
floating-point fit.

You're right, I forgot those.

Sebastian