C++ type erasure in llvm-g++

I'm curious about the type erasure that goes on when llvm-g++ compiles C++ code. Is this a consequence of it just being the easiest way to do things based on the design of gcc and how LLVM is plugged into it?

Can someone more familiar with the llvm-gcc infrastructure comment on the difficulty of generating more strongly typed virtual function tables rather than just having them all be variable length arrays of pointers of unknown type and casting to the "known" type before the call?

I understand that there's some issue with structural type equivalence that can merge identical looking table types, but I feel like it would help me with alias analysis to determine possible targets for indirect calls/invokes, or for debug related purposes. What I'd really like would be a vcall/vinvoke instruction, but I get LL part of LLVM.

I guess my main complaint is that I'd like to use the LLVM infrastructure to do some higher-level (semantics-wise) manipulation of code without waiting for clang to handle C++. Any other suggestions would be appreciated.

Thanks,
Luke

I'm curious about the type erasure that goes on when llvm-g++ compiles
C++ code. Is this a consequence of it just being the easiest way to do
things based on the design of gcc and how LLVM is plugged into it?

This is just due to how the G++ front-end happens to lower the C++ types to C types internally.

Can someone more familiar with the llvm-gcc infrastructure comment on
the difficulty of generating more strongly typed virtual function tables
rather than just having them all be variable length arrays of pointers
of unknown type and casting to the "known" type before the call?

This would require changing the G++ front-end. I have no idea how difficult that would be.

-Chris

Can someone more familiar with the llvm-gcc infrastructure comment on
the difficulty of generating more strongly typed virtual function tables
rather than just having them all be variable length arrays of pointers
of unknown type and casting to the "known" type before the call?

The easiest way would be to handle this in the gcc/llvm interface layer. The type of each slot can easily be figured out and the type of the vtable can be built up as a structure instead of an array. I'd guess it shouldn't be more than 100 lines. Harder to do would be to transform the virtual dispatch code. It is exposed as just raw add/subtract, fetch and then indirect call. Seems like part of the solution may be to propagate the ALIAS_SET information from the type system down for llvm to reason with. It should be more complete and accurate than the information llvm has, though, maybe only marginally so. The saving grace would be the code is heavily stylized and you're getting it before the optimizer swizzles it on you. Since all the math is with constants usually, you just need to recognize the style and the type during the call and the type at the other end, where the pointer arithmetic starts and the transform back into the usual llvm accessors for structures. Annoying to do, but, probably under 200 lines.

Any other suggestions would be appreciated.

Sure, just add code to propagate the types around, add code to handle constant arithmetic on these things and and to handle normal virtual dispatches, after that, add support for pointer to member functions and you're done. You should be able to figure out that these things don't escape much, that for a given constant (index), the same shape (type) is always used, that for a given variable (pointer to member function), that the same shape (type) is used and all assignments of this variable come from things of the same shape or that they comes from literals that have the same shape.

Mike Stump wrote:

Can someone more familiar with the llvm-gcc infrastructure comment on
the difficulty of generating more strongly typed virtual function tables
rather than just having them all be variable length arrays of pointers
of unknown type and casting to the "known" type before the call?

The easiest way would be to handle this in the gcc/llvm interface layer. The type of each slot can easily be figured out and the type of the vtable can be built up as a structure instead of an array. I'd guess it shouldn't be more than 100 lines. Harder to do would be to transform the virtual dispatch code. It is exposed as just raw add/ subtract, fetch and then indirect call. Seems like part of the solution may be to propagate the ALIAS_SET information from the type system down for llvm to reason with. It should be more complete and accurate than the information llvm has, though, maybe only marginally so. The saving grace would be the code is heavily stylized and you're getting it before the optimizer swizzles it on you. Since all the math is with constants usually, you just need to recognize the style and the type during the call and the type at the other end, where the pointer arithmetic starts and the transform back into the usual llvm accessors for structures. Annoying to do, but, probably under 200 lines.

OK, so it's mainly a problem of becoming comfortable with the llvm-gcc internals that are affected and not a fundamental whole-compiler design problem. That sounds like a multi-month rather than multi-year thing for me, Thanks.

Any other suggestions would be appreciated.

Sure, just add code to propagate the types around, add code to handle constant arithmetic on these things and and to handle normal virtual dispatches, after that, add support for pointer to member functions and you're done. You should be able to figure out that these things don't escape much, that for a given constant (index), the same shape (type) is always used, that for a given variable (pointer to member function), that the same shape (type) is used and all assignments of this variable come from things of the same shape or that they comes from literals that have the same shape.

So I can essentially rematerialize the vtable types by pushing things back through from the indirect calls in the program. Wouldn't existing alias analysis _do_ the same thing in a less specific manner? I guess that alias analysis doesn't always "trust" casts, where if I manually pushed back I would be assuming that the casts are correct?

Luke

OK, so it's mainly a problem of becoming comfortable with the llvm-gcc
internals that are affected and not a fundamental whole-compiler design
problem.

That's correct.

So I can essentially rematerialize the vtable types by pushing things
back through from the indirect calls in the program. Wouldn't existing
alias analysis _do_ the same thing in a less specific manner?

Yes, it should, that it essentially what I was describing. I can't make any claims about existing alias analysis, I thought there were large swaths missing. The idea would be, that if you wrote C++ code in C, using an array of void* for the vtables, you could get the backend to figure it all out and raise things back up. Kinda double the work than doing it safely to begin with.

I guess that alias analysis doesn't always "trust" casts, where if I manually
pushed back I would be assuming that the casts are correct?

Once all the pushing is in, one should be able to discover that the casts all convert to the same type, and remove them as useless. :slight_smile:

Mike Stump wrote:

I guess that alias analysis doesn't always "trust" casts, where if I manually
pushed back I would be assuming that the casts are correct?

Once all the pushing is in, one should be able to discover that the casts all convert to the same type, and remove them as useless. :slight_smile:

Right. I guess what I meant is that the cast might not have been generated by llvm-g++ but may be a user "lying" about the type of a function with a C cast for whatever reason. It could alias almost any global or anything that has its address taken. With a C++ vtable I know that the user never had access to the vtable and thus couldn't have done that.

I'm statically cloning and instrumenting call targets for my application (and doing runtime indirect branch target translation lookups), so it's really bad if I miss a target that I need. On the other hand I'd like to clone as few functions as possible. The most likely reason for an indirect call in what I get is C++ virtual calls, so disambiguating just these call sites would be helpful.

One thing I don't exactly understand is how I can push back through the vtable pointer loads in order to distinguish different vtables. I still need alias analysis for this... if I don't then the virtual function at index 0 will likely have lots of different types.

I think I really need to make this happen at a higher level where I have a much better idea of the set of possible targets of a virtual call.

Luke

Right. I guess what I meant is that the cast might not have been
generated by llvm-g++ but may be a user "lying"

Right. If the type is the same, you know it isn't a lie, and you can do what you want. If you `see' someone lying, well, within the confines of the language standard, we have ways of dealing with that. Basically, it is verbotten. In practice, for a production compiler, you need to handle it reasonably (things like debuggers, JITs, libffi and the like).

One thing I don't exactly understand is how I can push back through the
vtable pointer loads in order to distinguish different vtables. I still
need alias analysis for this... if I don't then the virtual function at
index 0 will likely have lots of different types.

index 0 for a given family of vtables will always have the same signature due to the language rules. If they are different, it means the vtables is of a different family and for a particular vtable pointer, can never be pointed to as the types don't match.

I think I really need to make this happen at a higher level where I have a much better idea of the set of possible targets of a virtual call.

That's just the set of possible values that a particular vtable pointer can have. -fwhole-program or some such can explain what that set is (after one does up all the code I mean). It is true that the front end also has this information, but it doesn't have the -fwhole-program bit to ensure it has the entire set.