preserving type signatures

Hi there,

I'm currently using Clang/LLVM for static analysis of C programs, and
I want to get the exactly same type signatures as in the original
source programs. However, for example, with Clang, if a function
returns struct {float, float}, then the compiled function in the llvm
bitcode returns a double that aggregates the two floats. I wonder
there are any ways to prevent Clang to do this type transformation.
Thanks for any help.

Naoya

It's completely wrong to convert { float, float } into double. Just
because they have the same size doesn't mean they're identical.

Can you give us a simple example where this occurs?

Renato,

Here's a sample code that is compiled as I wrote:

It has been discussed on the list recently:

http://old.nabble.com/64bit-MRV-problem:-{-float,-float,-float}–>-{-double,-float-}-td27304830.html

Best regards,
Victor

Victor, thanks for pointing out the discussion, though unfortunately
there is no way to work around the ABI issue.

Thanks,
Naoya

Victor, thanks for pointing out the discussion, though unfortunately
there is no way to work around the ABI issue.

You can probably work around this by implementing your own target, and having it lower the ABI types in a way that you would prefer. X86-64 is one of the more brutal targets for type hacking.

-Chris

Hi there,

I'm currently using Clang/LLVM for static analysis of C programs, and
I want to get the exactly same type signatures as in the original
source programs. However, for example, with Clang, if a function
returns struct {float, float}, then the compiled function in the llvm
bitcode returns a double that aggregates the two floats. I wonder
there are any ways to prevent Clang to do this type transformation.
Thanks for any help.

Yes, as Chris mentioned, you can write your own target to do this.

However, my intuition is that you should "not be trying to do this in
the first place". Clang's type system is richer than LLVM's, you can
only map one direction. It sounds like you are trying to map back from
LLVM types to Clang types, and I wouldn't recommend that.

- Daniel

However, for example, with Clang, if a function
returns struct {float, float}, then the compiled function in the llvm
bitcode returns a double that aggregates the two floats.

It's completely wrong to convert { float, float } into double.

No it isn't. Clang does this on occasion when it wants to satisfy an
ABI constraint.

Just
because they have the same size doesn't mean they're identical.

Just because it converts doesn't mean it is assuming they are identical.

- Daniel

It's completely wrong to convert { float, float } into double.

No it isn't. Clang does this on occasion when it wants to satisfy an
ABI constraint.

Hi Duncan,

It might make sense for some platforms (such as x86_64) but it doesn't
for all platforms, especially non-64bit ones.

I read the thread with your detailed explanations and I understand now
why it was done and also agree that it's quite ugly. Though, maybe
some metadata (instead of the whole C type system) could be passed
down to the codegen to avoid this.

It's "ok" to assume most people will be using intel with GCC 64 these
days, but IMHO, forcing (all?) other target codegens to work around a
specific x86_64 ABI is just wrong.

Just
because they have the same size doesn't mean they're identical.

Just because it converts doesn't mean it is assuming they are identical.

Quite right, my mistake.

cheers,
--renato

http://systemcall.org/

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

You seem to be under the impression that this happens for all targets. It does not. If you compile your example code for FreeBSD x86, it returns the values as an i64. If you compile for Linux x86, the return type of the function is void and there is a sret pointer passed as the first argument.

This is because, unfortunately, LLVM does not handle any of the ABI-specific calling convention logic for you. It maps IR types to specific calling conventions (in an undocumented way, which is fun if you're writing a compiler that requires ABI compatibility with C).

There is no assumption that 'most people will be using Intel with GCC 64' there is an assumption that, if you are using x86-64 then you want code that uses the same ABI as your system compiler. Clang generates LLVM IR that LLVM will translate into something that has the correct ABI.

As Daniel says, you are free to define a new ABI that has your own mappings if you want more of the information preserved, but you'll still find things that LLVM IR can't represent, such as unions or the difference between complex types and structures. If C type information is important to you then you would be better off using a representation that contains this information.

David

-- Sent from my IBM 1620

Hi David,

So, that means the IR is not platform/ABI agnostic? What if I want to
mix IR generated from different platforms into one big virtual machine
(RPE-style)? I hope there's a way to disable all that ABI stuff and
generate plain simple IR, is there?

cheers,
--renato

http://systemcall.org/

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm

So, that means the IR is not platform/ABI agnostic?

Correct. See: http://llvm.org/docs/FAQ.html#platformindependent

What if I want to
mix IR generated from different platforms into one big virtual machine
(RPE-style)?

Then you need to use something other than LLVM IR. You also need to start with a source language other than C, because things like sizeof(int) change on different platforms, as do a lot of things defined in headers and a number of predefined macros. Even preprocessing the same C source on FreeBSD/x86 and Solaris/SPARC64 will give very different output for all but the most trivial programs. Compiling it to IR will add even more differences.

I hope there's a way to disable all that ABI stuff and
generate plain simple IR, is there?

No. LLVM IR is less expressive than C. There is no way of generating 'plain and simple IR' that can be turned into native code trivially. Consider something simple like a function taking an argument that is a union of an int and a void*. On any sane ABI, this argument will be passed in a register, so the IR will use an i32 or i64. Linux/86, however, will pass it via a pointer. How would you represent this in LLVM IR?

David

-- Sent from my brain

Thanks so much for all the help. It seems I should use Clang rather
than LLVM for my analysis since I need higher-level type information
than what LLVM can represent.

Thanks,
Naoya

Then you need to use something other than LLVM IR. You also need to start with a source language other than C, because things like sizeof(int) change on different platforms, as do a lot of things defined in headers and a number of predefined macros. Even preprocessing the same C source on FreeBSD/x86 and Solaris/SPARC64 will give very different output for all but the most trivial programs. Compiling it to IR will add even more differences.

I thought the sizeof problem was solved with the data layout
information... But yes, headers will mess up things.

One more thing that occurred to me is that, if you create all IR as
the same (see inline answer below), some optimizations might assume
wrong things and screw up the ABI, not just slow down the program.

No. LLVM IR is less expressive than C.

Indeed, and that's when all my assumptions went away.

The whole point of converting to IR is to reduce the expressiveness so
the optimizations and codegen can work with something simple and avoid
herculean tasks. So having an IR that is as expressive as any language
(or the mix of all of them) is not just difficult, it's wrong. Right?
:wink:

There is no way of generating 'plain and simple IR' that can be turned into native code trivially. Consider something simple like a function taking an argument that is a union of an int and a void*. On any sane ABI, this argument will be passed in a register, so the IR will use an i32 or i64. Linux/86, however, will pass it via a pointer. How would you represent this in LLVM IR?

If you leave this to the lower levels to decide, all definitions would be:

%union.foo = type { i32, i8* }; ; obvious platform dependent problems
here, solved by data layout, maybe

define void @func (%union.foo arg)

And the codegen would change to pointers or registers.

But, as I said above, that would encourage (or discourage)
optimizations at this level that could break the logic. Also, that
would serve to no purpose, since the idea of the IR is to make things
simpler and not run opt/codegen on pure C.

Thanks for the clarifications.

cheers,
--renato

http://systemcall.org/

Reclaim your digital rights, eliminate DRM, learn more at
http://www.defectivebydesign.org/what_is_drm