So, if we take the source-code level case.
You can write a source-code level program that will compile unchanged
to produce a 32-bit application or a 64-bit application.
Proof of this is just looking at almost any Linux based distro
available in 32-bit or 64-bitapplications.
So, if you then ask a different question:
Instead of porting a 32-bit program to 64-bit, port the 32-bit program
to a program that will work equally well if compiled for 32-bit target
or 64-bit target?
That's still impossible. In C++, it's trivial to write code like this:
template<size_t size>
struct AlignedStorage;
template<4>
struct AlignedStorage {
union {
uint32_t;
uint8_t element;
};
};
template<8>
struct AlignedStorage {
union {
uint64_t;
uint8_t element;
};
};
...
AlignedStorage<sizeof(void*)> storage;
You end up compiling literally different code based on the size of a pointer with templates. Or you could use it in a macro. This isn't academic:
<http://dxr.mozilla.org/search?tree=mozilla-central&q=regexp%3A%2F%23if.*SIZEOF_%2F&redirect=true> [1]. (Note: this is in a code base that already uses the intN_t types almost everywhere instead of plain int/long/etc.).
This is a concern even before we get to optimizations that can take advantage of identical representations to deduplicate code on different branches, or the fact that the inlining of sizeof() operations as constants has profound second-order effects on code like radically alterating structural layout (grepping a recent paper indicates that precision on structural typing binary programs even when you're collapsing all types of the same size is 90%. That is an upper bound on your effectiveness).
First steps in this might be looking at every use of "int" and "long"
and replace them with int32_t and int64_t. I.e. replace target
specific types with target agnostic types.
So, if the binary is 32bit, int will be 32bit, change the source code
to say "int32_t" instead of "int".
if the binary is 32bit, and on that target long will be 32bit, change
the source code to say "int32_t".
In 3 million lines of code, there are:
* >1000 uses of size_t
* 857 uses of ptrdiff_t
* >1000 uses of intptr_t and uintptr_t
* 839 uses of ssize_t
I am assuming that all of these are intended to be explicitly pointer-sized integer variables. In addition, there are over 504 distinct unions, which is a subset of places where types are polymorphically used--I'm not counting uses of reinterpret_cast or static_cast (or C-style type-punning)--which chalk up into several thousand more possible combinations.
I know that there will be special cases that are difficult to handle.
I don't expect 100%. I am looking to write a tool that can do say 80%
of the work.
You are *very* optimistic to assume that you can well-type 80% of the program given only the binary code of the program. I think DSA managed, given LLVM IR mid-optimization, to determine 80% of the objects accessed by loads/stores to be a type more precise than a bag of bytes on SPEC2000, which isn't a particularly hard benchmark for real-world programs.
So, it is not black and white. I want it to work say 80% of the time,
but at least highlight where the remaining 20% is, and do manual work
on it.
I am assuming a lot about your background knowledge here, but the fact that you were not aware of qemu as prior art and also some of your choices of words leads me to believe that you have not looked very hard into prior research on static analysis either of C code or binary code. That is not a recipse for success.
[1] Shameless DXR plug: we support regex searches