Signed/unsigned value type resolution

Hi all,

I am currently working on a static analysis aimed at integer arithmetic overflow/underflow detection. We are attempting to build a sound abstract domain (based on Cousot & Cousot-style abstract interpretation), but practically speaking this really requires the ability to figure out the word size and signedness of values in the intermediate representation. I'm well aware that LLVM leverages the (usual) equivalence of certain arithmetic operations in two's compliment form with respect to signedness, but from a program analysis point of view it can be very important to know whether, for example, 0xFFFFFFFF means 65535 or -1 (assuming 16 bits), particularly when values are represented by conceptually infinitiary abstract domains.

There seems to be some support in the head version in the DIType class (specifically DIType::isUnsignedDIType()) for extracting this information from debug metadata, though this member function is missing in 2.9. It is also sometimes possible to infer signedness from context, since certain instructions imply it, but I'm finding that doing that still leaves many cases unresolved.

What's the best way to go with this?

Thank you in advance,
Sarah Thompson
NASA Ames (back doing LLVM stuff again after a while working on robotics)

Hi Sarah, this is a hopeless task because both signed and unsigned
variables in C can map to the *same* LLVM IR register. Consider the
following example:

void foo(int x, int y) {
   if ((x < y) || ((unsigned)x < (unsigned)y))

In LLVM IR this becomes:

define void @foo(i32 %x, i32 %y) {
   %0 = icmp slt i32 %x, %y
   %1 = icmp ult i32 %x, %y
   %2 = or i1 %0, %1
   br i1 %2, label %"3", label %return

   tail call void @abort() noreturn nounwind

   ret void

Note how both the signed variable x and the unsigned variable (unsigned)x
have become the same IR register %x. The underlying problem here is that
casts from signed to unsigned or unsigned to signed in C become no-ops in
LLVM IR, so signed and unsigned values that are different in C become the
same register in LLVM IR. Thus it is logically impossible to correctly
assign "signed" or "unsigned" labels to LLVM IR registers.

Instead of trying to fight LLVM's type system, I think you need to
embrace it: accept that only operations have signs, and adapt your
algorithms to work with that.

Ciao, Duncan.