How to infer sign information from LLVM IR?

I'm implementing an integer bounds analysis (i.e., for each variable
compute its lower and upper bound) as a LLVM pass. My objective is to
identify potential overflows in both signed and unsigned C integers.

To do that I need to know the sign of each integer. However, we know
that LLVM IR does not keep sign information explicitly.

My understanding is that the only instructions with explicit sign
information is ICmp (e.g., icmp ugt ....) for the case of
integers. Thus, sign information may be reconstructred from icmp
instructions. On the other hand, arithmetic instructions as Add, Sub,
and Mul do not have this information. (I'm aware of the "nsw" and
"nuw" flags but they are optional.) As a result, it's possible to have
a straight-line piece of code (without icmp instructions) where no
sign information can be inferred but potential overflows can happen
without knowing whether the integer is signed or not.

Recently, somebody posted:

Some time ago I posted here a couple times about integer overflow
checking. Since then we (at Utah) joined forces with Vikram Adve and
his student Will Dietz who are also looking at integer issues. Between
the four of us we've done a lot of looking at and thinking about integer
overflow in C/C++, and have written up a paper containing some data and
observations. Perhaps it'll be interesting to people here:

   http://www.cs.utah.edu/~regehr/papers/overflow12.pdf

In this work, they implement a dynamic analysis for detecting integer
overflows in the front-end (using clang). I belive they moved to the
front-end rather than implementing a LLVM pass precisely due to, among
others, the sign problem I mention.

Is there a quick solution to infer sign information without moving to
the front-end? Maybe encoding some metadata?

Thanks!
Jorge

Hi Jorge,

I'm implementing an integer bounds analysis (i.e., for each variable
compute its lower and upper bound) as a LLVM pass. My objective is to
identify potential overflows in both signed and unsigned C integers.

To do that I need to know the sign of each integer.

there is no well-defined concept of the sign of an integer type in LLVM
IR. Thus I think this approach is misguided. Consider for example the
following C:

   void foo(int x, int y) {
     if ((x < y) || ((unsigned)x < (unsigned)y))
       abort();
   }

The casts simply don't appear at all in the LLVM IR. You get something
like this:

   %cmp = icmp slt i32 %x, %y
   %cmp1 = icmp ult i32 %x, %y
   %or.cond = or i1 %cmp, %cmp1
   br i1 %or.cond, label %if.then, label %if.end

if.then: ; preds = %entry
   tail call void @abort() noreturn nounwind
   unreachable

if.end: ; preds = %entry
   ret void

The general point I'm trying to make is that in C you change something
from signed to unsigned by casting. In LLVM IR none of those casts are
visible, it's always the same register representing both the signed and
unsigned C values. Thus you can't say that the register is signed or
unsigned because it represents both signed and unsigned values simultaneously.
You may well see both signed and unsigned operations on it (like the signed and
unsigned icmp instructions above).

Ciao, Duncan.

  However, we know