Use Smallest types in IR

Hi all,
I’m developing a compiler for a SIMD machine, so I want the types in the IR to be as small as possible in order to utilize the vector registers on more elements.
I understand C’s Spec is integer promotion when dealing with char/short types, but I would have expected LLVM to optimize the zext/sext + trunc and use the original user-code types in many cases.
I’ve encountered many examples where the computations in the IR remain on i32, including unnecessary promotions, for example:

I was hoping there was an existing optimization that already does this, instead of having to implement an LLVM pass to coalesce unnecessary promotions.

Thanks,
Alon

They’re not unnecessary promotions. If you change ‘char’ to ‘unsigned char’ in your first example, promotions are eliminated.

Adding with signed overflow is UB in C, whereas unsigned overflow is not. promotions prohibits signed overflow in i8.

Without looking much at all. I wondered if this was canEvaluateTruncated in InstcombineCasts.cpp not being able to remove a shl+ashr pair. With unsigned types this would be an and instead of 2 shifts.

I think this is also related to [InstCombine] sext(trunc(x)) should not be transformed to ashr(shl(trunc(x)))) when there is no chance to fold · Issue #116019 · llvm/llvm-project · GitHub – we really shouldn’t be replacing sext with shl+ashr in this case in the first place. Fixing that issue will probably also fix this case.

Thanks for the replies!
Regarding the 2nd example,
I would expected LLVM to use the smallest type for the computation (i16) so instead:

  %16 = load i8, ptr %15, align 1, !dbg !41, !tbaa !42
  %17 = zext i8 %16 to i32, !dbg !41
  %19 = load i8, ptr %18, align 1, !dbg !45, !tbaa !42
  %20 = zext i8 %19 to i32, !dbg !45
  %21 = mul nuw nsw i32 %20, %17, !dbg !46
  %22 = icmp samesign ugt i32 %21, %10, !dbg !48
  %23 = select i1 %22, i32 %21, i32 0, !dbg !49
  %25 = load i8, ptr %24, align 1, !dbg !50, !tbaa !42
  %26 = zext i8 %25 to i32, !dbg !50
  %27 = mul nuw nsw i32 %23, %26, !dbg !51
  %28 = trunc i32 %27 to i16, !dbg !52

I would expect it to use i16 in all of the intermediate computations like so:

  %16 = load i8, ptr %15, align 1, !dbg !41, !tbaa !42
  %17 = zext i8 %16 to i32, !dbg !41
  %19 = load i8, ptr %18, align 1, !dbg !45, !tbaa !42
  %20 = zext i8 %19 to i16, !dbg !45
  %21 = mul nuw nsw i16 %20, %17, !dbg !46
  %22 = icmp samesign ugt i16 %21, %10, !dbg !48
  %23 = select i1 %22, i16 %21, i16 0, !dbg !49
  %25 = load i8, ptr %24, align 1, !dbg !50, !tbaa !42
  %26 = zext i8 %25 to i32, !dbg !50
  %zx = zext i16 %23 to i32,
  %27 = mul nuw nsw i32 %zx, %26, !dbg !51
  %28 = trunc i32 %27 to i16, !dbg !52

Perhaps this has some cost-model hidden since I prefer as much compute done on small types as my machine is SIMD so I can utilize the most if its vector registers.

In the code example, If you want to use i16 instead of i32 to do mul+cmp+sel, you should prove that i16 definitely has no overflow issue. Or else it might be correctness issue. Am I right?

I’m multiplying 2 unsigned chars, so the result will not overflow in i16:

%17 = zext i8 %16 to i32, !dbg !41
%20 = zext i8 %19 to i32, !dbg !45
%21 = mul nuw nsw i32 %20, %17, !dbg !46

It actually multiply twice, so e.g. if in1[x] == in2[x] == in3[x] == 0xFF, it would overflow in i16.
Now, still removing temp * in3[x] and using just temp still promotes to 32 bits. That seems odd indeed, as both x86 and armv8 do the same, while doing the multiply / select operations apart, they do use i16.

right, I didn’t mean the whole computation will be on i16. The first mul can be done on i16, since it won’t overflow. Then, we can cmp+select on i16, and zext to i32 for the last multiplication.

If there is no overflow issue, the shorter size might be more efficient in SIMD mode. But frequent trunc or sext in SIMD is still expensive we need to considerate. For example, vpmovsxbd, vpmovdb and pack in x86 AVX512 AlderLake are all >= 3 cycles. Do we have any idea to measure the total cost when trying to shorten part of them?

I agree we should have some cost-model here. Not sure this is something LLVM currently models.
The overhead in my example is a single zext i16->i32 operation, while many instructions are done on i16 instead of i32 which is better on my HW.
So instead

%17 = zext i8 %16 to i32
%20 = zext i8 %19 to i32
%21 = mul nuw nsw i32 %20, %17
%22 = icmp samesign ugt i32 %21, %10
%23 = select i1 %22, i32 %21, i32 0
%26 = zext i8 %25 to i32
%27 = mul nuw nsw i32 %23, %26
%28 = trunc i32 %27 to i16

I prefer:

%17 = zext i8 %16 to i16
%20 = zext i8 %19 to i16
%21 = mul nuw nsw i16 %20, %17
%22 = icmp samesign ugt i16 %21, %10
%23 = select i1 %22, i16 %21, i32 0
%extra_zext = zext %23 i16 to i32
%26 = zext i8 %25 to i32
%27 = mul nuw nsw i32 %extra_zext, %26
%28 = trunc i32 %27 to i16

The example is scalar mode right? I just curious why i16 is better than i32 in scalar mode? It seems no execution reduced. Is i16 instruction itself better than the same operation in i32 on your HW?

The example scalar just to simplify. Indeed I prefer i16 over i32 when vectorizing the code, and not when it remains scalar.