Use Smallest types in IR

komalon1 · November 24, 2024, 1:42pm

Hi all,
I’m developing a compiler for a SIMD machine, so I want the types in the IR to be as small as possible in order to utilize the vector registers on more elements.
I understand C’s Spec is integer promotion when dealing with char/short types, but I would have expected LLVM to optimize the zext/sext + trunc and use the original user-code types in many cases.
I’ve encountered many examples where the computations in the IR remain on i32, including unnecessary promotions, for example:

I was hoping there was an existing optimization that already does this, instead of having to implement an LLVM pass to coalesce unnecessary promotions.

Thanks,
Alon

zsrkmyn · November 25, 2024, 3:01am

They’re not unnecessary promotions. If you change ‘char’ to ‘unsigned char’ in your first example, promotions are eliminated.

Adding with signed overflow is UB in C, whereas unsigned overflow is not. promotions prohibits signed overflow in i8.

topperc · November 25, 2024, 3:22am

Without looking much at all. I wondered if this was canEvaluateTruncated in InstcombineCasts.cpp not being able to remove a shl+ashr pair. With unsigned types this would be an and instead of 2 shifts.

nikic · November 25, 2024, 8:43am

I think this is also related to [InstCombine] sext(trunc(x)) should not be transformed to ashr(shl(trunc(x)))) when there is no chance to fold · Issue #116019 · llvm/llvm-project · GitHub – we really shouldn’t be replacing sext with shl+ashr in this case in the first place. Fixing that issue will probably also fix this case.

komalon1 · November 26, 2024, 12:11pm

Thanks for the replies!
Regarding the 2nd example,
I would expected LLVM to use the smallest type for the computation (i16) so instead:

  %16 = load i8, ptr %15, align 1, !dbg !41, !tbaa !42
  %17 = zext i8 %16 to i32, !dbg !41
  %19 = load i8, ptr %18, align 1, !dbg !45, !tbaa !42
  %20 = zext i8 %19 to i32, !dbg !45
  %21 = mul nuw nsw i32 %20, %17, !dbg !46
  %22 = icmp samesign ugt i32 %21, %10, !dbg !48
  %23 = select i1 %22, i32 %21, i32 0, !dbg !49
  %25 = load i8, ptr %24, align 1, !dbg !50, !tbaa !42
  %26 = zext i8 %25 to i32, !dbg !50
  %27 = mul nuw nsw i32 %23, %26, !dbg !51
  %28 = trunc i32 %27 to i16, !dbg !52

I would expect it to use i16 in all of the intermediate computations like so:

  %16 = load i8, ptr %15, align 1, !dbg !41, !tbaa !42
  %17 = zext i8 %16 to i32, !dbg !41
  %19 = load i8, ptr %18, align 1, !dbg !45, !tbaa !42
  %20 = zext i8 %19 to i16, !dbg !45
  %21 = mul nuw nsw i16 %20, %17, !dbg !46
  %22 = icmp samesign ugt i16 %21, %10, !dbg !48
  %23 = select i1 %22, i16 %21, i16 0, !dbg !49
  %25 = load i8, ptr %24, align 1, !dbg !50, !tbaa !42
  %26 = zext i8 %25 to i32, !dbg !50
  %zx = zext i16 %23 to i32,
  %27 = mul nuw nsw i32 %zx, %26, !dbg !51
  %28 = trunc i32 %27 to i16, !dbg !52

Perhaps this has some cost-model hidden since I prefer as much compute done on small types as my machine is SIMD so I can utilize the most if its vector registers.

tianleliu · November 27, 2024, 3:16am

In the code example, If you want to use i16 instead of i32 to do mul+cmp+sel, you should prove that i16 definitely has no overflow issue. Or else it might be correctness issue. Am I right?

komalon1 · November 27, 2024, 7:44am

I’m multiplying 2 unsigned chars, so the result will not overflow in i16:

%17 = zext i8 %16 to i32, !dbg !41
%20 = zext i8 %19 to i32, !dbg !45
%21 = mul nuw nsw i32 %20, %17, !dbg !46

dsampaio · November 27, 2024, 2:21pm

It actually multiply twice, so e.g. if in1[x] == in2[x] == in3[x] == 0xFF, it would overflow in i16.
Now, still removing temp * in3[x] and using just temp still promotes to 32 bits. That seems odd indeed, as both x86 and armv8 do the same, while doing the multiply / select operations apart, they do use i16.

komalon1 · November 27, 2024, 2:36pm

right, I didn’t mean the whole computation will be on i16. The first mul can be done on i16, since it won’t overflow. Then, we can cmp+select on i16, and zext to i32 for the last multiplication.

tianleliu · November 28, 2024, 1:41am

If there is no overflow issue, the shorter size might be more efficient in SIMD mode. But frequent trunc or sext in SIMD is still expensive we need to considerate. For example, vpmovsxbd, vpmovdb and pack in x86 AVX512 AlderLake are all >= 3 cycles. Do we have any idea to measure the total cost when trying to shorten part of them?

komalon1 · November 28, 2024, 7:29am

I agree we should have some cost-model here. Not sure this is something LLVM currently models.
The overhead in my example is a single zext i16->i32 operation, while many instructions are done on i16 instead of i32 which is better on my HW.
So instead

%17 = zext i8 %16 to i32
%20 = zext i8 %19 to i32
%21 = mul nuw nsw i32 %20, %17
%22 = icmp samesign ugt i32 %21, %10
%23 = select i1 %22, i32 %21, i32 0
%26 = zext i8 %25 to i32
%27 = mul nuw nsw i32 %23, %26
%28 = trunc i32 %27 to i16

I prefer:

%17 = zext i8 %16 to i16
%20 = zext i8 %19 to i16
%21 = mul nuw nsw i16 %20, %17
%22 = icmp samesign ugt i16 %21, %10
%23 = select i1 %22, i16 %21, i32 0
%extra_zext = zext %23 i16 to i32
%26 = zext i8 %25 to i32
%27 = mul nuw nsw i32 %extra_zext, %26
%28 = trunc i32 %27 to i16

tianleliu · November 29, 2024, 2:56am

The example is scalar mode right? I just curious why i16 is better than i32 in scalar mode? It seems no execution reduced. Is i16 instruction itself better than the same operation in i32 on your HW?

komalon1 · December 1, 2024, 7:16am

The example scalar just to simplify. Indeed I prefer i16 over i32 when vectorizing the code, and not when it remains scalar.

Topic		Replies	Views
Disable integer promotion (Dilan Manatunga via cfe-dev) Clang Frontend	8	175	June 1, 2016
Handling native i16 types in clang and opt LLVM Dev List Archives	4	177	July 25, 2018
Vector promotion broken for <2 x [i8\|i16]> LLVM Dev List Archives	18	203	August 2, 2012
type promotion i16 -> i32 LLVM Dev List Archives	2	105	June 23, 2011
global type legalization? LLVM Dev List Archives	12	144	September 15, 2010

Use Smallest types in IR

Related topics