always allow canonicalizing to 8- and 16-bit ops?

Example:
define i8 @narrow_add(i8 %x, i8 %y) {
%x32 = zext i8 %x to i32
%y32 = zext i8 %y to i32
%add = add nsw i32 %x32, %y32
%tr = trunc i32 %add to i8
ret i8 %tr
}

With no data-layout or with an x86 target where 8-bit integer is in the data-layout, we reduce to:

$ ./opt -instcombine narrowadd.ll -S
define i8 @narrow_add(i8 %x, i8 %y) {
%add = add i8 %x, %y
ret i8 %add
}

But on a target that has 32-bit registers without explicit subregister ops, we don’t do that transform because we avoid changing operations from a legal (as specified in the data-layout) width to an illegal width - see InstCombiner::shouldChangeType().

Should we make an exception to allow narrowing for the common cases of i8 and i16?

In the motivating example from PR35875 ( https://bugs.llvm.org/show_bug.cgi?id=35875 ), an ARM target is stuck at 19 IR instructions:

declare void @use4(i8, i8, i8, i8)
define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
%nx = xor i8 %x, -1
%ny = xor i8 %y, -1
%nz = xor i8 %z, -1
%zx = zext i8 %nx to i32
%zy = zext i8 %ny to i32
%zz = zext i8 %nz to i32

%cmpxz = icmp ult i32 %zx, %zz
%minxz = select i1 %cmpxz, i32 %zx, i32 %zz
%cmpyz = icmp ult i32 %zy, %zz
%minyz = select i1 %cmpyz, i32 %zy, i32 %zz
%cmpyx = icmp ult i8 %y, %x
%minxyz = select i1 %cmpyx, i32 %minxz, i32 %minyz
%tr_minxyz = trunc i32 %minxyz to i8

%new_zx = sub nsw i32 %zx, %minxyz
%new_zy = sub nsw i32 %zy, %minxyz
%new_zz = sub nsw i32 %zz, %minxyz
%new_x = trunc i32 %new_zx to i8
%new_y = trunc i32 %new_zy to i8
%new_z = trunc i32 %new_zz to i8

call void @use4(i8 %tr_minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
ret void
}

…but x86 gets to shrink the subs which leads to a bunch of other transforms, and we grind this down to 10 instructions between instcombine and early-cse:

define void @min_of_3_vals(i8 %x, i8 %y, i8 %z) {
%nx = xor i8 %x, -1
%ny = xor i8 %y, -1
%nz = xor i8 %z, -1
%cmpxz = icmp ult i8 %nx, %nz
%minxz = select i1 %cmpxz, i8 %nx, i8 %nz
%1 = icmp ult i8 %minxz, %ny
%minxyz = select i1 %1, i8 %minxz, i8 %ny
%new_x = sub i8 %nx, %minxyz
%new_y = sub i8 %ny, %minxyz
%new_z = sub i8 %nz, %minxyz

call void @use4(i8 %minxyz, i8 %new_x, i8 %new_y, i8 %new_z)
ret void
}

Hello

Thanks for looking into this.

I can't be very confident what the knock on result of a change like that would be,
especially on architectures that are not Arm. What I can do though, is run some
benchmarks and look at that results.

Using this patch:

--- a/lib/Transforms/InstCombine/InstructionCombining.cpp
+++ b/lib/Transforms/InstCombine/InstructionCombining.cpp
@@ -150,6 +150,9 @@ bool InstCombiner::shouldChangeType(unsigned FromWidth,
   bool FromLegal = FromWidth == 1 || DL.isLegalInteger(FromWidth);
   bool ToLegal = ToWidth == 1 || DL.isLegalInteger(ToWidth);

+ if (FromLegal && ToWidth < FromWidth && (ToWidth == 8 || ToWidth == 16))
+ return true;

Thanks for the perf testing. I assume that DAG legalization is equipped to handle these cases fairly well, or someone would’ve complained by now…

FWIW (and at least some of this can be blamed on me), instcombine already does the narrowing transforms without checking shouldChangeType() for binops like and/or/xor/udiv. The justification was that narrower ops are always better for bit-tracking. If we can eliminate casts, then that improves analysis even more and allows more follow-on transforms.

If there are no objections here, I think you can post your patch for review on Phabricator. If we can squash any of the regressions before that goes in, that would be even better.

Hi Sanjay,

Thank you for your help.

I have looked at one of Cortex-M regressions David talked about. The patch improves IR:

====Before:

%conv10 = zext i8 %40 to i32

%xor = xor i32 %conv10, %conv14

%conv16 = trunc i32 %xor to i8

store i8 %conv16, i8* %storemerge185, align 1

====After:

%xor = xor i8 %40, %conv14

store i8 %xor, i8* %storemerge185, align 1

However it affects register allocation causing more spill/fills. I am trying to figure out why it happens.

Thanks,

Evgeny Astigeevich

The current implementation of 64-bit code generation for RISC-V has
i64 as the only legal type - with Krzysztof's variable-size register
class support, simply switching GPRs to i64 was the path of least
resistance for getting working codegen. In that case, it would be
interesting to explore allowing narrowing for i32 as well.

I should note: RV64I codegen isn't yet upstreamed primarily as I want
to do more investigation and thinking about the pros/cons of this
approach and discuss on llvm-dev. I don't want to derail this thread
with that discussion.

Best,

Alex

Sounds good. I've put up D42424.

In the mean time, I will keep looking at the regressions. But as Evgeny mentioned, this may be registry allocation having a worse time on these register constrained targets.

We will see.

Cheers,
Dave