Optimizations hindered by GVN widening

Currently, the GVN optimization widens loads if the alignment permits it. There are some limitations that show up, as seen in example below:

Initial IR:

declare void @consume(i8) readonly

define i8 @bar(i8* align 2 %a) {
  %1 = load i8, i8* %a
  %idx = getelementptr i8, i8* %a, i64 1
  %2 = load i8, i8* %idx, align 1
  call void @consume(i8 %1).
  ret i8 %2
}

define i1 @foo(i8* %a) {
entry:
  %0 = call i8 @bar(i8* %a)
  %1 = icmp eq i8 %0, 0
  br i1 %1, label %cont.1, label %exit

cont.1:
  store i8 0, i8* %a
  %2 = call i8 @bar(i8* %a)
  %3 = icmp eq i8 %2, 0
  ret i1 %3

exit:
  ret i1 true
}

Since %a is 2 byte aligned, GVN widens the loads in bar, then bar() gets inlined into foo. The resulting final IR at O3:

define i8 @bar(i8* align 2 %a) {
  %1 = bitcast i8* %a to i16*
  %2 = load i16, i16* %1, align 2
  %3 = trunc i16 %2 to i8
  %idx = getelementptr i8, i8* %a, i64 1
  %4 = lshr i16 %2, 8
  %5 = trunc i16 %4 to i8
  call void @consume(i8 %3)
  ret i8 %5
}
define i1 @foo(i8* %a) {
entry:
  %0 = bitcast i8* %a to i16*
  %1 = load i16, i16* %0, align 2 <— widened load from bar()
  %2 = trunc i16 %1 to i8
  call void @consume(i8 %2)
  %3 = icmp ult i16 %1, 256
  br i1 %3, label %cont.1, label %exit

cont.1: ; preds = %entry
  store i8 0, i8* %a, align 2
  %4 = load i16, i16* %0, align 2 <— widened load from bar()
  %5 = trunc i16 %4 to i8
  call void @consume(i8 %5)
  %6 = icmp ult i16 %4, 256
  ret i1 %6

exit: ; preds = %entry
  ret i1 true
}

In the absence of GVN widening (we can see this when %a is 1 byte aligned in bar), bar is inlined into foo as-is.
Final IR at O3:
define i8 @bar(i8* align 1 %a) {
  %1 = load i8, i8* %a, align 1 <— align is 1, so GVN does not widen load
  %idx = getelementptr i8, i8* %a, i64 1
  %2 = load i8, i8* %idx, align 1
  call void @consume(i8 %1)
  ret i8 %2
}
  
define i1 @foo(i8* %a) {
entry:
  %0 = load i8, i8* %a, align 1
  %idx.i = getelementptr i8, i8* %a, i64 1
  %1 = load i8, i8* %idx.i, align 1 <— both loads exist non-widened from bar(). Used in GVN for removing the loads in cont.1 BB.
  call void @consume(i8 %0)
  %2 = icmp eq i8 %1, 0
  br i1 %2, label %cont.1, label %exit
  
cont.1: ; preds = %entry
  store i8 0, i8* %a, align 1
; both the loads are removed. First load has the value fed from the store above. Second load is same as the one in entry BB (%1).
  call void @consume(i8 0)
  ret i1 true

exit: ; preds = %entry
  ret i1 true
}

Here the 2 loads in the cont.1 basic block are removed by GVN, since the loads are not widened to a single load. Also, GVN and later optimizations does value forwarding of %a (0) and also the compare folds to true.

This does not happen when GVN widens the load. Note that at the time GVN did the widening, it was a good transform since we have a single load instead of 2. However, inlining and later optimizations proved that leaving the load non-widened was more beneficial.

Any thoughts on how we could mitigate these problems introduced by GVN widening?

Thanks,
Anna