Hi all,
I’ve run into a case where the optimizer seems to be having trouble doing the “obvious” thing.
Consider this code:
define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
%a1 = icmp slt <16 x i8> %a0, zeroinitializer
%a2 = bitcast <16 x i1> %a1 to i16
%astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
;store i16 %a2, i16* %astore
ret i16 %a2
}
The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:
foo: # @foo
vpmovmskb eax, xmm0
ret
Writing to the output vector also works well:
define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
%astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
store i16 123, i16* %astore
ret void
}
outputs:
writing: # @writing
mov word ptr [rdi + 14], 123
ret
Now, combining these two by uncommenting the store in foo()
suddenly results in a very large function, instead of just:
vpmovmskb eax, xmm0
mov word ptr [rdi + 14], ax
ret
Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?
Godbolt link: https://llvm.godbolt.org/z/OgExDk
Thanks a lot for the help.
Cheers,
Johan