Vectorizer has trouble with vpmovmskb and store

Hi all,
I’ve run into a case where the optimizer seems to be having trouble doing the “obvious” thing.

Consider this code:

define i16 @foo(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
%a1 = icmp slt <16 x i8> %a0, zeroinitializer
%a2 = bitcast <16 x i1> %a1 to i16
%astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
;store i16 %a2, i16* %astore
ret i16 %a2

The optimizer recognizes this and llc nicely outputs a vpmovmskb instruction:

foo: # @foo
vpmovmskb eax, xmm0

Writing to the output vector also works well:

define void @writing(<8 x i16>* dereferenceable(16) %egress, <16 x i8> %a0) {
%astore = getelementptr inbounds <8 x i16>, <8 x i16>* %egress, i64 0, i64 7
store i16 123, i16* %astore
ret void


writing: # @writing
mov word ptr [rdi + 14], 123

Now, combining these two by uncommenting the store in foo() suddenly results in a very large function, instead of just:

vpmovmskb eax, xmm0
mov word ptr [rdi + 14], ax

Is there something wrong with my IR code, or is the optimizer somehow confused? Can I rewrite the code such that the optimizer does understand?

Godbolt link:

Thanks a lot for the help.

Here’s a quick patch that fixes this. I don’t know to avoid it in IR. I haven’t checked any other tests, but it does fix your case. I’ll try to put up a real phabricator tonight or tomorrow.

diff --git a/lib/Target/X86/X86ISelLowering.cpp b/lib/Target/X86/X86ISelLowering.cpp
index e31f2a6…d79c0be 100644
— a/lib/Target/X86/X86ISelLowering.cpp
+++ b/lib/Target/X86/X86ISelLowering.cpp
@@ -4837,6 +4837,11 @@ bool X86TargetLowering::isCheapToSpeculateCtlz() const {

bool X86TargetLowering::isLoadBitCastBeneficial(EVT LoadVT,
EVT BitcastVT) const {

  • if (!LoadVT.isVector() && BitcastVT.isVector() &&
  • BitcastVT.getVectorElementType() == MVT::i1 &&
  • !Subtarget.hasAVX512())
  • return false;

We should handle this a lot better after r34763

Hello Craig,
Thank you for the quick response and fix.
However, the improvement turns out to be quite fragile. If I run opt on the original testcase, and run the output through llc then the previous very long assembly output results. (things work for a bitcast from <16 x i1> to i16, but not for a <16 x i1>* store)
Godbolt link:


I was afraid of that. I thought I had checked whether InstCombine would remove the bitcast here, but I guess I didn’t or didn’t do it right. I’ll see what I can do to fix this.

Ok I’ve made another fix in r348104