Using intrinsics with memory operands

Hi all,

I was wondering how to use variations of intrinsic functions that take a memory operand.

Take for example the SSE4.1 pmovsxbd instruction. One variant takes two XMM registers, while another has a 32-bit memory location as source operand. The latter is quite interesting if you know you’re reading from memory anyway, and if it’s not 16-byte aligned. It looks like LLVM’s Intrinsic::x86_sse41_pmovsxbd expects a v16i8 as source operand though. So how do I achieve using the variant taking a memory operand?

Thanks a bunch,

Nicolas Capens

I tried adding the following to IntrinsicsX86.td:

def int_x86_sse41_pmovsxbd_m : GCCBuiltin<“__builtin_ia32_pmovsxbd128_m”>,

Intrinsic<[llvm_v4i32_ty, llvm_ptr_ty],

[IntrReadMem]>;

But while I now have a Intrinsic::x86_sse41_pmovsxbd_m that I can use for ‘calling’ the intrinsic, I’m getting a “cannot yet select” assert. Any clues highly appreciated.

I was wondering how to use variations of intrinsic functions that take a
memory operand.

Often, for intrinsics where it matters, there's a variant of the
intrinsic that takes a pointer operand that you can use, although it
looks like there isn't one here.

Take for example the SSE4.1 pmovsxbd instruction. One variant takes two XMM
registers, while another has a 32-bit memory location as source operand. The
latter is quite interesting if you know you're reading from memory anyway,
and if it's not 16-byte aligned. It looks like LLVM's
Intrinsic::x86_sse41_pmovsxbd expects a v16i8 as source operand though. So
how do I achieve using the variant taking a memory operand?

A load+insertelement+pmovsx sequence should codegen into a single
instruction, but it looks like that isn't working. I guess the
pattern-matching magic should kick in and take care of this, but that
doesn't seem to be working for a simple example like the following:

target datalayout =
"e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:32:32"
target triple = "i386-pc-linux-gnu"

define <4 x i32> @a(i32* %x) nounwind {
entry:
  load i32* %x, align 4 ; <i32>:0 [#uses=1]
  insertelement <4 x i32> undef, i32 %0, i32 0 ; <<4 x i32>>:1 [#uses=1]
  bitcast <4 x i32> %1 to <16 x i8> ; <<16 x i8>>:5 [#uses=1]
  tail call <4 x i32> @llvm.x86.sse41.pmovsxbd( <16 x i8> %2 ) nounwind
readnone ; <<2 x i64>>:6 [#uses=1]
  ret <4 x i32> %3
}

declare <4 x i32> @llvm.x86.sse41.pmovsxbd(<16 x i8>) nounwind readnone

I think the issue is that the pattern for the memory operand of
pmovsxbd isn't flexible enough to see through the scalar_to_vector
step.

-Eli

Eli is correct. This is a deficiency in the matching code. We don't want variants of intrinsics which take memory operands. We often have to add code matching scalar_to_vector and / or bit_convert explicitly. Perhaps we should have tablegen produce matching code that check for these nodes.

Evan