X86 Intrinsics : _mm_storel_epi64/ _mm_loadl_epi64 with -m32


I’m using _mm_storel_epi64/ _mm_loadl_epi64 in my test case as below
and generating 32-bit code (using -m32 and -msse4.2). The 64-bit load
and 64-bit store operations are replaced with two 32-bit mov
instructions, presumably due to the use of uint64_t type. If I use
__m128i instead of uint64_t everywhere, then the read and write happen
as 64-bit operations using the xmm registers as expected.

void indvbl_write64(volatile void *p, uint64_t v)
       __m128i tmp = _mm_loadl_epi64((__m128i const *)&v);
       _mm_storel_epi64((__m128i *)p, tmp);

uint64_t indivbl_read64 (volatile void *p)
        __m128i tmp = _mm_loadl_epi64((__m128i const *)p);
        return *(uint64_t *)&tmp;

Options used to compile: clang –O2 –c –msse4.2 –m32 test.c

Generated code:

00000000 <indvbl_write64>:
   0: 8b 44 24 08 mov 0x8(%esp),%eax
   4: 8b 54 24 04 mov 0x4(%esp),%edx
   8: 8b 4c 24 0c mov 0xc(%esp),%ecx
   c: 89 4a 04 mov %ecx,0x4(%edx)
   f: 89 02 mov %eax,(%edx)
  11: c3 ret
  12: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%eax,%eax,1)
  19: 00 00 00
  1c: 0f 1f 40 00 nopl 0x0(%eax)

00000020 <indvbl_read64>:
  20: 8b 4c 24 04 mov 0x4(%esp),%ecx
  24: 8b 01 mov (%ecx),%eax
  26: 8b 51 04 mov 0x4(%ecx),%edx
  29: c3 ret

The front-end generates insertelement <2 x i64> and extractelement <2
x i64> for the load and stores as expected and optimizer generates
load i64 and store i64, which are then lowered into 32-bit move
instructions in the Instruction Selection Phase.

Would it be possible and safe to generate a single 64-bit load/store
in this case with –m32 ? If so, please may I have some pointers to
related parts of the code I should be looking at to make this