[x86] Prefetch intrinsics and prefetchw

Hi,

I am looking at how the PREFETCHW instruction is matched to the IR prefetch intrinsic (and __builtin_prefetch).

Consider this C program:
char foo[100];
int bar(void) {
    __builtin_prefetch(foo, 0, 0);
    __builtin_prefetch(foo, 0, 1);
    __builtin_prefetch(foo, 0, 2);
    __builtin_prefetch(foo, 0, 3);

    __builtin_prefetch(foo, 1, 0);
    __builtin_prefetch(foo, 1, 1);
    __builtin_prefetch(foo, 1, 2);
    __builtin_prefetch(foo, 1, 3);

    *foo = 1;

    return foo[0];
}

The generated IR for the prefetches follow this pattern:

  tail call void @llvm.prefetch(i8* %0, i32 0, i32 0, i32 1)
  tail call void @llvm.prefetch(i8* %1, i32 0, i32 1, i32 1)
  tail call void @llvm.prefetch(i8* %2, i32 0, i32 2, i32 1)
  tail call void @llvm.prefetch(i8* %3, i32 0, i32 3, i32 1)
  tail call void @llvm.prefetch(i8* %4, i32 1, i32 0, i32 1)
  tail call void @llvm.prefetch(i8* %5, i32 1, i32 1, i32 1)
  tail call void @llvm.prefetch(i8* %6, i32 1, i32 2, i32 1)
  tail call void @llvm.prefetch(i8* %7, i32 1, i32 3, i32 1)

The generated x86_64 code for the first 4 calls, where the read/write parameter
is 0 (read) is exactly as expected:
(Generated with clang -O2 -S -march=btver2 test.c)
  prefetchnta foo(%rip)
  prefetcht2 foo(%rip)
  prefetcht1 foo(%rip)
  prefetcht0 foo(%rip)

The question is what should be expected when the r/w parameter is 1 (write).
Currently the backend generates:
  prefetchnta foo(%rip)
  prefetcht2 foo(%rip)
  prefetcht1 foo(%rip)
  prefetchw foo(%rip)

However, a different possibility would be for the r/w parameter to take
precedence over the locality parameter to generate:
  prefetchw foo(%rip)
  prefetchw foo(%rip)
  prefetchw foo(%rip)
  prefetchw foo(%rip)

The PREFETCHW instruction prefetches the L1 cache line and sets the cache-line
state to modified. Since there is no PREFETCHW for higher-level cache-lines,
it is debatable what prefetch instruction should be generated when a write
prefetch is requested with a locality < 3. One opinion is that the rw
parameter takes precedence over locality, therefore prefetch(a, 1, 1, 1) should
generate prefetchw and not prefetch2. FWIW, this is what GCC appears to
do (write trumps locality.)

Not sure if there is a right/wrong here; what is the preferred behavior?

Thanks,
- Josh