RFC phantom memory intrinsic

Hi,
For PR21780 solution, I plan to add a new functionality to restore
memory operations that was once deleted, in this particular case it is
the load operations that were deleted by InstCombine, please note that
once the load was removed there is no way to restore it back and that
prevents us from vectorizing the shuffle operation. There are probably
more similar issues where this approach could be applied.
I added phatom_mem(llvm_anyptr_ty, llvm_i64_ty) intrinsic for that,
indicating that for particular pointer let's call it %ptr we observed
maximum possible offset at which there was reference by its type in a
function. After InstCombine deleted the load operation, it could be
restored in SLPVectorizer and we could restore chains of GEPs, Loads
and Inserts in case we encounter phatom_mem intrinsic.

Here is two part review:
          https://reviews.llvm.org/D37579 - InstCombine part.
          https://reviews.llvm.org/D37648 - SLP part.

Also, there might be different approaches in describing deleted memory
operations, for example, for my case: phantom_load(llvm_anyptr_ty,
llvm_i64_ty). First parameter describes pointer and second parameter
offset from pointer this loaded was deleted, for example. This two
operations:

  %arrayidx1 = getelementptr inbounds double, double* %ptr, i64 1
  %ld1 = load double, double* %arrayidx1

could be represented in the IR with this one: "void phantom_load(%ptr,
1)" after removal. But, the approach that is already implemented in
both reviews looks better to me since we don't need to add intrinsic
for every removed operation in the IR. Also, while constructing such
form in the IR we have to be careful since some pointer operations
might be in loops and as the result we might end up construction an
incorrect IR. So, I just avoid to notice any pointer operation if it
is belong to a loop, except those where the the whole chain of
operations pointer origin, GEP, Load, Shuffle operation are in the
same loop and in the same basic block.
                                         Thanks, Dinar.

Here is the thread for this issue regarding using metadata:
http://lists.llvm.org/pipermail/llvm-dev/2017-July/115730.html

Interesting approach but how do you handle more complex offsets, e.g., when the pointer is part of an aggregate? Only one offset does not seem enough to handle generic cases.

Hi Michael,

Interesting approach but how do you handle more complex offsets, e.g., when the pointer is part of an aggregate? Only one offset does not seem enough to handle generic cases.

Yes, correct, this a little bit changed example is not working.
#include <x86intrin.h>

__m256d vsht_d4_fold(const double* ptr, unsigned long long i) {
  __m256d foo = (__m256d){ ptr[i], ptr[i+1], ptr[i+2], ptr[i+3] };
  return __builtin_shufflevector( foo, foo, 3, 3, 2, 2 );
}
But with the aggregate case it is a new level of complexity, should we
we care about? There might be some logic that probably would be mark
as dead by InstCombine and we don't want to keep it.
BTW: Looks like SLP could not recognize the case either :
define <4 x double> @vsht_d4_fold(double* %ptr, i64 %i) local_unnamed_addr #0 {
entry:
  %arrayidx = getelementptr inbounds double, double* %ptr, i64 %i
  %0 = load double, double* %arrayidx, align 8
  %vecinit = insertelement <4 x double> undef, double %0, i32 0
  %add = add i64 %i, 1
  %arrayidx1 = getelementptr inbounds double, double* %ptr, i64 %add
  %1 = load double, double* %arrayidx1, align 8
  %vecinit2 = insertelement <4 x double> %vecinit, double %1, i32 1
  %add3 = add i64 %i, 2
  %arrayidx4 = getelementptr inbounds double, double* %ptr, i64 %add3
  %2 = load double, double* %arrayidx4, align 8
  %vecinit5 = insertelement <4 x double> %vecinit2, double %2, i32 2
  %add6 = add i64 %i, 3
  %arrayidx7 = getelementptr inbounds double, double* %ptr, i64 %add6
  %3 = load double, double* %arrayidx7, align 8
  %vecinit8 = insertelement <4 x double> %vecinit5, double %3, i32 3
  %shuffle = shufflevector <4 x double> %vecinit8, <4 x double>
%vecinit8, <4 x i32> <i32 3, i32 3, i32 2, i32 2>
  ret <4 x double> %shuffle
}

        Thanks, Dinar.

Hi Dinar,

I am asking because I am maintaining an out-of-tree pass which does
exactly what SLP does not. It is a pass designed for GPUs to combine
loads and stores, e.g., when consecutive fields of a structure have the
same type it merges the loads and stores to vector loads and stores. I
have a case where InstCombine removes a store and your approach would be
valuable for me if the entire access to an aggregate could be restored.
Second thing I am concerned is that this intrinsics "just" fix a
specific problem of IC where potentially a more generic solution is needed.

Cheers,
Michael

Hi Michael,

I have a case where InstCombine removes a store and your approach would be
valuable for me if the entire access to an aggregate could be restored.

Yes, no problem and we could add the aggregate pointer to this new
intrinsic and in my particular case I should ignore it, but I am
looking now at "speculation_marker" metadata and I am still not sure
how to implement it better.
                  Thanks, Dinar.

Hi Michael,

I have a case where InstCombine removes a store and your approach would be
valuable for me if the entire access to an aggregate could be restored.

Yes, no problem and we could add the aggregate pointer to this new
intrinsic and in my particular case I should ignore it, but I am
looking now at "speculation_marker" metadata and I am still not sure
how to implement it better.

Are you primarily concerned with being able to widen loads later in the pipeline? Could we attached metadata to the remaining loads indicating that it would be legal to widen them?

  -Hal

Hi Hal,

Are you primarily concerned with being able to widen loads later in the pipeline? Could we attached metadata to the remaining loads indicating that it would be legal to widen them?

no, I don't have any concerns about intrinsic way of implementation,
and intrinsic way looks safer for me since we somehow detach our
information about memory from that actual load instruction. I updated
https://reviews.llvm.org/D37579 and https://reviews.llvm.org/D37648
with adding the aggregate pointer as second parameter as Michael
asked. So now, the intrinsic look like this:

void phantom_mem(any_pointer base, any_pointer aggregate, uint64_t
maximum_offset)

For PR21780, I don't need to use aggregate so it is set to null, but
for other similar issues this aggregate parameter might be useful.
                             Thanks, Dinar.

In many cases, we can resolve such cases via a speculative load. I assume you've reviewed the case at hand to ensure we can't speculative insert a wider load in the case you need?

That doesn't work for the store case, but if you just care about loads...

Philip

Hi Hal,

Are you primarily concerned with being able to widen loads later in the pipeline? Could we attached metadata to the remaining loads indicating that it would be legal to widen them?

no, I don't have any concerns about intrinsic way of implementation,
and intrinsic way looks safer for me since we somehow detach our
information about memory from that actual load instruction.

In general, our use of intrinsics vs. (metadata or attributes), especially an intrinsic that would be automatically introduced during canonicalization, is characterized as "only if there's no other way". The reason is that intrinsics are expensive, they add uses to otherwise-single-use values (which block optimization) (*), and keep otherwise-dead code alive.

I believe that there are uses for an intrinsic like this. Such uses require that the information implied by the intrinsic be anchored to a particular place in the CFG (and, importantly, not be hoisted or removed). This may be useful to convey at what point in the CFG some memory is dereferenceable, thus allowing, for example, sunk accesses to be re-hoisted. In your case, however, you don't need that information (AFAIKT). Instead, you just need to know that you can widen certain loads. This is valuable information, but it can be conveyed with metadata on the relevant loads. That would be a cheaper solution. If this seems impractical, I'd certainly like to understand why.

(*) CodeMetrics has a way to collect "ephemeral values" to avoid extra values affecting some of the cost modeling, but they still block optimization.

Thanks again,
Hal

In general, our use of intrinsics vs. (metadata or attributes), especially an intrinsic that would be automatically introduced during canonicalization, is characterized as "only if there's no other way". The reason is that intrinsics are ?>expensive, they add uses to otherwise-single-use values (which block optimization) (*), and keep otherwise-dead code alive.

I believe that there are uses for an intrinsic like this. Such uses require that the information implied by the intrinsic be anchored to a particular place in the CFG (and, importantly, not be hoisted or removed). This may be useful to convey >at what point in the CFG some memory is dereferenceable, thus allowing, for example, sunk accesses to be re-hoisted. In your case, however, you don't need that information (AFAIKT). Instead, you just need to know that you can widen >certain loads. This is valuable information, but it can be conveyed with metadata on the relevant loads. That would be a cheaper solution. If this seems impractical, I'd certainly like to understand why.

ok, I see, yes it would be certainly cheaper by using metadata. Thank
you Philip for pointing for important use-case in D37648.
                            Thanks, Dinar.

Hi,
I updated solution for PR21780 in https://reviews.llvm.org/D37579,
https://reviews.llvm.org/D37648 and I think I fixed the issue that
Philip pointer out in the last review about accessing offsets that
might not be accessible. This time, instead of keeping maximum or
minimum offset from the base pointer we keep each offset that proven
to be dereferanceable from that pointer. This metadata should be
attached to a load and here is example snippet of bit-code:

%ld1 = load double, double* %arrayidx1, align 8, !speculation.marker !0

    ...
    !0 = !{i64 -1, i64 2}

Offsets aren't required to be sorted while placement in metadata.
Please review https://reviews.llvm.org/D37579, https://reviews.llvm.org/D37648.
                   Thanks, Dinar.