Alias analysis and instruction level parallelism

I am pretty excited about the recent activity on dependence
analysis. The only remaining problem from our point of view
is how to get the alias information to the back end instruction
scheduler. If I understand things correctly, the alias information
basically gets lost in the process of lowering to target
instructions.

We are interested in the DSP domain, so we really need to get
SIMD style parallelism to work, and this needs alias information.
In one of the earlier threads on alias analysis someone commented
that preserving the alias information would not really be that
difficult, but possibly tedious.

My initial reaction is that if one were to decorate MachineInstr's
with the LLVM level pointer values that they use for reading
and writing memory, then one should be able to use those
values and the AliasAnalysis interface to query dependences
between MachineInstr's. I am not intimately familiar with
how the lowering is done, so if there are some obvious
problems with this approach, please let me know.

Hi,

My initial reaction is that if one were to decorate MachineInstr's
with the LLVM level pointer values that they use for reading
and writing memory,

this is already the case: SrcValue and SVOffset.

Ciao,

Duncan.

Duncan Sands wrote:

My initial reaction is that if one were to decorate MachineInstr's
with the LLVM level pointer values that they use for reading
and writing memory,

this is already the case: SrcValue and SVOffset.

Ah, that's right. I went back and read the discussion from
January, and Florian Brandner explains there that the real
culprit is the lowering of GEP in the codegen prepare pass.

Florian, if you have worked more on this, we would be very interested
in your work. Serialization caused by memory references is a big
obstacle for us at the moment.

codgenprepare rewrites GEP instructions into a sequence of PtrToInt casts,
additions, multiplies and finally a IntToPtr cast. The basic alias analysis
pass does not analyse pointers past an IntToPtr cast, and thus returns "may
alias".

i have tried to overcome this. basically you only need to find the base object
of the original GEP, but this is possible in some simple cases only. if
pointer arithmetic is involved it is hard to distinguish between the GEP base
and other calculations. you might even end up analyzing pointer calculations
that did not originate from a GEP.

an other option would be to save the GEP instructions and provide a mapping
between them and the lowered arithmetic. it's not very clean but it would be
safe, and probably work well.

florian

Florian Brandner wrote:

an other option would be to save the GEP instructions and provide a mapping between them and the lowered arithmetic. it's not very clean but it would be safe, and probably work well.

In my opinion this seems like the obvious way to go. The information
is already there, so reverse engineering it back does not sound too
clever. Of course if the reverse engineering would provide useful
information about pointers that do not originate from GEP instructions,
then it might be worthwhile. But my feeling is that at least for our
purposes, GEPs really are the interesting stuff. My understanding is
that parallelizing C compilers try to convert pointer arithmetic back
to array access in order to make sense of them, so essentially turning
them into GEPs. Certainly the dependence literature is heavily array
based.

If you handle this at LLVM IR level then you've access to all the alias info and GEP instructions are directly available to you. LLVM IR supports vector types for SIMD style parallelism and target specific code generators lowers them appropriately.

I think this is trickier than it sounds; the reason GEPs are lowered is to
allow strength-reduction and other things to do transformations on them.
It would require those passes to know how to update the mapping.

Dan

Devang Patel wrote:

If you handle this at LLVM IR level then you've access to all the alias info and GEP instructions are directly available to you. LLVM IR supports vector types for SIMD style parallelism and target specific code generators lowers them appropriately.

Our target is a statically scheduled VLIW style processor,
which may have custom FUs designed by the user. This means
that the instruction scheduler needs to have very detailed
knowledge of the available FUs, latencies, the interconnection
network etc. The main problem is not exploiting some set of
vector style FUs, but rather packing parallel operations
into wide instructions. We would love to do this at the level
of LLVM IR, but it does not seem to be possible.

Dan Gohman wrote:

I think this is trickier than it sounds; the reason GEPs are lowered is to
allow strength-reduction and other things to do transformations on them.
It would require those passes to know how to update the mapping.

Yes, I do appreciate the amount of work involved, and I am
very open to other suggestions.

What the backend really needs to know is what loads and
stores are independent of each other. When looking at a
scheduling region (basic block for now), we already know
at the LLVM IR level which loads and stores could potentially
be scheduled at the same cycle, so essentially we can
anticipate which alias queries the backend would make.

An alternative game plan would then be to identify the
loads and stores of interest, do the alias queries
at the LLVM IR level, and store the independence info
somewhere. The back end would then trace the target memory
references back to LLVM IR loads and stores, and consult
the stored independence information while scheduling.

It seems to me that this should be safe, at least if
one is careful about what passes are run between
the alias analysis and the back end. I cannot think
of anything offhand that could make two independent
memory references to be dependent, which would need
to happen in order for things to go bad.

There is a potential combinatorial explosion built into
this, but I don't think it would turn up in practice.

Dan Gohman wrote:

I think this is trickier than it sounds; the reason GEPs are lowered
is to
allow strength-reduction and other things to do transformations on them.
It would require those passes to know how to update the mapping.

Yes, I do appreciate the amount of work involved, and I am
very open to other suggestions.

To clarify, I was responding to the approach of mapping
address values back to their original GEP instructions
here.

What the backend really needs to know is what loads and
stores are independent of each other. When looking at a
scheduling region (basic block for now), we already know
at the LLVM IR level which loads and stores could potentially
be scheduled at the same cycle, so essentially we can
anticipate which alias queries the backend would make.

An alternative game plan would then be to identify the
loads and stores of interest, do the alias queries
at the LLVM IR level, and store the independence info
somewhere. The back end would then trace the target memory
references back to LLVM IR loads and stores, and consult
the stored independence information while scheduling.

It seems to me that this should be safe, at least if
one is careful about what passes are run between
the alias analysis and the back end. I cannot think
of anything offhand that could make two independent
memory references to be dependent, which would need
to happen in order for things to go bad.

I agree. I think what you're describing here and what
I described in a separate email are very similar :-).

The current SelectionDAG already has a way to store the
the information you want here, using its chain dependencies.
It needs improvement to do a better job identifying
interesting loads and stores and making alias queries --
the things you're describing here.

If we do that, and then preserve this dependency information
when SelectionDAG nodes are translated to MachineInstrs,
we'll have a very useful system.

Dan

How about a much simpler approach! Here's a silly, but reasonable example:

int A[100];

void test(int x) {
   while (x)
     A[--x] = 0;
}

we compile this into (x86 with pic codegen):

..
LBB1_1: ## bb.preheader
  movl L_A$non_lazy_ptr, %ecx
  leal -4(%ecx,%eax,4), %ecx
  xorl %edx, %edx
  .align 4,0x90
LBB1_2: ## bb
  movl $0, (%ecx)
  addl $4294967292, %ecx
  incl %edx
  cmpl %eax, %edx
  jne LBB1_2 ## bb
...

This is pretty reasonable code, what does the output of lsr look like though?

$ llvm-gcc t.c -S -o - -O3 -emit-llvm | llvm-as | llc -print-isel-input -relocation-model=pic
...
  %tmp = mul i32 %x, 4 ; <i32> [#uses=1]
  %tmp2 = add i32 ptrtoint ([100 x i32]* @A to i32), %tmp ; <i32> [#uses=1]
  %tmp4 = add i32 %tmp2, -4 ; <i32> [#uses=1]
  br label %bb
bb: ; preds = %bb.preheader, %bb
  %iv.5 = phi i32 [ %tmp4, %bb.preheader ], [ %iv.5.inc, %bb ] ; <i32> [#uses=2]
...
  %iv.57 = inttoptr i32 %iv.5 to i32* ; <i32*> [#uses=1]
  store i32 0, i32* %iv.57, align 4
...
  %iv.5.inc = add i32 %iv.5, -4 ; <i32> [#uses=1]
  br i1 %exitcond, label %return, label %bb
return: ; preds = %bb, %entry
  ret void

Wow, that's horrible! No wonder basicaa gets confused, I don't blame it. However, none of this is needed. It would be much better for LSR to lower the code into:

    %tmp4 = getelementptr [100 x i32]* @A, i32 0, i32 %x ; i32*
    br label %bb
bb:
    %iv.5 = phi i32* [tmp4], [iv.5.inc]
..
    store i32 0, i32* iv.5
..
    %iv.5.inc = getelementptr i32* %iv.5, i32 -1
  br i1 %exitcond, label %return, label %bb
return: ; preds = %bb, %entry
  ret void

With this code, basicaa will have little problem understanding what is going on, and generating this from LSR should not be very hard at all. Better yet, no new crazy infrastructure change is required.

What do you think?

-Chris