Latency of true depency of store followed by aliased load in ScheduleDAGInstrs

Hi all,

I have a question regarding the latency of the true dependency of a
store followed by an aliased load in ScheduleDAGInstrs. The latency
seems to depend on the store and load being volatile or not as can be
seen in the post-RA-sched debug output of the attached ARM example:

$ llc -O3 -debug-only=post-RA-sched store_load_latency_test.ll

...

SU(2): STRi12 %R2<kill>, %R0<kill>, 0, pred:14, pred:%noreg; mem:Volatile ST4[%p1](tbaa=!"int")
  # preds left : 1
  # succs left : 2
  # rdefs left : 0
  Latency : 1
  Depth : 2
  Height : 0
  Predecessors:
   val SU(1): Latency=1 Reg=%R2
  Successors:
   antiSU(3): Latency=0
   ch SU(3): Latency=0

SU(3): %R0<def> = LDRi12 %R1<kill>, 0, pred:14, pred:%noreg; mem:Volatile LD4[%p2](tbaa=!"int")
  # preds left : 2
  # succs left : 1
  # rdefs left : 0
  Latency : 1
  Depth : 2
  Height : 0
  Predecessors:
   antiSU(2): Latency=0
   ch SU(2): Latency=0
  Successors:
   val SU(4294967295): Latency=2

...

SU(2): STRi12 %R2<kill>, %R0<kill>, 0, pred:14, pred:%noreg; mem:ST4[%p1](tbaa=!"int")
  # preds left : 1
  # succs left : 2
  # rdefs left : 0
  Latency : 1
  Depth : 2
  Height : 0
  Predecessors:
   val SU(1): Latency=1 Reg=%R2
  Successors:
   antiSU(3): Latency=0
   ch SU(3): Latency=1

SU(3): %R0<def> = LDRi12 %R1<kill>, 0, pred:14, pred:%noreg; mem:LD4[%p2](tbaa=!"int")
  # preds left : 2
  # succs left : 1
  # rdefs left : 0
  Latency : 1
  Depth : 3
  Height : 0
  Predecessors:
   antiSU(2): Latency=0
   ch SU(2): Latency=1
  Successors:
   val SU(4294967295): Latency=2

...

So in the volatile case the latency of the chain dependency is 0, while
in the non volatile case it is 1.

I am using ScheduleDAGInstrs in a scheduler for a VLIW target and in the
volatile case the load gets incorrectly scheduled in the same cycle as
the store. Is ScheduleDAGInstrs incorrect in the volatile case or
shouldn't I rely on the latency being non zero for getting a correct
schedule?

Best regards,

Jordy Potman

store_load_latency_test.ll (727 Bytes)

I don't like the inconsistency. There's no reason for it other than sloppy implementation. I tried to clean this up in r158380. Memory dependence latency should now be conservative in this respect.

That said, this was really meant to be a heuristic, and a fairly unimportant one at that, so I never paid much attention to the inconsistency. Whether memory dependencies can be bundled in the same group is cpu specific. Register anti-dependencies are also zero cycle. Do you bundle those in the same group? What about register output dependencies? They can result from dead register defs. You may need to add your own checks to determine valid instruction groups. For example, you may not want to allow any DAG dependencies at all within the group. If the basic ScheduleHazardRecognizer is insufficient, you could try the approach that the Hexagon target uses. One of the engineers who works on that backend should be able to explain it better.

-Andy

Thanks for cleaning up the inconsistency. Register anti-dependencies can
be bundled in the same group on our target and we treat dead register
defs specially in our scheduler so I guess that is why we did not run
into issues before. I now understand that checking explicitly for
allowed/not-allowed DAG dependencies within a group is a more robust
approach, so we'll start doing that.

Thanks,

Jordy