scoreboard hazard det. and instruction groupings

I'm considering writing more-detailed itineraries for some PowerPC CPUs
that use the 'traditional' instruction grouping scheme. In essence,
this means that multiple instructions will stall in some pipeline stage
until a complete group is formed, then all will continue.

I expect to provide CPU-specific code to help determine when the
currently-waiting instructions would form a group. Is there a
straightforward way that I could override the scoreboard hazard
detector (or layer on top of it) to insert this kind of logic?

Should I instead be looking to do something like what Hexagon does
for VLIW cores? I think the main difference between the PowerPC scheme
and a VLIW scheme, is that the CPU itself can form groups internally,
it is just more efficient if the groups are provided pre-made. Maybe
this difference, if it is one, is insignificant.

Thanks again,
Hal

Hal, I think you're asking whether to use the ScheduleHazardRecognizer or DFAPacketizer. I suggest sticking with the hazard recognizer unless you have an important problem that can't be solved that way. It's the API used by most targets and doesn't require a custom scheduler. Although I don't want to stop you from generalizing the DFA work either if you feel compelled to do that.

Ignoring compile time for a moment, I think an advantage of a DFA is modeling a situation where the hardware can assign resources to best fit the entire group rather then one instruction at a time. For example, if InstA requires either Unit0 or Unit1, and InstB requires Unit0, is {InstA, InstB} a valid group? Depending on your cpu, a DFA could either recognize that it's valid, or give you a chance to reorder the instructions within a group once they've been selected.

Ideally, you can express your constraints using InstrStage itinerary entries. If not, then you need to do your own bookkeeping by saving extra state during EmitInstruction and checking for hazards in getHazardType. At this point, you need to decide whether your custom logic can be easily generalized to either top-down or bottom-up scheduling. If not, you can force MISched to scheduling one either direction. SD scheduling is stuck with bottom-up for the remainder of its days, and postRA scheduling is top-down.

-Andy

Hal,

Ignoring compile time for a moment, I think an advantage of a DFA is modeling a situation where the hardware can assign resources to best fit the entire group rather then one instruction at a time. For example, if InstA requires either Unit0 or Unit1, and InstB requires Unit0, is {InstA, InstB} a valid group? Depending on your cpu, a DFA could either recognize that it's valid, or give you a chance to reorder the instructions within a group once they've been selected.

I would recommend the DFA mechanism as well from what you've described. It considers all permutations of mapping instructions to functional units. To add to what Andrew said, note that the DFA answers the question of whether there exists a legal mapping of a group of instructions to functional units. It does not, however, return a legal mapping. Will that be sufficient for what you want?

-Anshu

> I'm considering writing more-detailed itineraries for some PowerPC
> CPUs that use the 'traditional' instruction grouping scheme. In
> essence, this means that multiple instructions will stall in some
> pipeline stage until a complete group is formed, then all will
> continue.
>
> I expect to provide CPU-specific code to help determine when the
> currently-waiting instructions would form a group. Is there a
> straightforward way that I could override the scoreboard hazard
> detector (or layer on top of it) to insert this kind of logic?
>
> Should I instead be looking to do something like what Hexagon does
> for VLIW cores? I think the main difference between the PowerPC
> scheme and a VLIW scheme, is that the CPU itself can form groups
> internally, it is just more efficient if the groups are provided
> pre-made. Maybe this difference, if it is one, is insignificant.

Hal, I think you're asking whether to use the
ScheduleHazardRecognizer or DFAPacketizer. I suggest sticking with
the hazard recognizer unless you have an important problem that can't
be solved that way. It's the API used by most targets and doesn't
require a custom scheduler. Although I don't want to stop you from
generalizing the DFA work either if you feel compelled to do that.

I don't yet feel compelled, and I don't know much about the
DFAPacketizer. I just want something that will work cleanly :wink:

Looking at VLIWPacketizerList::PacketizeMIs, it seems like the
instructions are first scheduled (via some external scheme?), and then
packetized 'in order'. Is that correct?

Ignoring compile time for a moment, I think an advantage of a DFA is
modeling a situation where the hardware can assign resources to best
fit the entire group rather then one instruction at a time. For
example, if InstA requires either Unit0 or Unit1, and InstB requires
Unit0, is {InstA, InstB} a valid group? Depending on your cpu, a DFA
could either recognize that it's valid, or give you a chance to
reorder the instructions within a group once they've been selected.

In the PowerPC grouping scheme, resources are assigned on a group
basis (by the instruction dispatching stages). However, once the group
is dispatched to the appropriate functional units, 'bypass' is still
available on an instruction-by-instruction basis to instructions in
later groups. Final writeback waits until all members of the group
complete.

Ideally, you can express your constraints using InstrStage itinerary
entries.

I don't see how, in the current scheme, to express that an instruction
must wait in FU0 until there are also waiting instructions in FU1, FU2
and FU3. Furthermore, there are certain constraints on what those
instructions can be, and which ones will move forward as the next
dispatched group, and I think we need to fallback into C++ to deal with
them.

If not, then you need to do your own bookkeeping by saving
extra state during EmitInstruction and checking for hazards in
getHazardType. At this point, you need to decide whether your custom
logic can be easily generalized to either top-down or bottom-up
scheduling.

I think that it can be either. Within the current system, however, it
might need to be top down. To do bottom up, you'd need to have
look-ahead of the depths of the pipelines to emit grouping hazards
when considering the ends of the pipelines (although this may just be
for corner cases, I think that normal dependency analysis should catch
most of these).

If not, you can force MISched to scheduling one either
direction. SD scheduling is stuck with bottom-up for the remainder of
its days, and postRA scheduling is top-down.

I would rather do something that will be easy to maintain going
forward, and so if that can be accomplished within the normal
framework, then that would be great.

I think that I'll try ignoring the issue for now, just use a normal
itinerary with bottom-up scheduling, and then the existing top-down
pass (which attempts to enforce some of the ordering constraints
(which are most severe on the G5s)). If that gives unsatisfactory
results, then we can think about something more involved.

Thanks again,
Hal

Hal,

> Ignoring compile time for a moment, I think an advantage of a DFA
> is modeling a situation where the hardware can assign resources to
> best fit the entire group rather then one instruction at a time.
> For example, if InstA requires either Unit0 or Unit1, and InstB
> requires Unit0, is {InstA, InstB} a valid group? Depending on your
> cpu, a DFA could either recognize that it's valid, or give you a
> chance to reorder the instructions within a group once they've been
> selected.
>

I would recommend the DFA mechanism as well from what you've
described. It considers all permutations of mapping instructions to
functional units. To add to what Andrew said, note that the DFA
answers the question of whether there exists a legal mapping of a
group of instructions to functional units. It does not, however,
return a legal mapping. Will that be sufficient for what you want?

Yes, I generally don't need the mapping, the core will do that
in hardware. But I do need to know which instructions can be grouped
together, and to get the groupings.

On some targets, the order of the instructions matters (for example, on
the G5, the instructions are in groups of 5, and a branch must always
be in the last slot). But that is somewhat higher level than the
functional units themselves, and should be pretty easy to figure out.

So do you use the regular bottom-up scheduling first and then use the
DFA after that?

Thanks again,
Hal

Looking at VLIWPacketizerList::PacketizeMIs, it seems like the

instructions are first scheduled (via some external scheme?), and then
packetized 'in order'. Is that correct?

Anshu?

In the PowerPC grouping scheme, resources are assigned on a group
basis (by the instruction dispatching stages). However, once the group
is dispatched to the appropriate functional units, 'bypass' is still
available on an instruction-by-instruction basis to instructions in
later groups. Final writeback waits until all members of the group
complete.

Ideally, you can express your constraints using InstrStage itinerary
entries.

I don't see how, in the current scheme, to express that an instruction
must wait in FU0 until there are also waiting instructions in FU1, FU2
and FU3. Furthermore, there are certain constraints on what those
instructions can be, and which ones will move forward as the next
dispatched group, and I think we need to fallback into C++ to deal with
them.

Right. I should have mentioned that the static itinerary really can't express dynamic constraints. You can play games by inventing new types of FuncUnits. But there's no way to say an instruction hogs a pipeline until some future event at unknown interval.

-Andy

Yes, we schedule first and then packetize in order.

-Anshu

Hello everyone,

  I am working on a release based on the branch 3.1 version of code.
Unfortunately it has enough differences that exact rev does not apply.
I am hitting an assert in liveness update with seemingly trivial code
(attached).

/local/mnt/workspace/slarin/tools/llvm-mainline-merged/lib/CodeGen/LiveInter
valAnalysis.cpp:1078: void
llvm::LiveIntervals::HMEditor::moveAllRangesFrom(llvm::MachineInstr*,
llvm::SlotIndex): Assertion `validator.rangesOk() && "moveAllOperandsFrom
broke liveness."' failed.

The code being scheduled (function "push") is trivial:

# Machine code for function push: Post SSA
Function Live Outs: %R0

0B BB#0: derived from LLVM BB %entry
16B %vreg9<def> = TFRI_V4 <ga:@xx_stack>; IntRegs:%vreg9
        Successors according to CFG: BB#1

48B BB#1: derived from LLVM BB %for.cond
        Predecessors according to CFG: BB#0 BB#1
80B %vreg1<def> = COPY %vreg10<kill>; IntRegs:%vreg1,%vreg10
96B %vreg10<def> = LDriw %vreg9<kill>, 0; mem:LD4[%stack.0.in]
IntRegs:%vreg10,%vreg9
112B %vreg9<def> = ADD_ri %vreg10, 8; IntRegs:%vreg9,%vreg10
128B %vreg6<def> = CMPEQri %vreg10, 0; PredRegs:%vreg6 IntRegs:%vreg10
176B JMP_cNot %vreg6<kill>, <BB#1>, %PC<imp-def>; PredRegs:%vreg6
192B JMP <BB#2>
        Successors according to CFG: BB#2 BB#1

208B BB#2: derived from LLVM BB %for.end
        Predecessors according to CFG: BB#1
224B %vreg7<def> = LDriw %vreg1<kill>, 0; mem:LD4[%first1](tbaa=!"any
pointer") IntRegs:%vreg7,%vreg1
240B STriw_GP <ga:@yy_instr>, 0, %vreg7<kill>;
mem:ST4[@yy_instr](tbaa=!"any pointer") IntRegs:%vreg7
256B JMPR %PC<imp-def>, %R31<imp-use>, %R0<imp-use,undef>

Hexagon MI scheduler is working with BB1 and picks this load:

  %vreg10<def> = LDriw %vreg9<kill>, 0; mem:LD4[%stack.0.in]
IntRegs:%vreg10,%vreg9

To be scheduled first (!). Right there after

7 clang 0x000000000226aece
llvm::LiveIntervals::handleMove(llvm::MachineInstr*) + 378
8 clang 0x0000000001c2574f
llvm::VLIWMachineScheduler::listScheduleTopDown() + 595
9 clang 0x0000000001c24cd5 llvm::VLIWMachineScheduler::schedule()
+ 505

It does not seem to happen on the trunk.

My question is - Does anyone recognizes the issue, and what patch(es) do I
need to address it. Since my release is based on 3.1, I have to cherry pick
them...
Any lead is highly appreciated.

Thanks.

Sergei

test_live.o.ll (2.64 KB)

test_live.i (639 Bytes)

Hello everyone,

I am working on a release based on the branch 3.1 version of code.
Unfortunately it has enough differences that exact rev does not apply.
I am hitting an assert in liveness update with seemingly trivial code
(attached).

/local/mnt/workspace/slarin/tools/llvm-mainline-merged/lib/CodeGen/LiveInter
valAnalysis.cpp:1078: void
llvm::LiveIntervals::HMEditor::moveAllRangesFrom(llvm::MachineInstr*,
llvm::SlotIndex): Assertion `validator.rangesOk() && "moveAllOperandsFrom
broke liveness."' failed.

The code being scheduled (function "push") is trivial:

# Machine code for function push: Post SSA
Function Live Outs: %R0

0B BB#0: derived from LLVM BB %entry
16B %vreg9<def> = TFRI_V4 <ga:@xx_stack>; IntRegs:%vreg9
       Successors according to CFG: BB#1

48B BB#1: derived from LLVM BB %for.cond
       Predecessors according to CFG: BB#0 BB#1
80B %vreg1<def> = COPY %vreg10<kill>; IntRegs:%vreg1,%vreg10
96B %vreg10<def> = LDriw %vreg9<kill>, 0; mem:LD4[%stack.0.in]
IntRegs:%vreg10,%vreg9
112B %vreg9<def> = ADD_ri %vreg10, 8; IntRegs:%vreg9,%vreg10
128B %vreg6<def> = CMPEQri %vreg10, 0; PredRegs:%vreg6 IntRegs:%vreg10
176B JMP_cNot %vreg6<kill>, <BB#1>, %PC<imp-def>; PredRegs:%vreg6
192B JMP <BB#2>
       Successors according to CFG: BB#2 BB#1

208B BB#2: derived from LLVM BB %for.end
       Predecessors according to CFG: BB#1
224B %vreg7<def> = LDriw %vreg1<kill>, 0; mem:LD4[%first1](tbaa=!"any
pointer") IntRegs:%vreg7,%vreg1
240B STriw_GP <ga:@yy_instr>, 0, %vreg7<kill>;
mem:ST4[@yy_instr](tbaa=!"any pointer") IntRegs:%vreg7
256B JMPR %PC<imp-def>, %R31<imp-use>, %R0<imp-use,undef>

Hexagon MI scheduler is working with BB1 and picks this load:

%vreg10<def> = LDriw %vreg9<kill>, 0; mem:LD4[%stack.0.in]
IntRegs:%vreg10,%vreg9

To be scheduled first (!). Right there after

7 clang 0x000000000226aece
llvm::LiveIntervals::handleMove(llvm::MachineInstr*) + 378
8 clang 0x0000000001c2574f
llvm::VLIWMachineScheduler::listScheduleTopDown() + 595
9 clang 0x0000000001c24cd5 llvm::VLIWMachineScheduler::schedule()
+ 505

It does not seem to happen on the trunk.

My question is - Does anyone recognizes the issue, and what patch(es) do I
need to address it. Since my release is based on 3.1, I have to cherry pick
them...
Any lead is highly appreciated.

Thanks.

Sergei

Quickly scanning the commit log, I only see this one since 3.1:

commit f905f69668e5dd184c0a2b5fae38d9f3721c0d3b
Author: Lang Hames <lhames@gmail.com>

    Clear the entering, exiting and internal ranges of a bundle before collecting
    ranges for the instruction about to be bundled. This fixes a bug in an external
    project where an assertion was triggered due to spurious 'multiple defs' within
    the bundle.
    
    Patch by Ivan Llopard. Thanks Ivan!

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@157632

Maybe Lang can remember something relevant though.

I still see a number of these asserts on trunk. I just haven't gotten around to making a push to fix them yet, because I've been focussing on other areas. However, if you reproduce your problem on trunk, file a bug right away. If you aren't able to work around this, I'll take some time to file bugs for asserts that I can reproduce in case they're related to your problem.

-Andy

Andy,

  Thanks for reply. I was able to trace the problem to the MI DAG dep
constructor. See this:

SU(0): %vreg1<def> = COPY %vreg10<kill>; IntRegs:%vreg1,%vreg10
  # preds left : 0
  # succs left : 0
  # rdefs left : 1
  Latency : 1
  Depth : 0
  Height : 0

SU(1): %vreg10<def> = LDriw %vreg9<kill>, 0; mem:LD4[%stack.0.in]
IntRegs:%vreg10,%vreg9
  # preds left : 0
  # succs left : 3
  # rdefs left : 1
  Latency : 1
  Depth : 0
  Height : 0
  Successors:
   val SU(3): Latency=1
   val SU(2): Latency=1
   antiSU(2): Latency=0

SU(2): %vreg9<def> = ADD_ri %vreg10, 8; IntRegs:%vreg9,%vreg10
  # preds left : 2
  # succs left : 0
  # rdefs left : 1
  Latency : 1
  Depth : 0
  Height : 0
  Predecessors:
   val SU(1): Latency=1 Reg=%vreg10
   antiSU(1): Latency=0

The anti dep between SU(0) and SU(1) is missing, so when scheduler decides
to reorder them, LI rightfully complains.
Again, this is branch code, but if I can see something wrong in platform
independent portion, I'll track it on the trunk.

Thanks.

Sergei

Andy,

  I traced my problem to this point:
In ScheduleDAGInstrs.cpp we have the following function:

/// addVRegDefDeps - Add register output and data dependencies from this
SUnit
/// to instructions that occur later in the same scheduling region if they
read
/// from or write to the virtual register defined at OperIdx.
///
/// TODO: Hoist loop induction variable increments. This has to be
/// reevaluated. Generally, IV scheduling should be done before coalescing.
void ScheduleDAGInstrs::addVRegDefDeps(SUnit *SU, unsigned OperIdx) {
  const MachineInstr *MI = SU->getInstr();
  unsigned Reg = MI->getOperand(OperIdx).getReg();

  // SSA defs do not have output/anti dependencies.
  // The current operand is a def, so we have at least one.
  if (llvm::next(MRI.def_begin(Reg)) == MRI.def_end()) <<<<<<<<<<<<<<<<<<
This is what I am missing. See below.
    return;

  // Add output dependence to the next nearest def of this vreg.
  //
  // Unless this definition is dead, the output dependence should be
  // transitively redundant with antidependencies from this definition's
  // uses. We're conservative for now until we have a way to guarantee the
uses
  // are not eliminated sometime during scheduling. The output dependence
edge
  // is also useful if output latency exceeds def-use latency.
  VReg2SUnitMap::iterator DefI = VRegDefs.find(Reg);
  if (DefI == VRegDefs.end())
    VRegDefs.insert(VReg2SUnit(Reg, SU));
  else {
    SUnit *DefSU = DefI->SU;
    if (DefSU != SU && DefSU != &ExitSU) {
      unsigned OutLatency = TII->getOutputLatency(InstrItins, MI, OperIdx,
                                                  DefSU->getInstr());
      DefSU->addPred(SDep(SU, SDep::Output, OutLatency, Reg));
    }
    DefI->SU = SU;
  }
}

So if this early exit is taken:

  // SSA defs do not have output/anti dependencies.
  // The current operand is a def, so we have at least one.
  if (llvm::next(MRI.def_begin(Reg)) == MRI.def_end())
    return;

we do not ever get to this point:

   VRegDefs.insert(VReg2SUnit(Reg, SU));

But later, when checking for anti dependency for another MI here:

void ScheduleDAGInstrs::addVRegUseDeps(SUnit *SU, unsigned OperIdx) {
...
  // Add antidependence to the following def of the vreg it uses.
  VReg2SUnitMap::iterator DefI = VRegDefs.find(Reg);
  if (DefI != VRegDefs.end() && DefI->SU != SU)
    DefI->SU->addPred(SDep(SU, SDep::Anti, 0, Reg));

We will never find that def in VRegDefs.find(Reg) even though it exists.

I know this has been working for a while, but I am still missing something
here.
What is this statement

if (llvm::next(MRI.def_begin(Reg)) == MRI.def_end())

should guarantee? From it there must be more than one definition in MRI.def
for that reg for it to work...

To connect it to the original example... When parsing (BU order) this
instruction:

SU(1): %vreg10<def> = LDriw %vreg9<kill>, 0; mem:LD4[%stack.0.in]

The %vreg10<def> never inserted to VRegDefs, so with next instruction:

SU(0): %vreg1<def> = COPY %vreg10<kill>; IntRegs:%vreg1,%vreg10

Anti dep on %vreg10 is never created.

Thanks.

Sergei

Thanks for the detailed explanation! My understanding is that COPY %vreg10 is illegal because is has no reaching def on all paths (LDriw is the only def).

Now, the (llvm::next(MRI.def_begin(Reg)) == MRI.def_end()) check is not really sufficient to test SSA in the presence of undefined operands. However, I thought we would have an IMPLICIT_DEF of the vreg in that case, even after the phi is removed. That right Jakob? Otherwise we I think we should somehow mark the vreg as being undefined.

Anyway, I added an assert to catch this problem (see below), and it never triggered on X86. So my guess is that you have incorrect IR coming in. Can you check where %vreg10 is defined. Before coalescing, was it a phi with an operand?

It is safe to workaround the problem by removing the early exit following the SSA check.

-Andy

— a/lib/CodeGen/ScheduleDAGInstrs.cpp
+++ b/lib/CodeGen/ScheduleDAGInstrs.cpp
@@ -413,9 +413,11 @@ void ScheduleDAGInstrs::addVRegDefDeps(SUnit *SU, unsigned OperIdx) {

// SSA defs do not have output/anti dependencies.
// The current operand is a def, so we have at least one.

  • if (llvm::next(MRI.def_begin(Reg)) == MRI.def_end())
  • if (llvm::next(MRI.def_begin(Reg)) == MRI.def_end()) {
  • //!!!
  • VRegDefs.insert(VReg2SUnit(Reg, SU));
    return;

I hate to be the one to start this thread, but I have to say it at least once…

General advice. Working off trunk will be less painful than cherry-picking. When it comes to merging, one revision at a time is the only sane approach. If you run sanity tests after merging a set of new trunk revisions, before checking them into your development branch, you can avoid dealing with the rare, temporary breakage, or “instability” as people like to say. If the breakage lasts more than a day, then it’s specific to your branch and you want to find out about it ASAP. It will be much harder to fix close to the next release.

For example, I just checked in changes to AA scheduling. If you run into a problem with it, I want to know about it now, as in this week. It will be a much bigger burden for me change it again next week, or any time after that.

I know for some reason people like their own releases to be “based on llvm release X”. That could work if your release cycle is longer than llvm’s (after branching your own development you could continue pulling llvm changes until next llvm release). For frequent releases, it’s nonsense. Either way, your development branch should still follow trunk.

-Andy

TL;DR: Here here. Yes, more, please!

I'll go further: don't use development branches. It's just not worth it.

I say this as someone who created a development branch of Clang. It was and is a pile of pain. We want to kill it with fire.

A lot of people think that a development branch is a useful way to solve problems they run into while developing on trunk:

  • Commit velocity
  • Code review latency
  • API instability
  • Rough or WIP code being seen by others

I assert that these are either not actual problems when working on trunk, or are problems that will actually be worse with a development branch. They are very bad reasons to form one.

  • Commit velocity may go up, but testing, correctness, and quality of those commits will go down. You also have the added burden of having to commit merges.
  • Code review latency doesn’t go away, it gets deferred until the time at which you want to re-integrate your changes. Deferring code review is like deferring payments on a loan. The more you do it, and the longer you do it, the more interest you build up. It’s especially like a loan in that it is a compounding phenomenon. However, it’s worse than most loans, because the interest rate is somewhere between 20% and 2000%. In such scenarios, micro-payments start to make sense, in the same way that tiny incremental patches make sense on trunk.
  • API instability is the exact same problem as code review latency – the API is no more stable, you’re just deferring work. The interest rate here is much lower at least, but there is a secondary gotcha – if your code is on trunk, then the API changer will do much of the work for you.
  • Rough / WIP code should be seen by others, in order to get early feedback, avoid bad paths of implementation etc. We should indoctrinate ourselves as developers with a delightful glee in showing code before it’s 100% ready for prime time.

There are actual cases where you must use a development branch, such as due to controversial changes that need to be demonstrated as viable before accepted, or drastic internal changes that require a lot of work to get into a working state (the type-system rewrite of LLVM is a good example here).

If you find yourself in such a scenario, it is essential to make the development branch look as much like trunk as possible. Integrate daily at least, and multiple times a day if you can. Otherwise, all of the problems above will eat away at your productivity.

</soap box>

-Chandler

Andy,

You are probably right here – look at this – before phi elimination this code looks much more sane:

*** IR Dump After Live Variable Analysis ***:

Machine code for function push: SSA

Function Live Outs: %R0

BB#0: derived from LLVM BB %entry

%vreg5 = IMPLICIT_DEF; IntRegs:%vreg5

%vreg4 = TFRI_V4 ga:xx_stack; IntRegs:%vreg4

Successors according to CFG: BB#1

BB#1: derived from LLVM BB %for.cond

Predecessors according to CFG: BB#0 BB#1

%vreg0 = PHI %vreg4, <BB#0>, %vreg3, <BB#1>; IntRegs:%vreg0,%vreg4,%vreg3

%vreg1 = PHI %vreg5, <BB#0>, %vreg2, <BB#1>; IntRegs:%vreg1,%vreg5,%vreg2

%vreg2 = LDriw %vreg0, 0; mem:LD4[%stack.0.in] IntRegs:%vreg2,%vreg0

%vreg3 = ADD_ri %vreg2, 8; IntRegs:%vreg3,%vreg2

%vreg6 = CMPEQri %vreg2, 0; PredRegs:%vreg6 IntRegs:%vreg2

JMP_cNot %vreg6, <BB#1>, %PC; PredRegs:%vreg6

JMP <BB#2>

Successors according to CFG: BB#2 BB#1

BB#2: derived from LLVM BB %for.end

Predecessors according to CFG: BB#1

%vreg7 = LDriw %vreg1, 0; mem:LD4[%first1](tbaa=!“any pointer”) IntRegs:%vreg7,%vreg1

STriw_GP ga:yy_instr, 0, %vreg7; mem:ST4[@yy_instr](tbaa=!“any pointer”) IntRegs:%vreg7

%vreg8 = IMPLICIT_DEF; IntRegs:%vreg8

%R0 = COPY %vreg8; IntRegs:%vreg8

JMPR %PC, %R31, %R0<imp-use,kill>

Right after the dead vreg is introduced:

*** IR Dump After Eliminate PHI nodes for register allocation ***:

Machine code for function push: Post SSA

Function Live Outs: %R0

BB#0: derived from LLVM BB %entry

%vreg4 = TFRI_V4 ga:xx_stack; IntRegs:%vreg4

%vreg9 = COPY %vreg4; IntRegs:%vreg9,%vreg4

Successors according to CFG: BB#1

BB#1: derived from LLVM BB %for.cond

Predecessors according to CFG: BB#0 BB#1

%vreg0 = COPY %vreg9; IntRegs:%vreg0,%vreg9

%vreg1 = COPY %vreg10; IntRegs:%vreg1,%vreg10 <<<<<<<<<<<<<<<<<<<<<<<<<<< Not defined on first iteration….

%vreg2 = LDriw %vreg0, 0; mem:LD4[%stack.0.in] IntRegs:%vreg2,%vreg0

%vreg3 = ADD_ri %vreg2, 8; IntRegs:%vreg3,%vreg2

%vreg6 = CMPEQri %vreg2, 0; PredRegs:%vreg6 IntRegs:%vreg2

%vreg9 = COPY %vreg3; IntRegs:%vreg9,%vreg3

%vreg10 = COPY %vreg2; IntRegs:%vreg10,%vreg2

JMP_cNot %vreg6, <BB#1>, %PC; PredRegs:%vreg6

JMP <BB#2>

Successors according to CFG: BB#2 BB#1

BB#2: derived from LLVM BB %for.end

Predecessors according to CFG: BB#1

%vreg7 = LDriw %vreg1, 0; mem:LD4[%first1](tbaa=!“any pointer”) IntRegs:%vreg7,%vreg1

STriw_GP ga:yy_instr, 0, %vreg7; mem:ST4[@yy_instr](tbaa=!“any pointer”) IntRegs:%vreg7

%vreg8 = IMPLICIT_DEF; IntRegs:%vreg8

%R0 = COPY %vreg8; IntRegs:%vreg8

JMPR %PC, %R31, %R0<imp-use,kill>

End machine code for function push.

So the problem is elsewhere after all…

I’ll keep on digging. Thanks.

Sergei

Sergei,

If you don't want an undefined variable here, that's something you should look into. Otherwise, the machine code before phi elimination looks ok. Phi elimination is not doing everything I would like it to do in this case. I'll have to try constructing a test case by hand. Until I figure out the right fix, you may want to work around by disabling the SSA check in the scheduler.

Thanks
-Andy

Ok, after a long detour I am back to where I have started. I think there is a problem at dep DAG construction. Let me try to convince you…

Here is the C code we are dealing with:

push ()

{

struct xx_stack *stack, *top;

for (stack = xx_stack; stack; stack = stack->next)

top = stack;

yy_instr = top->first;

}

If the loop never iterates, “top” will have garbage in it. If it iterates even once, it will presumably have valid pointer. Bad, but perfectly valid code.

In SSA it looked like this:

BB#0: derived from LLVM BB %entry

%vreg5 = IMPLICIT_DEF; IntRegs:%vreg5 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<Dummy def.

%vreg4 = TFRI_V4 ga:xx_stack; IntRegs:%vreg4

Successors according to CFG: BB#1

BB#1: derived from LLVM BB %for.cond

Predecessors according to CFG: BB#0 BB#1

%vreg0 = PHI %vreg4, <BB#0>, %vreg3, <BB#1>; IntRegs:%vreg0,%vreg4,%vreg3

%vreg1 = PHI %vreg5, <BB#0>, %vreg2, <BB#1>; IntRegs:%vreg1,%vreg5,%vreg2 <<<<<<<<<<< Use of that dummy value.

%vreg2 = LDriw %vreg0, 0; mem:LD4[%stack.0.in] IntRegs:%vreg2,%vreg0

By the time it has gotten to the scheduler, it became this:

BB#0: derived from LLVM BB %entry

%vreg9 = TFRI_V4 ga:xx_stack; IntRegs:%vreg9

Successors according to CFG: BB#1

BB#1: derived from LLVM BB %for.cond

Predecessors according to CFG: BB#0 BB#1

%vreg1 = COPY %vreg10; IntRegs:%vreg1,%vreg10 <<<<<<<<<<<<< First use uninitialized vreg10

%vreg10 = LDriw %vreg9, 0; mem:LD4[%stack.0.in] IntRegs:%vreg10,%vreg9

%vreg9 = ADD_ri %vreg10, 8; IntRegs:%vreg9,%vreg10

%vreg6 = CMPEQri %vreg10, 0; PredRegs:%vreg6 IntRegs:%vreg10

JMP_cNot %vreg6, <BB#1>, %PC; PredRegs:%vreg6

JMP <BB#2>

Successors according to CFG: BB#2 BB#1

BB#2: derived from LLVM BB %for.end

Predecessors according to CFG: BB#1

%vreg7 = LDriw %vreg1, 0; mem:LD4[%first1](tbaa=!“any pointer”) IntRegs:%vreg7,%vreg1

STriw_GP ga:yy_instr, 0, %vreg7; mem:ST4[@yy_instr](tbaa=!“any pointer”) IntRegs:%vreg7

JMPR %PC, %R31, %R0<imp-use,undef>

If loop never iterates, vreg10 and vreg1 will hold garbage (as they should!). If it iterates even once, vreg1 use in BB2 will be fine.

This exactly matches original C code behavior.

If it is a legitimate code sequence, we need to handle it in DAG dep construction, and introduce anti dep between

%vreg1 = COPY %vreg10; IntRegs:%vreg1,%vreg10

%vreg10 = LDriw %vreg9, 0; mem:LD4[%stack.0.in] IntRegs:%vreg10,%vreg9

…because reordering them is wrong. The rest of this thread discusses my logic about how to do it.

By the way, it is exactly same code on trunk…

I am testing your suggested workaround, and will update with results.

Thanks for your patience.

Sergei

Guys, that was very interesting and useful stuff you preached there. Mind if I make a patch to the FAQ and include it somewhere?

"Q: When should I develop against a branch? A: Only if you are making really big or controversial changes. … blah blah

I am thinking that it would be highly useful for the project, if somebody (yours sincerely) took it upon himself to hunt and gather the very interesting bits that often pop up in these mailing lists. As a newbie, I think the lists are very valuable, but you cannot expect people to always read all of the old stuff on the lists and as much as we all love Google’s search engine (bing! bing!), it does always find the stuff you want to find (nearly, but not always).

Cheers,
Mikael
– Love Thy Frog!

2012/6/13 Chandler Carruth <chandlerc@google.com>

Yes, patches to the FAQ are very welcome.

We should honestly organize the FAQ a bit differently if you want to start really working to flesh it out with more useful information. I think there should be three FAQs at least:

  1. A FAQ for the LLVM Project, which is applicable to LLVM, Clang, and often other subprojects. This would include the license section in the current LLVM faq, and maybe some other points.

  2. A FAQ for the LLVM codebase – specific to the code, libraries, and infrastructure in the primary project.

  3. A Clang FAQ for the FE-specific stuff.

The current FAQ is mostly #2, with a bit of #1, and a lot of out-of-date or flat out wrong information. The current FAQ could probably stay as one page, or be two pages, but it should have a clear and well deliniated split between #1 and #2. We should discuss creating #3 on cfe-dev in a separate thread if you’re interested.

If you’re proposing patches, here are some suggestions as the standard process may not work as well… which others may contradict if they disagree… ;]

  • Patches to fix the formatting / structure should go directly to llvm-commits

  • Patches to add a FAQ entry to section #1 above should at least go to llvm-commits, llvmdev, and possibly a few of the mailing lists for suprojects to get wider feedback.

  • Patches to add a FAQ entry to section #2 above should go to llvm-commits and llvmdev.

I think the best way that I can currently contribute to the project is through technical writing. So I see myself as doing a serious, long-term project of extending the FAQ. So, I can affirm that I want to really start working :slight_smile: This at the same time as I plan to regularly release an unofficial Windows distro of Clang with headers and libraries so people don’t need to install MinGW to use Clang.

I was myself thinking of a number of FAQs:

  1. A General FAQ with high level questions like “What is the license?”, “What is LLVM?”, and “What is Clang?” Something that lifts the newcomer up to a fairly knowledgable state in a short time.
  2. An LLVM User FAQ that only deals with LLVM usage questions.
  3. An LLVM Dev FAQ that only deals with LLVM development questions.
  4. A Clang User FAQ that only deals with Clang usage questions.
  5. A Clang Dev FAQ that only deals with Clang development questions.
  6. A you-name-it User FAQ that (lld comes to mind)…
  7. A you-name-it Dev FAQ that…
  8. A Technical Writer FAQ (should include the Sphinx documentation from the lld docs).

I know this probably seems overwhelming at first, but LLVM + Clang are two gigantic projects that really need thousands of questions and answers, not only 10-15. Obviously, this won’t materialize overnight, but I was thinking along that path. I have plenty of time and my commitment to the project is 100 percent as I badly need it for my own compiler project. I cannot stand coding on my compiler 24/7, so I figure I can use the breaks (days and weeks) to work on LLVM/Clang documentation. That way I won’t have to swim in deep water (code), but can stay safely near the beach for the time being. Once I know enough about the project, I can perhaps jump in and begin coding myself. But meanwhile, I can spend my energy on making the transition from noob to guru that much easier for others who want to join it. I even see myself skimming through old posts on the mailing lists to gather useful stuff. And, perhaps, we can over time work out a system where you guys voluntarily cc: me whenever you give a strategic or important explanation you’d like to see included in the FAQ.

There are many, many things that I think could benefit from being documented in a FAQ manner. For instance, how do you make an object file from a program that uses LLVM as its backend? (You don’t need to answer, it is only an example, even though I am very curious about finding out about this sometime). Basically, I want to document what I learn while I transition from LLVM noob to advanced LLVM user (I doubt I’ll make it to guru level, but advanced user is also fine for me).

I’m printing your commit guidelines right after I send this reply. They are highly useful and even the mailing lists ought to have some FAQ entries - probably in the General FAQ.

2012/6/14 Chandler Carruth <chandlerc@google.com>