Mismatch between DFA and Scoreboard

Hi,

I see a discrepancy between the ScoreBoard and the DFA when deciding on available resources. I think I can trace that down to DFAPacketizerEmitter.cpp not heeding the NextCycles field of the raw itinerary data. In essence I have two itineraries that can be scheduled in the same cycle. In scoreboard form:

InsA:
001
000
000
100

InsB
010
100

The two empty slots of InsA are coded by NextCycles=2 in the first stage (actually 7, but this captures the essence). I run the PostRASched with ScoreboardHazardRecognizer on an example with just these two instructions and it schedules them in the same cycle, with the scoreboard in the debug output looking entirely sane. DFAPacketizer, run immediately after, says the reources are not available.

My questions

  • Am I right in assuming that the scoreboard and the DFA should always agree about resource conflicts?
  • Should I rewrite my itineraries to fully expanded form, i.e. without using NextCycles?
  • Am I in DFA state explosion territory with eight basic functional units, seven pipeline stages and variable timing as shown by the instructions above?

A less urgent, more philosophical question:

  • Is there a difference between NextCycles = -1 and NextCycles = 1?

Kind regards,
Martien

NextCycles is also known as TimeInc I guess. And answring my last question: -1 will take the default from cycles, 1 is just 1.

I am not an expert on the DFA, but I have used it. You’re correct that there are differences between the DFA and scoreboard. The DFA, as used by Hexagon, encodes information about issue slots, but doesn’t account for latency or instruction cycles. That is, it simply determines if a set of n instructions be legally issued together. When the DFA was added to LLVM, Hexagon instructions executed in a single cycle. That is no longer true and instructions can take multiple cycles to complete, but the resources used after the issue slot aren’t modelled by the DFA.

For example, say an instruction is issued at cycle 0 on functional unit A, but takes m cycles to complete. Only the issue slot is modeled by the DFA, at cycle 0. The resources used for the m-1 cycles is not modeled (for Hexagon). That means it’s possible to issue another instruction at cycle 1 on functional unit A because the prior instruction is at a different pipeline stage.

If you’re modelling pipeline stages with the DFA, then I wouldn’t be surprised if you see DFA state explosion, or even other issues as that type of functionality probably isn’t tested well.

That’s my recollection of the DFA, so I may be missing some details…

Thanks,
Brendon

Thanks @bcahoon, that seems to confirm my findings. My problem is to code a register write conflict which has instruction dependent timing. So a load writes in later stage than a mul, which writes in a later stage than an add. My scoreboard has 8 independent FuncUnits and is seven deep. It will need all the time-shifted states to keep track of the register write conflict.

For now, I have eliminated the need for the DFA by doing packetization based on the scoreboard. I would like to have the DFA available under the ScheduleHazardRecognizer interface, so that I can plug it into the MultiHazardRecognizer as a fast shortcut. As it is now, I would need two scheduler models to have that conjunction. ComboFunctions aren’t interpreted by the scoreboard hazard recognizer, whereas cycles and time steps aren’t recognized by the DFA. I may be flattered into contributing a DFA implementation of the Scoreboard that gives a quick ‘hasHazard’ check. before entering the full scoreboard query.

Regards,
Martien

My reading of the code suggests it clears the DFA for each cycle and recomputes it. Thus, it doesn’t appear to span multiple cycles. I hacked the pipeliner to use the hazard recognizer for ARC. That seems to allow me to track conflicts accurately. It is likely more expensive though.

I think your reading is correct. I’m not too worried about speed, and I think the state machine for the entire scoreboard would explode anyway. I could see a use of a per-cycle state machine as a building block for the full scoreboard. I think the ScoreboardHazardRecognizer has some room for performance improvements by prepocessing the resource tables some more in tablegen.
How much of a hack is your pipeliner tweak? I’m banking on reuse of the hazard recognizer in the pipeliner. I’ll look at ARC anyway, thanks for the referral.

We have unfortunately not upstreamed this code yet. You will have to reimplement it using a scoreboard yourself. I don’t feel it is a hack at all. Pipelining with the hazard recognizer doesn’t generally dominate the codegen compile time. ISEL and regalloc usually dwarf it.

We’re on the same page. We have a complicated resource model with many different VLIW formats optimized for size. We factored that model out into a ‘module’ of the hazard recognizer. I’m also factoring the core of the scoreboard interpreter to feed in an itinerary explicitly, rather than reading it from the instruction descriptor. That will allow me to dynamically switch the resources, which we will use for instructions that use resources based on which register they write.
Our format module will preclude many searches of the scoreboard, but is itself not the fastest implementation possible yet.
I’m assuming that the pipeliner needs a modulo scoreboard that is persistent and compatible with changing the scheduler direction. Did you extend/change the API? I have a feeling that ‘advanceCycle’ as used from the list scheduler is not the most intuitive way for other scheduling algorithms.

Yes, we did modify the interface, but unfortunately we did not document our changes well.