Question about per-operand machine model

Hi Andy and all,

I have a question about per-operand machine model. I am finding some relations between 'MCWriteLatencyEntry' and 'MCWriteProcResEntry'.

For example,

class InstTEST<..., InstrItinClass itin> : Instruction {
   let Itinerary = Itin;
}
// I assume this MI writes 2 registers.
def TESTINST : InstTEST<..., II_TEST>

// schedule info
II_TEST: InstrItinClass;

def ALU1: ProcResource<1>;
def ALU2: ProcResource<1>;

def WriteALU1: SchedWriteRes<[ALU1]> { let Latency = 1; }
def WriteALU2: SchedWriteRes<[ALU2]> { let Latency = 2; }

def : ItinRW<[WriteALU1, WriteALU2], [II_TEST]>

From this example, we can access the latency information of MI with 'getWriteLatencyEntry()' and the resource information of MI with 'getWriteProcResBegin()'. At this point, I would like to find the related resource information with each latency information. But TableGen generates the 'WriteResourceID' of 'MCWriteLatencyEntry' when the 'Write' is referenced by a 'ReadAdvance'. And the order of each information, which are resource and latency, is not same. Could you let me know whether it is possible to find the related resource information with each latency information or not?

Thanks,
JinGu Kang

Resources and latency are not tied. An instruction is mapped to a scheduling class. A scheduling class is mapped to a set of resources and a per-operand list of latencies.

For resources that are not fully pipelined, you indicate the number of cycles during which the resource is used:

This uses two ALUs in the same cycle:

def WriteALU2: SchedWriteRes<[ALU, ALU]> {}

This uses an ALU for two cycles:

def WriteALUReplay: SchedWriteRes<[ALU]> { let ResourceCycles = [2]; }

This uses a MUL for 4 cycles and an ALU for two cycles:

def WriteMULALU: SchedWriteRes<[MUL, ALU]> { let ResourceCycles = [4, 2]; }

The weakness of the per-operand model (as opposed to the per-pipeline-stage itinerary) is that we cannot indicate, for example, that the ALU resources are used after the MUL resources. We could extend the model fairly easily to accomodate that.

-Andy

Resources and latency are not tied. An instruction is mapped to a scheduling class. A scheduling class is mapped to a set of resources and a per-operand list of latencies.

Thanks for your kind explanation.

Our heuristic algorithm have needed the latency and the resource per operand to check resource conflicts per cycle. In order to support this with LLVM, I expected a per-operand list of resources like latencies with a scheduling class.

Can I ask you something to modify on tablegen? I think that the 'WriteResourceID' field of 'MCWriteLatencyEntry' is for identifying the WriteResources of each defintion as commented on code. As you know, tablegen sets the 'WriteResourceID' field of 'MCWriteLatencyEntry' with 'WriteID' when the 'Write' of defition is referenced by a 'ReadAdvance'. If we always set this field with 'WriteID', it causes problem? I can see that 'ReadAdvance' only uses the 'WriteResourceID' field of 'MCWriteLatencyEntry' in 'computeOperandLatency' function. I think the pair of latency and write resource for defintion will be useful to check conflicts of resources. As reference, I have attached simple patch.

Thanks,
JinGu Kang

WriteResourceID_on_tablegen.patch (683 Bytes)

Hi Andy,

I am sorry to misunderstand 'ReadAdvance' code. In order to support
resource per operand, I feel we need more table and function. If
possbile, I would like to listen to your opinion whether this feature is
useful or not. As I mentioned on previous e-mail, it will be useful to
access the latency and the resource per operand while checking resource
conflict per cycle.

Thanks,
JinGu Kang

Hi JinGu,

We currently have the ResourceCycles list to indicate the number of cpu cycles during which a resource is reserved. We could simply add a ResourceDelay with similar grammar. The MachineScheduler could be taught to keep track of the first and last time that a resource is reserved.

Note that the MachineScheduler will work with the instruction itineraries if you choose to implement them. That¡¯s the only way to get a full reservation table without customizing the scheduler. You can plugin your own state machine or other scheduling constraint logic. You may want to do this if you have very complicated constraints.

Can you provide an example of the most complicated instruction resources that you need to model?

-Andy

Hi Andy,

I am trying to schedule and packetize instructions for VLIW at post-RA
stage or final codegen stage, where code transformations are not allowed
any more, because hardware can not resolve resource conflict. There is a
simple example as following:

ADD dest_reg1, src_reg1, src_reg2 (functional unit : ALU)
STORE dest_reg2, mem (functional unit: LOAD_STORE)

These instructions can be genally packetized together because there is
no dependency among operands and they use different functional unit. But
we have one more restricton. The restriction is that some of
instructions can not access to same register file at the same cycle. In
other words, if 'src_reg1' of ADD instruction uses register file 'A' and
'dest_reg2' of STORE instruction uses same register file at the same
cycle, it causes resource conflict and can not be executed on same
cycle. This restriction depends on instruction type. I tried to consider
each register file as a resource unit which is consumed by each operand.
While scheduling instructions per cycle, used register file is recorded
on state per cycle to check the conflict. In our heristic, it depends on
operand's latency to record this resource on specific cycle's state. so
I have tried to find a way to get latency and resource with each
operand. If it is not possible to support this feature with per-operand
resource model, as you suggested, I will try to make our own state
machine or other scheduling constraint logic. I am newbee with
scheduler. If you have any kinds of comment or feel something worng,
please let me know. It will be really helpful.

Thanks for your kind response,
JinGu Kang

Hi Andy,

I am trying to schedule and packetize instructions for VLIW at post-RA
stage or final codegen stage, where code transformations are not allowed
any more, because hardware can not resolve resource conflict. There is a
simple example as following:

ADD dest_reg1, src_reg1, src_reg2 (functional unit : ALU)
STORE dest_reg2, mem (functional unit: LOAD_STORE)

These instructions can be genally packetized together because there is
no dependency among operands and they use different functional unit. But
we have one more restricton. The restriction is that some of
instructions can not access to same register file at the same cycle. In
other words, if 'src_reg1' of ADD instruction uses register file 'A' and
'dest_reg2' of STORE instruction uses same register file at the same
cycle, it causes resource conflict and can not be executed on same
cycle. This restriction depends on instruction type. I tried to consider
each register file as a resource unit which is consumed by each operand.
While scheduling instructions per cycle, used register file is recorded
on state per cycle to check the conflict. In our heristic, it depends on
operand's latency to record this resource on specific cycle's state. so
I have tried to find a way to get latency and resource with each
operand. If it is not possible to support this feature with per-operand
resource model, as you suggested, I will try to make our own state
machine or other scheduling constraint logic. I am newbee with
scheduler. If you have any kinds of comment or feel something worng,
please let me know. It will be really helpful.

It sounds like the register file is static and does not depend on register allocation. In this case, what you tried makes sense but is really not supported. The machine model tables are designed to be efficient for the common case, and per-operand resources don¡¯t really make sense most of the time.

It sounds like you want to model the pipeline stage at which a resource is used. To do that with the per-operand machine model (misnomer), I think we need a ResourceDelay vector in addition to ResourceCycles, which we could easily add.

However, overall, I think you¡¯re target is interesting enough that you may be better off augmenting the standard machine model with your own model. Your scheduler plugin could keep your own tables or state machine to model the constraints.

If you want to be clever, you could write tablegen code to build your model up from the SchedRead/Write definitions that are part of the standard model. You could add extra fields specific to your model.

Were you previously using the old instruction itineraries, and now moving to the new model?

-Andy

Hi Andrew,

We are currently using a custom model where scheduling information is attached to each MCInstrDesc through tablegen, and we're trying to move to one of LLVM's models.

To expand on what JinGu mentioned, our target has explicit ports that are used to read and write values from and to the register file. The read port is usually accessed on cycle 0 while the write port is accessed when the result is written back to the destination register. Let's assume ADD has a latency of 1, MUL has a latency of 2 and both use port P0 to write back their result. The two instructions below would conflict on P0:

MUL r3, r4, r5
ADD r0, r1, r2
NOP ; Both r0 and r4 are written back using P0 - conflict.

On our target there is no interlock which means any conflict results in the wrong value being written back to one of the register. That's why we want to model these ports as resources in the new model. That's also why we map these port resources to each operand as each operand accesses a different port.

After reading your replies, we have realized that the scheduler does not need to know which operand corresponds to each port. It simply needs to know the set of ports used by each instruction and after how many cycles these ports are used/reserved to avoid any conflict. That's why I believe the new process resource model closely fits what we need, except for the per-resource delay you mentioned.

This is how our model currently looks like:

def :ItinRW<[1_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_ADD]>;
def :ItinRW<[2_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_MUL]>;

where n_LATENCY_WITH_p is defined roughly as:

class n_LATENCY_WITH_p<int latency, ProcResourceKind port> : SchedWriteRes<[PR_Pp]> {
     let Latency = latency;
     let ResourceDelays = [latency];
}

class PR_Pp<int portIdx> : ProcResource<1>;

The latency for register write-back/port access is static and without interlock, which I think means the port resources should have 'Buffered = 0' in the definition. Is that correct?

I have attached a patch that adds the 'ResourceDelays' field in tablegen. Could you have a look at it? A couple possible issues are:
- 'Delay' is signed, since 'Cycles' in MCWriteLatencyEntry is also signed.
- When an instruction accesses the same resource multiple times, the uses are aggregated in SubtargetEmitter::GenSchedClassTables. I'm not sure how that would work if we add a 'Delay' field to MCWriteProcResEntry.

Thanks,
Pierre

add_resource_delays.patch (4.93 KB)

Hi Andrew,

We are currently using a custom model where scheduling information is attached to each MCInstrDesc through tablegen, and we’re trying to move to one of LLVM’s models.

To expand on what JinGu mentioned, our target has explicit ports that are used to read and write values from and to the register file. The read port is usually accessed on cycle 0 while the write port is accessed when the result is written back to the destination register. Let’s assume ADD has a latency of 1, MUL has a latency of 2 and both use port P0 to write back their result. The two instructions below would conflict on P0:

MUL r3, r4, r5
ADD r0, r1, r2
NOP ; Both r0 and r4 are written back using P0 - conflict.

On our target there is no interlock which means any conflict results in the wrong value being written back to one of the register. That’s why we want to model these ports as resources in the new model. That’s also why we map these port resources to each operand as each operand accesses a different port.

After reading your replies, we have realized that the scheduler does not need to know which operand corresponds to each port. It simply needs to know the set of ports used by each instruction and after how many cycles these ports are used/reserved to avoid any conflict. That’s why I believe the new process resource model closely fits what we need, except for the per-resource delay you mentioned.

This is how our model currently looks like:

def :ItinRW<[1_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_ADD]>;
def :ItinRW<[2_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_MUL]>;

where n_LATENCY_WITH_p is defined roughly as:

class n_LATENCY_WITH_p<int latency, ProcResourceKind port> : SchedWriteRes<[PR_Pp]> {
let Latency = latency;
let ResourceDelays = [latency];
}

class PR_Pp : ProcResource<1>;

The latency for register write-back/port access is static and without interlock, which I think means the port resources should have ‘Buffered = 0’ in the definition. Is that correct?

Yes, but it isn’t sufficient. The scheduler makes no attempt to insert nops currently. However, at the very least, you will want to implement your own MachineSchedStrategy. It would be natural to handle nop insertion within your implementation.

In fact, the interpretation of most machine model properties (MircoOpBufferSize, resource BufferSize, ResourceCycles, ResourceDelay) is handled within the MachineSchedStrategy. In past emails I have been explaining how the GenericScheduler interprets the model, but it is really up to your custom strategy to implement the model.

I have attached a patch that adds the ‘ResourceDelays’ field in tablegen. Could you have a look at it? A couple possible issues are:

  • ‘Delay’ is signed, since ‘Cycles’ in MCWriteLatencyEntry is also signed.

Sure.

  • When an instruction accesses the same resource multiple times, the uses are aggregated in SubtargetEmitter::GenSchedClassTables. I’m not sure how that would work if we add a ‘Delay’ field to MCWriteProcResEntry.

Me neither. I suggest adding an assert to make sure no one accidentally uses two resources with non-zero delay. Otherwise, your patch looks fine to me. It’s totally up to you to test it though. I really want to take this patch, but we have no mechanism for testing out-of-tree target features.

-Andy

Hi Andrew,

We are currently using a custom model where scheduling information is attached to each MCInstrDesc through tablegen, and we’re trying to move to one of LLVM’s models.

To expand on what JinGu mentioned, our target has explicit ports that are used to read and write values from and to the register file. The read port is usually accessed on cycle 0 while the write port is accessed when the result is written back to the destination register. Let’s assume ADD has a latency of 1, MUL has a latency of 2 and both use port P0 to write back their result. The two instructions below would conflict on P0:

MUL r3, r4, r5
ADD r0, r1, r2
NOP ; Both r0 and r4 are written back using P0 - conflict.

On our target there is no interlock which means any conflict results in the wrong value being written back to one of the register. That’s why we want to model these ports as resources in the new model. That’s also why we map these port resources to each operand as each operand accesses a different port.

After reading your replies, we have realized that the scheduler does not need to know which operand corresponds to each port. It simply needs to know the set of ports used by each instruction and after how many cycles these ports are used/reserved to avoid any conflict. That’s why I believe the new process resource model closely fits what we need, except for the per-resource delay you mentioned.

This is how our model currently looks like:

def :ItinRW<[1_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_ADD]>;
def :ItinRW<[2_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_MUL]>;

where n_LATENCY_WITH_p is defined roughly as:

class n_LATENCY_WITH_p<int latency, ProcResourceKind port> : SchedWriteRes<[PR_Pp]> {
let Latency = latency;
let ResourceDelays = [latency];
}

class PR_Pp : ProcResource<1>;

The latency for register write-back/port access is static and without interlock, which I think means the port resources should have ‘Buffered = 0’ in the definition. Is that correct?

Yes, but it isn’t sufficient. The scheduler makes no attempt to insert nops currently. However, at the very least, you will want to implement your own MachineSchedStrategy. It would be natural to handle nop insertion within your implementation.

Nop insertion during scheduling sounds good to me, but nop insertion after regalloc has the advantage of being able to insert nops for spill/reload. Unless you don’t have spills?

Pete

Hi Andrew,

We are currently using a custom model where scheduling information is attached to each MCInstrDesc through tablegen, and we’re trying to move to one of LLVM’s models.

To expand on what JinGu mentioned, our target has explicit ports that are used to read and write values from and to the register file. The read port is usually accessed on cycle 0 while the write port is accessed when the result is written back to the destination register. Let’s assume ADD has a latency of 1, MUL has a latency of 2 and both use port P0 to write back their result. The two instructions below would conflict on P0:

MUL r3, r4, r5
ADD r0, r1, r2
NOP ; Both r0 and r4 are written back using P0 - conflict.

On our target there is no interlock which means any conflict results in the wrong value being written back to one of the register. That’s why we want to model these ports as resources in the new model. That’s also why we map these port resources to each operand as each operand accesses a different port.

After reading your replies, we have realized that the scheduler does not need to know which operand corresponds to each port. It simply needs to know the set of ports used by each instruction and after how many cycles these ports are used/reserved to avoid any conflict. That’s why I believe the new process resource model closely fits what we need, except for the per-resource delay you mentioned.

This is how our model currently looks like:

def :ItinRW<[1_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_ADD]>;
def :ItinRW<[2_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_MUL]>;

where n_LATENCY_WITH_p is defined roughly as:

class n_LATENCY_WITH_p<int latency, ProcResourceKind port> : SchedWriteRes<[PR_Pp]> {
let Latency = latency;
let ResourceDelays = [latency];
}

class PR_Pp : ProcResource<1>;

The latency for register write-back/port access is static and without interlock, which I think means the port resources should have ‘Buffered = 0’ in the definition. Is that correct?

Yes, but it isn’t sufficient. The scheduler makes no attempt to insert nops currently. However, at the very least, you will want to implement your own MachineSchedStrategy. It would be natural to handle nop insertion within your implementation.

Nop insertion during scheduling sounds good to me, but nop insertion after regalloc has the advantage of being able to insert nops for spill/reload. Unless you don’t have spills?

To elaborate a bit more, MachineScheduler can run both preRA and postRA. So, if you want to do nop insertion within MachineScheduler (as opposed to a separate pass) you could enable it only during postRA scheduling.

-Andy

Thanks, I’ll have a look at MachineSchedStrategy and see how we can implement it for our target. We are currently doing scheduling with a custom scheduler/packetization pass at the very end of the machine compilation process, before object generation. We aren’t using the MachineScheduler or any other scheduling before that pass, neither preRA or postRA. I think we could benefit from at least preRA scheduling, so that the register live ranges seen by the RA better match the final ranges (after our scheduling pass). We do have spills, but I’m not sure if there is a benefit to inserting nops before our final pass though. Would that improve register allocation, if it was possible to do so preRA? Adding an assert when someone uses two resources with non-zero delay, or maybe two different delays, sounds good to me. I’m glad to hear that you’d want to take patches even for out-of-tree features, it’s much appreciated. I’m not very familiar with the itinerary model, but aren’t these two ways of expressing schedules equivalent? def :ItinRW<[1_LATENCY_WITH_P0, 0_LATENCY_WITH_P1, 0_LATENCY_WITH_P2], [II_ADD]>; InstrItinData<II_ADD, [InstrStage<1, [P1], 0>, InstrStage<1, [P2]>, InstrStage<1, [P0]>], [1, 0, 0]> If that’s the case, then does the new machine model express itineraries with more than one stage, without adding this ‘ResourceDelays’ field? Thanks, Pierre

Thanks, I’ll have a look at MachineSchedStrategy and see how we can implement it for our target. We are currently doing scheduling with a custom scheduler/packetization pass at the very end of the machine compilation process, before object generation. We aren’t using the MachineScheduler or any other scheduling before that pass, neither preRA or postRA. I think we could benefit from at least preRA scheduling, so that the register live ranges seen by the RA better match the final ranges (after our scheduling pass).

I see, then you’re really can do whatever you want with the machine model. You’re just limited by the tables produced by the current tablegen backend and whatever features you add to it. You just need to understand the format of the tables in GenSubtargetInfo.inc and try to fit your model into that format. Hopefully adding ResourceDelays will give you enough flexibility.

We do have spills, but I’m not sure if there is a benefit to inserting nops before our final pass though. Would that improve register allocation, if it was possible to do so preRA?

Nope. The main reason to bundle pre-RA is to expose more opportunity for code motion (without physical register dependencies), thus generate tighter bundles. In general, there’s no reason to introduce nops other than to meet your encoding constraints, so it’s just a matter of picking the most convenient place to do that.

-Andy