Incorrect Cortex-R4/R4F/R5 ProcessorModel in ARM.td

In ARM.td, I see that the ProcessorModel for cortex-r4, cortex-r4f, and cortex-r5 (as well as r7 and r8) is based on “CortexA8Model”, which seems incorrect. When this was added in 2015, there were also comments associated with this configuration, such as “// FIXME: R5 has currently the same ProcessorModel as A8” (later removed). The processor model for Cortex-r52 appears to be correct and corresponds to an associated “CortexR52Model”.

Does anyone know why r4/r4f/r5 were setup based on “CortexA8Model”.

Is there a plan to upstream a fix to correct this?

Thanks!

Alan Phipps

Hello Alan,

Using a cortex-a8 scheduling model for v7-r CPUs may not be optimal but I wouldn't go as far as to call it incorrect. The cortex-r4, cortex-r4f and cortex-r5 are in-order cores like cortex-a8 (another in-order core) is the closest match. We don't have any current plans to develop a custom scheduling model for r4, r4f or r5.

Peter

Thanks, Peter, for your response. Right -- certainly not incorrect in the sense of generating an incorrect schedule, but definitely seems suboptimal.

I've also noticed that if I experimentally base the v7-r model on the Cortex-R52 ProcessModel (or even build for Cortex-R52), I achieve a better schedule than if it were based on cortex-a8, and I see 2%-3% performance improvement on benchmarks like Coremark running on cortex-r5 hardware. Do you know why that might be the case? Can you suggest other, more straightforward ways one might improve performance scheduling for cortex-r5 if there aren't any plans to develop a custom model for v7-r?

Thanks for your help,

-Alan

Hello Alan,

Looking at the public information for Cortex-R5 (https://developer.arm.com/ip-products/processors/cortex-r/cortex-r5) and Cortex-R52 (https://developer.arm.com/ip-products/processors/cortex-r/cortex-r52) shows that both are in-order with similar length pipelines. It is possible that the Cortex-R52 scheduling model may match the Cortex-R5 more closely than the choices available at the time that Cortex-R5 was upstreamed.

I haven't written a schedule model myself. My understanding of the process is that the technical reference manual or any other publicly available information about the micro-architecure is used to provide initial values for the model. Then it is a matter of refinement against as many benchmarks as you can run.

I think if empirically the Cortex-R52 model is producing better results than the Cortex-A8 then it could be possible to adapt the model for the Cortex-R5 by removing the parts specific to V8-R and tweaking parameters based on cycle times from the technical reference manual (TRM). I'm sure we could find someone to review a patch if there is good enough set of benchmarks showing that a model is better than the Cortex-A8.

The technical reference manual for the Cortex-R5: https://developer.arm.com/documentation/ddi0460/c/

Peter

Hey Peter,

I've begun looking into adapting the model for the R52 into a model for the R5.

Tweaking the instruction timings and removing V8-r specific stuff has been mostly straightforward, and I'm seeing about a 3% improvement in benchmarks like coremark.

However, the R5 rules on which instructions can be dual issued are different from the R52, and I don't see how the superscalar behavior is modeled in the existing R52 schedule.

Would you happen to know what part of the R52 tablegen file is for modeling the superscalar behavior?

Thanks,
Benson

Hello,

I know just about enough to find the right file to describe the scheduling model, but I don't know much about the details myself. I'm hoping that one of my colleagues or someone knowing about scheduling in general can help/correct what I'm writing below.

From what I glean from:

https://llvm.org/devmtg/2016-09/slides/Absar-SchedulingInOrder.pdf
https://llvm.org/devmtg/2014-10/Slides/Estes-MISchedulerTutorial.pdf

The basics of superscalar modelling are the IssueWidth

def CortexR52Model : SchedMachineModel {
  let MicroOpBufferSize = 0; // R52 is in-order processor
  let IssueWidth = 2; // 2 micro-ops dispatched per cycle
  let LoadLatency = 1; // Optimistic, assuming no misses
  let MispredictPenalty = 8; // A branch direction mispredict, including PFU
  let CompleteModel = 0; // Covers instructions applicable to cortex-r52.
}

I would expect the forwarding information to be useful as to dual issue certain pairs the dependencies would need to be available.

// Forwarding information - based on when an operand is read
def : ReadAdvance<R52Read_ISS, 0>;
def : ReadAdvance<R52Read_EX1, 1>;
def : ReadAdvance<R52Read_EX2, 2>;
def : ReadAdvance<R52Read_F0, 0>;
def : ReadAdvance<R52Read_F1, 1>;
def : ReadAdvance<R52Read_F2, 2>;

From https://llvm.org/devmtg/2016-09/slides/Absar-SchedulingInOrder.pdf assuming it still holds (5 years ago)

LLVM Scheduler -What's missing?
* Instructions with slot constraints
** Cannot issue in second slot - specification and pickNode changes
** Cannot issue with any other - micro-ops
** Cannot issue with specific another - reliance on resource constraint (not adequate)
* Inter-lock constraint modelling
** Cannot slow down previous instruction
* First-half, second-half and in-stage forwarding
** Further divide pipeline stages
* Variadic instructions
** SchedPredicate, SchedVariant - an alternate compact representation necessary

It may be that more complex superscalar constraints cannot be modelled.

Hope that helps

Peter

Hey Peter,

Thanks for the reply, I was able to flesh out most of the R5 model with the information you had provided.

However, I had a question about the R5 TRM regarding the meaning of "Issue Cycles". The description of Issue Cycles says "the minimum number of cycles required to issue an instruction". Do issue cycles indicate that no other instructions will be issued for that amount of time?

For example, here's an entry from the timings chapter:

Instruction | Cycles | Early Regs | Result Latency |
VDIV.F64 <Dd>, <Dn>, <Dm> | 3 | <Dn>, <Dm> | 63 |

And let's say I have the following sequence:

VDIV r1, r2, r3
ADD r4, r5, r6

Since there's no data dependence, these instructions should be issued right after one another. However, since VDIV has 3 "issue cycles", if VDIV is issued on cycle 0, does that mean ADD is issued on cycle 3? Or, are they both issued respectively on cycle 1 and 2, and "issue cycles" indicate something else?

(I am assuming that some aspects of the superscalar behavior come into play here, but I'm not sure how)

Thanks again!
Benson

Hello Benson,

Having to give a quick reply as I'm about to go on vacation for a couple of weeks. Again hoping my Arm colleagues can correct me if I'm wrong as I'm interpreting what is in the TRM, which usually abstracts somewhat from the HW.

If we take out dual issue then my understanding is that the sequence below would take 4 cycles 3 for VDIV and 1 for the add. If there were a data dependency, for example if the second instruction used d1 then the second instruction stalls for 63 (result latency) - 3 (cycles)
VDIV.F64 d1, d2, d3 //. I'm assuming double precision registers here for VDIV.F64 as in table.
ADD r4, r5, r6

With dual issuing, we can look into the permitted combinations which I don't think there is a match for 64-bit CDP instructions like VDIV.F64. With the 32 bit equivalent there is
VDIV.F32 s1, s2, s3
ADD r4, r5, r6
Would dual issue under case F1 b,m

Any single precision CDP (exceptions...) | As for Case C (any data processing instruction) |

In this case the instruction with the fewest cycles is considered to take 0 cycles, so the total cycle count for
VDIV.F32 s1, s2, s3 // 2 cycles for VDIV.F32
ADD r4, r5, r6 // 1 cycle for ADD
I'd expect to be 2 cycles.

Hope that helps, and apologies for a hasty response

Peter