[RFC] Iterrative compilation framework for Clang/LLVM

This is a first step towards to full iterative compilation framework for
Clang/LLVM. The attached patch is work in progress and the idea of sending
it to the list earlier rather than later is to open up a discussion on
iterative compilation in Clang/LLVM. All comments are welcome, especially
those related to integration of the framework into Clang/LLVM.

Current compilers have pipeline structure in which data from one phase is
passed to the following one. But these compilation phases can interfere in
unpredictable way so compilers use conservative heuristics on which they
make decisions during the compilation process. Because of unpredictability
of these interferences, compilers sometimes make decisions which can degrade
code quality. Iterative compilation could partially resolve these problems
by allowing compiler to explore some alternative decisions, generate code
for these new decision paths and take the generated code that is the best
for a particular criteria eventually.

This approach is not new in the literature, we can mention Milepost/CTuning
[http://ctuning.org] among other projects. In this approach, in comparison
to Milepost/CTuning, the emphasis is on much finer granularity by changing
arbitrary compiler decisions, and not only those based on values of command
line options, at any point in translation process.

This framework implements iterative approach in which in every iteration
slightly different code is generated. At the end of compilation process the
best emitted code is chosen.

All points at which relevant decisions are made should be identified.
Access to framework data structure (DecisionTree) should be provided at
these points. Based on the data from DecisionTree, obtained by calling
getDecision member function, at these points decisions are made whether to
execute default actions (as compiler would do without this framework) or to
follow alternative paths.

Let us consider Inliner example:

Original code:

if (!IC) {
DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
<< “, thres=” << (IC.getCostDelta() + IC.getCost())
<< ", Call: " << *CS.getInstruction() << “\n”);
return false;
}

Code after instrumentalization:

bool doInlinging = (bool)IC;
DecisionTreeProxy *decisionTreeProxy;
decisionTreeProxy =
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
int t = MaxDtPriority - abs(IC.getCostDelta());
bool decision = false;
if (t >= 0)
decision = decisionTreeProxy->getDecision(call);
if (decision)
doInlinging = !doInlinging;

if (!doInlinging) {
DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
<< “, thres=” << (IC.getCostDelta() + IC.getCost())
<< ", Call: " << *CS.getInstruction() << “\n”);
return false;
}

Function getDecision returns true or false. If false is returned than this
means to take default path and true means to take alternative one.

Compilation process can be represented by sequence of binary values where
every value determines decision made at some point. Value FALSE means to
follow normal compilation process as original compiler would do without
presence of iterative compilation framework. TRUE value means to take
alternative path or to make opposite decision at that point. This naturally
forms tree like structure which we named Decision Tree.
Traversing the tree from the root to the leaf of the tree provides all the
decisions made during one iteration in compilation process.

At the end of each iteration, quality of generated code is estimated by
executing newly introduced target dependent pass. Based on results path for
the following iteration is calculated. At the moment, this has been proved
for MIPS only and it is based on code size.

Initially iterative compilation framework was implemented in CodeGen phase
but large number of code transformations was already performed in earlier
phases. This reduces possibilities to improve quality of generated code.
Thus iterative compilation framework is applied to Clang driver where it can
influence every single compiler phase. This relocation resulted in some
implementation issues mostly related to the fact that driver and the
compiler itself (-cc1) are located in different processes. This is resolved
by introducing DecisionTreeProxy class that implements communication between
these two processes through temporary files.

At this point mechanism is applied to two optimizations: Inliner and Loop
Invariant Code Motion. Supported platform is MIPS32r2.

Despite small number of instrumented points we got improvements on some test
files from CSiBE benchmark. Here are the promising results:

jpeg-6b/jidctflt.c -10.2%
OpenTCP-1.0.4/system.c -6.7%
libmspack/test/chmd_md5.c -5.8%
flex-2.5.31/regex.c -4.8%

We plan to apply this framework to more compiler phases and to some other
target architectures. It is expected to give better results on some non
orthogonal architectures like microMIPS. Another direction is to apply
machine learning based on some input program features (number of basic
blocks, register pressure at some point etc.) and it can be used for
selecting compilation path for the next iteration or estimating the best
path for compiling program without using iterative approach (with only one
iteration).

o
/
/
o
/
/
o o
/ \ /
/ \ /
L1 o o
/ /
/ /
L2 L3

Figure 1: Decision Tree Example - Symbol ‘o’ represent decision points. If
getDecision function returns FALSE the left branch is taken (default path),
otherwise the right branch is taken.
‘L1’ compilation iteration can be represented as a string
(FALSE,FALSE,FALSE), ‘L2’ as (FALSE,FALSE,TRUE,FALSE) and ‘L3’
as (TRUE, FALSE, FALSE).

Clearly, we see huge potential in this direction. At the same time, we are
aware this would cause compiler to be slower which everyone is against, but
this penalty can be taken when we are in need for most optimized code.

What does everyone think?
Could this approach be considered as a valid mode for Clang/LLVM?

Please use this post and the attached patch (it can be applied to 3.3
release - clang should be inside llvm/tools folder) as a reference and
not the patch for submission.

Example command line:

clang -target mipsel-unknown-linux -Oz -S -fiterative-comp=3 example.c

Thank you. Looking forward to hearing your comments on this.

ic.patch (61 KB)

Radovan,

Thanks for posting this! I would really like to have this kind of functionality available. There are a lot of things to think through here; comments inline...

From: "Radovan Obradovic" <Radovan.Obradovic@imgtec.com>
To: llvmdev@cs.uiuc.edu
Sent: Monday, December 16, 2013 11:31:21 AM
Subject: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

This is a first step towards to full iterative compilation framework
for
Clang/LLVM. The attached patch is work in progress and the idea of
sending
it to the list earlier rather than later is to open up a discussion
on
iterative compilation in Clang/LLVM. All comments are welcome,
especially
those related to integration of the framework into Clang/LLVM.

I'm glad you're doing this, and I hope that you'll be willing to stick though what might be a long design discussion :slight_smile:

Current compilers have pipeline structure in which data from one
phase is
passed to the following one. But these compilation phases can
interfere in
unpredictable way so compilers use conservative heuristics on which
they
make decisions during the compilation process. Because of
unpredictability
of these interferences, compilers sometimes make decisions which can
degrade
code quality. Iterative compilation could partially resolve these
problems
by allowing compiler to explore some alternative decisions, generate
code
for these new decision paths and take the generated code that is the
best
for a particular criteria eventually.

This approach is not new in the literature, we can mention
Milepost/CTuning
[http://ctuning.org] among other projects. In this approach, in
comparison
to Milepost/CTuning, the emphasis is on much finer granularity by
changing
arbitrary compiler decisions, and not only those based on values of
command
line options, at any point in translation process.

This framework implements iterative approach in which in every
iteration
slightly different code is generated. At the end of compilation
process the
best emitted code is chosen.

All points at which relevant decisions are made should be identified.
Access to framework data structure (DecisionTree) should be provided
at
these points. Based on the data from DecisionTree, obtained by
calling
getDecision member function, at these points decisions are made
whether to
execute default actions (as compiler would do without this framework)
or to
follow alternative paths.

Let us consider Inliner example:

Original code:

if (!IC) {
DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
<< ", thres=" << (IC.getCostDelta() + IC.getCost())
<< ", Call: " << *CS.getInstruction() << "\n");
return false;
}

Code after instrumentalization:

bool doInlinging = (bool)IC;
DecisionTreeProxy *decisionTreeProxy;
decisionTreeProxy =
moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
int t = MaxDtPriority - abs(IC.getCostDelta());
bool decision = false;
if (t >= 0)
decision = decisionTreeProxy->getDecision(call);
if (decision)
doInlinging = !doInlinging;

if (!doInlinging) {
DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
<< ", thres=" << (IC.getCostDelta() + IC.getCost())
<< ", Call: " << *CS.getInstruction() << "\n");
return false;
}

Talking a quick look over the patch, this seems to introduce a lot of boilerplate code. Can you think of a design that simplifies this?

Function getDecision returns true or false. If false is returned than
this
means to take default path and true means to take alternative one.

Compilation process can be represented by sequence of binary values
where
every value determines decision made at some point. Value FALSE means
to
follow normal compilation process as original compiler would do
without
presence of iterative compilation framework. TRUE value means to take
alternative path or to make opposite decision at that point. This
naturally
forms tree like structure which we named Decision Tree.
Traversing the tree from the root to the leaf of the tree provides
all the
decisions made during one iteration in compilation process.

At the end of each iteration, quality of generated code is estimated
by
executing newly introduced target dependent pass. Based on results
path for
the following iteration is calculated. At the moment, this has been
proved
for MIPS only and it is based on code size.

I think that, in general, we ought to be able to get an overall latency estimate from the instruction scheduler.

Initially iterative compilation framework was implemented in CodeGen
phase
but large number of code transformations was already performed in
earlier
phases. This reduces possibilities to improve quality of generated
code.
Thus iterative compilation framework is applied to Clang driver where
it can
influence every single compiler phase. This relocation resulted in
some
implementation issues mostly related to the fact that driver and the
compiler itself (-cc1) are located in different processes. This is
resolved
by introducing DecisionTreeProxy class that implements communication
between
these two processes through temporary files.

I think that this can be solved by introducing some API for the necessary data flow. There is obviously already a binding here.

At this point mechanism is applied to two optimizations: Inliner and
Loop
Invariant Code Motion. Supported platform is MIPS32r2.

Despite small number of instrumented points we got improvements on
some test
files from CSiBE benchmark. Here are the promising results:

jpeg-6b/jidctflt.c -10.2%
OpenTCP-1.0.4/system.c -6.7%
libmspack/test/chmd_md5.c -5.8%
flex-2.5.31/regex.c -4.8%

Very interesting. Out of curiosity, any slowdowns?

We plan to apply this framework to more compiler phases and to some
other
target architectures. It is expected to give better results on some
non
orthogonal architectures like microMIPS. Another direction is to
apply
machine learning based on some input program features (number of
basic
blocks, register pressure at some point etc.) and it can be used for
selecting compilation path for the next iteration or estimating the
best
path for compiling program without using iterative approach (with
only one
iteration).

o
/ \
/ \
o \
/ \
/ \
o o
/ \ /
/ \ /
L1 o o
/ /
/ /
L2 L3

Figure 1: Decision Tree Example - Symbol 'o' represent decision
points. If
getDecision function returns FALSE the left branch is taken (default
path),
otherwise the right branch is taken.
'L1' compilation iteration can be represented as a string
(FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and 'L3'
as (TRUE, FALSE, FALSE).

Clearly, we see huge potential in this direction. At the same time,
we are
aware this would cause compiler to be slower which everyone is
against, but
this penalty can be taken when we are in need for most optimized
code.

What is the overhead of the current implementation? I suspect that, in order to control the large number of decision points from large regions, we will want some way to limit fine-grained decisions in the tree in large regions (and only including coarse ones, like the inlining of relatively large functions). In small blocks, on the other hand, we might even include instruction scheduling decisions (for in-order cores).

I'd been thinking of experimenting with some iterative scheduling for my in-order PPC cores (for small blocks in loops), and integrating it into something like this might be even better.

What does everyone think?
Could this approach be considered as a valid mode for Clang/LLVM?

Please use this post and the attached patch (it can be applied to 3.3
release - clang should be inside llvm/tools folder) as a reference
and
not the patch for submission.

I see in the patch some (seemingly small) changes to the pass manager setup. Can you describe exactly what you've done. Chandler is rewriting the pass infrastructure, and I'd like to understand how what you've done will or won't fit with the new setup.

Thanks again,
Hal

Hal,

Thank you for finding interest in our work!
Please find my answers inlined below.

________________________________________
From: Hal Finkel [hfinkel@anl.gov]
Sent: Monday, December 16, 2013 4:26 PM
To: Radovan Obradovic
Cc: llvmdev@cs.uiuc.edu; chandlerc; Andrew Trick
Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Radovan,

Thanks for posting this! I would really like to have this kind of
functionality available. There are a lot of things to think through
here; comments inline...

> From: "Radovan Obradovic" <Radovan.Obradovic@imgtec.com>
> To: llvmdev@cs.uiuc.edu
> Sent: Monday, December 16, 2013 11:31:21 AM
> Subject: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM
>
>
>
>
> This is a first step towards to full iterative compilation framework
> for
> Clang/LLVM. The attached patch is work in progress and the idea of
> sending
> it to the list earlier rather than later is to open up a discussion
> on
> iterative compilation in Clang/LLVM. All comments are welcome,
> especially
> those related to integration of the framework into Clang/LLVM.

I'm glad you're doing this, and I hope that you'll be willing to
stick though what might be a long design discussion :slight_smile:

Our belief is that this approach can be beneficial, especially
for non-orthogonal architectures (e.g. microMIPS), thus we plan
to continue our work on this.

>
> Current compilers have pipeline structure in which data from one
> phase is
> passed to the following one. But these compilation phases can
> interfere in
> unpredictable way so compilers use conservative heuristics on which
> they
> make decisions during the compilation process. Because of
> unpredictability
> of these interferences, compilers sometimes make decisions which can
> degrade
> code quality. Iterative compilation could partially resolve these
> problems
> by allowing compiler to explore some alternative decisions, generate
> code
> for these new decision paths and take the generated code that is the
> best
> for a particular criteria eventually.
>
> This approach is not new in the literature, we can mention
> Milepost/CTuning
> [http://ctuning.org] among other projects. In this approach, in
> comparison
> to Milepost/CTuning, the emphasis is on much finer granularity by
> changing
> arbitrary compiler decisions, and not only those based on values of
> command
> line options, at any point in translation process.
>
> This framework implements iterative approach in which in every
> iteration
> slightly different code is generated. At the end of compilation
> process the
> best emitted code is chosen.
>
> All points at which relevant decisions are made should be identified.
> Access to framework data structure (DecisionTree) should be provided
> at
> these points. Based on the data from DecisionTree, obtained by
> calling
> getDecision member function, at these points decisions are made
> whether to
> execute default actions (as compiler would do without this framework)
> or to
> follow alternative paths.
>
> Let us consider Inliner example:
>
> Original code:
>
> if (!IC) {
> DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
> << ", thres=" << (IC.getCostDelta() + IC.getCost())
> << ", Call: " << *CS.getInstruction() << "\n");
> return false;
> }
>
> Code after instrumentalization:
>
> bool doInlinging = (bool)IC;
> DecisionTreeProxy *decisionTreeProxy;
> decisionTreeProxy =
> moduleDecisionTreeProxies->getCurrFnDecisionTreeProxy();
> int t = MaxDtPriority - abs(IC.getCostDelta());
> bool decision = false;
> if (t >= 0)
> decision = decisionTreeProxy->getDecision(call);
> if (decision)
> doInlinging = !doInlinging;
>
> if (!doInlinging) {
> DEBUG(dbgs() << " NOT Inlining: cost=" << IC.getCost()
> << ", thres=" << (IC.getCostDelta() + IC.getCost())
> << ", Call: " << *CS.getInstruction() << "\n");
> return false;
> }

Talking a quick look over the patch, this seems to introduce a lot
of boilerplate code. Can you think of a design that simplifies this?

It is necessary to access decision tree at every place where you
want to call getDecision function and priority parameter must be
calculated which is specific to every such place. It is only a
few lines of code but we will consider how to improve this.
All suggestions are welcomed.

>
> Function getDecision returns true or false. If false is returned than
> this
> means to take default path and true means to take alternative one.
>
> Compilation process can be represented by sequence of binary values
> where
> every value determines decision made at some point. Value FALSE means
> to
> follow normal compilation process as original compiler would do
> without
> presence of iterative compilation framework. TRUE value means to take
> alternative path or to make opposite decision at that point. This
> naturally
> forms tree like structure which we named Decision Tree.
> Traversing the tree from the root to the leaf of the tree provides
> all the
> decisions made during one iteration in compilation process.
>
> At the end of each iteration, quality of generated code is estimated
> by
> executing newly introduced target dependent pass. Based on results
> path for
> the following iteration is calculated. At the moment, this has been
> proved
> for MIPS only and it is based on code size.

I think that, in general, we ought to be able to get an overall
latency estimate from the instruction scheduler.

Criteria to optimize for speed should be added as well. Do you
think that latency info form scheduler can be used to estimate
execution speed of generated code? Can we guess somehow the execution
frequency of basic blocks?

>
> Initially iterative compilation framework was implemented in CodeGen
> phase
> but large number of code transformations was already performed in
> earlier
> phases. This reduces possibilities to improve quality of generated
> code.
> Thus iterative compilation framework is applied to Clang driver where
> it can
> influence every single compiler phase. This relocation resulted in
> some
> implementation issues mostly related to the fact that driver and the
> compiler itself (-cc1) are located in different processes. This is
> resolved
> by introducing DecisionTreeProxy class that implements communication
> between
> these two processes through temporary files.

I think that this can be solved by introducing some API for the
necessary data flow. There is obviously already a binding here.

This probably needs to be fixed. All suggestions are welcomed.

>
> At this point mechanism is applied to two optimizations: Inliner and
> Loop
> Invariant Code Motion. Supported platform is MIPS32r2.
>
> Despite small number of instrumented points we got improvements on
> some test
> files from CSiBE benchmark. Here are the promising results:
>
> jpeg-6b/jidctflt.c -10.2%
> OpenTCP-1.0.4/system.c -6.7%
> libmspack/test/chmd_md5.c -5.8%
> flex-2.5.31/regex.c -4.8%

Very interesting. Out of curiosity, any slowdowns?

jidctflt.c is optimized due to the fact that LICM moves some
constant definitions out of the loop and increases the register
pressure so some spills occur. So optimized version, in this
case, should be even faster. But slowdowns are possible and we
use option -Oz so we are telling the compiler that code size
is the priority.

>
> We plan to apply this framework to more compiler phases and to some
> other
> target architectures. It is expected to give better results on some
> non
> orthogonal architectures like microMIPS. Another direction is to
> apply
> machine learning based on some input program features (number of
> basic
> blocks, register pressure at some point etc.) and it can be used for
> selecting compilation path for the next iteration or estimating the
> best
> path for compiling program without using iterative approach (with
> only one
> iteration).
>
>
> o
> / \
> / \
> o \
> / \
> / \
> o o
> / \ /
> / \ /
> L1 o o
> / /
> / /
> L2 L3
>
> Figure 1: Decision Tree Example - Symbol 'o' represent decision
> points. If
> getDecision function returns FALSE the left branch is taken (default
> path),
> otherwise the right branch is taken.
> 'L1' compilation iteration can be represented as a string
> (FALSE,FALSE,FALSE), 'L2' as (FALSE,FALSE,TRUE,FALSE) and 'L3'
> as (TRUE, FALSE, FALSE).
>
>
> Clearly, we see huge potential in this direction. At the same time,
> we are
> aware this would cause compiler to be slower which everyone is
> against, but
> this penalty can be taken when we are in need for most optimized
> code.

What is the overhead of the current implementation? I suspect that,
in order to control the large number of decision points from large
regions, we will want some way to limit fine-grained decisions in
the tree in large regions (and only including coarse ones, like the
inlining of relatively large functions). In small blocks, on the
other hand, we might even include instruction scheduling decisions
(for in-order cores).

In the current implementation one full compilation per
decision is needed. Plan is that, for CodeGen pass, do once all the
passes of CodeGen per decision (to reuse bitcode). Machine learning should
considerably reduce the need for the large number of iterations.

I'd been thinking of experimenting with some iterative scheduling
for my in-order PPC cores (for small blocks in loops), and
integrating it into something like this might be even better.

>
> What does everyone think?
> Could this approach be considered as a valid mode for Clang/LLVM?
>
> Please use this post and the attached patch (it can be applied to 3.3
> release - clang should be inside llvm/tools folder) as a reference
> and
> not the patch for submission.

I see in the patch some (seemingly small) changes to the pass manager
setup. Can you describe exactly what you've done. Chandler is
rewriting the pass infrastructure, and I'd like to understand how
what you've done will or won't fit with the new setup.

Pointer to decision tree proxy has to be passed to every place
where getDecision is called and to the pass that calculates function
fitness. So pointer to it is added in Pass and PassManagerBase classes.
We will provide more detail explanation later.

Hi Radovan,

I am also interested in the iterative compilation in LLVM, and had implemented a simple iterative compilation driver.

I guess you do not need to embedded the pointer to ModuleDecisionTreeProxies into the core classes of llvm, i.e. the PassManager class and the Function class.

Instead, you could:

  1. Implement a special pass manager that runs a set of passes iteratively. Implementing the iterative compilation framework based on pass manage allows the framework to live outside of clang, and we can use it in opt, llc, etc.

  2. Implement an immutable pass to store the global information (i.e. ModuleDecisionTreeProxies if I do not misunderstand your patch) across iterations. In your example, once you implemented ModuleDecisionTreeProxies as an immutable pass, you can get the ModuleDecisionTreeProxies in every pass you need it by simply call “getAnalysis()”

Thanks
Hongbin

PS: I attached our iterative driver, you could have a look if you are interested.

IterativeSchedulingDriver.cpp (10 KB)

Hal,

Here is a slightly modified iterative compilation patch that reduces amount
of boilerplate code at the points where the decisions are made.

Radovan

ic.patch (57.3 KB)

Hello Hongbin,

Thank you for your suggestions. It sounds like something we were looking for
in order to better integrate iterative compilation framework into llvm.
We plan to port this implementation to llvm 3.4 release and then we will try to
follow your recommendations.

Radovan

This is somewhat tangential to the thread… I think it would be great to have a static performance estimator. I’m thinking of an MI-level library that simulates a loop using the machine model until it settles on an average number of cycles per iteration. A simple tool could handle the most probable path through the loop. It could be extended to handle averaging across a small number of paths if needed. Estimating in-order cycles is easy enough. I also developed simplified model for estimating OOO core cycles—I don’t have working code any more, as it doesn’t serve any purpose in the code base.

-Andy

From: "Andrew Trick" <atrick@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Radovan Obradovic" <Radovan.Obradovic@imgtec.com>, "llvmdev@cs.uiuc.edu Dev" <llvmdev@cs.uiuc.edu>
Sent: Monday, December 23, 2013 4:41:00 PM
Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

At the end of each iteration, quality of generated code is estimated
by
executing newly introduced target dependent pass. Based on results
path for
the following iteration is calculated. At the moment, this has been
proved
for MIPS only and it is based on code size.

I think that, in general, we ought to be able to get an overall
latency estimate from the instruction scheduler.

This is somewhat tangential to the thread… I think it would be great
to have a static performance estimator. I’m thinking of an MI-level
library that simulates a loop using the machine model until it
settles on an average number of cycles per iteration. A simple tool
could handle the most probable path through the loop. It could be
extended to handle averaging across a small number of paths if
needed. Estimating in-order cycles is easy enough. I also developed
simplified model for estimating OOO core cycles

Can you describe the model? I'm curious how you did this.

Thanks again,
Hal

Here is a rough sketch of the simulation, if you can call it that. This is not a pipeline simulator. Instead, we’re estimating the impact of saturating OOO resources on the schedule. We can get some idea of how many loop iterations before pipeline stalls occur. If OOO resources are not saturated, then we’re left with two cycle counts: first the time to decode and dispatch the micro-ops, second the total latency of scheduled instructions.

(I’m writing this up from my notes, so I won’t claim correctness. This is just to give you the idea.)

** Keep track of three points on a timeline, in increasing order:

CurrCycle = (DispatchedMOps / Width) + Stalls
“Cycles to decode and dispatch scheduled instructions
into reservation stations, plus guaranteed pipeline stalls.”

ExecutionCycles = max ResourceCycles of ready instructions over all resources.
“Minimum cycles to execute ready instructions.”

CriticalCycles = max ResourceCycles of scheduled instructions over all resources.
“Minimum cycles to execute all scheduled instructions.”

** Some state for book-keeping:

BufferedMOps “queue of buffered micro-ops waiting to retire”

BufferedResOps[NRes] “queue of buffered micro-ops waiting to execute”

Executed “count of executed but not retired micro-ops”

Retired “count of retired micro-ops”

** Very simple modeling of the reorder buffer/register rename pool:

*** After scheduling an instruction, I:
if ReadyCycle(I) <= CurrCycle:
++Retired
else
BufferedMOps.push(ReadyCycle(I))

*** After bumping CurrCycle
for T in BufferedMOps if T <= CurrCycle:
++Retired
BufferedMOps.erase(T)

** Model the reservation stations:

*** After scheduling an instruction, I:
if ReadyCycle(I) <= ExecutionCycles:
++Executed
ResourceCycles[R] += CyclesPerResource(R)
ExecutionCycles = max ResourceCycles[R], ExecutionCycles
else for R in Resources(I):
BufferedResOps[R].push(ReadyCycle(I))

*** After bumping ExecutionCycles:
for T in BufferedResOps[R] if T <= ExecutionCycles:
++Executed
ResourceCycles[R] += CyclesPerResource(R)
ExecutionCycles = max ResourceCycles[R], ExecutionCycles

** Check for stalls

CurrCycles is incremented whenever we detect stalls that result from reaching the limit of the MicroOpBufferSize or the BufferSize of one of the processor resources, as described in the machine model.

ExecutionCycles is updated in a feedback loop. Each time an instruction becomes ready, it adds to the cycles needed to exececute based on the instruction’s resources. ExecutionCycles is the maximum cycles to execute instructions based on some saturated resource. When ExecutionCycles increases, more instructions become ready, and so on.

CriticalCycles is just an estimate of the total latency to execute. It’s a useful metric but doesn’t factor into the simulation.

Lastly, we can factor in the overal impact of stalls by maintaining the invariant at all times:

CurrCycles <= ExecutionCycles <= CriticalCycles

-Andy

From: "Radovan Obradovic" <Radovan.Obradovic@imgtec.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu, "chandlerc" <chandlerc@google.com>, "Andrew Trick" <atrick@apple.com>, "Zoran Jovanovic"
<Zoran.Jovanovic@imgtec.com>, "Petar Jovanovic" <Petar.Jovanovic@imgtec.com>, "Rich Fuhler"
<Rich.Fuhler@imgtec.com>, clattner@apple.com
Sent: Thursday, December 19, 2013 6:21:16 AM
Subject: RE: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM

Hal,

Here is a slightly modified iterative compilation patch that reduces
amount
of boilerplate code at the points where the decisions are made.

Thanks! That looks more manageable. There is going to be *a lot* of work involved in getting something like this integrated; but if you're up for that, as a next step, please rebase the patch and upload it to phabricator (see http://llvm.org/docs/Phabricator.html). That will make it easier to review in detail.

-Hal

Thank you for the reply.
At the moment we are working on rebasing our implementation (meanwhile, there were changes in pass manager implementation).
Also, we are implementing DecissionTree as immutable pass (suggestion by Hongbin Zheng).
As soon as we are finished new patch will be posted in Phabricator.

Regards,
Radovan & Zoran