[GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register across the module and procedure boundaries.

I have identified some old but effective research work on this area.

I want community’s feedback for feasibility of these approach and I am targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like to discuss those approach first, I will read other two paper then initiate discussion for them as well. All I want is to find out a concrete implementation plan before 23 May, 2016 and for that I need community’s help.

  1. Compile time ----- Minimizing register usage penalty at procedure calls - http://dl.acm.org/citation.cfm?id=53999

====================================================================In this approach intra-procedural register allocation is used as base but machine code generation order is bottom up traversal of call graph and inter-procedural effect is achieved by propagating register usage information of callee function to caller (i.e child to parent in CallGraph) so that caller can use different registers than callee and can save load store cost at procedure call, this is not trivial as it seems due to recursive calls, library function usage etc. Also for upper region of the graph in this technique available number of registers might become zero in that case it should fall back to normal load store at procedure call. Apart from these difficulties other difficulties have been identified please follow this mail-chain https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion

My mentor has already provided me a patch that alters code generation order as per bottom up call graph traversal, I am working from that point now. Any other help/suggestion is always welcomed.

  1. Link time ----- Global register allocation at link time - http://dl.acm.org/citation.cfm?id=989415

====================================================================In this particular approach (sort of true IPRA) registers will be reallocated (this optimization will be optional if turned off still code will be compiled as per intra-procedural allocation) at link time. Here modules are first complied as per normal compilation but the object code is annotated with details so that linker can build call graph and also calculate usage information at link time. Compiler also write hints in object code that if particular variable is allocated in some other register ( due to new allocation) then how the code should be changed? Thus linker can use these information to decide which variables (global) need to be in same register through out the program execution and also according to register usage information in call graph which procedure will not be active simultaneously so that locals for that procedures can be in same registers with out load store at procedure calls.

For these particular method help me to analyze feasibility:

  1. Can llvm collects following information at module level in MachineIR? list of procedures in module, list of locals in procedures, list of procedures that a particular procedure can call, and a list of the variables this procedure references. Each entry in the last two lists includes an estimate of the number of times the procedure is called or the variable is referenced in each execution of this procedure

  2. Can llvm write informative commands to object files?

  3. Can LTO is capable of leveraging those commands?

For the first part a mechanism similar to MachineModulePass would be desirable but that may not be possible during this project, but if we can make some sort of smaller version of that to suit our purpose.

  1. Compile time ----- Minimum cost interprocedural register allocation - http://dl.acm.org/citation.cfm?id=237780

  2. Compile time ----- Register allocation across procedure and module boundaries - http://dl.acm.org/citation.cfm?id=93551

I am aspiring for true IPRA so that I would require lot of help but I think with proper guidance it is achievable.

Any help/hints would be appreciated!

Sincerely,

Vivek

From: "vivek pandya" <vivekvpandya@gmail.com>
To: "llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling"
<mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
Cc: "Quentin Colombet" <qcolombet@apple.com>
Sent: Tuesday, May 10, 2016 2:59:16 PM
Subject: [GSoC 2016] Interprocedural Register Allocation -
Introduction and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for
Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and
Hal Finkel will be mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating
register across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community's feedback for feasibility of these approach and I
am targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would
like to discuss those approach first, I will read other two paper
then initiate discussion for them as well. All I want is to find out
a concrete implementation plan before 23 May, 2016 and for that I
need community's help.

1) Compile time ----- Minimizing register usage penalty at procedure
calls - http://dl.acm.org/citation.cfm?id=53999
====================================================================In
this approach intra-procedural register allocation is used as base
but machine code generation order is bottom up traversal of call
graph and inter-procedural effect is achieved by propagating
register usage information of callee function to caller (i.e child
to parent in CallGraph) so that caller can use different registers
than callee and can save load store cost at procedure call, this is
not trivial as it seems due to recursive calls, library function
usage etc. Also for upper region of the graph in this technique
available number of registers might become zero in that case it
should fall back to normal load store at procedure call. Apart from
these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that
point now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
====================================================================In
this particular approach (sort of true IPRA) registers will be
reallocated (this optimization will be optional if turned off still
code will be compiled as per intra-procedural allocation) at link
time. Here modules are first complied as per normal compilation but
the object code is annotated with details so that linker can build
call graph and also calculate usage information at link time.
Compiler also write hints in object code that if particular variable
is allocated in some other register ( due to new allocation) then
how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same
register through out the program execution and also according to
register usage information in call graph which procedure will not be
active simultaneously so that locals for that procedures can be in
same registers with out load store at procedure calls.
For these particular method help me to analyze feasibility:
1) Can llvm collects following information at module level in
MachineIR? list of procedures in module, list of locals in
procedures, list of procedures that a particular procedure can call,
and a list of the variables this procedure references. Each entry in
the last two lists includes an estimate of the number of times the
procedure is called or the variable is referenced in each execution
of this procedure
2) Can llvm write informative commands to object files?
3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend that you focus on (1) first. If you finish that, we can certainly move on to other things. Regarding link time, note that any such a design would likely look much different than in David Wall's paper however, because our LTO re-codegens everything anyway. The paper says, "Finally, it keeps us honest as designers of the system; once we postpone anything until link time, the temptation is great to postpone everything, ..." - Well, we've long-since succumb to that temptation when we LTO. C'est la vie.

For the first part a mechanism similar to MachineModulePass would be
desirable but that may not be possible during this project, but if
we can make some sort of smaller version of that to suit our
purpose.

I don't think we need to make any kind of MachineModulePass to make this work. Once we alter the visitation order based on the CGSCC iteration scheme, we can keep state in-between functions in the pre-existing hacky way (using static members of the relevant function passes).

-Hal


From: “vivek pandya” <vivekvpandya@gmail.com>
To: “llvm-dev” <llvm-dev@lists.llvm.org>, “Tim Amini Golling” <mehdi.amini@apple.com>, “Hal Finkel” <hfinkel@anl.gov>
Cc: “Quentin Colombet” <qcolombet@apple.com>
Sent: Tuesday, May 10, 2016 2:59:16 PM
Subject: [GSoC 2016] Interprocedural Register Allocation - Introduction and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community’s feedback for feasibility of these approach and I am targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like to discuss those approach first, I will read other two paper then initiate discussion for them as well. All I want is to find out a concrete implementation plan before 23 May, 2016 and for that I need community’s help.

  1. Compile time ----- Minimizing register usage penalty at procedure calls - http://dl.acm.org/citation.cfm?id=53999
    ====================================================================In this approach intra-procedural register allocation is used as base but machine code generation order is bottom up traversal of call graph and inter-procedural effect is achieved by propagating register usage information of callee function to caller (i.e child to parent in CallGraph) so that caller can use different registers than callee and can save load store cost at procedure call, this is not trivial as it seems due to recursive calls, library function usage etc. Also for upper region of the graph in this technique available number of registers might become zero in that case it should fall back to normal load store at procedure call. Apart from these difficulties other difficulties have been identified please follow this mail-chain https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
    My mentor has already provided me a patch that alters code generation order as per bottom up call graph traversal, I am working from that point now. Any other help/suggestion is always welcomed.

  2. Link time ----- Global register allocation at link time - http://dl.acm.org/citation.cfm?id=989415
    ====================================================================In this particular approach (sort of true IPRA) registers will be reallocated (this optimization will be optional if turned off still code will be compiled as per intra-procedural allocation) at link time. Here modules are first complied as per normal compilation but the object code is annotated with details so that linker can build call graph and also calculate usage information at link time. Compiler also write hints in object code that if particular variable is allocated in some other register ( due to new allocation) then how the code should be changed? Thus linker can use these information to decide which variables (global) need to be in same register through out the program execution and also according to register usage information in call graph which procedure will not be active simultaneously so that locals for that procedures can be in same registers with out load store at procedure calls.
    For these particular method help me to analyze feasibility:

  3. Can llvm collects following information at module level in MachineIR? list of procedures in module, list of locals in procedures, list of procedures that a particular procedure can call, and a list of the variables this procedure references. Each entry in the last two lists includes an estimate of the number of times the procedure is called or the variable is referenced in each execution of this procedure

  4. Can llvm write informative commands to object files?

  5. Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend that you focus on (1) first. If you finish that, we can certainly move on to other things.

I’ll add +1 here, but I already wrote the same thing on IRC when discussing with Vivek. True IPRA without a proper MachineModule infrastructure won’t be doable in my opinion (even with such infrastructure, it may not be trivial in LLVM in general).

Regarding link time, note that any such a design would likely look much different than in David Wall’s paper however, because our LTO re-codegens everything anyway. The paper says, “Finally, it keeps us honest as designers of the system; once we postpone anything until link time, the temptation is great to postpone everything, …” - Well, we’ve long-since succumb to that temptation when we LTO. C’est la vie.

+1 as well, our LTO will benefit naturally from the leaf-to-root information propagation. ThinLTO will be more challenging/interesting though!

For the first part a mechanism similar to MachineModulePass would be desirable but that may not be possible during this project, but if we can make some sort of smaller version of that to suit our purpose.

I don’t think we need to make any kind of MachineModulePass to make this work. Once we alter the visitation order based on the CGSCC iteration scheme, we can keep state in-between functions in the pre-existing hacky way (using static members of the relevant function passes).

I also don’t see where/why we need a MachineModule(Pass) for the CGSCC scheme, that said I’d rather avoid using a function pass with static members, if we can have a ModuleAnalysis that is bookkeeping the results for functions in the module and queries by the register allocator somehow.
Matthias/Quentin may have other inputs on this aspect.

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling" <
mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Quentin Colombet" <qcolombet@apple.com>
*Sent: *Tuesday, May 10, 2016 2:59:16 PM
*Subject: *[GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for Interprocedural
Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be
mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.

I have identified some old but effective research work on this area.

I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete
implementation plan before 23 May, 2016 and for that I need community's
help.

1) Compile time ----- Minimizing register usage penalty at procedure calls
- http://dl.acm.org/citation.cfm?id=53999

====================================================================In
this approach intra-procedural register allocation is used as base but
machine code generation order is bottom up traversal of call graph and
inter-procedural effect is achieved by propagating register usage
information of callee function to caller (i.e child to parent in CallGraph)
so that caller can use different registers than callee and can save load
store cost at procedure call, this is not trivial as it seems due to
recursive calls, library function usage etc. Also for upper region of the
graph in this technique available number of registers might become zero in
that case it should fall back to normal load store at procedure call. Apart
from these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion

My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that point
now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415

====================================================================In
this particular approach (sort of true IPRA) registers will be reallocated
(this optimization will be optional if turned off still code will be
compiled as per intra-procedural allocation) at link time. Here modules are
first complied as per normal compilation but the object code is annotated
with details so that linker can build call graph and also calculate usage
information at link time. Compiler also write hints in object code that if
particular variable is allocated in some other register ( due to new
allocation) then how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same register
through out the program execution and also according to register usage
information in call graph which procedure will not be active simultaneously
so that locals for that procedures can be in same registers with out load
store at procedure calls.

For these particular method help me to analyze feasibility:

1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of
procedures that a particular procedure can call, and a list of the
variables this procedure references. Each entry in the last two lists
includes an estimate of the number of times the procedure is called or the
variable is referenced in each execution of this procedure

2) Can llvm write informative commands to object files?

3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend
that you focus on (1) first. If you finish that, we can certainly move on
to other things. Regarding link time, note that any such a design would
likely look much different than in David Wall's paper however, because our
LTO re-codegens everything anyway. The paper says, "Finally, it keeps us
honest as designers of the system; once we postpone anything until link
time, the temptation is great to postpone everything, ..." - Well, we've
long-since succumb to that temptation when we LTO. C'est la vie.

For the first part a mechanism similar to MachineModulePass would be
desirable but that may not be possible during this project, but if we can
make some sort of smaller version of that to suit our purpose.

Sorry my mistake here by first part I mean 1) requirement in the link time

approach.

I don't think we need to make any kind of MachineModulePass to make this
work. Once we alter the visitation order based on the CGSCC iteration
scheme, we can keep state in-between functions in the pre-existing hacky
way (using static members of the relevant function passes).

Yes for propogating register usage approach we don't need MachineModulePass.

Sincerely,
Vivek

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling" <
mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Quentin Colombet" <qcolombet@apple.com>
*Sent: *Tuesday, May 10, 2016 2:59:16 PM
*Subject: *[GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for Interprocedural
Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be
mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete
implementation plan before 23 May, 2016 and for that I need community's
help.

1) Compile time ----- Minimizing register usage penalty at procedure calls
- http://dl.acm.org/citation.cfm?id=53999
====================================================================In
this approach intra-procedural register allocation is used as base but
machine code generation order is bottom up traversal of call graph and
inter-procedural effect is achieved by propagating register usage
information of callee function to caller (i.e child to parent in CallGraph)
so that caller can use different registers than callee and can save load
store cost at procedure call, this is not trivial as it seems due to
recursive calls, library function usage etc. Also for upper region of the
graph in this technique available number of registers might become zero in
that case it should fall back to normal load store at procedure call. Apart
from these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that point
now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
====================================================================In
this particular approach (sort of true IPRA) registers will be reallocated
(this optimization will be optional if turned off still code will be
compiled as per intra-procedural allocation) at link time. Here modules are
first complied as per normal compilation but the object code is annotated
with details so that linker can build call graph and also calculate usage
information at link time. Compiler also write hints in object code that if
particular variable is allocated in some other register ( due to new
allocation) then how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same register
through out the program execution and also according to register usage
information in call graph which procedure will not be active simultaneously
so that locals for that procedures can be in same registers with out load
store at procedure calls.
For these particular method help me to analyze feasibility:
1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of
procedures that a particular procedure can call, and a list of the
variables this procedure references. Each entry in the last two lists
includes an estimate of the number of times the procedure is called or the
variable is referenced in each execution of this procedure
2) Can llvm write informative commands to object files?
3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend
that you focus on (1) first. If you finish that, we can certainly move on
to other things.

I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule
infrastructure won't be doable in my opinion (even with such
infrastructure, it may not be trivial in LLVM in general).

Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've long-since
succumb to that temptation when we LTO. C'est la vie.

+1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting
though!

For the first part a mechanism similar to MachineModulePass would be
desirable but that may not be possible during this project, but if we can
make some sort of smaller version of that to suit our purpose.

I don't think we need to make any kind of MachineModulePass to make this
work. Once we alter the visitation order based on the CGSCC iteration
scheme, we can keep state in-between functions in the pre-existing hacky
way (using static members of the relevant function passes).

Sorry my mistake here by first part I mean 1) requirement in the link

time approach.

I also don't see where/why we need a MachineModule(Pass) for the CGSCC
scheme, that said I'd rather avoid using a function pass with static
members, if we can have a ModuleAnalysis that is bookkeeping the results
for functions in the module and queries by the register allocator somehow.
Matthias/Quentin may have other inputs on this aspect.

Yes for propagating register usage approach we don't need MachineModulePass

Vivek

*Vivek Pandya*

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling" <
mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Quentin Colombet" <qcolombet@apple.com>
*Sent: *Tuesday, May 10, 2016 2:59:16 PM
*Subject: *[GSoC 2016] Interprocedural Register Allocation -
Introduction and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for
Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal
Finkel will be mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete
implementation plan before 23 May, 2016 and for that I need community's
help.

1) Compile time ----- Minimizing register usage penalty at procedure
calls - http://dl.acm.org/citation.cfm?id=53999
====================================================================In
this approach intra-procedural register allocation is used as base but
machine code generation order is bottom up traversal of call graph and
inter-procedural effect is achieved by propagating register usage
information of callee function to caller (i.e child to parent in CallGraph)
so that caller can use different registers than callee and can save load
store cost at procedure call, this is not trivial as it seems due to
recursive calls, library function usage etc. Also for upper region of the
graph in this technique available number of registers might become zero in
that case it should fall back to normal load store at procedure call. Apart
from these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that point
now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
====================================================================In
this particular approach (sort of true IPRA) registers will be reallocated
(this optimization will be optional if turned off still code will be
compiled as per intra-procedural allocation) at link time. Here modules are
first complied as per normal compilation but the object code is annotated
with details so that linker can build call graph and also calculate usage
information at link time. Compiler also write hints in object code that if
particular variable is allocated in some other register ( due to new
allocation) then how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same register
through out the program execution and also according to register usage
information in call graph which procedure will not be active simultaneously
so that locals for that procedures can be in same registers with out load
store at procedure calls.
For these particular method help me to analyze feasibility:
1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of
procedures that a particular procedure can call, and a list of the
variables this procedure references. Each entry in the last two lists
includes an estimate of the number of times the procedure is called or the
variable is referenced in each execution of this procedure
2) Can llvm write informative commands to object files?
3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend
that you focus on (1) first. If you finish that, we can certainly move on
to other things.

I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule
infrastructure won't be doable in my opinion (even with such
infrastructure, it may not be trivial in LLVM in general).

Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've long-since
succumb to that temptation when we LTO. C'est la vie.

+1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting
though!

For the first part a mechanism similar to MachineModulePass would be
desirable but that may not be possible during this project, but if we can
make some sort of smaller version of that to suit our purpose.

I don't think we need to make any kind of MachineModulePass to make this
work. Once we alter the visitation order based on the CGSCC iteration
scheme, we can keep state in-between functions in the pre-existing hacky
way (using static members of the relevant function passes).

Sorry my mistake here by first part I mean 1) requirement in the link

time approach.

I also don't see where/why we need a MachineModule(Pass) for the CGSCC
scheme, that said I'd rather avoid using a function pass with static
members, if we can have a ModuleAnalysis that is bookkeeping the results
for functions in the module and queries by the register allocator somehow.
Matthias/Quentin may have other inputs on this aspect.

@Hal do you mean to add a simple MachineFunction pass that will just
operate on register allocated function and prepare a BitVector to indicate
which register is being used by MachineFunction, and then use this pass as
analysis pass (i.e just simply return static BitVector for clobbered
register when register allocation for next function begins. This part is
not much clear to me) this thing can be done by scheduling a pass post
register allocation in lib/CodeGen/Passes.cpp

void TargetPassConfig::addMachinePasses() {
.
.
.
  // Run pre-ra passes.
  addPreRegAlloc();

  // Run register allocation and passes that are tightly coupled with it,
  // including phi elimination and scheduling.
  if (getOptimizeRegAlloc())
    addOptimizedRegAlloc(createRegAllocPass(true));
  else
    addFastRegAlloc(createRegAllocPass(false));

  // Run post-ra passes.
  addPostRegAlloc();
// Adding a new pass here which keeps register mask information across
function calls.
.
.
.
}

But this also requires current register allocators to use this information
in someway because RegMaskBits in LiveIntervalAnalysis.cpp is not static
across calls. I mean I am not clear for how to propagate static info to
Intra-procedural Register allocators (if possible without disturbing their
code )

I think this also applies in someway to Mehdi Amini's idea to keep a
ModulePass for bookkeeping but then existing register allocators will be
required to change so that the code can query the ModulePass for
RegMaskBits for particular function.

Vivek

From: "vivek pandya" <vivekvpandya@gmail.com>
To: "Mehdi Amini" <mehdi.amini@apple.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "Quentin Colombet"
<qcolombet@apple.com>, "llvm-dev" <llvm-dev@lists.llvm.org>,
"Matthias Braun" <matze@braunis.de>
Sent: Wednesday, May 11, 2016 3:15:03 AM
Subject: Re: [GSoC 2016] Interprocedural Register Allocation -
Introduction and Feedback

Vivek Pandya

> Vivek Pandya

> >
>

> > > > From: "vivek pandya" < vivekvpandya@gmail.com >
> > >
> >
>

> > > > To: "llvm-dev" < llvm-dev@lists.llvm.org >, "Tim Amini
> > > > Golling"
> > > > <
> > > > mehdi.amini@apple.com >, "Hal Finkel" < hfinkel@anl.gov >
> > >
> >
>

> > > > Cc: "Quentin Colombet" < qcolombet@apple.com >
> > >
> >
>

> > > > Sent: Tuesday, May 10, 2016 2:59:16 PM
> > >
> >
>

> > > > Subject: [GSoC 2016] Interprocedural Register Allocation -
> > > > Introduction and Feedback
> > >
> >
>

> > > > Hello LLVM Community,
> > >
> >
>

> > > > Sorry for delay as I was busy in final exams.
> > >
> >
>

> > > > I am Vivek from India. Thanks for choosing my proposal for
> > > > Interprocedural Register Allocation (IPRA) in LLVM. Mehdi
> > > > Amini
> > > > and
> > > > Hal Finkel will be mentoring me for this project.
> > >
> >
>

> > > > IPRA can reduce code size and runtime of programs by
> > > > allocating
> > > > register across the module and procedure boundaries.
> > >
> >
>

> > > > I have identified some old but effective research work on
> > > > this
> > > > area.
> > >
> >
>

> > > > I want community's feedback for feasibility of these approach
> > > > and
> > > > I
> > > > am targeting to implement two of them during this project.
> > >
> >
>

> > > > Here is list of the papers, I have read first two papers and
> > > > I
> > > > would
> > > > like to discuss those approach first, I will read other two
> > > > paper
> > > > then initiate discussion for them as well. All I want is to
> > > > find
> > > > out
> > > > a concrete implementation plan before 23 May, 2016 and for
> > > > that
> > > > I
> > > > need community's help.
> > >
> >
>

> > > > 1) Compile time ----- Minimizing register usage penalty at
> > > > procedure
> > > > calls - http://dl.acm.org/citation.cfm?id=53999
> > >
> >
>

> > > > ====================================================================In
> > > > this approach intra-procedural register allocation is used as
> > > > base
> > > > but machine code generation order is bottom up traversal of
> > > > call
> > > > graph and inter-procedural effect is achieved by propagating
> > > > register usage information of callee function to caller (i.e
> > > > child
> > > > to parent in CallGraph) so that caller can use different
> > > > registers
> > > > than callee and can save load store cost at procedure call,
> > > > this
> > > > is
> > > > not trivial as it seems due to recursive calls, library
> > > > function
> > > > usage etc. Also for upper region of the graph in this
> > > > technique
> > > > available number of registers might become zero in that case
> > > > it
> > > > should fall back to normal load store at procedure call.
> > > > Apart
> > > > from
> > > > these difficulties other difficulties have been identified
> > > > please
> > > > follow this mail-chain
> > > > https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
> > >
> >
>

> > > > My mentor has already provided me a patch that alters code
> > > > generation
> > > > order as per bottom up call graph traversal, I am working
> > > > from
> > > > that
> > > > point now. Any other help/suggestion is always welcomed.
> > >
> >
>

> > > > 2) Link time ----- Global register allocation at link time -
> > > > http://dl.acm.org/citation.cfm?id=989415
> > >
> >
>

> > > > ====================================================================In
> > > > this particular approach (sort of true IPRA) registers will
> > > > be
> > > > reallocated (this optimization will be optional if turned off
> > > > still
> > > > code will be compiled as per intra-procedural allocation) at
> > > > link
> > > > time. Here modules are first complied as per normal
> > > > compilation
> > > > but
> > > > the object code is annotated with details so that linker can
> > > > build
> > > > call graph and also calculate usage information at link time.
> > > > Compiler also write hints in object code that if particular
> > > > variable
> > > > is allocated in some other register ( due to new allocation)
> > > > then
> > > > how the code should be changed? Thus linker can use these
> > > > information to decide which variables (global) need to be in
> > > > same
> > > > register through out the program execution and also according
> > > > to
> > > > register usage information in call graph which procedure will
> > > > not
> > > > be
> > > > active simultaneously so that locals for that procedures can
> > > > be
> > > > in
> > > > same registers with out load store at procedure calls.
> > >
> >
>

> > > > For these particular method help me to analyze feasibility:
> > >
> >
>

> > > > 1) Can llvm collects following information at module level in
> > > > MachineIR? list of procedures in module, list of locals in
> > > > procedures, list of procedures that a particular procedure
> > > > can
> > > > call,
> > > > and a list of the variables this procedure references. Each
> > > > entry
> > > > in
> > > > the last two lists includes an estimate of the number of
> > > > times
> > > > the
> > > > procedure is called or the variable is referenced in each
> > > > execution
> > > > of this procedure
> > >
> >
>

> > > > 2) Can llvm write informative commands to object files?
> > >
> >
>

> > > > 3) Can LTO is capable of leveraging those commands?
> > >
> >
>

> > > In terms of scoping the project for the summer, I definitely
> > > recommend that you focus on (1) first. If you finish that, we
> > > can
> > > certainly move on to other things.
> >
>

> > I'll add +1 here, but I already wrote the same thing on IRC when
> > discussing with Vivek. True IPRA without a proper MachineModule
> > infrastructure won't be doable in my opinion (even with such
> > infrastructure, it may not be trivial in LLVM in general).
>

> > > Regarding link time, note that any such a design would likely
> > > look
> > > much different than in David Wall's paper however, because our
> > > LTO
> > > re-codegens everything anyway. The paper says, "Finally, it
> > > keeps
> > > us
> > > honest as designers of the system; once we postpone anything
> > > until
> > > link time, the temptation is great to postpone everything, ..."
> > > -
> > > Well, we've long-since succumb to that temptation when we LTO.
> > > C'est
> > > la vie.
> >
>

> > +1 as well, our LTO will benefit naturally from the leaf-to-root
> > information propagation. ThinLTO will be more
> > challenging/interesting though!
>

> > > > For the first part a mechanism similar to MachineModulePass
> > > > would
> > > > be
> > > > desirable but that may not be possible during this project,
> > > > but
> > > > if
> > > > we can make some sort of smaller version of that to suit our
> > > > purpose.
> > >
> >
>

> > > I don't think we need to make any kind of MachineModulePass to
> > > make
> > > this work. Once we alter the visitation order based on the
> > > CGSCC
> > > iteration scheme, we can keep state in-between functions in the
> > > pre-existing hacky way (using static members of the relevant
> > > function passes).
> >
>

> Sorry my mistake here by first part I mean 1) requirement in the
> link
> time approach.

> > I also don't see where/why we need a MachineModule(Pass) for the
> > CGSCC scheme, that said I'd rather avoid using a function pass
> > with
> > static members, if we can have a ModuleAnalysis that is
> > bookkeeping
> > the results for functions in the module and queries by the
> > register
> > allocator somehow.
>

> > Matthias/Quentin may have other inputs on this aspect.
>

@Hal do you mean to add a simple MachineFunction pass that will just
operate on register allocated function and prepare a BitVector to
indicate which register is being used by MachineFunction, and then
use this pass as analysis pass (i.e just simply return static
BitVector for clobbered register when register allocation for next
function begins. This part is not much clear to me) this thing can
be done by scheduling a pass post register allocation in
lib/CodeGen/Passes.cpp

void TargetPassConfig::addMachinePasses() {
.
.
.
// Run pre-ra passes.
addPreRegAlloc();

// Run register allocation and passes that are tightly coupled with
it,
// including phi elimination and scheduling.
if (getOptimizeRegAlloc())
addOptimizedRegAlloc(createRegAllocPass(true));
else
addFastRegAlloc(createRegAllocPass(false));

// Run post-ra passes.
addPostRegAlloc();
// Adding a new pass here which keeps register mask information
across function calls.
.
.
.
}

But this also requires current register allocators to use this
information in someway because RegMaskBits in
LiveIntervalAnalysis.cpp is not static across calls. I mean I am not
clear for how to propagate static info to Intra-procedural Register
allocators (if possible without disturbing their code )

First, my hope is that we won't need to change the register allocators, as such, in order to make use of this information. Instead, we'll simply be able to alter the register masks generated for the call instructions. These masks will indicate fewer clobbers than might otherwise be present based on the ABI because of information gathered during the codegen of the callee. These masks are generally constructed by target based on the calling convention. The PowerPC backend, for example, looks like this:

// Add a register mask operand representing the call-preserved registers.
const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
const uint32_t *Mask =
TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);
assert(Mask && "Missing call preserved mask for calling convention");
Ops.push_back(DAG.getRegisterMask(Mask));

but it can be more complicated. If you look for uses of 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I mean. Regardless, the code ends up calling some method is the targets TargetRegisterInfo subclass. These methods generally look something like this:

const uint32_t *
PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,
CallingConv::ID CC) const {
const PPCSubtarget &Subtarget = MF.getSubtarget<PPCSubtarget>();
...
return TM.isPPC64() ? (Subtarget.hasAltivec() ? CSR_SVR464_Altivec_RegMask
: CSR_SVR464_RegMask)
: (Subtarget.hasAltivec() ? CSR_SVR432_Altivec_RegMask
: CSR_SVR432_RegMask);
}

In any case, the fundamental idea here is that, when someone calls getCallPreservedMask in order to set the regmask on a call, we might not have to use the CC at all. Instead, if we've already codegened the function, we might use a cache of 'exact' register masks computed during codegen of the potential callees instead.

In order to do this, I think we'll need to provide a function callable from the target's getCallPreservedMask implementation, which can return such an 'exact' regmask when available. I think we need to do it this way for two reasons:

1. Not all of the target code calls getCallPreservedMask, but sometimes calls other similar target-specific functions (e.g. getTLSCallPreservedMask).
2. The targets need to opt-in to this behavior because only the target can know that all register uses are really tagged correctly post "pre-emit".

Because the target is free to introduce uses of registers at essentially any time, we need to do the scanning for used registers after the "pre-emit" passes run. This can be done by scheduling some simple register-use scanning pass after the call to addPreEmitPass in lib/CodeGen/Passes.cpp.

I think this also applies in someway to Mehdi Amini's idea to keep a
ModulePass for bookkeeping but then existing register allocators
will be required to change so that the code can query the ModulePass
for RegMaskBits for particular function.

I think that the simplest way to do this is to create an immutable analysis pass (e.g. BasicAA) that keeps the cache of the computed register masks. This is somewhat similar in spirit to how the 'AssumptionCache' analysis works at the IR level. This analysis can then be created by lib/CodeGen/Passes.cpp early, and then queried and passed around later by the CodeGen/Target code. Because it is an immutable analysis, it won't get destroyed until the very end, which is also important because, I imagine, it will need to own the memory associated with the generated register masks.

-Hal

MachineRegister maintains linked lists with defs/uses for each register so you can determine whether a specific register is used or not without scanning.

  • Matthias

From: "Matthias Braun" <matze@braunis.de>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "vivek pandya" <vivekvpandya@gmail.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Wednesday, May 11, 2016 12:46:25 PM
Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
Allocation - Introduction and Feedback

> > From: "vivek pandya" < vivekvpandya@gmail.com >
>

> > To: "Mehdi Amini" < mehdi.amini@apple.com >
>

> > Cc: "Hal Finkel" < hfinkel@anl.gov >, "Quentin Colombet" <
> > qcolombet@apple.com >, "llvm-dev" < llvm-dev@lists.llvm.org >,
> > "Matthias Braun" < matze@braunis.de >
>

> > Sent: Wednesday, May 11, 2016 3:15:03 AM
>

> > Subject: Re: [GSoC 2016] Interprocedural Register Allocation -
> > Introduction and Feedback
>

> > Vivek Pandya
>

>

> > > Vivek Pandya
> >
>

> >
>

> > > >
> > >
> >
>

> > > > > > From: "vivek pandya" < vivekvpandya@gmail.com >
> > > > >
> > > >
> > >
> >
>

> > > > > > To: "llvm-dev" < llvm-dev@lists.llvm.org >, "Tim Amini
> > > > > > Golling"
> > > > > > <
> > > > > > mehdi.amini@apple.com >, "Hal Finkel" < hfinkel@anl.gov >
> > > > >
> > > >
> > >
> >
>

> > > > > > Cc: "Quentin Colombet" < qcolombet@apple.com >
> > > > >
> > > >
> > >
> >
>

> > > > > > Sent: Tuesday, May 10, 2016 2:59:16 PM
> > > > >
> > > >
> > >
> >
>

> > > > > > Subject: [GSoC 2016] Interprocedural Register Allocation
> > > > > > -
> > > > > > Introduction and Feedback
> > > > >
> > > >
> > >
> >
>

> > > > > > Hello LLVM Community,
> > > > >
> > > >
> > >
> >
>

> > > > > > Sorry for delay as I was busy in final exams.
> > > > >
> > > >
> > >
> >
>

> > > > > > I am Vivek from India. Thanks for choosing my proposal
> > > > > > for
> > > > > > Interprocedural Register Allocation (IPRA) in LLVM. Mehdi
> > > > > > Amini
> > > > > > and
> > > > > > Hal Finkel will be mentoring me for this project.
> > > > >
> > > >
> > >
> >
>

> > > > > > IPRA can reduce code size and runtime of programs by
> > > > > > allocating
> > > > > > register across the module and procedure boundaries.
> > > > >
> > > >
> > >
> >
>

> > > > > > I have identified some old but effective research work on
> > > > > > this
> > > > > > area.
> > > > >
> > > >
> > >
> >
>

> > > > > > I want community's feedback for feasibility of these
> > > > > > approach
> > > > > > and
> > > > > > I
> > > > > > am targeting to implement two of them during this
> > > > > > project.
> > > > >
> > > >
> > >
> >
>

> > > > > > Here is list of the papers, I have read first two papers
> > > > > > and
> > > > > > I
> > > > > > would
> > > > > > like to discuss those approach first, I will read other
> > > > > > two
> > > > > > paper
> > > > > > then initiate discussion for them as well. All I want is
> > > > > > to
> > > > > > find
> > > > > > out
> > > > > > a concrete implementation plan before 23 May, 2016 and
> > > > > > for
> > > > > > that
> > > > > > I
> > > > > > need community's help.
> > > > >
> > > >
> > >
> >
>

> > > > > > 1) Compile time ----- Minimizing register usage penalty
> > > > > > at
> > > > > > procedure
> > > > > > calls - http://dl.acm.org/citation.cfm?id=53999
> > > > >
> > > >
> > >
> >
>

> > > > > > ====================================================================In
> > > > > > this approach intra-procedural register allocation is
> > > > > > used
> > > > > > as
> > > > > > base
> > > > > > but machine code generation order is bottom up traversal
> > > > > > of
> > > > > > call
> > > > > > graph and inter-procedural effect is achieved by
> > > > > > propagating
> > > > > > register usage information of callee function to caller
> > > > > > (i.e
> > > > > > child
> > > > > > to parent in CallGraph) so that caller can use different
> > > > > > registers
> > > > > > than callee and can save load store cost at procedure
> > > > > > call,
> > > > > > this
> > > > > > is
> > > > > > not trivial as it seems due to recursive calls, library
> > > > > > function
> > > > > > usage etc. Also for upper region of the graph in this
> > > > > > technique
> > > > > > available number of registers might become zero in that
> > > > > > case
> > > > > > it
> > > > > > should fall back to normal load store at procedure call.
> > > > > > Apart
> > > > > > from
> > > > > > these difficulties other difficulties have been
> > > > > > identified
> > > > > > please
> > > > > > follow this mail-chain
> > > > > > https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
> > > > >
> > > >
> > >
> >
>

> > > > > > My mentor has already provided me a patch that alters
> > > > > > code
> > > > > > generation
> > > > > > order as per bottom up call graph traversal, I am working
> > > > > > from
> > > > > > that
> > > > > > point now. Any other help/suggestion is always welcomed.
> > > > >
> > > >
> > >
> >
>

> > > > > > 2) Link time ----- Global register allocation at link
> > > > > > time
> > > > > > -
> > > > > > http://dl.acm.org/citation.cfm?id=989415
> > > > >
> > > >
> > >
> >
>

> > > > > > ====================================================================In
> > > > > > this particular approach (sort of true IPRA) registers
> > > > > > will
> > > > > > be
> > > > > > reallocated (this optimization will be optional if turned
> > > > > > off
> > > > > > still
> > > > > > code will be compiled as per intra-procedural allocation)
> > > > > > at
> > > > > > link
> > > > > > time. Here modules are first complied as per normal
> > > > > > compilation
> > > > > > but
> > > > > > the object code is annotated with details so that linker
> > > > > > can
> > > > > > build
> > > > > > call graph and also calculate usage information at link
> > > > > > time.
> > > > > > Compiler also write hints in object code that if
> > > > > > particular
> > > > > > variable
> > > > > > is allocated in some other register ( due to new
> > > > > > allocation)
> > > > > > then
> > > > > > how the code should be changed? Thus linker can use these
> > > > > > information to decide which variables (global) need to be
> > > > > > in
> > > > > > same
> > > > > > register through out the program execution and also
> > > > > > according
> > > > > > to
> > > > > > register usage information in call graph which procedure
> > > > > > will
> > > > > > not
> > > > > > be
> > > > > > active simultaneously so that locals for that procedures
> > > > > > can
> > > > > > be
> > > > > > in
> > > > > > same registers with out load store at procedure calls.
> > > > >
> > > >
> > >
> >
>

> > > > > > For these particular method help me to analyze
> > > > > > feasibility:
> > > > >
> > > >
> > >
> >
>

> > > > > > 1) Can llvm collects following information at module
> > > > > > level
> > > > > > in
> > > > > > MachineIR? list of procedures in module, list of locals
> > > > > > in
> > > > > > procedures, list of procedures that a particular
> > > > > > procedure
> > > > > > can
> > > > > > call,
> > > > > > and a list of the variables this procedure references.
> > > > > > Each
> > > > > > entry
> > > > > > in
> > > > > > the last two lists includes an estimate of the number of
> > > > > > times
> > > > > > the
> > > > > > procedure is called or the variable is referenced in each
> > > > > > execution
> > > > > > of this procedure
> > > > >
> > > >
> > >
> >
>

> > > > > > 2) Can llvm write informative commands to object files?
> > > > >
> > > >
> > >
> >
>

> > > > > > 3) Can LTO is capable of leveraging those commands?
> > > > >
> > > >
> > >
> >
>

> > > > > In terms of scoping the project for the summer, I
> > > > > definitely
> > > > > recommend that you focus on (1) first. If you finish that,
> > > > > we
> > > > > can
> > > > > certainly move on to other things.
> > > >
> > >
> >
>

> > > > I'll add +1 here, but I already wrote the same thing on IRC
> > > > when
> > > > discussing with Vivek. True IPRA without a proper
> > > > MachineModule
> > > > infrastructure won't be doable in my opinion (even with such
> > > > infrastructure, it may not be trivial in LLVM in general).
> > >
> >
>

> > > > > Regarding link time, note that any such a design would
> > > > > likely
> > > > > look
> > > > > much different than in David Wall's paper however, because
> > > > > our
> > > > > LTO
> > > > > re-codegens everything anyway. The paper says, "Finally, it
> > > > > keeps
> > > > > us
> > > > > honest as designers of the system; once we postpone
> > > > > anything
> > > > > until
> > > > > link time, the temptation is great to postpone everything,
> > > > > ..."
> > > > > -
> > > > > Well, we've long-since succumb to that temptation when we
> > > > > LTO.
> > > > > C'est
> > > > > la vie.
> > > >
> > >
> >
>

> > > > +1 as well, our LTO will benefit naturally from the
> > > > leaf-to-root
> > > > information propagation. ThinLTO will be more
> > > > challenging/interesting though!
> > >
> >
>

> > > > > > For the first part a mechanism similar to
> > > > > > MachineModulePass
> > > > > > would
> > > > > > be
> > > > > > desirable but that may not be possible during this
> > > > > > project,
> > > > > > but
> > > > > > if
> > > > > > we can make some sort of smaller version of that to suit
> > > > > > our
> > > > > > purpose.
> > > > >
> > > >
> > >
> >
>

> > > > > I don't think we need to make any kind of MachineModulePass
> > > > > to
> > > > > make
> > > > > this work. Once we alter the visitation order based on the
> > > > > CGSCC
> > > > > iteration scheme, we can keep state in-between functions in
> > > > > the
> > > > > pre-existing hacky way (using static members of the
> > > > > relevant
> > > > > function passes).
> > > >
> > >
> >
>

> > > Sorry my mistake here by first part I mean 1) requirement in
> > > the
> > > link
> > > time approach.
> >
>

> > > > I also don't see where/why we need a MachineModule(Pass) for
> > > > the
> > > > CGSCC scheme, that said I'd rather avoid using a function
> > > > pass
> > > > with
> > > > static members, if we can have a ModuleAnalysis that is
> > > > bookkeeping
> > > > the results for functions in the module and queries by the
> > > > register
> > > > allocator somehow.
> > >
> >
>

> > > > Matthias/Quentin may have other inputs on this aspect.
> > >
> >
>

> > @Hal do you mean to add a simple MachineFunction pass that will
> > just
> > operate on register allocated function and prepare a BitVector to
> > indicate which register is being used by MachineFunction, and
> > then
> > use this pass as analysis pass (i.e just simply return static
> > BitVector for clobbered register when register allocation for
> > next
> > function begins. This part is not much clear to me) this thing
> > can
> > be done by scheduling a pass post register allocation in
> > lib/CodeGen/Passes.cpp
>

> > void TargetPassConfig::addMachinePasses() {
>

> > .
>

> > .
>

> > .
>

> > // Run pre-ra passes.
>

> > addPreRegAlloc();
>

> > // Run register allocation and passes that are tightly coupled
> > with
> > it,
>

> > // including phi elimination and scheduling.
>

> > if (getOptimizeRegAlloc())
>

> > addOptimizedRegAlloc(createRegAllocPass(true));
>

> > else
>

> > addFastRegAlloc(createRegAllocPass(false));
>

> > // Run post-ra passes.
>

> > addPostRegAlloc();
>

> > // Adding a new pass here which keeps register mask information
> > across function calls.
>

> > .
>

> > .
>

> > .
>

> > }
>

> > But this also requires current register allocators to use this
> > information in someway because RegMaskBits in
> > LiveIntervalAnalysis.cpp is not static across calls. I mean I am
> > not
> > clear for how to propagate static info to Intra-procedural
> > Register
> > allocators (if possible without disturbing their code )
>

> First, my hope is that we won't need to change the register
> allocators, as such, in order to make use of this information.
> Instead, we'll simply be able to alter the register masks generated
> for the call instructions. These masks will indicate fewer clobbers
> than might otherwise be present based on the ABI because of
> information gathered during the codegen of the callee. These masks
> are generally constructed by target based on the calling
> convention.
> The PowerPC backend, for example, looks like this:

> // Add a register mask operand representing the call-preserved
> registers.

> const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();

> const uint32_t *Mask =

> TRI->getCallPreservedMask(DAG.getMachineFunction(), CallConv);

> assert(Mask && "Missing call preserved mask for calling
> convention");

> Ops.push_back(DAG.getRegisterMask(Mask));

> but it can be more complicated. If you look for uses of
> 'getRegisterMask' in Target/*/*ISelLowering.cpp, you'll see what I
> mean. Regardless, the code ends up calling some method is the
> targets TargetRegisterInfo subclass. These methods generally look
> something like this:

> const uint32_t *

> PPCRegisterInfo::getCallPreservedMask(const MachineFunction &MF,

> CallingConv::ID CC) const {

> const PPCSubtarget &Subtarget = MF.getSubtarget<PPCSubtarget>();

> ...

> return TM.isPPC64() ? (Subtarget.hasAltivec() ?
> CSR_SVR464_Altivec_RegMask

> : CSR_SVR464_RegMask)

> : (Subtarget.hasAltivec() ? CSR_SVR432_Altivec_RegMask

> : CSR_SVR432_RegMask);

> }

> In any case, the fundamental idea here is that, when someone calls
> getCallPreservedMask in order to set the regmask on a call, we
> might
> not have to use the CC at all. Instead, if we've already codegened
> the function, we might use a cache of 'exact' register masks
> computed during codegen of the potential callees instead.

> In order to do this, I think we'll need to provide a function
> callable from the target's getCallPreservedMask implementation,
> which can return such an 'exact' regmask when available. I think we
> need to do it this way for two reasons:

> 1. Not all of the target code calls getCallPreservedMask, but
> sometimes calls other similar target-specific functions (e.g.
> getTLSCallPreservedMask).

> 2. The targets need to opt-in to this behavior because only the
> target can know that all register uses are really tagged correctly
> post "pre-emit".

> Because the target is free to introduce uses of registers at
> essentially any time, we need to do the scanning for used registers
> after the "pre-emit" passes run. This can be done by scheduling
> some
> simple register-use scanning pass after the call to addPreEmitPass
> in lib/CodeGen/Passes.cpp.

MachineRegister maintains linked lists with defs/uses for each
register so you can determine whether a specific register is used or
not without scanning.

Does this include regmask-clobbered registers?

-Hal

Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be the union of all regmasks in the function.

*Vivek Pandya*

Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be the
union of all regmasks in the function.

Does this means there is no requirement of a separate pass for propagating

register usage information ? And also only codegen order needs to be
changed to DFS on call graph. And currently no intra-procedural register is
using UsedPhysRegMask to avoid load/store. Does it mean we need to change
that ? Or I am getting it wrong.

- Vivek

From: "vivek pandya" <vivekvpandya@gmail.com>
To: "Matthias Braun" <matze@braunis.de>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Wednesday, May 11, 2016 1:14:07 PM
Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
Allocation - Introduction and Feedback

Vivek Pandya

> Yes there is also MachineRegisterInfo::UsedPhysRegMask which should
> be the union of all regmasks in the function.

Does this means there is no requirement of a separate pass for
propagating register usage information ? And also only codegen order
needs to be changed to DFS on call graph. And currently no
intra-procedural register is using UsedPhysRegMask to avoid
load/store. Does it mean we need to change that ? Or I am getting it
wrong.

No, it just means that the "scanning" procedure is simple, it does not actually need to visit each instruction.

-Hal

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"Matthias Braun" <matze@braunis.de>
*Cc: *"Hal Finkel" <hfinkel@anl.gov>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Wednesday, May 11, 2016 1:14:07 PM
*Subject: *Re: [llvm-dev] [GSoC 2016] Interprocedural Register Allocation
- Introduction and Feedback

*Vivek Pandya*

Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be
the union of all regmasks in the function.

Does this means there is no requirement of a separate pass for

propagating register usage information ? And also only codegen order needs
to be changed to DFS on call graph. And currently no intra-procedural
register is using UsedPhysRegMask to avoid load/store. Does it mean we need
to change that ? Or I am getting it wrong.

No, it just means that the "scanning" procedure is simple, it does not
actually need to visit each instruction.

ok so still we need a ModulePass that will be executed as pre-emit pass but
it will do propagation in O(1) time right ?

- Vivek

From: "vivek pandya" <vivekvpandya@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Matthias Braun"
<matze@braunis.de>
Sent: Wednesday, May 11, 2016 1:19:18 PM
Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
Allocation - Introduction and Feedback

Vivek Pandya

> > From: "vivek pandya" < vivekvpandya@gmail.com >
>

> > To: "Matthias Braun" < matze@braunis.de >
>

> > Cc: "Hal Finkel" < hfinkel@anl.gov >, "llvm-dev" <
> > llvm-dev@lists.llvm.org >
>

> > Sent: Wednesday, May 11, 2016 1:14:07 PM
>

> > Subject: Re: [llvm-dev] [GSoC 2016] Interprocedural Register
> > Allocation - Introduction and Feedback
>

> > Vivek Pandya
>

>

> > > Yes there is also MachineRegisterInfo::UsedPhysRegMask which
> > > should
> > > be the union of all regmasks in the function.
> >
>

> > Does this means there is no requirement of a separate pass for
> > propagating register usage information ? And also only codegen
> > order
> > needs to be changed to DFS on call graph. And currently no
> > intra-procedural register is using UsedPhysRegMask to avoid
> > load/store. Does it mean we need to change that ? Or I am getting
> > it
> > wrong.
>

> No, it just means that the "scanning" procedure is simple, it does
> not actually need to visit each instruction.

ok so still we need a ModulePass that will be executed as pre-emit
pass but it will do propagation in O(1) time right ?

No, we need a function pass that runs after the pre-emit passes. This pass will update some module-level analysis (which will need to be an immutable pass, likely, to fit within the current infrastructure, even though it is not technically immutable - like our AssumptionCache). But, yes, this post-pre-emit function pass will be O(1).

-Hal

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling" <
mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Quentin Colombet" <qcolombet@apple.com>
*Sent: *Tuesday, May 10, 2016 2:59:16 PM
*Subject: *[GSoC 2016] Interprocedural Register Allocation - Introduction
and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for Interprocedural
Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal Finkel will be
mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete
implementation plan before 23 May, 2016 and for that I need community's
help.

1) Compile time ----- Minimizing register usage penalty at procedure calls
- http://dl.acm.org/citation.cfm?id=53999
====================================================================In
this approach intra-procedural register allocation is used as base but
machine code generation order is bottom up traversal of call graph and
inter-procedural effect is achieved by propagating register usage
information of callee function to caller (i.e child to parent in CallGraph)
so that caller can use different registers than callee and can save load
store cost at procedure call, this is not trivial as it seems due to
recursive calls, library function usage etc. Also for upper region of the
graph in this technique available number of registers might become zero in
that case it should fall back to normal load store at procedure call. Apart
from these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that point
now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
====================================================================In
this particular approach (sort of true IPRA) registers will be reallocated
(this optimization will be optional if turned off still code will be
compiled as per intra-procedural allocation) at link time. Here modules are
first complied as per normal compilation but the object code is annotated
with details so that linker can build call graph and also calculate usage
information at link time. Compiler also write hints in object code that if
particular variable is allocated in some other register ( due to new
allocation) then how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same register
through out the program execution and also according to register usage
information in call graph which procedure will not be active simultaneously
so that locals for that procedures can be in same registers with out load
store at procedure calls.
For these particular method help me to analyze feasibility:
1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of
procedures that a particular procedure can call, and a list of the
variables this procedure references. Each entry in the last two lists
includes an estimate of the number of times the procedure is called or the
variable is referenced in each execution of this procedure
2) Can llvm write informative commands to object files?
3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend
that you focus on (1) first. If you finish that, we can certainly move on
to other things.

I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule
infrastructure won't be doable in my opinion (even with such
infrastructure, it may not be trivial in LLVM in general).

Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've long-since
succumb to that temptation when we LTO. C'est la vie.

+1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting
though!

@Mehdi I don't understand this comment, it seems to me that two different
approaches has been merged here. Information propagation in calligraph will
be done without LTO.

You are propagating information on the callgraph, but this is efficient only when you have the function definition that is codegen in the same module as the call site.
So why LTO?
Because you see “all” the definition for “all” the call sites in your program.

*Vivek Pandya*

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling" <
mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Quentin Colombet" <qcolombet@apple.com>
*Sent: *Tuesday, May 10, 2016 2:59:16 PM
*Subject: *[GSoC 2016] Interprocedural Register Allocation -
Introduction and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for
Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal
Finkel will be mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would like
to discuss those approach first, I will read other two paper then initiate
discussion for them as well. All I want is to find out a concrete
implementation plan before 23 May, 2016 and for that I need community's
help.

1) Compile time ----- Minimizing register usage penalty at procedure
calls - http://dl.acm.org/citation.cfm?id=53999
====================================================================In
this approach intra-procedural register allocation is used as base but
machine code generation order is bottom up traversal of call graph and
inter-procedural effect is achieved by propagating register usage
information of callee function to caller (i.e child to parent in CallGraph)
so that caller can use different registers than callee and can save load
store cost at procedure call, this is not trivial as it seems due to
recursive calls, library function usage etc. Also for upper region of the
graph in this technique available number of registers might become zero in
that case it should fall back to normal load store at procedure call. Apart
from these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that point
now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
====================================================================In
this particular approach (sort of true IPRA) registers will be reallocated
(this optimization will be optional if turned off still code will be
compiled as per intra-procedural allocation) at link time. Here modules are
first complied as per normal compilation but the object code is annotated
with details so that linker can build call graph and also calculate usage
information at link time. Compiler also write hints in object code that if
particular variable is allocated in some other register ( due to new
allocation) then how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same register
through out the program execution and also according to register usage
information in call graph which procedure will not be active simultaneously
so that locals for that procedures can be in same registers with out load
store at procedure calls.
For these particular method help me to analyze feasibility:
1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of
procedures that a particular procedure can call, and a list of the
variables this procedure references. Each entry in the last two lists
includes an estimate of the number of times the procedure is called or the
variable is referenced in each execution of this procedure
2) Can llvm write informative commands to object files?
3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend
that you focus on (1) first. If you finish that, we can certainly move on
to other things.

I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule
infrastructure won't be doable in my opinion (even with such
infrastructure, it may not be trivial in LLVM in general).

Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've long-since
succumb to that temptation when we LTO. C'est la vie.

+1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting
though!

@Mehdi I don't understand this comment, it seems to me that two different
approaches has been merged here. Information propagation in calligraph will
be done without LTO.

You are propagating information on the callgraph, but this is efficient
only when you have the function definition that is codegen in the same
module as the call site.
So why LTO?
Because you see "all" the definition for "all" the call sites in your
program.

@Hal, @Mehdi yes LTO gives more information compare to other optimizations
but as per our current plan CallGraphSCC pass and a MachineFunctionPass for
register usage scanning will be sufficient right?

-Vivek

Yes, this was exactly my point when I wrote “our LTO will benefit naturally from the leaf-to-root information propagation”, i.e. the paper you cite on doing something specific for LTO is irrelevant. You don’t need to do anything with LTO.

*Vivek Pandya*

*Vivek Pandya*

*Vivek Pandya*

------------------------------

*From: *"vivek pandya" <vivekvpandya@gmail.com>
*To: *"llvm-dev" <llvm-dev@lists.llvm.org>, "Tim Amini Golling" <
mehdi.amini@apple.com>, "Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Quentin Colombet" <qcolombet@apple.com>
*Sent: *Tuesday, May 10, 2016 2:59:16 PM
*Subject: *[GSoC 2016] Interprocedural Register Allocation -
Introduction and Feedback

Hello LLVM Community,

Sorry for delay as I was busy in final exams.

I am Vivek from India. Thanks for choosing my proposal for
Interprocedural Register Allocation (IPRA) in LLVM. Mehdi Amini and Hal
Finkel will be mentoring me for this project.

IPRA can reduce code size and runtime of programs by allocating register
across the module and procedure boundaries.

I have identified some old but effective research work on this area.
I want community's feedback for feasibility of these approach and I am
targeting to implement two of them during this project.

Here is list of the papers, I have read first two papers and I would
like to discuss those approach first, I will read other two paper then
initiate discussion for them as well. All I want is to find out a concrete
implementation plan before 23 May, 2016 and for that I need community's
help.

1) Compile time ----- Minimizing register usage penalty at procedure
calls - http://dl.acm.org/citation.cfm?id=53999
====================================================================In
this approach intra-procedural register allocation is used as base but
machine code generation order is bottom up traversal of call graph and
inter-procedural effect is achieved by propagating register usage
information of callee function to caller (i.e child to parent in CallGraph)
so that caller can use different registers than callee and can save load
store cost at procedure call, this is not trivial as it seems due to
recursive calls, library function usage etc. Also for upper region of the
graph in this technique available number of registers might become zero in
that case it should fall back to normal load store at procedure call. Apart
from these difficulties other difficulties have been identified please
follow this mail-chain
https://groups.google.com/d/topic/llvm-dev/HOYAXv3m1LY/discussion
My mentor has already provided me a patch that alters code generation
order as per bottom up call graph traversal, I am working from that point
now. Any other help/suggestion is always welcomed.

2) Link time ----- Global register allocation at link time -
http://dl.acm.org/citation.cfm?id=989415
====================================================================In
this particular approach (sort of true IPRA) registers will be reallocated
(this optimization will be optional if turned off still code will be
compiled as per intra-procedural allocation) at link time. Here modules are
first complied as per normal compilation but the object code is annotated
with details so that linker can build call graph and also calculate usage
information at link time. Compiler also write hints in object code that if
particular variable is allocated in some other register ( due to new
allocation) then how the code should be changed? Thus linker can use these
information to decide which variables (global) need to be in same register
through out the program execution and also according to register usage
information in call graph which procedure will not be active simultaneously
so that locals for that procedures can be in same registers with out load
store at procedure calls.
For these particular method help me to analyze feasibility:
1) Can llvm collects following information at module level in MachineIR?
list of procedures in module, list of locals in procedures, list of
procedures that a particular procedure can call, and a list of the
variables this procedure references. Each entry in the last two lists
includes an estimate of the number of times the procedure is called or the
variable is referenced in each execution of this procedure
2) Can llvm write informative commands to object files?
3) Can LTO is capable of leveraging those commands?

In terms of scoping the project for the summer, I definitely recommend
that you focus on (1) first. If you finish that, we can certainly move on
to other things.

I'll add +1 here, but I already wrote the same thing on IRC when
discussing with Vivek. True IPRA without a proper MachineModule
infrastructure won't be doable in my opinion (even with such
infrastructure, it may not be trivial in LLVM in general).

Regarding link time, note that any such a design would likely look much
different than in David Wall's paper however, because our LTO re-codegens
everything anyway. The paper says, "Finally, it keeps us honest as
designers of the system; once we postpone anything until link time, the
temptation is great to postpone everything, ..." - Well, we've long-since
succumb to that temptation when we LTO. C'est la vie.

+1 as well, our LTO will benefit naturally from the leaf-to-root
information propagation. ThinLTO will be more challenging/interesting
though!

@Mehdi I don't understand this comment, it seems to me that two different
approaches has been merged here. Information propagation in calligraph will
be done without LTO.

You are propagating information on the callgraph, but this is efficient
only when you have the function definition that is codegen in the same
module as the call site.
So why LTO?
Because you see "all" the definition for "all" the call sites in your
program.

@Hal, @Mehdi yes LTO gives more information compare to other optimizations
but as per our current plan CallGraphSCC pass and a MachineFunctionPass for
register usage scanning will be sufficient right?

Yes, this was exactly my point when I wrote "our LTO will benefit
naturally from the leaf-to-root information propagation", i.e. the paper
you cite on doing something specific for LTO is irrelevant. You don't need
to do anything with LTO.

oh sorry my bad, by "our LTO will benefit naturally from the leaf-to-root
information propagation" I understood that information propagation should
be done using LTO.

*Vivek Pandya*

Yes there is also MachineRegisterInfo::UsedPhysRegMask which should be the
union of all regmasks in the function.

./lib/CodeGen/MIRParser/MIRParser.cpp: RegInfo.*setUsedPhysRegMask*
(CalleeSavedRegisterMask.flip());

Is this line responsible for setting the UsedPhysMask after codegen for a

function? And This will be changed for each function call right ?
-Vivek