[GSoC 2016] Code Generation Improvements task

Hello LLVM Community,

I am interested doing following project with LLVM for GSoC 2016.

Code Generation Improvements:

Particularly Generalize target-specific backend passes that could be target-independent

I have done some initial study and try to understand the task to be done. Please help me to develop the proposal.

Following are my initial findings :

  1. lib/Target/Hexagon/RDF* :
    Code for these pass is mostly target independent so to generalize them code needs to be wrap in MachineFunction pass and then use it as required.
    And if already not done , Merge Set of SSA based CFG can be computed at time of SSA generation. This can improve performance of Ramakrishna’s algorithm.

  2. lib/Target/AArch64/AArch64AddressTypePromotion.cpp
    As far as I understand this pass promotes sign exertion for 32 bit integer ( address) and performs calculation on 64 bit number thus processes need not switch execution mode to 32 bit.
    Some other platforms such as MIPS, NVPTX, Sparc can be benefited by such optimization because MIPS64 supports MIPS32 bit instruction and it requires mode switch indicated by control register. For PTX size of pointers depends on the host machine so there might be similar situation.
    But architectures like Power PC need not such optimization as 64 bit instruction operating in 32 bit mode passes only lower 32 bits.

  3. lib/Target/AArch64/AArch64PromoteConstant.cpp
    This pass tries to simplify aggregate data like struct of const with special SIMD instructions available on the system. For example on ARM its NEON similarly other architectures have SIMD support specifically MIPS, IBM System Z, Power PC with MMX/AltiVee and x86 with Intel’s AVX.

Apart from these , The proposal can include task for merging the delay slot filling logic ( from Sparc and Mips ) into single target independent pass.

These is just a primary investigation. I am not expert with all architectures supposed by LLVM but MIPS, x86 and to some extent ARM.

I have question regarding Target hooks. Does it means using TargetInfo an SubTargetInfo class and at runtime decide architecture type and based on that perform optimization ( i.e use target specific instructions ) ?

Please help me ! Am I going in right direction ? Suggest some code , document to look for further ideas. Also if any one like to mentor me for this project.

Sincerely,
Vivek Pandya

Hi Vivek,

(Mostly responding with AArch64 hints, though anything I happen to
know from elsewhere too).

2. lib/Target/AArch64/AArch64AddressTypePromotion.cpp
As far as I understand this pass promotes sign exertion for 32 bit integer (
address) and performs calculation on 64 bit number thus processes need not
switch execution mode to 32 bit.

Switching execution mode isn't an option on AArch64 (it can only
happen with OS support and never happens within a single process on a
sane OS).

This pass is more a matter of putting the IR in a form that precisely
matches the addressing modes that are actually available. AArch64 can
encode addresses like "base64 + sext(offset32)" into the actual
load/store instruction so it's advantageous to put the sext as close
as possible to the pointer dereference.

I'm afraid I don't really know enough about other architectures to say
which could benefit. It's obviously only beneficial if they have the
addressing modes to support it.

3. lib/Target/AArch64/AArch64PromoteConstant.cpp
This pass tries to simplify aggregate data like struct of const with special
SIMD instructions available on the system. For example on ARM its NEON
similarly other architectures have SIMD support specifically MIPS, IBM
System Z, Power PC with MMX/AltiVee and x86 with Intel’s AVX.

Possibly. It seems to rely pretty strongly on ARM's "load more than
you can actually use" instructions: vldN instructions can load up to 4
128-bit vectors, but they can still only be used as 128-bit vectors.
If other targets possess similar, then they could well benefit; if
not, then it's probably pointless.

I have question regarding Target hooks. Does it means using TargetInfo an
SubTargetInfo class and at runtime decide architecture type and based on
that perform optimization ( i.e use target specific instructions ) ?

I think they more normally live in TargetTransformInfo.

Please help me ! Am I going in right direction ? Suggest some code ,
document to look for further ideas. Also if any one like to mentor me for
this project.

It sounds like a plausible direction, but documentation is always
lacking in these kinds of things.

As a complete outsider to targets with delay slots, merging their
logic sounds like a nice improvement to me (especially as Lanai is
probably incoming as another ISA that has decided delay slots are a
good idea). But (also as an outsider) I have no idea how practical
that really is.

Cheers.

Tim.

*Vivek Pandya*

Hi Vivek,

(Mostly responding with AArch64 hints, though anything I happen to
know from elsewhere too).

> 2. lib/Target/AArch64/AArch64AddressTypePromotion.cpp
> As far as I understand this pass promotes sign exertion for 32 bit
integer (
> address) and performs calculation on 64 bit number thus processes need
not
> switch execution mode to 32 bit.

Switching execution mode isn't an option on AArch64 (it can only
happen with OS support and never happens within a single process on a
sane OS).

This pass is more a matter of putting the IR in a form that precisely
matches the addressing modes that are actually available. AArch64 can
encode addresses like "base64 + sext(offset32)" into the actual
load/store instruction so it's advantageous to put the sext as close
as possible to the pointer dereference.

I'm afraid I don't really know enough about other architectures to say
which could benefit. It's obviously only beneficial if they have the
addressing modes to support it.

> 3. lib/Target/AArch64/AArch64PromoteConstant.cpp
> This pass tries to simplify aggregate data like struct of const with
special
> SIMD instructions available on the system. For example on ARM its NEON
> similarly other architectures have SIMD support specifically MIPS, IBM
> System Z, Power PC with MMX/AltiVee and x86 with Intel’s AVX.

Possibly. It seems to rely pretty strongly on ARM's "load more than
you can actually use" instructions: vldN instructions can load up to 4
128-bit vectors, but they can still only be used as 128-bit vectors.
If other targets possess similar, then they could well benefit; if
not, then it's probably pointless.

> I have question regarding Target hooks. Does it means using TargetInfo an
> SubTargetInfo class and at runtime decide architecture type and based on
> that perform optimization ( i.e use target specific instructions ) ?

I think they more normally live in TargetTransformInfo.

> Please help me ! Am I going in right direction ? Suggest some code ,
> document to look for further ideas. Also if any one like to mentor me for
> this project.

It sounds like a plausible direction, but documentation is always
lacking in these kinds of things.

As a complete outsider to targets with delay slots, merging their
logic sounds like a nice improvement to me (especially as Lanai is
probably incoming as another ISA that has decided delay slots are a
good idea). But (also as an outsider) I have no idea how practical
that really is.

Thanks Tim for providing more insights, I would gather more information in
given direction. Further more here mentioned 3 tasks may be not a much work
for some one who has a good grasp on llvm but for me it may be sufficient
for GSoC duration. It may not be possible for Google to provide fundings
for limited number of improvements. So I am thinking to include some TODOs
in StackColoring.cpp and StackSlotColoring.cpp in proposal too. Will it be
enough to demonstrate in proposal ?

Still I am looking for feedback on RDF part and also if some one is willing
to mentor me.

Sincerely,
Vivek

*Vivek Pandya*

*Vivek Pandya*

Hi Vivek,

(Mostly responding with AArch64 hints, though anything I happen to
know from elsewhere too).

> 2. lib/Target/AArch64/AArch64AddressTypePromotion.cpp
> As far as I understand this pass promotes sign exertion for 32 bit
integer (
> address) and performs calculation on 64 bit number thus processes need
not
> switch execution mode to 32 bit.

Switching execution mode isn't an option on AArch64 (it can only
happen with OS support and never happens within a single process on a
sane OS).

This pass is more a matter of putting the IR in a form that precisely
matches the addressing modes that are actually available. AArch64 can
encode addresses like "base64 + sext(offset32)" into the actual
load/store instruction so it's advantageous to put the sext as close
as possible to the pointer dereference.

I'm afraid I don't really know enough about other architectures to say
which could benefit. It's obviously only beneficial if they have the
addressing modes to support it.

I did some research for other architectures which has similar addressing

mode:
MIPS :
Base Addressing : sext(Immediate 16 bit value) + 32 / 64 bit register value
PC Relative : PC value + sext(Immediate 16 bit value << 2 bit)

ARM : ARM has immediate addressing but offset is used as singned or zero
extedned is to be determined.

Power PC: Register Indirect with Immediate Index Addressing for Integer
Loads and Store

It seems that most of the architecture which supports immediate offset are
required to use sext to preserve sign before adding them to register value.

So this pass seems to be useful for other architecture.

- Vivek

Hi Vivek,
Sorry, I missed this email. I wrote the RDF stuff and I'd be happy to help you out with it if you are interested.

The idea was to have a utility class that would represent the data flow between registers. The registers could be a mixture of virtual and physical, although the main application would be to use it on a post-RA code. I decided against having it as a part of the pass manager, because the user does not have any direct control over the creation and invalidation of analyses, at least in the current version of the pass manager. This does not mean that it cannot (or shouldn't) be used in an analysis, just that it should also be available as a standalone utility.

The missing bits are:

1. Handling of regmasks
This shouldn't be too hard. All reference nodes (except those in phi nodes) have a pointer to the machine operand, from which the actual register is obtained. Regmasks are different, since a single operand references multiple registers at once. The way to handle them would be to treat a regmask as a register of its own that is aliased with the registers, whose clobbering it represents.

2. Recomputing liveness information on instruction level.
The MI-level IR uses implicit operands to keep track of the liveness of aliased registers. These implicit operands serve no other purpose, but they may introduce apparent dependencies (that do not, in fact exist). RDF will ignore these implicit operands when constructing the DFG, and optimizations using RDF could produce code where the liveness information carried by these operands is no longer valid (the same goes for <kill> flags). This information would need to be recomputed. There is some code in there that does that for the <kill> flags, but it does not deal with the implicit operands at all.

3. Making it work with ther targets.
RDF is intended to handle code that contains both physical and virtual registers on any target, but it has only been tested (in some capacity) on post-RA code and only on Hexagon. Making it fully target-independent would involve testing it with other targets.
- There are "copy propagation" and "dead code elimination" passes that use RDF. Both are also meant to be target-independent and could serve as a testing tool.
- RDF liveness would need to be verified to work on other targets. It is meant to recalculate block live-ins.

4. It is unknown what RDF will do with bundles.
In theory, it should use the summary information from each bundle (without looking inside of bundles), but I have no idea whether there are any cases that would break it. There is nothing to represent the data flow within a bundle: besides not having any representation for it now, the actual data flow there may be highly target-dependent. This is more of a hypothetical question, at least for now, since it may be fairly complex to design and implement.

-Krzysztof

*Vivek Pandya*

Still I am looking for feedback on RDF part and also if some one is
willing to mentor me.

Hello Krzysztof Parrzyszek,

I switched to other topic as I felt I don't have enough experience in
various backends. I also not tried to ask for guidance after I observed
long gap. So as far as GSoC is concerned it is not possible to make a solid
proposal with in a day. I need to read the papers related to RDF stuffs in
LLVM and also write up the whole initial search I did. But still I will
keep this mail in my mind and I will notify you if I work some thing on
this topic during my free time.

Hi Vivek,

Sorry, I missed this email. I wrote the RDF stuff and I'd be happy to
help you out with it if you are interested.

The idea was to have a utility class that would represent the data flow
between registers. The registers could be a mixture of virtual and
physical, although the main application would be to use it on a post-RA
code. I decided against having it as a part of the pass manager, because
the user does not have any direct control over the creation and
invalidation of analyses, at least in the current version of the pass
manager. This does not mean that it cannot (or shouldn't) be used in an
analysis, just that it should also be available as a standalone utility.

The missing bits are:

1. Handling of regmasks
This shouldn't be too hard. All reference nodes (except those in phi
nodes) have a pointer to the machine operand, from which the actual
register is obtained. Regmasks are different, since a single operand
references multiple registers at once. The way to handle them would be to
treat a regmask as a register of its own that is aliased with the
registers, whose clobbering it represents.

2. Recomputing liveness information on instruction level.
The MI-level IR uses implicit operands to keep track of the liveness of
aliased registers. These implicit operands serve no other purpose, but they
may introduce apparent dependencies (that do not, in fact exist). RDF will
ignore these implicit operands when constructing the DFG, and optimizations
using RDF could produce code where the liveness information carried by
these operands is no longer valid (the same goes for <kill> flags). This
information would need to be recomputed. There is some code in there that
does that for the <kill> flags, but it does not deal with the implicit
operands at all.

The only thing I understand here is 3 point because some thing similar I

have thought.

3. Making it work with ther targets.
RDF is intended to handle code that contains both physical and virtual
registers on any target, but it has only been tested (in some capacity) on
post-RA code and only on Hexagon. Making it fully target-independent would
involve testing it with other targets.
- There are "copy propagation" and "dead code elimination" passes that use
RDF. Both are also meant to be target-independent and could serve as a
testing tool.
- RDF liveness would need to be verified to work on other targets. It is
meant to recalculate block live-ins.

4. It is unknown what RDF will do with bundles.
In theory, it should use the summary information from each bundle (without
looking inside of bundles), but I have no idea whether there are any cases
that would break it. There is nothing to represent the data flow within a
bundle: besides not having any representation for it now, the actual data
flow there may be highly target-dependent. This is more of a hypothetical
question, at least for now, since it may be fairly complex to design and
implement.

Thanks for reply !

Sincerely,
Vivek

That's understandable. Thank you for your interest!

-Krzysztof