module level assembly optimization

I would like to do constant pools over an entire module instead of just on a per function basis as constant islands does it now.

It seems there are two options for this:

1) collect the machine functions with their machine instructions for the whole module, edit them as needed and then
call asmprinter for them one at a time.

2) collect all the instruction streams and store them in lists, one per section and put them all out at the end.

There are possibly other kinds of module level optimizations that could benefit from this.

Some of this might be better done by moving this functionality into target independent code but I think I could do it all
without such changes.

Any thoughts?

Reed

Two questions:

a) what do you think you'll gain over per-function,
b) how are you going to handle constants out of range on a module level?

-eric

You make some good points.

We have actually been further discussing this inside of Mips/Imagination. One of the guys here has done a module
level constant island for another architecture.

We've discussed some further improvements to his scheme and have come up with something that can work and later be extended, using
the current model for constant islands that llvm uses for ARM.

I will create some module level statistics on missed possibilities so that we can evaluate how much better things could have
been if we had module level optimizations. One big win is that often you need to load the 32 bit address of something
and these addresses can be placed in a constant island, i.e. printf, sin, cos, etc.. So it's possible to have a lot of references to
such things. Recreating them inline on a true risc machine can be expensive. How much we save? I have no statistics on this
for Mips since we have never had a constant island compiler (gcc does something simple with one pool at the end of the function).

If we decide that there is something there, then the following steps would be something like:

1) When a function is compiled, remember where the constant islands for it where. New functions can reference those places.
2) Delay putting out the constant island that is at the end of the function. When the next function appears, see if it is small enough to
where this earlier constant island could have been put at the end of this new function. This make things work when you have lots of small
functions and you keep moving the pool further down in the stream.
3) It's also possible to create a module pass to collect module level statistics
that tells you whether something in you constant pool will occur later. You can decide if it will be beneficial to move a constant from one pool
to another. Well...many ideas. The point is that we have a lot of room that we can grow into in order to do better pooling, while starting with the
basic scheme of a per function.

So there is a lot of room that we can improve things, starting with the basic model that already exists in LLVM so that is the plan for how
I intend to proceed.

Reed

My experience is here consists of only once debugging an issue on the
ARM constant island pass, so take my opinion with a grain of salt

When looking at the pass, it is easy to notice how much it looks like
an assembler doing relaxations. It needs to know exact offsets, adding
data to an island changes offsets, etc. It is also easy to notice how
hard it is to test a pass like that that is nested deep inside
codegen.

My thought at the time was that it would be nice to implement the pass
in the assembler itself by having codegen use pseudo instructions.
Something like

.const_island_set_reg r4, 0x12345678

The assembler would then be responsible for creating the islands and
producing a load (or using something like movw + movt if it was better
in that case.

The advantages would be
* The assembler already naturally handles exact addresses.
* It would be much easier to test.
* It would work across functions

Then again, it is entirely possible I missed something that mandates
this pass being in codegen.

Cheers,
Rafael

I agree with you 100%.

I think it should be an assembler function too, as should long branch optimization.
In fact the earliest pdp-11 unix compilers did long branch optimization in the assembler.

Moving this to the assembler would require some community buy in.

To do this in the assembler and not create any issues, I think you would need to serialize one the asmstreamer interfaces and have a way
to edit the instruction stream parts.

Then you could play it back through asm printer.

Reed

I would like to do constant pools over an entire module instead of
just on a
per function basis as constant islands does it now.

It seems there are two options for this:

1) collect the machine functions with their machine instructions for the
whole module, edit them as needed and then
call asmprinter for them one at a time.

2) collect all the instruction streams and store them in lists, one per
section and put them all out at the end.

There are possibly other kinds of module level optimizations that could
benefit from this.

Some of this might be better done by moving this functionality into
target
independent code but I think I could do it all
without such changes.

Any thoughts?

My experience is here consists of only once debugging an issue on the
ARM constant island pass, so take my opinion with a grain of salt

When looking at the pass, it is easy to notice how much it looks like
an assembler doing relaxations. It needs to know exact offsets, adding
data to an island changes offsets, etc. It is also easy to notice how
hard it is to test a pass like that that is nested deep inside
codegen.

My thought at the time was that it would be nice to implement the pass
in the assembler itself by having codegen use pseudo instructions.
Something like

.const_island_set_reg r4, 0x12345678

The assembler would then be responsible for creating the islands and
producing a load (or using something like movw + movt if it was better
in that case.

The advantages would be
* The assembler already naturally handles exact addresses.
* It would be much easier to test.
* It would work across functions

Then again, it is entirely possible I missed something that mandates
this pass being in codegen.

Cheers,
Rafael

I agree with you 100%.

I think it should be an assembler function too, as should long branch
optimization.
In fact the earliest pdp-11 unix compilers did long branch optimization
in the assembler.

Moving this to the assembler would require some community buy in.

To do this in the assembler and not create any issues, I think you would
need to serialize one the asmstreamer interfaces and have a way
to edit the instruction stream parts.

Then you could play it back through asm printer.

Reed

I thought about serializing asmprinter for this purpose but that code would really belong in the target independent code and that would require a proposal to the community and I don't know what kind of response there would be.

Certainly, it would be generally useful.

Probably for now, I will start with the LLVM ARM model for constant islands.

Another possibility would be to process all the machine code for the module in one pass. There are some advantages to having all the basic block information still intact. I think that Jim Grosbach said that this was needed for ARM.

Reed

This would be really great, and a powerful way to make it so that each RISC target didn’t have to have their own implementation of the same thing. Pulling this off well would require teaching the assembler about branches, including how to shorten them (like x86 does) which can be more complex for RISC targets. However, I think this would be really great infrastructure to have at the MC level in any case.

-Chris

I've been understanding the ARM Constant island code better because I'm getting ready to do the Mips version.

In effect, it is even fairly well parameterized already. Almost a base class and not needing to be just ARM specific.

I don't think it will be difficult at all to merge my version and the ARM one into a single class. If we do that, we could create the general class and make the Mips and ARM versions be specific classes of this new base class.

The real issue is the fact that "sometimes" you really want to be able to process a linked list of machine functions.

I'm not really sure if there is any benefit to doing this kind of assembly stuff at the mc layer. You just have less information at that time. In a machine function you already have the machine instructions so you know what is there essentially.

This higher layer has more information in it which could be necessary and/or at least helpful. For example, sometimes you need to still use register scavenger to do constant work, as we have to do for mips32 long branches.

So I think it would be useful to have a new module pass that is sent a list of machine functions and it maybe even calls asm printer for those.
Function passes could still be used but we need to guarantee order of the functions because we are calculating module level address offsets.

Reed