allocating registers less "sparingly"

Hello LLVM people,

Our customizable TTA target [1] is capable of having plenty of registers
and register file ports to improve instruction level parallelism and
reduce spills. It's totally up to the designer of the particular TTA
processor how much the processor has registers and register file resources
along with other TTA components.

We have ported LLVM 2.1 to produce an intermediate TTA program format with
registers allocated (it adapts to the architecture's register files) but without
operations bound to target's function units, nor scheduled to instructions.
The operation binding and the final scheduling is done as a post pass in
our TTA-framework called TCE [2] (not yet released to public).

The problem is that the register allocators in LLVM seem to reuse registers
too much, that is, allocate them "too sparingly". This leads to false
dependencies which unnecessarily restrict scheduling freedom in our post pass
instruction scheduler. That is, even if the target has 64 registers, LLVM
might only use, say 13, and have multiple writes to the same register in
the same basic block. Is there some easy way to tune this behavior so it
reuses registers only if it really has to (when it runs out of new ones)?

I can understand that for many architectures without static scheduling etc.
the "allocate registers sparingly"-approach is the best one as it might lead
to less context data to be saved in calls, etc., but in our case the calls are
not usually the bottleneck as our target programs are at the moment mainly DSP
kernels and multimedia codecs which are usually loop-oriented. In addition,
LLVM seems to do an excellent job in inlining function calls.

I look forward to your replies.

The links:

[1] http://en.wikipedia.org/wiki/Transport_triggered_architecture
[2] http://tce.cs.tut.fi/papers/TCE_Overview.pdf

Best regards,

Hello LLVM people,

Our customizable TTA target [1] is capable of having plenty of registers
and register file ports to improve instruction level parallelism and
reduce spills. It's totally up to the designer of the particular TTA
processor how much the processor has registers and register file resources
along with other TTA components.

We have ported LLVM 2.1 to produce an intermediate TTA program format with
registers allocated (it adapts to the architecture's register files) but without
operations bound to target's function units, nor scheduled to instructions.
The operation binding and the final scheduling is done as a post pass in
our TTA-framework called TCE [2] (not yet released to public).

The problem is that the register allocators in LLVM seem to reuse registers
too much, that is, allocate them "too sparingly". This leads to false
dependencies which unnecessarily restrict scheduling freedom in our post pass
instruction scheduler. That is, even if the target has 64 registers, LLVM
might only use, say 13, and have multiple writes to the same register in
the same basic block. Is there some easy way to tune this behavior so it
reuses registers only if it really has to (when it runs out of new ones)?

What you want is to remove anti-dependency before post-allocation scheduling. Alas, there is nothing in current LLVM implementation that deals with this. There are quite a few papers on this topic, e.g. http://citeseer.ist.psu.edu/calland97removal.html

I am also very interested in this. Do you plan to contribute your work back to the community? :slight_smile:

Evan

Evan Cheng wrote:

What you want is to remove anti-dependency before post-allocation scheduling. Alas, there is nothing in current LLVM implementation that deals with this. There are quite a few papers on this topic, e.g. http://citeseer.ist.psu.edu/calland97removal.html

Thank you for the pointer, I'll take a look when I have the time again.

I was kind of hoping we could have avoided the unnecessary anti-deps
the first place, for example, by means of a more sane distribution of
registers in the register allocator, but I guess this kind of pass
that removes them after allocation would be fine too.

I am also very interested in this. Do you plan to contribute your work back to the community? :slight_smile:

I certainly hope we could produce something we can contribute, I
cannot give any promises at this point though :slight_smile:

Evan Cheng wrote
> What you want is to remove anti-dependency before post-allocation
> scheduling. Alas, there is nothing in current LLVM implementation
> that deals with this. There are quite a few papers on this topic,
> e.g. http://citeseer.ist.psu.edu/calland97removal.html

Much of the work (like the one you cited) deals with arrays,
and the dependences are "real" in the sense that they exist
in the original program. Dealing with antidependences introduced
by register allocation is different in that the variables involved
are scalars. The dependences may also be an artefact of the choices
made by the register allocator, that is, the dependences may not
be present in the original program.

What this means is that the problem should hopefully be simpler
as only scalars are involved. But I guess it might also mean
reverse-engineering some information that was lost when the
register allocator decided to place distinct variables in the same
physical register.

I’m only passingly familiar with the register allocator in LLVM, but I’d think that you could tweak it’s allocation policy to allocate out of, say, sequential registers and only reuse registers with dead values once it runs out of completely empty ones. Is this what you want?

Exactly. I was hoping there was some generic framework to tweak the
"allocation policy" so it would apply for all register allocators, but
I suppose there is none at the moment.

In addition, for our case, it could help to "flush" all live values to memory
before entering a long-running loop (which does not use the flushed
values) to free more registers for it. This would be useful also for
software pipelining which increases register pressure.

I'll see what I can do. I hopefully get more time to work on this after
the new year.

Greets,

I’m only passingly familiar with the register allocator in LLVM, but I’d think that you could tweak it’s allocation policy to allocate out of, say, sequential registers and only reuse registers with dead values once it runs out of completely empty ones. Is this what you want?

That will help. But it’s not really a clean solution. A heuristic based solution is never going to achieve the kind of balanced usage that’s desired.

On the other hand, repairing anti-dependencies is a well understood problem and if it’s implemented will bring a lot more benefits (e.g. post allocation scheduling).

Evan