[RFC] Interprocedural MIR-level outlining pass

Hi everyone,

Since I haven't said anything on the mailing list before, a quick
introduction. I'm an intern at Apple, and over the summer I implemented a
prototype for an outlining pass for code size in LLVM. Now I'm seeking to
eventually upstream it. I have the preliminary code on GitHub right now,
but it's still very prototypical (see the code section).

Hi Jessica,

This is quite interesting.

Can you comment on why you started by doing this at the MI level, as opposed to the IR level. And at the MI level, why after RA instead of before RA?

Perhaps the first question is another way of asking about how accurately we can model call-site code size at the IR level?

Thanks in advance,
Hal

I think the “Motivation” section explained that. I too first thought about “why not at IR?” but the reason looks like MIR, post-RA has the most accurate heuristics (best way to know looks like actually getting there).

Do you know if there is any experimental pass that relies on deriving heuristics by a feedback loop after letting, ie. a duplicate module/function/block continue past?

Regards,

Kevin

From: "Kevin Choi" <code.kchoi@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, August 26, 2016 4:55:29 PM
Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass

I think the "Motivation" section explained that.

I don't think it explained it.

I too first thought about "why not at IR?" but the reason looks like
MIR, post-RA has the most accurate heuristics (best way to know
looks like actually getting there).

But also, potentially, the fewest opportunities. That's why I'm curious about the motivation - the trade offs are not obvious to me.

-Hal

Hi,

I let Jessica give more details but here are some insights.

MIR offers a straight forward way to model benefits, because we know which instructions we remove and which one we add and there are no overhead of setting up parameters. Indeed, since the coloring will be the same between the different outlining candidates, the call is just a jump somewhere else. We do not have to worry about the impact of parameter passing and ABI.

So basically, better cost model. That’s one part of the story.

The other part is at the LLVM IR level or before register allocation identifying similar code sequence is much harder, at least with a suffix tree like algorithm. Basically the problem is how do we name our instructions such that we can match them.
Let me take an example.
foo() {
/* bunch of code */
a = add b, c;
d = add e, f;
}

bar() {
d = add e, g;
f = add c, w;
}

With proper renaming, we can outline both adds in one function. The difficulty is to recognize that they are semantically equivalent to give them the same identifier in the suffix tree. I won’t get into the details but it gets tricky quickly. We were thinking of reusing GVN to have such identifier if we wanted to do the outlining at IR level but solving this problem is hard.

By running after regalloc, we basically have a heuristic that does this naming for us.

Cheers,
-Quentin

FWIW: I’m with quentin. Without a good value numbering analysis, this is a hard problem.

GVN as it exists now doesn’t really provide what you want, and in fact, doesn’t value number a lot of instruction types (including, for example, loads, stores, calls, phis, etc). It also relies on ordering, and more importantly, it relies on not being a strong analysis. If you were to stop it from eliminating as it went, it would catch maybe 10% of the redundancies it does now. So what you want is “i want to know what globally is the same”, it doesn’t really answer that well.

Doing the analysis Quentin wants is pretty trivial in NewGVN (analysis and elimination are split, everything is broken into congruence classes explicitly, each congruence class has an id, you could sub in that id as the value for the terminated string), but i’d agree that GVN as it exists now will not do what they want, and would be pretty hard to make work well.

(FWIW: Davide Italiano has started breaking up newgvn into submittable pieces, and is pretty far along to having a first patch set I believe)

From: "Daniel Berlin" <dberlin@dberlin.org>
To: "Quentin Colombet" <qcolombet@apple.com>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Friday, August 26, 2016 11:06:56 PM
Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass

FWIW: I'm with quentin. Without a good value numbering analysis, this
is a hard problem.

How exactly does value numbering help here? This problem seems to be about finding structurally-similar parts of the computations of different values, not identical values. It seems much closer to the analysis necessary for SLP vectorization than value numbering, as such.

I think we might be able to use value numbering to help with subset of SLP vectorization, but only because we can define some kind of equivalence relation on consecutive memory accesses (and similar). What you need for this outlining seems more akin to the general SLP-vectorization case. This might just be the maximal common subgraph problem.

-Hal

------------------------------

*From: *"Daniel Berlin" <dberlin@dberlin.org>
*To: *"Quentin Colombet" <qcolombet@apple.com>
*Cc: *"Hal Finkel" <hfinkel@anl.gov>, "llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Friday, August 26, 2016 11:06:56 PM
*Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass

FWIW: I'm with quentin. Without a good value numbering analysis, this is a
hard problem.

How exactly does value numbering help here? This problem seems to be about
finding structurally-similar parts of the computations of different values,
not identical values.

Note, first, the optimization we are talking about is not outlining, at
least, as far as i've ever heard it.

function outlining/etc is about removing cold regions of functions to
separate functions to make the hot portions smaller, more inlinable, etc.

What is being talked about here is a variant of identical code folding,
e.g., http://research.google.com/pubs/pub36912.html and it's variants.

Where identical varies (and is based on semantic equivalence). Variants
where you extract portions partially to merge functions are pretty much the
same optimization :slight_smile:

Here, value numbering helps because it gives you value expressions.

It tells you:

This operation is "add V1, V2".
It does not matter what V1 and V2 are, they just "are abstract things". We
know if these abstract things are equivalent or not.

More relevantly, these expressions form a value number dag.

That is, they represent the computations being computed only in terms of
operators, abstract variables, and other parts of the VNDAG

The *actual values* of those computations, numbers, etc, do not matter (IE
the abstract value numbers have some set of things *in the program* that
are in the set of things equivalent to that value number).

It is "are these semantically the same operations in the same order",
regardless of the value of those operations.
See, for example, the DAGS generated by:

https://www.researchgate.net/publication/221323142_An_Efficient_SSA-Based_Algorithm_for_Complete_Global_Value_Numbering

Again, this is not something current GVN does, but NewGVN (and other GVN's
do) can give you.

The question of what portions of a function you can merge are variants of
common subgraph and graph isomorphism problems on a VNDAG.

(because in the end, it doesn't matter where the computations are, it
matters what they abstractly compute, and what they depend on)

It seems much closer to the analysis necessary for SLP vectorization than
value numbering, as such.

I think we might be able to use value numbering to help with subset of SLP
vectorization, but only because we can define some kind of equivalence
relation on consecutive memory accesses (and similar). What you need for
this outlining seems more akin to the general SLP-vectorization case. This
might just be the maximal common subgraph problem.

If i have an add in a loop, and it's VN equivalent to an add outside a loop
in some other function, i can still merge the functions :slight_smile:

It's how the value is generated and used that matters, not the structure of
the program.

So you can't just do it on a regular CFG or anything like that if you want
good results.

Thanks for a well written and informative writeup. This was a pleasure to read and laid out the issues well for consideration.

On a technical level, I could see this having a secondary use for developers of the compiler itself. Seeing the same pattern of assembly many times across a large binary is often a hint of a missed optimization. Using this type of technique to generate candidates for further investigation seems interesting.

Philip

Minor terminology point… The mechanism used here is definitely function outlining. The policy applied to select outline candidates is variant of code folding techniques. Many times when we refer to “an optimization” we tend to mix policy and mechanism into the same thing. I’m highlighting the difference only because I found Danny’s response slightly confusing on first read and through others might as well. Back to your regularly scheduled technical discussion…

The technique does remind me of block tail merging, which makes me want to say:

You need to erase the debug-location of any instructions outlined in this way. Because they are chosen for outlining by virtue of being duplicated in different parts of the code, their source origin is ambiguous and therefore the debug info cannot represent it correctly.

These code fragments are not “functions” in the normal sense (they have totally idiosyncratic calling conventions, for one thing) so it’s probably best to pretend they don’t exist, from a debug-info perspective.

I know debug info is usually the last thing on the mind of somebody implementing a new optimization, but that’s one reason why the optimized-code debugging experience is generally terrible.

Thanks for listening,

–paulr

That’s actually something that’s been brought up a few times! I’m thinking of ways to use the suffix tree based approach as a tool for that, and maybe for finding areas where code could be refactored. For example, people could find sequences of instructions which appear suspiciously often, and deduce that they are copied and pasted pieces of code.

It also may be interesting to investigate other string algorithms to see if they may have similar applications. Techniques used in bioinformatics are interesting to me in particular, since they work on finding structural similarities across gene sequences. It’s not exactly the same thing, but it might be possible to repurpose their techniques to find certain similarities across codebases.

- Jessica

Hi Jessica,

Thanks for this - echoing Philip, it was an excellent writeup. I’m currently looking at code size optimizations for microcontroller code myself and so am very interested!

You mentioned several links for different parts of the algorithm - do you have a link to an applyable patch for me to test out?

I’m also working on basic block tail merging / instruction sinking in SimplifyCFG, and have been thinking over the last few days about using a similar datastructure to provide longest common substring queries for sequences of instructions - having something similar in ADT would be very useful.

Cheers,

James

Hi Jessica,

Let me echo others’ comments – this is a very interesting research project and a nice write-up, indeed!

In order to transform it to a practically useful optimization pass, though, “results” section should be strenghtened up a bit:

  • You simply say “tests >= 4 Kb”, without specifying the nature of these tests at all. Please tell exactly, along with exact compiler options you used. If the tests are artificial, they might be good for debugging, but worthless as a measure of usefulness of your pass.

  • You use “a fairly recent 64-bit Intel processor” to test an embedded-targeting optimization, which is odd. If you want to target x86 (good choice! :-)), perhaps a Quark chip might be a better choice? You can measure code size impact without using a real hardware – just add “-triple i586-intel-elfiamcu” option.

Yours,

Andrey

And one more thing: have you measured impact of your pass with MergeFunctions pass enabled? AFAIK, it’s disabled by default (though I might be mistaken).

Yours,

Andrey

Daniel,

I wonder what the NewGVN would generate for the following C code:

int a, b;

int foo() {

return a + b;

}

int bar() {

return a + b;

}

?

Obviously, the expressions would be the same (“value1 + value2”), but a single operator is not worthy to be outlined. What classes of congruency would be assigned to operands? The same for both reads of “a” and “b”? If yes, it would be incorrect, as obviously, these two additions are not guaranteed to be semantically equivalent (a’s and b’s values can be changed in-between).

Also note that one should be able to compare VN’s results from different functions. Looks like an inter-procedural VN is required.

Yours,

Andrey

Daniel,

I wonder what the NewGVN would generate for the following C code:

int a, b;

int foo() {
  return a + b;
}

int bar() {
  return a + b;
}

?

Obviously, the expressions would be the same ("value1 + value2"), but a
single operator is not worthy to be outlined.

Sure.

What classes of congruency would be assigned to operands?

GVN is not interprocedural, so comparing gvn congruencey results (as
opposed to functions) between functions is udefined.

The same for both reads of "a" and "b"? If yes, it would be incorrect, as
obviously, these two additions are not guaranteed to be semantically
equivalent (a's and b's values can be changed in-between).

Also note that one should be able to compare VN's results from different
functions.

I'm not sure why you think you need to do this. Youu need to be able to
compare the abstract VNDAG's, and you can.
The values of those abstract functions are just parameter bindings you pass
to the outlined function.

You need to be able to compare congruency if you are trying to *eliminate*
usages because they already exists.

It is perfectly legal to outline both examples into a single function and
call that. It will give correct answers.

IE

int outlined()
{
  return a + b;
}

int foo() {
  return outlined();
}

int bar() {
return outlined();
}

Thanks a bunch! I’ll try to clarify a bit about why I did what I did.

In order to transform it to a practically useful optimization pass, though, “results” section should be strenghtened up a bit:

  • You simply say “tests >= 4 Kb”, without specifying the nature of these tests at all. Please tell exactly, along with exact compiler options you used. If the tests are artificial, they might be good for debugging, but worthless as a measure of usefulness of your pass

Definitely agree!

The tests that I used were the C/C++ programs in the LLVM test suite, with externals enabled. What I did was run clang -mno-red-zone against clang -mno-red-zone with outlining, and clang -mno-red-zone -Oz against clang -mno-red-zone with outlining. What would be better would be to run it on some real embedded cases versus the programs in the test suite. What I found in the test suite is there are a few programs which do really well (like, say an 80% reduction!), but it turns out that those programs are typically entirely macros.

(Some other interesting test cases, less relevant to embedded, might be programs with heavy template usage.)

  • You use “a fairly recent 64-bit Intel processor” to test an embedded-targeting optimization, which is odd. If you want to target x86 (good choice! :-)), perhaps a Quark chip might be a better choice? You can measure code size impact without using a real hardware – just add “-triple i586-intel-elfiamcu” option.

The reason I started out using an Intel processor was just to verify that I could even run outlined programs in the first place. I have been working on extending it to ARM, which would be a far more interesting for embedded than x86-64. That being said, this can potentially be useful on non-embedded computers but the benefits be much greater in an embedded space.

(I also didn’t know about that flag, so thanks!)

  • Jessica

Hi Jessica, as I said on Twitter many thanks for sharing such a
detailed write-up. One thing I was wondering is whether you'd looked
at your outlining scheme with reference to function prologues and
epilogues. There has been a patch proposed to use shared epilogues in
compiler-rt for AArch64 in order to reduce code size
(https://reviews.llvm.org/D15600) and work on the RISC-V compressed
instruction set has shown that outlining for prologues/epilogues can
reduce code size by ~5% and weaken the argument for supporting load
multiple and store multiple instructions
(https://riscv.org/wp-content/uploads/2015/06/riscv-compressed-workshop-june2015.pdf,
slide 15).

Having the compiler automatically detect such opportunities seems
attractive. Do you think that these this subset of outlining would be
better addressed by a specifically targeted pass, or within your
MachineOutliner?

Best,

Alex

Daniel,

OK, I see your point.

My understanding was that VNDAG is an internal representation, not meant to be used by VN’s customers (that should really care for classes of congruency, nothing more). Also, I thought that VN’s results are available for a single function at a time. It seems that NewGVN provides access to VNDAGs computed for the whole program – which enables its usage in Jessica’s pass.

Still, some issues remain. I wonder what leaf nodes for reads of “a” and “b” values contain? If they have VN numbers (as the “Efficient SSA” paper says) and VNDAG comparator compares them (for leafs), then the best one can do is this:

int outlined(int arg1, int arg2)
{
return arg1 + arg2;
}

int foo() {
arg1 = a;
arg2 = b;
return outlined(arg1, arg2);
}

int bar() {
arg1 = a;
arg2 = b;
return outlined(arg1, arg2);
}

that hardly helps.

If, on the other hand, VNDAG’s comparator compares variable names and doesn’t pick classes of congruency at all, how this relates to Value Numbering? SSA can be used for this purpose just as well.

I tend to agree with Hal – value numbering computes equivalent values, while this pass needs equivalent sequences of instructions.

Yours,
Andrey