[RFC] Interprocedural MIR-level outlining pass

Daniel,

OK, I see your point.

My understanding was that VNDAG is an internal representation, not meant
to be used by VN's customers (that should really care for classes of
congruency, nothing more).

Depends on the customers?
You provide what your customers want, not decide what they want :slight_smile:

Also, I thought that VN's results are available for a single function at a
time. It seems that NewGVN provides access to VNDAGs computed for the whole
program -- which enables its usage in Jessica's pass.

It does not currently, this would be trivial to do, however.

You are going very very far into the implementation details which very much
do not matter from an algorithm design perspective.

Still, some issues remain. I wonder what leaf nodes for reads of "a" and
"b" values contain? If they have VN numbers (as the "Efficient SSA" paper
says) and VNDAG comparator compares them (for leafs), then the best one can
do is this:

int outlined(int arg1, int arg2)
{
  return arg1 + arg2;
}

int foo() {
  arg1 = a;
  arg2 = b;
  return outlined(arg1, arg2);
}

int bar() {
  arg1 = a;
  arg2 = b;
  return outlined(arg1, arg2);
}

that hardly helps.

See above. Again, you are going really far into implementation details for
no reason i can see. We can do pretty much whatever we want.
Remember also, you outlined a single statement function, that is the best
you can do in pretty much any legal case.

If, on the other hand, VNDAG's comparator compares variable names and
doesn't pick classes of congruency at all, how this relates to Value
Numbering? SSA can be used for this purpose just as well.

I tend to agree with Hal -- value numbering computes equivalent *values*,

Sorry, but this is just flat out wrong

"A Global Value Numbering(GVN) algorithm is considered to be complete (or
precise), if it can detect all Herbrand equivalences among expressions in a
program.
Two expressions are said to be Herbrand equivalent (or transparent
equivalent ), if they are computed by the same operator applied to
equivalent operands "

This is, AFAIK, precisely what you want.

I'm not entirely happy with this definition (IMHO, it's overly
restrictive), but this in irrelevant.

What relevant is what one considers to be "equivalent operands". Take my
example again -- for outlining (Jessica's name) / code folding (your name)
optimization reads of "a" and "b" globals are equivalent; for VN and its
users they are not.

Yours,
Andrey

From: "Andrey Bokhanko" <andreybokhanko@gmail.com>
To: "Daniel Berlin" <dberlin@dberlin.org>
Cc: "Hal Finkel" <hfinkel@anl.gov>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Wednesday, August 31, 2016 7:17:08 AM
Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass

> > I tend to agree with Hal -- value numbering computes equivalent
> > *values*,
>

> Sorry, but this is just flat out wrong

> "A Global Value Numbering(GVN) algorithm is considered to be
> complete
> (or precise), if it can detect all Herbrand equivalences among
> expressions in a program.

> Two expressions are said to be Herbrand equivalent (or transparent
> equivalent ), if they are computed by the same operator applied to
> equivalent operands "

> This is, AFAIK, precisely what you want.

I'm not entirely happy with this definition (IMHO, it's overly
restrictive), but this in irrelevant.

What relevant is what one considers to be "equivalent operands". Take
my example again -- for outlining (Jessica's name) / code folding
(your name) optimization reads of "a" and "b" globals are
equivalent; for VN and its users they are not.

Yes, this was exactly my point. We want to recognize structurally-equivalent sequences of instructions on inequivalent operands.

-Hal

I tend to agree with Hal -- value numbering computes equivalent *values*,

Sorry, but this is just flat out wrong

"A Global Value Numbering(GVN) algorithm is considered to be complete (or
precise), if it can detect all Herbrand equivalences among expressions in a
program.
Two expressions are said to be Herbrand equivalent (or transparent
equivalent ), if they are computed by the same operator applied to
equivalent operands "

This is, AFAIK, precisely what you want.

I'm not entirely happy with this definition (IMHO, it's overly
restrictive),

Err, this is the definition going all the way back to kildall, so ...

but this in irrelevant.

What relevant is what one considers to be "equivalent operands". Take my
example again -- for outlining (Jessica's name) / code folding (your name)
optimization reads of "a" and "b" globals are equivalent; for VN and its
users they are not.

You realize how you define "equivalent" is completely up to you, right?
The algorithm does not care.

It does not care if you say "anything that i can prove equal is equivalent"
or "anything that is a variable " is equivalent.

Yes, this was exactly my point. We want to recognize
structurally-equivalent sequences of instructions on inequivalent operands.

Yes, and my point is "none of the vn and vn-dag generating algorithms care".

you can define equivalent to be "structural", you can define it to be
"these two variables are equivalent if they both start with "a"", you can
define it however you want.
They will still give you the dags you want.

This is as simple as substituting a hash and equality function.

To whit: Actual compilers do it this way.

So i'm entirely unsure why there is such an argument that it hard or
impossible, or even strange.

It is in fact quite easy, the same way GCC has had the VN pass that
produces expression DAG output, and had it used to code hoisting, to do
PRE, to do folding, to do whatever. It has been used for all of these
thing.

Some of these use a more standard VN definition of equivalence that is
useful for redundancy elimination.
Some of them use one that is meant for folding (and would be illegal to use
for straight redundancy elimination).

If you want to build a pass that basically does the same thing, it seems
silly, but feel free!

(and in particular, the definition of equivalence used by code folding to make the dags is STH like “two VNDAG expressions are equivalent if their operands come from VNDAG expressions with the same opcode”)

Thus,

VN2 = VN0 + VN1
VN3 = VN1 + VN2

is considered equivalent to

VN2 = VN0 + VN5
VN3 = VN1 + VN2

Despite the fact that this is completely illegal for straight redundancy elimination.

But again, as I said if y’all want to make a pass that basically generates a new type of expression DAG, have fun :slight_smile:

(I’m going to just leave this thread be now)

From: "Daniel Berlin" <dberlin@dberlin.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Andrey Bokhanko" <andreybokhanko@gmail.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Wednesday, August 31, 2016 7:02:57 PM
Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass

(and in particular, the definition of equivalence used by code
folding to make the dags is STH like "two VNDAG expressions are
equivalent if their operands come from VNDAG expressions with the
same opcode")

Thus,

VN2 = VN0 + VN1
VN3 = VN1 + VN2

is considered equivalent to

VN2 = VN0 + VN5
VN3 = VN1 + VN2

Despite the fact that this is completely illegal for straight
redundancy elimination.

But again, as I said if y'all want to make a pass that basically
generates a new type of expression DAG, have fun :slight_smile:

I don't think that anyone wants to do anything unnecessary, or re-invent any wheels. I'm just trying to understand what you're saying.

Regarding you example above, I certainly see how you might do this simply by defining an equivalence class over all "external" inputs. I don't think this is sufficient, however, for what is needed here. The problem is that we need to recognize maximal structurally-similar subgraphs, and I don't see how what you're proposing does that (even if you have some scheme where you don't hash every part of each instruction). Maybe if you had some stream-based similarity-preserving hash function, but that's another matter.

Let's say we have this:

q = a + b
r = c + d
s = q + r
t = q - r

and later we have:

u = e + f
v = g - h
w = u + v
x = u - v

Now, what I want to recognize is that the latter two instructions in each group are structurally similar. Even if I define an equivalence class over external inputs, in this case, E = { a, b, c, d, e, f, g, h}, then I have:

q = E + E
r = E + E
s = q + r
t = q - r

and:

u = E + E
v = E - E
w = u + v
x = u - v

but then r and v are still different, so how to we find that {s, t} can be abstracted into a function, because we've found the isomorphism { s -> w, t -> x }, and then thus reuse that function to evaluate {w, x}?

Thanks again,
Hal

------------------------------

*From: *"Daniel Berlin" <dberlin@dberlin.org>
*To: *"Hal Finkel" <hfinkel@anl.gov>
*Cc: *"Andrey Bokhanko" <andreybokhanko@gmail.com>, "llvm-dev" <
llvm-dev@lists.llvm.org>
*Sent: *Wednesday, August 31, 2016 7:02:57 PM
*Subject: *Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining pass

(and in particular, the definition of equivalence used by code folding to
make the dags is STH like "two VNDAG expressions are equivalent if their
operands come from VNDAG expressions with the same opcode")

Thus,

VN2 = VN0 + VN1
VN3 = VN1 + VN2

is considered equivalent to

VN2 = VN0 + VN5
VN3 = VN1 + VN2

Despite the fact that this is completely illegal for straight redundancy
elimination.

But again, as I said if y'all want to make a pass that basically generates
a new type of expression DAG, have fun :slight_smile:

I don't think that anyone wants to do anything unnecessary, or re-invent
any wheels. I'm just trying to understand what you're saying.

Regarding you example above, I certainly see how you might do this simply
by defining an equivalence class over all "external" inputs. I don't think
this is sufficient, however, for what is needed here. The problem is that
we need to recognize maximal structurally-similar subgraphs, and I don't
see how what you're proposing does that (even if you have some scheme where
you don't hash every part of each instruction). Maybe if you had some
stream-based similarity-preserving hash function, but that's another matter.

Now, what I want to recognize is that the latter two instructions in each
group are structurally similar. Even if I define an equivalence class over
external inputs, in this case, E = { a, b, c, d, e, f, g, h}, then I have:

q = E + E
r = E + E
s = q + r

t = q - r

Doing literally nothing but building a standard VN dag (IE with normal

equivalence),
The dag here will contain:

V1 = VN Expression of E + E
V2 = V1
s = VN expression of V1 + V2 (you can substitute V1 if you like or not)
x = VN expression of V1 - V2 (ditto)

and:

u = E + E
v = E - E
w = u + v
x = u - v

The dag here would contain
V1 = E + E
V2 = E - E
w = VN expression of V1 + V2
x = VN expression of V1 - V2

So what's the issue you are worried about here, precisely?

If you extend this example to be longer, the answer does not change.

If you make it so one example is longer than the other, you can still
discover this by normal graph isomorphism algorithms.
If you literally only care about the operation being performed, you can
make that work too.

but then r and v are still different, so how to we find that {s, t} can be

abstracted into a function, because we've found the isomorphism { s -> w, t
-> x }, and then thus reuse that function to evaluate {w, x}?

So let's back up. In your scheme, how do you think you do that now,
precisely?
What is the algorithm you use to discover the property you are trying to
find?

Hi Daniel,

Consider me convinced (not sure you care, but still… :-))

What confused me is that this is not VN in a traditional sense – it’s more like using VN’s infrastructure to compute something different. But one can use VN’s code for this, indeed.

Thank you for the explanation!

Yours,
Andrey

:slight_smile:

I was actually about to just post the following -

Any complete code folding algorithm should be able to determine the following functions can all be folded together - i’m just using full functions because it’s easier as they only have one output, feel free to make them partial.
All of these provably compute the same thing (assume proper data types, blah blah blah).

a = b + c
d = e + f
return a + d

a = b + c
d = e + f
g = a + 0
h = d + 0
return g + h

a = b + c
d = e + f
g = a + 1
h = d - 1
return g + h

if (arg1)
{
a = b + c
d = e + f
}
else
{
a = e + f
d = b + c
}
return a + d

data[] = {e, f, b, c};
for (i = 0; i < 4; ++i)

sum += data[i];

return sum;

etc

You can go on forever.

I cannot see how it is possible to make a complete code folding algorithm that catches even the above cases, and is not based on a VN algorithm to either generate dags that look the same for each function / part of the function, or determine that two dags that don’t look like they compute the same operation, do in fact, compute the same operation.
(or to canonicalize the function itself).

Maybe it’s possible. I’m definitely not smart enough to see it :slight_smile:

Can you make code folding algorithms that get less? Sure.
Might you want to do that anyway because it’s easier/faster/whatever? Again, sure.

But at some point, as your thing is extended to catch more and more cases, i do not see how it does not either rely on a complete VN analysis, or become a complete VN analysis itself.

If NewGVN is smart enough to handle case #5, this is really impressive!

From: "Daniel Berlin" <dberlin@dberlin.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Andrey Bokhanko" <andreybokhanko@gmail.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>
Sent: Wednesday, August 31, 2016 8:24:11 PM
Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
pass

> > From: "Daniel Berlin" < dberlin@dberlin.org >
>

> > To: "Hal Finkel" < hfinkel@anl.gov >
>

> > Cc: "Andrey Bokhanko" < andreybokhanko@gmail.com >, "llvm-dev" <
> > llvm-dev@lists.llvm.org >
>

> > Sent: Wednesday, August 31, 2016 7:02:57 PM
>

> > Subject: Re: [llvm-dev] [RFC] Interprocedural MIR-level outlining
> > pass
>

> > (and in particular, the definition of equivalence used by code
> > folding to make the dags is STH like "two VNDAG expressions are
> > equivalent if their operands come from VNDAG expressions with the
> > same opcode")
>

> > Thus,
>

> > VN2 = VN0 + VN1
>

> > VN3 = VN1 + VN2
>

> > is considered equivalent to
>

> > VN2 = VN0 + VN5
>

> > VN3 = VN1 + VN2
>

> > Despite the fact that this is completely illegal for straight
> > redundancy elimination.
>

> > But again, as I said if y'all want to make a pass that basically
> > generates a new type of expression DAG, have fun :slight_smile:
>

> I don't think that anyone wants to do anything unnecessary, or
> re-invent any wheels. I'm just trying to understand what you're
> saying.

> Regarding you example above, I certainly see how you might do this
> simply by defining an equivalence class over all "external" inputs.
> I don't think this is sufficient, however, for what is needed here.
> The problem is that we need to recognize maximal
> structurally-similar subgraphs, and I don't see how what you're
> proposing does that (even if you have some scheme where you don't
> hash every part of each instruction). Maybe if you had some
> stream-based similarity-preserving hash function, but that's
> another
> matter.

> Now, what I want to recognize is that the latter two instructions
> in
> each group are structurally similar. Even if I define an
> equivalence
> class over external inputs, in this case, E = { a, b, c, d, e, f,
> g,
> h}, then I have:

> q = E + E

> r = E + E

> s = q + r

> t = q - r

Doing literally nothing but building a standard VN dag (IE with
normal equivalence),
The dag here will contain:

V1 = VN Expression of E + E
V2 = V1
s = VN expression of V1 + V2 (you can substitute V1 if you like or
not)
x = VN expression of V1 - V2 (ditto)

> and:

> u = E + E

> v = E - E

> w = u + v

> x = u - v

The dag here would contain
V1 = E + E
V2 = E - E
w = VN expression of V1 + V2
x = VN expression of V1 - V2

Okay, I understand this. V2 here, of course, is different from V2 in the instruction sequence above.

So what's the issue you are worried about here, precisely?

If you extend this example to be longer, the answer does not change.

If you make it so one example is longer than the other, you can still
discover this by normal graph isomorphism algorithms.

Alright, I think this is where we're talking past each other. I thought you were saying that using the VN DAG somehow makes this part easier (compared, say, to running on the SSA use/def graph). If you're not claiming that, then I think we're on the same page.

-Hal

Hi Jessica,

I am looking into trying out your patch and reproducing results that you got for code size reduction. Could you tell me what compiler options you used? Also, which clang commit did you work with?

Violeta

Hi Violeta,

I compiled with

clang -Oz and
clang -Oz -mno-red-zone for comparisons against Oz

and

clang -O0 and
clang -O0 -mno-red-zone for comparisons against a default clang.

I unfortunately don’t have the clang commit I worked with on my home laptop, and don’t have access to the computer I was using at Apple, so I can’t help you there.

Jessica

Hi Jessica,

Thank you for your quick reply.

I have a few more questions.

1. In the results section of your original post, I suppose that by 'llvm test suite' you meant the SingleSource, MultiSource tests. Is that correct?

2. I am trying to run these tests with the -mno-red-zone option, however I am running into issues, and I hope that you can help me.

2.1. The first problem is that I get a segfault when trying to e.g. create an .o file. Example of building a test from the test-suite:

clang++ -I<path-to-results-dir>/MultiSource/Benchmarks/Prolangs-C++/simul -I<path-to-test-suite>/MultiSource/Benchmarks/Prolangs-C++/simul -I<path-to-test-suite>/include -I../../../../include -D_GNU_SOURCE -D__STDC_LIMIT_MACROS -DNDEBUG -DSMALL_PROBLEM_SIZE -O3 -mno-red-zone -mllvm -enable-machine-outliner -m64 -fomit-frame-pointer -c <path-to-test-suite>/MultiSource/Benchmarks/Prolangs-C++/simul/simulate.cpp -o simulate.o

The problem occurs in AsmPrinter::EmitFunctionBody(), during running 'X86 Assembly / Object Emitter' pass on an outlined function.
I haven't found what causes it, but the problematic bit seems to be the value of MI in EmitFunctionBody() in lib/CodeGen/AsmPrinter/AsmPrinter.cpp around line 854:
  ....

     // Print a label for the basic block.
     EmitBasicBlockStart(MBB);
     for (auto &MI : MBB) { <---- line 854

       // Print the assembly for the instruction.
       if (!MI.isPosition() && !MI.isImplicitDef() && !MI.isKill() &&
           !MI.isDebugValue()) {
....

I haven't debugged this any further, so I don't really know what is the cause of the problem here.
What makes it strange is that this segfault doesn't occur always (but does majority of the times), and also, it doesn't always happen for the same outlined function.
I was wondering if you have come across a similar issue?

2.2. When I try to build a simple example with '-mno-red-zone' and '-enable-machine-outliner' options, I get an 'undefined reference' to an outlined function, e.g. "undefined reference to `l_OUTLINED_FUNCTION0'". The .s file shows that the call that gets inserted is a call to `l_OUTLINED_FUNCTION0', and the outlined function has the label with a different prefix - `.LOUTLINED_FUNCTION0'.
Did you have the same problem?

Note: I have cloned your repository from https://github.com/ornata/llvm.git and for clang, I used commit aa5fc8f0161578a00cb5e61a19cb6c6429ff85e3.

I hope that you can help me resolve these issues, or tell me if I am doing something wrong here. :slight_smile:

Violeta

Hi Jessica,

Did you have time to take a look at these questions maybe?

Thank you,

Violeta

Hi Violeta,

Sorry, I haven’t had time to look since I’m in the process of an international move right now.

  1. Yes; we also used the external tests though.

  2. This is a much larger question; I’ll look into it and get back to you ASAP.

Jessica

Hi Violeta,

I think I've solved the problem you were having. I pushed the changes
to the git repo located at https://github.com/ornata/llvm. Sorry for
the delay!

If you're curious, the issue was in cloning instructions. The cloned
instructions created by CloneMachineInstr actually maintain references
back to the function you're cloning *from*. Since each outlined
function is cloned from some existing MachineFunction, when that
function is emitted and thrown out, certain outlined instructions are
deallocated. Thus, sometimes there's nothing there to print and so it
crashes and burns. The reason this never came up before was that in
the initial outliner implementation, LLVM didn't have
MachineModulePasses yet. As a preliminary solution, we just decided to
temporarily never throw out MachineFunctions. As a result, we never
tried to deallocate anything at all, so this problem never showed up!

Hopefully this resolves the issue on your end.

- Jessica