PATCH: Use size reduction -- wave2

Hi All,

here comes the patch for the second wave of Use class size reduction.

I have included all the machinery that is needed, and it is
*active*. The User* inside of Use is even sometimes NULL,
but the algorithm is able to recover it.
If there is a non-null User* present, then I am
asserting that it equals the computed value.

I did not receive feedback for the algorithmic part yet,
so I assume, you are comfortable with it.

Unfortunately I had to introduce a new GlobalVariable::Create
mechanism (I hoped to have nailed all in wave 1, but life is cruel).
I will submit scripts for the easy conversion of external projects
like the last time.

I have split this review material into 3 files, corresponding to
- essential changes,
- related changes in .cpp files,
- collateral nitty-gritty, mostly mechanical stuff.

Outlook: The actual removal of the Use::U member
will happen in wave 3 after this stuff is merged to
trunk. I do not expect any problems.
Btw., I have already performed a test merge
and it also passes all tests (deja, clang, test-suite).

Cheers,

  Gabor

--------------------- STATS ------------------
ggreif$ ls -l wave2*
-rw-r--r-- 1 ggreif ggreif 79367 Apr 16 00:20 wave2-essentials.diff
-rw-r--r-- 1 ggreif ggreif 51795 Apr 16 00:24 wave2-impl.diff
-rw-r--r-- 1 ggreif ggreif 25300 Apr 16 00:25 wave2-nittygritty.diff

ggreif$ wc wave2*
     2189 9005 79367 wave2-essentials.diff
     1408 4793 51795 wave2-impl.diff
      521 1995 25300 wave2-nittygritty.diff
     4118 15793 156462 total

wave2-essentials.diff (77.5 KB)

wave2-impl.diff (50.6 KB)

wave2-nittygritty.diff (24.7 KB)

Hi Gabor,

Can you provide performance data for this? I'd
like to know what affect these changes have on
compile time.

Thanks,

Dan

Unfortunately I had to introduce a new GlobalVariable::Create
mechanism (I hoped to have nailed all in wave 1, but life is cruel).
I will submit scripts for the easy conversion of external projects
like the last time.

One request is to explicity explain the new mechanism so people don't have to read the diffs or extrapolate from the conversion scripts.

Please send a separate email with what API changes occured and include "API Change" in the subject so its pretty clear whats happening. Advance notice as to when it will be merged in is also a plus.

Thanks,
Tanya

Hi Gabor,

Can you provide performance data for this? I'd
like to know what affect these changes have on
compile time.

Hi Dan,

Unfortunately, no. I can feed you with some speculation, though,
see below.

The reason why I cannot do measurements (at the moment) is that
- I have no experience with performance measurements of llvm compile
time,
- I have no good test cases that would provide the data interesting
for you,
- I have no machine that can run Release-mode tests
  my PPC mac mini at home runs Tiger --> Xcode 2.4.1 is broken for
Release mode [1]
  I have access to a beefy Mac Pro, but it also miscompiles Release
for the same reason. Also it is an image-processing machine in
production, so I can only run "nice make ..."
  I have access to a Solaris 10 workstation, but I really do not want
to build llvm-gcc on this one. It also keeps my /home on NFS, which
has speed and quota problems.

But: It is very simple to do your measurements yourself (and I'd love
to hear the results!) :

   cd llvm
   svn switch http://llvm.org/svn/llvm-project/llvm/branches/ggreif/use-diet
.
   make

rebuild llvm-gcc, etc.
retest.

And now here is my educated speculation:
There are 2 things that became slower
1) Use::getUser()
2) Use::get/set due to tagging.

The former is seldom called:

$ find lib -name "*.cpp" | xargs grep "getUser(" | wc -l
      41

We could audit those to make sure that no unnecessary calls are done.
But the getUse() algorithm is not sooo inefficient, anyway.

The second is counterbalanced with a faster access to the Use object
in most cases:
  With exception of PHINode and SwitchInst, the getOperand() function
(if called on a specialized "this" pointer) does a "this"-relative
access instead of getting OperandList pointer first and going thru
that. This was the case before for fixed-arity instructions, but now
it applies to variadic ones too that cannot perform operand list
resizing.

Some things got faster, like
1) getOperand access on (say) CallInst is now fetching operands from
relative to "this". Also, there is no "new Use[n]" allocations (and
delete []) for these any more.
2) no need to maintain Use::U
3) data-cache related gains on smaller Use objects (not yet activated)

So, my idea is that these changes are performance neutral.

There are potentially more speedups to be realized:
- indexing operands in the decreasing direction eliminates the
getNumOperands() call in getOperand() for variadic arity instructions.
- shaving off (unneeded) operand related administrative members from
User.

I hope that this is interesting, but I'd like to ask anybody who is
comfortable with performance testing to help provide some hard
data :slight_smile:

Cheers,

    Gabor

[1] <http://lists.cs.uiuc.edu/pipermail/llvmdev/2008-March/
013436.html>

> Unfortunately I had to introduce a new GlobalVariable::Create
> mechanism (I hoped to have nailed all in wave 1, but life is cruel).
> I will submit scripts for the easy conversion of external projects
> like the last time.

One request is to explicity explain the new mechanism so people don't have
to read the diffs or extrapolate from the conversion scripts.

Sure. The mechanism has been up to review here:

<http://lists.cs.uiuc.edu/pipermail/llvmdev/2008-April/013729.html>
The header file should provide a nice overview:
<http://llvm.org/viewvc/llvm-project/llvm/branches/ggreif/use-diet/
include/llvm/User.h?view=markup&pathrev=49380>

There is only one API change this time:
"new GlobalVariable" becomes "GlobalVariable::Create"
Currently there is another annoyance, you cannot pass a dereferenced
use_iterator to dyn_cast, you have to write dyn_cast<Foo>(I->get())
instead of the former dyn_cast<Foo>(*I). This turns out to be an ugly
performance bug in the current codebase (construction and destruction
of Use
objects without need). I am trying to come up with a syntactically
neutral
solution to this bug.

Please send a separate email with what API changes occured and include
"API Change" in the subject so its pretty clear whats happening. Advance
notice as to when it will be merged in is also a plus.

Sure, will do so.

Cheers,

    Gabor

Yeah, I think
<http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-
Mon-20080414/061158.html>
fixes this.

You can disregard related changes from the patch.

After a good deal of testing I'll merge r49785 to trunk.

Cheers,

   Gabor

But: It is very simple to do your measurements yourself (and I'd love
to hear the results!) :

   cd llvm
   svn switch http://llvm.org/svn/llvm-project/llvm/branches/ggreif/use-diet .
   make

rebuild llvm-gcc, etc.
retest.

Hmm, do not forget to revert the switch after being done:

   svn switch http://llvm.org/svn/llvm-project/llvm/trunk .

Also, you will need to do these conversions to llvm-gcc:

##### tcsh script follows ######
foreach CLASS (GlobalVariable)
setenv SUBST1 "s/new $CLASS/${CLASS}::Create/g"
setenv SUBST2 "s/new llvm::$CLASS/llvm::${CLASS}::Create/g"
foreach i (`find . -name "*.cpp"`)
sed -e "$SUBST1" -e "$SUBST2" < $i > $i.2
rm $i
mv $i.2 $i
end
foreach i (`find . -name "*.h"`)
sed -e "$SUBST1" -e "$SUBST2" < $i > $i.2
rm $i
mv $i.2 $i
end
end

In one place you will have to undo the change (compiler will tell
where).

You can also experiment with removing User::U (and the related
asserts).

Cheers,

    Gabor

And now here is my educated speculation:
There are 2 things that became slower
1) Use::getUser()
2) Use::get/set due to tagging.

The former is seldom called:

$ find lib -name "*.cpp" | xargs grep "getUser(" | wc -l
     41

The majority of those aren't actually Use::getUser, but on the
other hand this grep misses all the users of
value_use_iterator::operator*, which is much more popular.
Unfortunately, overloaded operators are not the easiest to
grep for ;-).

The second is counterbalanced with a faster access to the Use object
in most cases:
With exception of PHINode and SwitchInst, the getOperand() function
(if called on a specialized "this" pointer) does a "this"-relative
access instead of getting OperandList pointer first and going thru
that. This was the case before for fixed-arity instructions, but now
it applies to variadic ones too that cannot perform operand list
resizing.

Some things got faster, like
1) getOperand access on (say) CallInst is now fetching operands from
relative to "this". Also, there is no "new Use[n]" allocations (and
delete []) for these any more.

This is fine, but technically it's an independent change that could
be done without the getUser algorithm changes and tagging.

2) no need to maintain Use::U

3) data-cache related gains on smaller Use objects (not yet activated)

So, my idea is that these changes are performance neutral.

There are potentially more speedups to be realized:
- indexing operands in the decreasing direction eliminates the
getNumOperands() call in getOperand() for variadic arity instructions.

At a glance, the only getOperand() that I saw that uses getNumOperands()
is ReturnInst's, and that actually looks like a bug.

- shaving off (unneeded) operand related administrative members from
User.

Which members?

I hope that this is interesting, but I'd like to ask anybody who is
comfortable with performance testing to help provide some hard
data :slight_smile:

I agree that performance testing is not easy and requires resources,
but I'm curious what's motivating this project here. My assumption
has been that no one would likely go to the extreme of inventing a
mini-language encoded in the least significant bits of pointer
members in successive structs unless they were pretty desperate for
performance or scalability.

Dan

Perhaps someone should write some sort of C++ refactoring tool for that!

— Gordon

So, my idea is that these changes are performance neutral.

I strongly agree with Dan that we need to measure performance to ensure there is no significant performance regression.

I hope that this is interesting, but I'd like to ask anybody who is
comfortable with performance testing to help provide some hard
data :slight_smile:

I agree that performance testing is not easy and requires resources,
but I'm curious what's motivating this project here. My assumption
has been that no one would likely go to the extreme of inventing a
mini-language encoded in the least significant bits of pointer
members in successive structs unless they were pretty desperate for
performance or scalability.

Ah, this question is easy: it shrinks the Use class by a word, which reduces the most popular class in the LLVM IR by 25%. This directly makes all binary operators 8 bytes smaller for example, which is a pretty big memory footprint win, and memory footprint translates to dcache efficiency as well.

-Chris

Gabor,

Have you updated llvm2cpp to generate calls to the appropriate new constructors? Also, could you check the code in the tutorials to make sure it matches the new API?

--Owen

> And now here is my educated speculation:
> There are 2 things that became slower
> 1) Use::getUser()
> 2) Use::get/set due to tagging.

> The former is seldom called:

> $ find lib -name "*.cpp" | xargs grep "getUser(" | wc -l
> 41

The majority of those aren't actually Use::getUser, but on the
other hand this grep misses all the users of
value_use_iterator::operator*, which is much more popular.
Unfortunately, overloaded operators are not the easiest to
grep for ;-).

Hi Dan,

thanks for this hint, this makes me to look harder into cost reducing
getUser. My current idea is to do a very cheap heuristic inline, which
catches 100% of the cases for UnaryInstruction and 50% of the cases
for Binary/CmpInstruction, without doing a function call at all.

> The second is counterbalanced with a faster access to the Use object
> in most cases:
> With exception of PHINode and SwitchInst, the getOperand() function
> (if called on a specialized "this" pointer) does a "this"-relative
> access instead of getting OperandList pointer first and going thru
> that. This was the case before for fixed-arity instructions, but now
> it applies to variadic ones too that cannot perform operand list
> resizing.

> Some things got faster, like
> 1) getOperand access on (say) CallInst is now fetching operands from
> relative to "this". Also, there is no "new Use[n]" allocations (and
> delete []) for these any more.

This is fine, but technically it's an independent change that could
be done without the getUser algorithm changes and tagging.

> 2) no need to maintain Use::U
> 3) data-cache related gains on smaller Use objects (not yet activated)

> So, my idea is that these changes are performance neutral.

> There are potentially more speedups to be realized:
> - indexing operands in the decreasing direction eliminates the
> getNumOperands() call in getOperand() for variadic arity instructions.

At a glance, the only getOperand() that I saw that uses getNumOperands()
is ReturnInst's, and that actually looks like a bug.

> - shaving off (unneeded) operand related administrative members from
> User.

Which members?

I am thinking of OperandList and NumOperands. In most User subclasses
these
are trivially recomputable/constant. The catch is only that getting
these via a
User* is a bit more costly (virtual call or other trick). The question
is how
often the code calls getOperand thru an unspecialized User* ?

> I hope that this is interesting, but I'd like to ask anybody who is
> comfortable with performance testing to help provide some hard
> data :slight_smile:

I agree that performance testing is not easy and requires resources,

I am on my way into this area. Now I have a working Release setup
and already in possession of hard numbers, which are pretty
encouraging.
I see about 1% of speed degradation on a cruel verifier test
(2004-09-29-VerifierIsReallySlow.llx) but even there the user-time
is well in the cloud of measurements of the regular (non-tagged)
build.
On all other tests I have seen no degradation so far. And this
was even before I moved the tag from the Val to the Pred member of
Use.

but I'm curious what's motivating this project here. My assumption

LLVM is intended to be a pretty much target independent technology
with a
compact delivery format and potentially good startup behaviour. To
actually being a viable system for a wide range of machines it
must be as memory efficient as possible. The in-memory data structures
are needed for the JIT and optimizer and thus they must coexist with
the
generated machine code for the platform. <Use> seems to be the
structure
with the currently greatest bang for buck ratio, so I am thinking
about
a solution to that :slight_smile:

has been that no one would likely go to the extreme of inventing a
mini-language encoded in the least significant bits of pointer
members in successive structs unless they were pretty desperate for
performance or scalability.

Yep. There are desperate needs. I am in the embedded world. When
sometime
iPhone/Pod-like devices are supposed to run many multi-megainstruction
programs routinely, anybody else will see the difference.

Regarding the mini-language, I decided that the alternatives like
linear
search (see the Kernighan-Ritchie mini-language of null-terminated
strings
and strlen :slight_smile: and map lookup are far more costly, esp. in a multi-
threaded
environment. For completeness see my musings at <http://
heisenbug.blogspot.com/2008/04/llvm-data-structures-and-putting-use-
on.html>.

Dan

Thanks for your feedback!

Cheers,

    Gabor

Gabor,

Have you updated llvm2cpp to generate calls to the appropriate new

Yes. These are caught by my conversion scripts.

constructors? Also, could you check the code in the tutorials to make
sure it matches the new API?

Good point, will do.

Thanks,

   Gabor

> Gabor,

> Have you updated llvm2cpp to generate calls to the appropriate new

Yes. These are caught by my conversion scripts.

> constructors? Also, could you check the code in the tutorials to make
> sure it matches the new API?

Good point, will do.

All API changes so far (which landed on trunk) should be covered by
<http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-
Mon-20080414/061283.html>

Please alert me if you find some old API (uncompilable) examples.

Cheers,

   Gabor

Please send a separate mail with the subject "API change" and provide details in the email.

Not everyone reads these threads in depth nor does everyone read the llvm-commits list.

-Tanya

>> So, my idea is that these changes are performance neutral.

I strongly agree with Dan that we need to measure performance to
ensure there is no significant performance regression.

Dan, Chris,

finally I am in possession of hard performance data on
a realistic testcase (kimwitu++).

I have 20 measurements regading trunk LLVM and 5 with my changes
merged in:

###### TRUNK + DIET
pixxi:~ ggreif$ cat kimwituFastDiet50147.scatter|grep user|sort
user 1m25.404s
user 1m25.453s
user 1m25.454s
user 1m25.526s
user 1m25.973s

###### TRUNK
pixxi:~ ggreif$ cat kimwituRegular.scatter.backup|grep user|sort
user 1m25.127s
user 1m25.132s
user 1m25.147s
user 1m25.160s
user 1m25.169s
user 1m25.179s
user 1m25.179s
user 1m25.184s
user 1m25.189s
user 1m25.199s
user 1m25.204s
user 1m25.207s
user 1m25.212s
user 1m25.217s
user 1m25.219s
user 1m25.233s
user 1m25.243s
user 1m25.245s
user 1m25.259s
user 1m25.560s

ratio of the two best CPU times:
pixxi:~ ggreif$ expr 854040000 / 85127
10032

ratio of the two second best CPU times:
pixxi:~ ggreif$ expr 854530000 / 85132
10037

It looks like we have a degradation of 0.3%.
The <system> and <real> times show no surprises at all.

There is one important change still missing from the
use-diet branch, viz. the capacity of the BitcodeReaderValueList
is computed very naively with the new algorithm at each push_back.
I left this in to see whether the algorithm scales.
Kimwitu++ bitcode-reading puts more than 250.000 Use
objects into a contiguous array. To get its capacity
my algoritm has to visit more than 18 pointers each time.

Tonight I will store the capacity in a member variable,
and run comprehensive tests. I expect further speedups,
possibly even parity.

Barring any surprises I plan to merge the use-diet branch to
trunk this weekend. Owen promised to help me doing more
performance evaluations till then, so we get a clearer
picture.

I have also downloaded CHUD, so maybe even looking at
(and fixing) of bottlenecks is feasible in the next days.

What do you think?

Cheers,

    Gabor

PS: Yes I will send out several mails to llvm-dev before
and after merging.

So, my idea is that these changes are performance neutral.

I strongly agree with Dan that we need to measure performance to
ensure there is no significant performance regression.

Dan, Chris,

finally I am in possession of hard performance data on
a realistic testcase (kimwitu++).

I have 20 measurements regading trunk LLVM and 5 with my changes
merged in:

<snip>

It looks like we have a degradation of 0.3%.
The <system> and <real> times show no surprises at all.

There is one important change still missing from the
use-diet branch, viz. the capacity of the BitcodeReaderValueList
is computed very naively with the new algorithm at each push_back.
I left this in to see whether the algorithm scales.
Kimwitu++ bitcode-reading puts more than 250.000 Use
objects into a contiguous array. To get its capacity
my algoritm has to visit more than 18 pointers each time.

Tonight I will store the capacity in a member variable,
and run comprehensive tests. I expect further speedups,
possibly even parity.

Barring any surprises I plan to merge the use-diet branch to
trunk this weekend. Owen promised to help me doing more
performance evaluations till then, so we get a clearer
picture.

I have also downloaded CHUD, so maybe even looking at
(and fixing) of bottlenecks is feasible in the next days.

What do you think?

Could you also report on memory usages?

-bw