FireFox-46.0.1 build with interprocedural register allocation enabled

Hello,

I build FireFox-46.0.1 source with llvm to test interprocedural register allocation.
The build was successful with out any runtime faliures, here are few stats:

Measure W/O IPRA WITH IPRA
======= ======== =========
Total Build Time 76 mins 82.3 mins 8% increment
Octane v2.0 JS Benchmark Score (higher is better) 18675.69 19665.16 5% improvement
Kraken JS Benchmark time (lower is better) 1416.2 ms 1421.3 ms 0.35% regression
JetStream JS Benchmark Score (higer is better) 110.10 112.88 2.52% improvement

Any suggestions are welcome on how to effectively measure performance improvement!

Sincerely,
Vivek

Hi Vivek,

[Dropping firefox-dev, since I don't want to spam them]

> Measure W/O IPRAWITH IPRA
> ========================
> Total Build Time76 mins82.3 mins8% increment
> Octane v2.0 JS Benchmark Score (higher is better)18675.69 19665.165% improvement
> Kraken JS Benchmark time (lower is better)1416.2 ms 1421.3 ms 0.35% regression
> JetStream JS Benchmark Score (higer is better)110.10112.882.52% improvement

This is great!

Do you have a sense of how much noise these benchmarks have? For
instance, if you run the benchmark 10 times, what is standard
deviation in the numbers?

It'd also be great to have a precise, non-measurement oriented view of
the benchmarks. Is it possible to collect some statistics on how much
you've improved the register allocation in these workloads? Perhaps
you could count the instances where you preserved a register over a
call site that wouldn't have been possible without IPRA?

Thanks!
-- Sanjoy

Hi Vivek,

[Dropping firefox-dev, since I don't want to spam them]

> Measure W/O IPRAWITH IPRA
> ========================
> Total Build Time76 mins82.3 mins8% increment
> Octane v2.0 JS Benchmark Score (higher is better)18675.69 19665.165%
improvement
> Kraken JS Benchmark time (lower is better)1416.2 ms 1421.3 ms 0.35%
regression
> JetStream JS Benchmark Score (higer is better)110.10112.882.52%
improvement

This is great!

Do you have a sense of how much noise these benchmarks have? For
instance, if you run the benchmark 10 times, what is standard
deviation in the numbers?

Hello Sanjoy,

For Octane and Kraken I have run them 4 times and above result is geometric
mean. For Octane standard deviation (SD) is 918.54 (NO_IPRA) and 597.82
(With_IPRA). For Kraken unfortunately I don't have readings any more but
there was very minor change in each run. JetStream it self runs the test
for 3 times reports the result. From next time onwards I will run test at
least for 10 times

It'd also be great to have a precise, non-measurement oriented view of
the benchmarks.

I don't understand this point actually all these benchmarks are suggested
by firefox-devs and they wanted the results back.

Is it possible to collect some statistics on how much
you've improved the register allocation in these workloads?

Actually while testing single source test case with debug build I used to
compare results of -stats with regalloc keyword but while building such a
huge software I prefer release build of llvm. And also I don't know if
there is any way in llvm to generate stats for the whole build.

Perhaps
you could count the instances where you preserved a register over a
call site that wouldn't have been possible without IPRA?

Yes this seems a good idea and I think this is implementable, after
calculating new regmask when inserting data into immutable pass two
regmasks can be compared to calculate improvements. I will work on this and
let you know the progress.

Hi Vivek,

vivek pandya wrote:

> For Octane and Kraken I have run them 4 times and above result is
> geometric mean. For Octane standard deviation (SD) is
> 918.54 (NO_IPRA) and 597.82 (With_IPRA). For Kraken unfortunately I
> don't have readings any more but there was very
> minor change in each run. JetStream it self runs the test for 3
> times reports the result. From next time onwards I will
> run test at least for 10 times

Oh, no, I used "10 times" as an anecdotal number. Geomean of 4 times
is enough. Of course, if it does not take extra effort, running them
for more iterations is better, but don't bother if it will e.g. take
more than a trivial amount of manual work.

> It'd also be great to have a precise, non-measurement oriented view of
> the benchmarks.
>
> I don't understand this point actually all these benchmarks are
> suggested by firefox-devs and they wanted the results back.

I was talking about the `-stats` bit there ^.

> Is it possible to collect some statistics on how much
> you've improved the register allocation in these workloads?
>
> Actually while testing single source test case with debug build I
> used to compare results of -stats with regalloc
> keyword but while building such a huge software I prefer release
> build of llvm. And also I don't know if there is any
> way in llvm to generate stats for the whole build.

Can you do something quick and dirty, like just have some local
changes that dumps out some information on outs() ?

> Perhaps
> you could count the instances where you preserved a register over a
> call site that wouldn't have been possible without IPRA?
>
> Yes this seems a good idea and I think this is implementable, after
> calculating new regmask when inserting data into
> immutable pass two regmasks can be compared to calculate
> improvements. I will work on this and let you know the progress.

Just to be clear: you don't *have* to do this (specifically, check
with your mentor before sinking too much time into this). But if we
can come up with easy to "confirm your kill" (i.e. ensure what you
think should happen is what is actually happening) then I think we
should do it.

Thanks, and best of luck!
-- Sanjoy

Hello,

I build FireFox-46.0.1 source with llvm to test interprocedural register
allocation.
The build was successful with out any runtime faliures, here are few stats:

This is very good, thanks for working on this.

Measure W/O IPRA WITH IPRA
======= ======== =========
Total Build Time 76 mins 82.3 mins 8% increment
Octane v2.0 JS Benchmark Score (higher is better) 18675.69 19665.16 5%
improvement

This speedup is kind of amazing, enough to make me a little bit suspicious.

From what I can see, Octane is not exactly a microbenchmark but tries

to model complex/real-world web applications, so, I think you might
want to analyze where this speedup is coming from? Also, "score" might
be a misleading metric, can you actually elaborate what that means?
How does that relate to, let's say, runtime performance improvement?

Thanks!

Hi Vivek,

vivek pandya wrote:

> For Octane and Kraken I have run them 4 times and above result is
> geometric mean. For Octane standard deviation (SD) is
> 918.54 (NO_IPRA) and 597.82 (With_IPRA). For Kraken unfortunately I
> don't have readings any more but there was very
> minor change in each run. JetStream it self runs the test for 3
> times reports the result. From next time onwards I will
> run test at least for 10 times

Oh, no, I used "10 times" as an anecdotal number. Geomean of 4 times
is enough. Of course, if it does not take extra effort, running them
for more iterations is better, but don't bother if it will e.g. take
more than a trivial amount of manual work.

> It'd also be great to have a precise, non-measurement oriented view
of
> the benchmarks.
>
> I don't understand this point actually all these benchmarks are
> suggested by firefox-devs and they wanted the results back.

I was talking about the `-stats` bit there ^.

> Is it possible to collect some statistics on how much
> you've improved the register allocation in these workloads?
>
> Actually while testing single source test case with debug build I
> used to compare results of -stats with regalloc
> keyword but while building such a huge software I prefer release
> build of llvm. And also I don't know if there is any
> way in llvm to generate stats for the whole build.

Can you do something quick and dirty, like just have some local
changes that dumps out some information on outs() ?

Yes that should not take too much of time.

> Perhaps
> you could count the instances where you preserved a register over a
> call site that wouldn't have been possible without IPRA?
>
> Yes this seems a good idea and I think this is implementable, after
> calculating new regmask when inserting data into
> immutable pass two regmasks can be compared to calculate
> improvements. I will work on this and let you know the progress.

Just to be clear: you don't *have* to do this (specifically, check
with your mentor before sinking too much time into this). But if we
can come up with easy to "confirm your kill" (i.e. ensure what you
think should happen is what is actually happening) then I think we
should do it.

I will try if this can be done in a simple way. I am sure mentors will
welcome it.
-Vivek

> Hello,
>
> I build FireFox-46.0.1 source with llvm to test interprocedural register
> allocation.
> The build was successful with out any runtime faliures, here are few
stats:

This is very good, thanks for working on this.

>
> Measure W/O IPRA WITH IPRA
> ======= ======== =========
> Total Build Time 76 mins 82.3 mins 8% increment
> Octane v2.0 JS Benchmark Score (higher is better) 18675.69 19665.16 5%
> improvement

This speedup is kind of amazing, enough to make me a little bit suspicious.
From what I can see, Octane is not exactly a microbenchmark but tries
to model complex/real-world web applications, so, I think you might
want to analyze where this speedup is coming from?

Hi Davide,

I don't understand much about browser benchmarks but what IPRA is trying to
do is reduce spill code , and trying to keep values in register where ever
it can so speed up is comping from improved code quality. But with current
infrastructure it is hard to tell which particular functions in browser
code is getting benefitted.

Also, "score" might
be a misleading metric, can you actually elaborate what that means?
How does that relate to, let's say, runtime performance improvement?

Octane score considers 2 things execution speed and latency (pause during
execution) . You can find more information here

-Vivek

What was the breakdown across the sub-benchmarks, if I might ask?

-Boris

It sounds a little bit weird that you see such a big improvement for a
benchmark that's supposed to exercise JS (which is very likely handled
by a JIT inside FF), that's why I asked. My point (and worry) is that
benchmarks are very hard to get right, and from time to time you might
end up getting better numbers because of noise and not for 'improved
code quality'.
In other words, as you're presenting numbers, you should be able to
defend those numbers with an analysis which explains why your pass
makes the code better. Hope this makes sense.

Octane v2.0 JS Benchmark Score (higher is better)18675.69 19665.165%
improvement

What was the breakdown across the sub-benchmarks, if I might ask?

Hey Boris,
Sorry but I haven't kept details I just noted the final score. I will keep
it from next time.
-Vivek

>
>
>>
>> > Hello,
>> >
>> > I build FireFox-46.0.1 source with llvm to test interprocedural
register
>> > allocation.
>> > The build was successful with out any runtime faliures, here are few
>> > stats:
>>
>> This is very good, thanks for working on this.
>>
>> >
>> > Measure W/O IPRA WITH IPRA
>> > ======= ======== =========
>> > Total Build Time 76 mins 82.3 mins 8% increment
>> > Octane v2.0 JS Benchmark Score (higher is better) 18675.69 19665.16
5%
>> > improvement
>>
>> This speedup is kind of amazing, enough to make me a little bit
>> suspicious.
>> From what I can see, Octane is not exactly a microbenchmark but tries
>> to model complex/real-world web applications, so, I think you might
>> want to analyze where this speedup is coming from?
>
> Hi Davide,
>
> I don't understand much about browser benchmarks but what IPRA is trying
to
> do is reduce spill code , and trying to keep values in register where
ever
> it can so speed up is comping from improved code quality. But with
current
> infrastructure it is hard to tell which particular functions in browser
code
> is getting benefitted.

It sounds a little bit weird that you see such a big improvement for a
benchmark that's supposed to exercise JS (which is very likely handled
by a JIT inside FF), that's why I asked. My point (and worry) is that
benchmarks are very hard to get right, and from time to time you might
end up getting better numbers because of noise and not for 'improved
code quality'.

Yes Davide I also get the same concerns from llvm-devs that benchmarks are
very hard to get right.

Does ff ship JIT related code with in FF or it uses some library for that?

Also IPRA work in llvm is still in progress and this was my very first
experiment building a large software with it. My focus was to check if
there is not any compile time or runtime failures while building such a big
software, but I asked on #firefox IRC about how to measure browser
performance and I got these suggestions, during that chat some one from ff
community asked me to report back the result that is why I just mail this
to ff-dev.

In other words, as you're presenting numbers, you should be able to
defend those numbers with an analysis which explains why your pass
makes the code better. Hope this makes sense.

I have noted the points you mentioned and as work progresses I will try to
improve on benchmarking too but this is not the final conclusion.

-Vivek

You need to confirm that speedups are not “luck” (for instance by removing a spill you changed the code alignment, or just having the code layout in a CGSCC order that would make a difference).
Here is a possible way:

  1. Find one or two benchmarks that show the most improvements.
  2. Run with and without IPRA in a profiler (Instruments or other).
  3. Disassemble the hot path and try to figure out why. To confirm your findings (i.e. If you find a set of functions / call-sites) that you think are responsible for the speedup, you can try to bisect by forcing IPRA to run only on selected functions.

Yes having more stats seems a nice thing to have.

Hi Vivek,

vivek pandya wrote:

> For Octane and Kraken I have run them 4 times and above result is
> geometric mean. For Octane standard deviation (SD) is
> 918.54 (NO_IPRA) and 597.82 (With_IPRA). For Kraken unfortunately I
> don't have readings any more but there was very
> minor change in each run. JetStream it self runs the test for 3
> times reports the result. From next time onwards I will
> run test at least for 10 times

Oh, no, I used "10 times" as an anecdotal number. Geomean of 4 times
is enough. Of course, if it does not take extra effort, running them
for more iterations is better, but don't bother if it will e.g. take
more than a trivial amount of manual work.

> It'd also be great to have a precise, non-measurement oriented view
of
> the benchmarks.
>
> I don't understand this point actually all these benchmarks are
> suggested by firefox-devs and they wanted the results back.

I was talking about the `-stats` bit there ^.

> Is it possible to collect some statistics on how much
> you've improved the register allocation in these workloads?
>
> Actually while testing single source test case with debug build I
> used to compare results of -stats with regalloc
> keyword but while building such a huge software I prefer release
> build of llvm. And also I don't know if there is any
> way in llvm to generate stats for the whole build.

Can you do something quick and dirty, like just have some local
changes that dumps out some information on outs() ?

Yes that should not take too much of time.

> Perhaps
> you could count the instances where you preserved a register over a
> call site that wouldn't have been possible without IPRA?
>
> Yes this seems a good idea and I think this is implementable, after
> calculating new regmask when inserting data into
> immutable pass two regmasks can be compared to calculate
> improvements. I will work on this and let you know the progress.

Just to be clear: you don't *have* to do this (specifically, check
with your mentor before sinking too much time into this). But if we
can come up with easy to "confirm your kill" (i.e. ensure what you
think should happen is what is actually happening) then I think we
should do it.

I will try if this can be done in a simple way. I am sure mentors will
welcome it.

Yes having more stats seems a nice thing to have.

Hello ,

I did some quick and dirty hack for calculating no of register preserved
due to IPRA:

std::vector<uint32_t> RegMaskCopy = RegMask;
const uint32_t *CallPreservedMaskCopy =
      TRI->getCallPreservedMask(MF, MF.getFunction()->getCallingConv());
// calculate improvement by counting no of bits set.
unsigned int count = 0;
for (unsigned PReg = 1, PRegE = TRI->getNumRegs(); PReg < PRegE; ++PReg) {
  if(!(CallPreservedMaskCopy[PReg / 32] & (1u << PReg % 32))
      && (RegMaskCopy[PReg / 32] & (1u << PReg % 32))){
    markRegClobbered(TRI, &RegMaskCopy[0], PReg); // set aliases to 0
    count++;
  }
}
outs() << "Improvement Due to IPRA : " << count << "\n";

Is this seem good?
-Vivek

Hi,

the best way to move forward, and I guess the answer to your question, is to get your work to build and test on try.

And now I’m going to flood you with jargon, sorry. Instead of trying to document that all here, I suggest you ask people for help on irc. Notably, #taskcluster, and #build. Folks there are your target audience for compiler changes, too.

First read is , just to get the first piece of jargon out.

I suspect that the next steps are:

I speculate that you can use your version of llvm if you can create a docker image that is able to compile Firefox. Please confirm that with #taskcluster, and get some help there on how to do that.

Then you’d want to make sure that you can actually compile/run mozilla-central. In particular wrt to taskcluster, there’s been various file moves and changes, so working off of release is going to make your path rockier. I understand that mozilla-central is more of a moving target, but that shouldn’t be that much of a problem for you in practice, I hope.

I suspect that just focusing on linux x64 is good for you for now?

So jargon-thought-train:

Validate my assumptions about all of this :wink: .

Make your setup work with mozilla-central.

Work in docker.

Make your mozilla-central’s taskcluster tasks for linux x64 pick your docker image to build on try.

Push to try for linux x64, with all tests and perf tests.

Use treeherder and perfherder (web uis) to see tests and performance.

If you want to focus on your toolchain impact over time instead of mozilla-central, I think you should be able to keep the same base version of your patch and just update the docker image, and then use perfherder to compare the results.

HTH

Axel

Hi,

the best way to move forward, and I guess the answer to your question, is
to get your work to build and test on try.

And now I'm going to flood you with jargon, sorry. Instead of trying to
document that all here, I suggest you ask people for help on irc. Notably,
#taskcluster, and #build. Folks there are your target audience for compiler
changes, too.

First read is ReleaseEngineering/TryServer - MozillaWiki, just
to get the first piece of jargon out.

I suspect that the next steps are:

I speculate that you can use your version of llvm if you can create a
docker image that is able to compile Firefox. Please confirm that with
#taskcluster, and get some help there on how to do that.

Then you'd want to make sure that you can actually compile/run
mozilla-central. In particular wrt to taskcluster, there's been various
file moves and changes, so working off of release is going to make your
path rockier. I understand that mozilla-central is more of a moving target,
but that shouldn't be that much of a problem for you in practice, I hope.

I suspect that just focusing on linux x64 is good for you for now?

So jargon-thought-train:

Validate my assumptions about all of this :wink: .

Make your setup work with mozilla-central.

Work in docker.

Make your mozilla-central's taskcluster tasks for linux x64 pick your
docker image to build on try.

Push to try for linux x64, with all tests and perf tests.

Use treeherder and perfherder (web uis) to see tests and performance.

If you want to focus on your toolchain impact over time instead of
mozilla-central, I think you should be able to keep the same base version
of your patch and just update the docker image, and then use perfherder to
compare the results.

Hello Axel,

This seems good plan but I think I would require time to setup all these,
so I prefer to do this during next weekend.

Thanks,
Vivek

I too agree with the general consensus that the benchmarks
are hard to get are hard to trust on. Moreover, profiling at
function level is also cumbersome.

However,

  1. what we want to know (directly or indirectly) is how many
    registers we saved in hot path?
  2. How much time we saved in spill-refill?
  3. What is causing the regression?

and I think as long we focus on this data, we can surely conclude that

the change has a sterling impact.

> Hello,
>
> I build FireFox-46.0.1 source with llvm to test interprocedural register
> allocation.
> The build was successful with out any runtime faliures, here are few
stats:

This is very good, thanks for working on this.

>
> Measure W/O IPRA WITH IPRA
> ======= ======== =========
> Total Build Time 76 mins 82.3 mins 8% increment
> Octane v2.0 JS Benchmark Score (higher is better) 18675.69 19665.16 5%
> improvement

This speedup is kind of amazing, enough to make me a little bit
suspicious.
From what I can see, Octane is not exactly a microbenchmark but tries
to model complex/real-world web applications, so, I think you might
want to analyze where this speedup is coming from?

Hi Davide,

I don't understand much about browser benchmarks but what IPRA is trying
to do is reduce spill code , and trying to keep values in register where
ever it can so speed up is comping from improved code quality. But with
current infrastructure it is hard to tell which particular functions in
browser code is getting benefitted.

You need to confirm that speedups are not “luck” (for instance by removing
a spill you changed the code alignment, or just having the code layout in a
CGSCC order that would make a difference).
Here is a possible way:

1) Find one or two benchmarks that show the most improvements.
2) Run with and without IPRA in a profiler (Instruments or other).
3) Disassemble the hot path and try to figure out why. To confirm your
findings (i.e. If you find a set of functions / call-sites) that you think
are responsible for the speedup, you can try to bisect by forcing IPRA to
run only on selected functions.

I have tried some test case which has run time improvements and top most

such cases where not benefited by IPRA but change of code gen order to be
on call graph. I was able to verify it as most of those test cases have
very few functions and inside that function there are library calls like
printf so IPRA can not help much there. But one test case in which there is
around 4% performance improvement
test-suite/MultiSource/Benchmarks/FreeBench/pifft that I tried out and
generated assembly file for IPRA and NO_IPRA run and comparing those files
can show IPRA improves code quality by avoiding no of spills/restore
mainly.
Please check generated assembly here:
https://gist.github.com/vivekvpandya/081baba01196c705f8b9baf420d960a1/revisions

how ever these functions are not really on hot path ( according to
Instruments app) but still functions like mp_add , mp_sub has about 9 call
sites so I think this justifies improvements due to IPRA. I have also
compared some assembly for some large functions (~2000 lines of assembly
code) from sqlite3 source code and observed that in such large functions
IPRA is able to save good number of spills.

Sincerely,
Vivek