Dev Meeting BOF: Performance Tracking

All,
I'm curious to know if anyone is interested in tracking performance
(compile-time and/or execution-time) from a community perspective? This
is a much loftier goal then just supporting build bots. If so, I'd be
happy to propose a BOF at the upcoming Dev Meeting.

Chad

That is a great idea! :slight_smile:

Hi Chad,

I'm not sure I'll be at the US dev meeting this year, but we had a
performance BoF last year and I think we should have another, at least
to check progress (that has been made) and to plan ahead. I'm sure
Kristof, Tobias and others will be very glad to see it, too.

If memory serves me well (it doesn't), these are the list of things we
agreed on making, and their progress:

1. Performance-specific test-suite: a group of specific benchmarks
that should be tracked with the LNT infrastructure. Hal proposed to
look at this, but other people helped implement it. Last I heard there
was some way of running it but I'm not sure how to do it. I'd love to
have this as a buildbot, though, so we can track its progress.

2. Statistical analysis of the LNT data. A lot of work has been put
into this and I believe it's a lot better. Anton, Yi and others have
been discussing and submitting many patches to make the LNT reporting
infrastructure more stable, less prone to noise and more useful all
round. It's not perfect yet, but a lot better than last year's.

Some other things happened since then that are also worth mentioning...

3. LNT website got really unstable (Internal Server Error every other
day). This is the reason I stopped submitting results to it, since it
would make my bot fail. And because I still don't have a performance
test-suite bot, I don't really care much for the results. But with the
noise reduction, it'd be really interesting to monitor the progress,
even of the full test-suite, but right now, I can't afford to have
random failures. This seriously needs looking into and would be good
to have that as a topic in this BoF.

4. Big Endian results got in, and the infrastructure now is able to
have both "golden standard" results. That's done and working (AFAIK).

5. Renovation of the test/benchmarks. The tests and benchmarks in the
test-suite are getting really old. One good example is the ClamAV
anti-virus, that is not just old, but the results are bogus and
cooked, which doesn't really tell much signal from noise. Other
benchmarks have such short run-time that it's almost pointless. One
needs to go through the things we test/benchmark and make sure they're
valid and meaningful. This is probably similar, but more extensive,
than item 1.

About non-test-suite benchmarking...

I have been running some closed source benchmarks, but since we can't
share any data on it, getting historical relative results is almost
pointless. I don't think we should be worried as a community to run
keep open scores about them. Also, since almost every one is running
them behind closed doors, and fixing the bugs with reduced cases, I
think that's the best deal we can get.

I've also tried a few other benchmarks, like running ImageMagick
libraries, or Phoronix, and I have to say, they're not really that
great at spotting regressions. ImageMagick will take a lot of work to
make it into a meaningful benchmark, and Phoronix is not really ready
to be a compiler benchmark (it only compiles once, with the system
compiler, so you have to heavily hack the scripts). If you're up to
it, maybe you could hack those into a nice package, but it won't be
easy.

I know people have done it internally, like I did, but none of these
scripts are ready to be left out in the open, since they're either
very ugly (like mine) or contain private information...

Hope that helps...

cheers,
--renato

If memory serves me well (it doesn’t), these are the list of things we
agreed on making, and their progress:

  1. Performance-specific test-suite: a group of specific benchmarks
    that should be tracked with the LNT infrastructure. Hal proposed to
    look at this, but other people helped implement it. Last I heard there
    was some way of running it but I’m not sure how to do it. I’d love to
    have this as a buildbot, though, so we can track its progress.

We have this in LNT now which can be activated using --benchmarking-only. It’s about 50% faster than a full run and massively reduces the number of false positives.

Chris has also posted a patch to rerun tests which the server said changed. Haven’t tried it yet but it looks really promising.

  1. Statistical analysis of the LNT data. A lot of work has been put
    into this and I believe it’s a lot better. Anton, Yi and others have
    been discussing and submitting many patches to make the LNT reporting
    infrastructure more stable, less prone to noise and more useful all
    round. It’s not perfect yet, but a lot better than last year’s.

There’s definitely lots of room for improvement. I’m going to propose some more once we’ve solved the LNT stability issues.

  1. LNT website got really unstable (Internal Server Error every other
    day). This is the reason I stopped submitting results to it, since it
    would make my bot fail. And because I still don’t have a performance
    test-suite bot, I don’t really care much for the results. But with the
    noise reduction, it’d be really interesting to monitor the progress,
    even of the full test-suite, but right now, I can’t afford to have
    random failures. This seriously needs looking into and would be good
    to have that as a topic in this BoF.

We are now testing PostgreSQL as database backend on the public perf server, replacing the SQLite db. Hopefully this can improve the stability and system performance.

Also being discussing is to move the LNT server to a PaaS service, as it has higher availability and saves a lot of maintenance work. However this will need community to provide or fund the hosting service.

-Yi

We have this in LNT now which can be activated using `--benchmarking-only`.
It's about 50% faster than a full run and massively reduces the number of
false positives.

Excellent! Does that mean I can create another LNT buildbot with an
extra flag and I'll get my benchmarking bot?

We are now testing PostgreSQL as database backend on the public perf server,
replacing the SQLite db. Hopefully this can improve the stability and system
performance.

I believe a real database would make things more stable but I don't
think that the problem is in the DB, only that a better DB will hide
the real problem even deeper.

But I'm also not in the position to offer help debug the thing, so
I'll go with what you guys think is the best.

Also being discussing is to move the LNT server to a PaaS service, as it has
higher availability and saves a lot of maintenance work. However this will
need community to provide or fund the hosting service.

I've been thinking about this for a while and I think the LLVM
Foundation could make this happen across the globe.

If I got it right, part of the foundation's purpose is to make sure we
have a stable, modern and effective infrastructure for building,
testing and benchmarking LLVM, as well as other critical development
tools like bugzilla, phabricator and svn/git repositories. My proposal
is to use the foundation's charitable status to get help from around
the world for redundant services.

I don't know about PostgreSQL, but MySQL has a master/slave setting
that works fantastically well across continents, so that there is only
one master (with switch-over facility in case of failure) and multiple
remote slaves. I believe that services line LNT, bugzilla, phabricator
could highly benefit from this. Other services like git could also
have back-up copies (maybe making an official GitHub repo as a
redundant copy of out oficial repo), and buildbots would be easy to
have at least one slave of the same bot config on each continent, so
that if one fails, another is still up and running. The hardest one to
get that distributed is SVN.

If we could get universities in Europe and Asia, in addition to the
ones in US that host and maintain LLVM tools, we could have a really
reliable infrastructure without raising costs that much. I'd be happy
to pursue this in England (maybe even Brazil) and I believe there are
developers that would be glad to do this across Europe and Asia.

cheers,
--renato

Yes. But you should also let performance tracking buildbots take
multiple samples.

-Yi

Right! With a reduced set, that sure makes sense.

Thanks,
--renato

Hi Chad,

I'm definitely interested and would have proposed such a BOF myself if
you wouldn't have beaten me to it :slight_smile:

I think the BOF on the same topic last year was very productive in
identifying the most needed changes to enable tracking performance
from a community perspective. I think that by now most of the
suggestions made at that BOF have been implemented, and as the rest
of the thread shows, we'll hopefully soon have a few more performance
tracking bots that produce useful (i.e. low-noise) data.

I think it'll definitely be worthwhile to have a similar BOF this year.

Thanks,

Kristof

There is little for me to add, except that I would also be interested in such a BoF.

Cheers,
Tobias

Kristof,

Hi Chad,

I'm definitely interested and would have proposed such a BOF myself if
you wouldn't have beaten me to it :slight_smile:

Given you have much more context than I, I would be very happy to work
together on this BOF. :slight_smile:

I think the BOF on the same topic last year was very productive in
identifying the most needed changes to enable tracking performance
from a community perspective. I think that by now most of the
suggestions made at that BOF have been implemented, and as the rest
of the thread shows, we'll hopefully soon have a few more performance
tracking bots that produce useful (i.e. low-noise) data.

I'll grep through the dev/commits list to get up to speed.

For everyone's reference here are the notes from last year:
http://llvm.org/devmtg/2013-11/slides/BenchmarkBOFNotes.html

Kristof, feel free to comment further on these, if you feel so inclined.

I think it'll definitely be worthwhile to have a similar BOF this year.

I'll start working on some notes.

Chad

Kristof,
Unfortunately, our merge process is less than ideal. It has vastly improved
over the past few months (years I hear), but we still have times where we
bring
in days/weeks worth of commits en mass. To that end, I've setup a nightly
performance run against the community branch, but it's still an overwhelming
amount of work to track/report/bisect regressions. As you guessed, this is
what motivated my initial email.

The biggest problem that we were trying to solve this year was to produce
data without too much noise. I think with Renato hopefully setting up
a chromebook (Cortex-A15) soon there will finally be an ARM architecture
board producing useful data and pushing it into the central database.

I haven't got around finishing that work (at least not reporting to
Perf anyway) because of the instability issues.

I think getting Perf stable is priority 0 right now in the LLVM
benchmarking field.

I agree 110%; we don't want the bots crying wolf. Otherwise, real issues
will
fall on deaf ears.

I think this should be the main topic of the BoF this year: now that we
can produce useful data; what do we do with the data to actually improve
LLVM?

With the benchmark LNT reporting meaningful results and warning users
of spikes, I think we have at least the base covered.

I haven't used LNT in well over a year, but I recall Daniel Dunbar and I
having
many discussion on how LNT could be improved. (Forgive me if any of my
suggestions have already been address. I'm playing catch up at the moment.)

Further improvements I can think of would be to:

* Allow Perf/LNT to fix a set of "golden standards" based on past releases
* Mark the levels of those standards on every graph as coloured horizontal
lines
* Add warning systems when the current values deviate from any past
golden standard

I agree. IIRC, there's functionality to set a baseline run to compare
against.
Unfortunately, I think this is too coarse. It would be great if the golden
standard could be set on a per benchmark basis. Thus, upward trending
benchmarks can have their standard updated while other benchmarks remain
static.

* Allow Perf/LNT to report on differences between two distinct bots
* Create GCC buildbots with the same configurations/architectures and
compare them to LLVM's
* Mark golden standards for GCC releases, too, as a visual aid (no
warnings)

* Implement trend detection (gradual decrease of performance) and
historical comparisons (against older releases)
* Implement warning systems to the admin (not users) for such trends

Would it be useful to detect upwards trends as well? Per my comment above,
it would be great to update the golden standard so we're always moving in the
right direction.

* Improve spike detection to wait one or two more builds to make sure
the spike was an actual regression, but then email the original blame
list, not the current builds' one.

I recall Daniel and I discussing this issue. IIRC, we considered an
eager approach where the current build would rerun the benchmark to
verify the spikes. However, I like the lazy detection approach you're
suggesting. This avoids long running builds when there are real
regressions.

* Implement this feature on all warnings (previous runs, golden
standards, GCC comparisons)

* Renovate the list of tests and benchmarks, extending their run times
dynamically instead of running them multiple times, getting the times
for the core functionality instead of whole-program timing, etc.

Could we create a minimal test-suite that includes only benchmarks that
are known to have little variance and run times greater than some decided
upon threshold? With that in place we could begin the performance
tracking (and hopefully adoption) sooner.

I agree with Kristof that, with the world of benchmarks being what it
is, focusing on test-suite buildbots will probably give the best
return on investment for the community.

cheers,
--renato

Kristof/All,
I would be more than happy to contribute to this BOF in any way I can.

Chad

Hi Chad,

I recall Daniel and I discussing this issue. IIRC, we considered an eager

approach where the current build would rerun the benchmark to verify the
spikes. However, I like the lazy detection approach you're suggesting.
This avoids long running builds when there are real regressions.

I think the real issue behind this one is that it would change LNT from
being a passive system to an active system. Currently the LNT tests can be
run in any way one wishes, so long as a report is produced. Similarly, we
can add other benchmarks to the report, which we currently do internally to
avoid putting things like EEMBC into LNT's build system.

With an "eager" approach as you mention, LNT would have to know how to ssh
onto certain boxen, run the command and get the result back. Which would be
a ton of work to do well!

Cheers,

James

Hi Chad,

I recall Daniel and I discussing this issue. IIRC, we considered an
eager
approach where the current build would rerun the benchmark to verify the
spikes. However, I like the lazy detection approach you're suggesting.
This avoids long running builds when there are real regressions.

I think the real issue behind this one is that it would change LNT from
being a passive system to an active system. Currently the LNT tests can be
run in any way one wishes, so long as a report is produced. Similarly, we
can add other benchmarks to the report, which we currently do internally
to avoid putting things like EEMBC into LNT's build system.

With an "eager" approach as you mention, LNT would have to know how to ssh
onto certain boxen, run the command and get the result back. Which would
be a ton of work to do well!!

Ah, yes. That makes a great deal of sense. Thanks, James.

I agree. IIRC, there's functionality to set a baseline run to compare against.
Unfortunately, I think this is too coarse. It would be great if the golden
standard could be set on a per benchmark basis. Thus, upward trending
benchmarks can have their standard updated while other benchmarks remain
static.

Having multiple "golden standards" showing as a coloured line would
give the visual impression of mostly the highest score, no matter
which release that was. Programatically, it'd also allow us to enquire
about the "best golden standard" and always compare against it. I
think the historical values are important to show a graph of the
progress of releases, as well as the current revision, so you know how
that fluctuated in the past few years as well as in the past few
weeks.

Would it be useful to detect upwards trends as well? Per my comment above,
it would be great to update the golden standard so we're always moving in the
right direction.

Upwards trends are nice to know, but the "current standard" can be the
highest average of a set of N points since the last golden standard,
and for that we don't explicitly need to be tracking upwards trends.
If the last moving average is NOT the current standard, than we must
have had detected a downwards slope since then.

Could we create a minimal test-suite that includes only benchmarks that
are known to have little variance and run times greater than some decided
upon threshold? With that in place we could begin the performance
tracking (and hopefully adoption) sooner.

That's done. I haven't tested yet because of the failures in Perf.

In the beginning, we could start with the set of
gloden/current/previous standards for the benchmark-specific results,
not the whole test-suite. As we progress towards more stability, we
can implement that for all, but still allow configurations to only
warn (user/admin) of the restricted set, to avoid extra noise on noisy
targets (like ARM).

cheers,
--renato

My experience from leading BOFs at other conferences is more talk than action. So I suggest a different setup for this topic: how about having a working group meeting with participants who can commit time to work on this topic? The group meets for some time (TBD, during the conference of course), discusses and brainstorms the options, and - as a first immediate outcome - proposes a road forward in a 5-10 min report out talk.

There might be other topics that could benefit from the working group format as well, so we could have a separate report out session at the conference.

Cheers
Gerolf

Mine too, but in this case I have to say it wasn't at all what
happened. It started with a 10 minute description of what we had and
why it was bad, followed by a 40 minute discussion on what to do and
how.

There were about 80 people in the room, all actively involved in
defining actions and actors. In the end we had clear goals, with clear
owners and we have implemented every single one of them to date. I
have to say, I've never seen that happen before!

Furthermore, the "working group" was about the 80 people in the room
anyway, and they all helped in one way or another. So, for any other
discussion, I'd agree with you. For this one, I think we should stick
to what's working. :slight_smile:

cheers,
--renato

Hi Gerolf,

I also like actionable items coming out of a BoF more than "just talk".
That's why we tried identifying actionable items at the similar BoF last
year, and why I tried summarizing them clearly, see
http://llvm.org/devmtg/2013-11/slides/BenchmarkBOFNotes.html.

Last year, it proved difficult for most attendees to commit on the spot
during the BoF to actually work on any of the actions.

I think that the summary referred to above has helped to have most of
the actions implemented over the course of the year, even though most
actions at the BoF didn't have anyone owning them.

I agree it would be great if we'd have participants who can commit time
to work on action items on the spot. If that would prove difficult again
this time around, the next best thing I think is to at least have
actionable items identified and documented like last year.

Do I understand your proposal correctly that you propose to use a BoF slot
to produce actionable items, similar to last year; and then use e.g. a
lightning talk slot later during the conference to present to a wider
audience what the action items are?

Thanks,

Kristof

Hi Kristof,

thanks for the link and background info! It looks like this topic has a lot of
traction and momentum that already resembles a workgroup setting. A joined
list of action items and owners would be a wonderful outcome.

Cheers
Gerolf

Hi Gerolf,

I also like actionable items coming out of a BoF more than "just talk".
That's why we tried identifying actionable items at the similar BoF last
year, and why I tried summarizing them clearly, see
http://llvm.org/devmtg/2013-11/slides/BenchmarkBOFNotes.html.

Last year, it proved difficult for most attendees to commit on the spot
during the BoF to actually work on any of the actions.

I think that the summary referred to above has helped to have most of
the actions implemented over the course of the year, even though most
actions at the BoF didn't have anyone owning them.

I agree it would be great if we'd have participants who can commit time
to work on action items on the spot. If that would prove difficult again
this time around, the next best thing I think is to at least have
actionable items identified and documented like last year.

Do I understand your proposal correctly that you propose to use a BoF slot
to produce actionable items, similar to last year; and then use e.g. a
lightning talk slot later during the conference to present to a wider
audience what the action items are?

Yes, pretty much so.