RFC: LNT/Test-suite support for custom metrics and test parameterization

Greetings everyone,

We would like to improve LNT.

The following RFC describes two LNT enhancements:

  • Custom (extensible) metrics
  • Test parameterization

The main idea is in document https://docs.google.com/document/d/1zWWfu_iBQhFaHo73mhqqcL6Z82thHNAoCxaY7BveSf4/edit?usp=sharing.

Thanks,

Elena.

Hi Elena,

Many thanks for working on this!

May I first suggest to convert the google document to email content? That may make it a little bit easier for more people to review. It also makes sure the content is archived on the llvm's mail servers.
I'll refrain from making detailed comments until the text is in email, so that comments remain close to the text they make a comment on.

From a high-level point-of-view, a few thoughts I had on the custom metrics proposal:

* My understanding is that you suggest, to be able to add custom metrics, to change the database schema to something that resembles a key-value pair way of storing data more. Often, storing data in key-value pairs in a relational databases can slow down queries a lot, depending on how data typically gets queried. I think that for LNT usage, the schema you suggest may work well in practice. But I think you'll need to do query time measurements and web page load time measurements to compare the speed before and after your suggested schema change, on a database with as much real-world data as you can lay your hands on. Ideally, both for the sqlite and the postgres database engines.
* Quite a few users of LNT only use the server and webui, and use a different system to produce the test data in the json file format that can be submitted to the LNT server. Therefore, I think it's useful if you'd also describe how the JSON file structure would change for this proposal.

For the proposal to add test-suite parameters: I'm not sure I've understood the problem you're trying to solve well. Maybe an example of a more concrete use case could help demonstrate what the value is of having multiple sets of CFLAGS per test program in a single run?
It seems that you're working on a patch that adapts the Makefile structures in test-suite. Maybe it would be better to switch to using the new cmake+lit system to build and run the programs in the test-suite and fix the problem there?

Thanks,

Kristof

Hi Elena,

This is great, I would love to see extensible support for arbitrary metrics.

Hi Kristof and Daniel,

Thanks for your answers.

Unfortunately I haven’t tried scaling up to a large data set before. Today I tried and results are quite bad.

So database scheme should be rebuild. Now I thought about creating one sample table for each test-suite, but not cloning all tables in packet. As I see other tables can be same for all testsuites. I mean if user run tests of new testsuite, new sample table would be created during importing data from json, if it doesn’t exist. Are there some problems with this solution? May be, I don’t know some details.

Moreover, I have question about compile tests. Are compile tests runnable? In http://llvm.org/perf there is no compile test. Does that mean that they are deprecated for now?

About test parameters, for example, we would like to have opportunity to compare benchmark results of test compiled with -O3 and -Os in context of one run.

Elena.

Hi Kristof and Daniel,

Thanks for your answers.

Unfortunately I haven’t tried scaling up to a large data set before. Today I tried and results are quite bad.
So database scheme should be rebuild. Now I thought about creating one sample table for each test-suite, but not cloning all tables in packet. As I see other tables can be same for all testsuites. I mean if user run tests of new testsuite, new sample table would be created during importing data from json, if it doesn’t exist. Are there some problems with this solution? May be, I don’t know some details.

It's unfortunate to see performance doesn't scale with the proposed initial schema, but not entirely surprising. I don't really have much feedback on how the schema could be adapted otherwise as I haven't worked much on that. I hope Daniel will have more insights to share here.

Moreover, I have question about compile tests. Are compile tests runnable? In http://llvm.org/perf there is no compile test. Does that mean that they are deprecated for now?

About test parameters, for example, we would like to have opportunity to compare benchmark results of test compiled with -O3 and -Os in context of one run.

The way we use LNT, we would run different configuration (e.g. -O3 vs -Os) as different "machines" in LNT's model. This is also explained in LNT's documentation, see
https://github.com/llvm-mirror/lnt/blob/master/docs/concepts.rst. Unfortunately, this version of the documentation hasn't found it's way yet to http://llvm.org/docs/lnt/contents.html.
Is there a reason why storing different configurations as different "machines" in the LNT model doesn't work for you?
I assume that there are a number of places in LNT's analyses that assume that different runs coming from the same "machine" are always produced by the same configuration. But I'm not entirely sure about that.

Thanks,

Kristof

Hi Kristof,

The way we use LNT, we would run different configuration (e.g. -O3 vs -Os) as different “machines” in LNT’s model.

O2/O3 is indeed bad example. We’re also using different machines for Os/O3 - such parameters apply to all tests and we don’t propose major changes.

Elena was only extending LNT interface a bit to ease LLVM-testsuite execution with different compiler or HW flags.

Maybe some changes are required to analyze and compare metrics between “machines”: e.g. code size/performance between Os/O2/O3.
Do you perform such comparisons?

“test parameters” are different, they allow exploring multiple variants of the same test case. E.g. can be:

  • index of input data sets, length of input vector, size of matrix, etc;

  • macro that affect source code such as changing 1) static data allocation to dynamic or 2) constants to variables (compile-time unknown)

  • extra sets of internal compilation options that are relevant only for particular test case

Same parameters can apply to multiple tests with different value sets:

test1: param1={v1,v2,v3}

test2: param1={v2,v4}

test3:

Of course, original test cases can be duplicated (copied under different names) - that is enough to execute tests.
Explicit “test parameters” allow exploring dependencies between test parameters and metrics.

Hi Kristof,

The way we use LNT, we would run different configuration (e.g. -O3 vs -Os) as different “machines” in LNT’s model.

O2/O3 is indeed bad example. We’re also using different machines for Os/O3 - such parameters apply to all tests and we don’t propose major changes.

Elena was only extending LNT interface a bit to ease LLVM-testsuite execution with different compiler or HW flags.

Oh I see, this boils down to extending the lnt runtest interface to be able to specify a set of configurations, rather than a single configuration and making
sure configurations get submitted under different machine names? We kick off the different configuration runs through a script invoking lnt runtest multiple
times. I don’t see a big problem with extending the lnt runtest interface to do this, assuming it doesn’t break the underlying concepts assumed throughout
LNT. Maybe the only downside is that this will add even more command line options to lnt runtest, which already has a lot (too many?) command line
options.

Maybe some changes are required to analyze and compare metrics between “machines”: e.g. code size/performance between Os/O2/O3.
Do you perform such comparisons?

We typically do these kinds of comparisons when we test our patches pre-commit, i.e. comparing for example ‘-O3’ with '-O3 ‘mllvm -enable-my-new-pass’.
To stick with the LNT concepts, tests enabling new passes are stored as a different “machine”.
The only way I know to be able to do a comparison between runs on 2 different "machine"s is to manually edit the URL for run vs run comparison
and fill in the runids of the 2 runs you want to compare.
For example, the following URL is a comparison of green-dragon-07-x86_64-O3-flto vs green-dragon-06-x86_64-O0-g on the public llvm.org/perf server:
http://llvm.org/perf/db_default/v4/nts/70644?compare_to=70634
I had to manually look up and fill in the run ids 70644 and 70634.
It would be great if there was a better way to be able to do these kind of comparisons - i.e. not having to manually fill in run ids, but having a webui to easily find and pick the runs you want to compare.
(As an aside: I find it intriguing that the URL above suggests that there are quite a few cases where “-O0 -g” produces faster code than “-O3 -flto”).

“test parameters” are different, they allow exploring multiple variants of the same test case. E.g. can be:

  • index of input data sets, length of input vector, size of matrix, etc;

  • macro that affect source code such as changing 1) static data allocation to dynamic or 2) constants to variables (compile-time unknown)

  • extra sets of internal compilation options that are relevant only for particular test case

Same parameters can apply to multiple tests with different value sets:

test1: param1={v1,v2,v3}

test2: param1={v2,v4}

test3:

Of course, original test cases can be duplicated (copied under different names) - that is enough to execute tests.
Explicit “test parameters” allow exploring dependencies between test parameters and metrics.

Right. In the new cmake+lit way of driving the test-suite, some of these test parameters are input to cmake (like macros) and others will be input to lit (like changing inputs), I think.
We see this also in e.g. running SPEC with ref vs train vs test data sets. TBH, I’m not quite sure how to best drive this. I guess Mathhias may have better ideas than me here.
I do think that to comply with LNT’s current conceptual model, tests being run with different parameters will have to have different test names in the LNT view.

Thanks,

Kristof

Can you be more explicit which ones? I don’t see any regression (other than compared to the baseline, or on the compile time).

D’Oh! I was misinterpreting the compile time differences as execution time differences. Indeed, there is no unexpected result in there.
Sorry for the noise!

Kristof

Hi Sergey, Elena,

Firstly, thanks for this RFC. It’s great to see more people actively using and modifying LNT and the test metrics support in general is rather weak currently.

Metrics

Hi Sergey, Elena,

Firstly, thanks for this RFC. It's great to see more people actively using and modifying LNT and the test metrics support in general is rather weak currently.

Metrics

Questions from this thread that I can help address:

The LNT compile suite: we use it a lot here at Apple. It has a metric set that is customized for the analysis of compile time regressions. Given the recent interest in compile time, I hope to be able to setup a public bot to collect data in this suite sometime soon.

On the topic of encoding configurations in or across machines: both work and are used. LNT Compile stores all the optimization levels in the same run, and uses part of the benchmark name to encode that. “compile.benchmark.name (opt and flags).metric”. This kind of flexibility is nice. The tradeoff is that it is harder to compare results from different machines (the web ui make it almost impossible - but you can do it by editing the URLs by hand. I think James made this a bit better recently?).

The ill-fated FieldChange table. This was added to LNT ages ago, and there were no consumers of the data. When I went to do the regression tracking feature, I realized there was an error in how the data was being calculated (a missing join was mixing data from other machines). Since sometime people disagree with me, and people blindly update their LNT instances, I decided best thing to do was not DROP the table, but just leave it incase a rollback was needed. That is how FieldChangeV2 came about, and that is why FieldChange still exists. Everyone can feel free to DROP the old table any time, and now that we have not been using it for a while and no one has complained, it is probably safe to remove it with a migration.

The intent of the test-suites as the primary database entity in LNT is manage the schema of the metics. The test-suite abstraction adds gobs of complexity to LNT in the backend. I’d be happy to drop that idea in favor of a more flexible scheme. IMO, the complexity comes from how we dynamically create the database schema from the test-suite definition. I’ve tried to add things to LNT in the past, (JSON api, admin interface, better migration to name three), and most of the third party flask modules require DB schema to be defined up front. That said about the test-suite system, it might help out here. We could implement your proposed changes as a third test-suite kind.

I am really torn about this.

When I implemented the regression tracking stuff recently, it really showed me how badly we are scaling. On our production server, the run ingestion can take well over 100s. Time is mostly spent in FieldChange generation and regression grouping. Both have to access a lot of recent samples. This is not the end of the world, because it runs in a background process. Where this really sucks is when a regression has a lot indicators. The web interface renders these in a graph, and just trying to pull down 100 graphs worth of data kills the server. I ended up limiting those to a max of 10 datasets, and even that takes 30s to load.

So I do think we need some improvements to the scalability.

LNT usage is spread between two groups. Users who setup big servers, with Postgres and apache/Gunicorn. For those uses I think a NoSQL is the way to go. However, our second (and probably more common) user, is the people running little instance on their own machine to do some local compiler benchmarking. Their setup process needs to be dead simple, and I think requiring a NoSQL database to be setup on their machine first is a no starter. Like we do with sqlite, I think we need a transparent fall back for people who don’t have a NoSQL database.

Would it be helpful to anyone if I got a dump of the llvm.org <http://llvm.org/&gt; LNT Postgres database? It is a good dataset big dataset to test with, and I assume everyone is okay with it being public, since the LNT server already is.

Hi, Chris.

Thank you for your answer about compile tests. As I understood during looking through code of compile tests they don’t use test suite at all. Am I right? There is lack of information and examples of running compile tests in LNT documentation.

We understood that there are two groups of users: users using servers and collecting a lot of data and SQLite users, but these users as I think wouldn’t have about millions of sample records.

I think that it’s obvious that there is no universal solution for simple installing process and flexible high-loaded system.

I will update proposal and take into consideration your suggestion about third part of test-suite.

Thanks

Elena.

Hi everyone.

Thanks to everyone who took participant in discussion of this proposal.

After discussion we understood how other users use LNT and how great datasets may be.

So there is new updated proposal.

(Google docs version with some images - LNT RFC: LNT/Test-suite support for custom metrics - Google Docs)

Goal is the same.

Enable LNT support of custom metrics such as: user-defined run-time and static metrics (power, etc.) and LLVM pass statistic counters. Provide integration with LLVM testsuite to automatically collect LLVM statistic counters or custom metrics.

Analysis of current Database

Limitations

  1. This structure isn’t flexible.

There is no opportunity to run another test-suite except simple one.

  1. Performance is quite bad when database has a lot of records.

For example, rendering graph is too slow. On green-dragon-07-x86_64-O3-flto:42 SingleSource/Benchmarks/Shootout/objinst compile_time need for rendering 191.8 seconds.

  1. It’s difficult to add new features which need queries to sample table in database(if we use BLOB field for custom metrics).

Queries will be needed for more complex analysis. For example, if we would like to add some additional check for tests which compile time is too long, we should get result of query where this metric is greater than some constant.

Or we would like to compare tests with different run options, so we should get only some tests but not all.

BLOB field will help to save current structure and make system a bit more flexible. But in the nearest future it will be not enough.

Getting all metrics of all tests will make work slow on great datasets. And this way isn’t enough optimal.

So we wouldn’t like to do BLOB field, which wouldn’t help to add new features and have flexible system in future.

Proposal

We suggest to do third part of LNT (as Chris Matthews suggested). This part will be used for getting custom metrics and running any test-suite.

We suggest to use NoSQL database (for example, MongoDB or JSON/JSONB extension of PostgresSQL, which let use PostgresSQL as NoSQL database) for this part. This part will be enable if there is path to NoSQL database in config file.

It helps to have one Sample table(collection in NoSQL). If we use schemaless feature in MongoDB, for example, then it’s possible to add new fields when new testsuite is running. Then there would be one table with a lot of fields, some of which are empty. At any moment of time it will be possible to change schema of table(document).

A small prototype was made with MongoDB and ORM MongoEngine. This ORM was choosen because MongoAlchemy doesn’t support schemaless features and last MongoKit version has error with last pymongo release.

I try it on virtual machine and get next results on 5 000 000 records.

Current scheme - 13.72 seconds

MongoDB – 1.35 seconds.

Results of course will be better on real server machine .

For use some test-suite user should describe fields in file with format .fields such way:

{

“Fields” : [{

“TestSuiteName” : “Bytecode”,

“Type” : “integer”,

“BiggerIsBetter” : 0,

“Show” : true

},

{

“TestSuiteName” : “GCC”,

“Type” : “real”,

“BiggerIsBetter” : 0,

“Name” : “GCC time”

},

{

“TestSuiteName” : “JIT”,

“Type” : “real”,

“BiggerIsBetter” : 0,

“Name” : “JIT Compile time”,

“Show” : true

},

{

“TestSuiteName” : “GCC/LLC”,

“Type” : “string”,

“BiggerIsBetter” : 0

}]

}

There was added one field “Show” for describing if this metric should be shown by default on web page (as James Molloy suggested). Other metrics would be added to page if user chooses them in view options.

Conclusion

This change will let user to choose if he wants to use flexible powerful system or use limited version with SQLite database.

If user chooses NoSQL version his data can be copied from its old database to new one. This will help to use new features without losing old data.

The actual question is which NoSQL database will be better for LNT. We are interested in opinions of people, who know features of LNT better.

Thanks,

Elena.

Hi Elena,

Thanks for pushing forward with this. I like the idea of using a NoSQL solution.

My primary reservation is about adding the new NoSQL stuff as an extra backend. I think the current database backend and its use of SQLAlchemy is extremely complex and is in fact the most complex part of LNT. Adding something more (as opposed to replacing it) would just make this worse and make it more likely that contributors wouldn’t be able to test LNT very well (having three engines to test: SQLite, PostgreSQL and MongoDB).

I think it’d be far better all around, if we decide to go with the NoSQL solution, to just bite the bullet and force users who want to run a server to install MongoDB.

In my experience most of the teams I’ve seen using LNT have a single LNT server instance and submit results to that, rather than launching small instances to run “lnt viewcomparison”.

Cheers,

James

I think it's important that it remains simple to get a simple local instance up and running.
That will make it easier for people to give LNT a try, and also makes it easier for LNT developers to have everything they need for developing locally.
I have no experience with NoSQL database engines. Would it be possible, assuming you have the MongoDB/other packages installed for your system, to just run

$ ~/mysandbox/bin/python ~/lnt/setup.py develop
$ lnt create ~/myperfdb
$ lnt runserver ~/myperfdb

and be up and running (which is roughly what is required at the moment)?
Of course good documentation could go a long way if a few extra steps would be needed.

I do agree with James that if there are no major concerns for using a NoSQL database, it would be easiest if we only supported one database engine.
For example, I had to do quite a bit of LNT regression test mangling to make them work on both sqlite and postgres, and it remains harder than it should be to write regression tests that test the database interface.

Thanks,

Kristof

Of course it'll be running as now. But user will need have installed MongoDB.

Installation on linux with support of .deb packages is quite easy.
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv EA312927
echo "deb Index of debian wheezy/mongodb-org/3.2 main" | sudo tee /etc/apt/sources.list.d/mongodb-org-3.2.list
sudo apt-get update
sudo apt-get install -y mongodb-org
sudo service mongod start

After that mongo will be running service in localhost:27017. After that user should create database
mongo
use <db name>

And set database name in config file. As additional fields will be host and port for users who will do settings for their server, which will have default values(localhost:27017).

After that old steps
~/mysandbox/bin/python ~/lnt/setup.py develop
lnt create ~/myperfdb
lnt runserver ~/myperfdb

MongoDB has detailed instructions for installing for all operating systems.

Extra 6-7 commands for install MongoDB, which takes about 2 minutes and should be executed once shouldn't be a great problem for new users, who would like to try LNT.

Thanks,

Elena.

Hi all,

First off, let me ask one question about the use case you intend to support:

Is your expectation for LLVM statistics style data that this would be present for almost all runs in a database, or that it would only be present for a small subset of runs?

Second, here is my general perspective on the LNT system architecture:

  1. It is very important to me that LNT continue to support an “turn-key” experience. In fact, I wish that experience would get easier, not harder, for example by moving all interactions with the server to a true admin interface. I would prefer not to introduce additional dependencies in that case.

I will add that I am not completely opposed to moving to a “turn-key” architecture which requires substantially more components (e.g., PostgreSQL, memcached, etc.) as long as it was packaged in such a way that it could offer a nice turn key experience. For example, if someone was willing to implement & document a Vagrant or Docker based development model, that would be ok with me, as long as there was still the option to do fully native deployments on OS X with real system administrator support.

  1. Our internal application architecture is severely lacking. I believe the standard architecture these days would be to have (load-balancer + front-end + [load-balancer] + back-end + database), and I think partly we are suffering from missing most of that infrastructure. In particular, we are missing two important components:
  • There should be a separate backend, which would allow us to implement improved caching, and a clear model for long-lived or persistent state. I filed this as: https://llvm.org/bugs/show_bug.cgi?id=27534
  • This would give us a place to manage background processing tasks, for example jobs that reprocess the raw sample data for efficient queries.
    If we had such a thing, we could consider using something like memcached for improving caching (particularly in a larger deployment).
  1. My intention was always to use JSON blobs for situations where a small % of samples want to include arbitrary blobs of extra data. I think standardizing on PostgreSQL/JSON is the right approach here, given the standard PaaS support for PostgreSQL and its existing use within our deployments. SQLAlchemy has native support for this.

  2. I don’t actually think that a NoSQL database buys us much if anything (particularly when we have PostgreSQL/JSON available). We are not in a situation where we would need something like eventual consistency around writes, which leaves us wanting raw query performance over large sets of relatively well-structured data. I believe that is a situation which is best served by SQL with properly designed tables. This is also an area where having an infrastructure that could handle background processing to load query-optimized tables & indices would be valuable.

I think it is a mistake to believe that using NoSQL and a schema-less model without also introducing substantial caching will somehow give better performance than a real SQL database, and would be very interested to see performance data showing otherwise (e.g., compare your MongoDB implementation to one in which all data was in a full-schematized table with proper indices).

  1. The database schema used by LNT (which is dynamically instantiated in response to test suite configurations) is admittedly unorthodox. However, I still believe this is the right answer to support a turn-key performance tracking solution that can be customized by the user based on the fields they wish to track (in fact, if Elena’s data is present for almost all samples then it is exactly the designed use case). I’m not yet convinced that the right answer here isn’t to improve the actual implementation of this approach; for example, we could require the definition to be created when the LNT instance is set up, which might solve some of the issues Chris mentioned. Or, we could expose improved tools or a UI for interacting with the schema via an admin interface, for example to promote individual from a JSON blob to being first class for improved performance/reporting.

  2. To echo one of James’s points, I also do think that ultimately the most complicated part of adding support for arbitrary metrics is not the database implementation, but managing the complexity in the reporting and graphing logic (which is already cumbersome). I am fine seeing a layered implementation approach where we first focus on how just to get the data in, but I do think it is worth spending some time looking at how to manage the visualizations of that data.

  • Daniel

Hi Daniel,

About your suggestion with BLOB. As I understood, we need change Sample table to do custom metrics next way: remove compile_time, execution_time and etc columns, and make one column BLOB with JSON {“compile_time”: , “exec_time”: }. We can do only such way, because other testsuites don’t have these fields, but have others.

For example, user has 1 milion samples in table and 4 testsuite: simple, nightly and 2 custom testsuite in proportion 25%(250 000 samples for each). Now I would like to get for all runs of testsuite simple tests, which have compile_time > constant and make some analysis and check if such big compile_time reasonable or there is an error. I would take all 250 000 records and check field value by myself. I think that it will be quite slow.

Even for showing data decoding from BLOB for great number of records will affect perfomance badly.

Thanks,

Elena.