Proposal: change LNT’s regression detection algorithm and how it is used to reduce false positives

I agree this is a great idea. I think it needs to be fleshed out a little
though.

It would still be wise to run the regression detection algorithm, because
the test suite changes and the machines change, and the algorithm is not
perfect yet. It would be a valuable source of information though.

How would running it as part of regular testing change anything? Presumably
the only purpose it would serve is retrospectively going back and seeing
false-positives in the aggregate. But if we are already doing offline
analysis, we can run the regression detection algorithm (or any prospective
ones) offline on the raw data; it doesn't take that long.

This is not a small change to how LNT works, so I think some due diligence
is necessary. Is clang *really* that deterministic, especially over
successive revs?

Yes. Actually, google's build system depends on this for its caching
strategy to work and so the google guys are usually on top of any issues in
this respect (thanks google guys!).

I know it is supposed to be. Does anyone have any data to show this is
going to be an effective approach? It seems like there are benchmarks in
the test-suite which use __DATE__ and __TIME__ in them. I assume that will
be a problem?

__DATE__ and __TIME__ should be easy to solve by modifying the benchmark,
or teaching clang to always return a fixed value for them (maybe we already
have this? IIRC google's build system does something like this; or maybe
the do it at the OS level).

-- Sean Silva

Intel has a binary comparator tool that we have been using for several years for comparing output binaries

to see if the code within them is considered identical. We use it to eliminate runs (and therefore some performance noise)

from our own performance tracking tools.

We are willing to contribute the source code for this to the LLVM community if there is interest.

There are two programs involved: getdep, which displays the list of DLL/.so dependencies of the image in question, and cmpimage itself, which does the comparison ignoring the parts not contributed by the compiler. The cmpimage program is also almost completely derived from the published object format descriptions.

Let me know if there is interest in these pieces of tooling, and if so, what you think next steps should be.

Kevin B. Smith

Intel has a binary comparator tool that we have been using for several years for comparing output binaries

to see if the code within them is considered identical. We use it to eliminate runs (and therefore some performance noise)

from our own performance tracking tools.

We are willing to contribute the source code for this to the LLVM community if there is interest.

There are two programs involved: getdep, which displays the list of DLL/.so dependencies of the image in question, and cmpimage itself, which does the comparison ignoring the parts not contributed by the compiler. The cmpimage program is also almost completely derived from the published object format descriptions.

Let me know if there is interest in these pieces of tooling, and if so, what you think next steps should be.

Kevin B. Smith

I should have said up-front more about what object formats are supported:

The program operates on object files, images, and archives that are in ELF, Windows PECOFF, or Apple Mach-O format. The program doesn’t care which compiler, assembler, linker, or archiver generated the file(s).

Kevin

I’d love to experiment with this approach, and any tool I don’t have to
write my self is a bonus!

For starters, you can just use a SHA1 sum which is what I have been doing
and which appears to reduce the search space by a factor of about 1000. I'm
not sure we need anything more sophisticated than that.

Also, the hash is amenable to being stored in a database which avoids the
need to do numerous pairwise comparisons of actual binaries.

Maybe we can fall back to a more detailed pairwise comparison if the hashes
differ, but I'm not sure how much more that will buy us (a spot-check of
the commits that I detected as changing the binary suggests that they did
actually change it in a substantial way).

-- Sean Silva

Update: in that same block of 10,000 LLVM/Clang revisions, this the number of distinct SHA1 hashes for the binaries of the following benchmarks:

7 MultiSource/Applications/aha/aha
2 MultiSource/Benchmarks/BitBench/drop3/drop3
10 MultiSource/Benchmarks/BitBench/five11/five11
7 MultiSource/Benchmarks/BitBench/uudecode/uudecode
3 MultiSource/Benchmarks/BitBench/uuencode/uuencode
5 MultiSource/Benchmarks/Trimaran/enc-rc4/rc4
11 SingleSource/Benchmarks/BenchmarkGame/n-body
2 SingleSource/Benchmarks/Shootout/ackermann

Let me know if there are any specific benchmarks you would like me to test.

– Sean Silva

Lets try this on the whole test suite?

Lets try this on the whole test suite?

I started this as a drill-down on a single benchmark, so I've just written
a little bit of Python for the build logic, and grown it to a little list
of hardcoded benchmarks.

Is there a way to programmatically build and access all the binaries in the
test suite?

-- Sean Silva

I’d love to see this tool contributed, even it isn’t used for regression detection work. I’ve got a couple of hacked up scripts which do similar things and having a robust tool available for this would be very useful.

Philip

I agree. I think there are a lot of exciting uses for this tool. A stage 3 build bot would be another one.

OK, there is interest from at least a couple of people. What should next steps be?

Kevin

OK, there is interest from at least a couple of people. What should next steps be?

Kevin

The code for cmpimage and getdep consists of five source files, with the following sizes

$ wc *

5912 20353 191869 cmpimage.cpp

290 1328 10668 elf.h

1496 5006 41691 getdep.cpp

233 959 7692 macho.h

403 1831 18394 pecoff.h

8334 29477 270314 total

to build each of them is just a simple compilation for whatever C++ compiler you happen to be using (clang, icc, cl, g++)

$(CXX) –o cmpimage –O2 cmpimage.cpp

$(CXX) –o getdep –O2 getdep.cpp

This seems like it would fit rather easily into test-suite/tools, which already exists and has a Makefile that the commands to build

these could be integrated into.

This is my best guess/opinion based on a cursory look over the test-suite directory structure.

Kevin

The code for cmpimage and getdep consists of five source files, with the following sizes

$ wc *

5912 20353 191869 cmpimage.cpp

290 1328 10668 elf.h

1496 5006 41691 getdep.cpp

233 959 7692 macho.h

403 1831 18394 pecoff.h

8334 29477 270314 total

to build each of them is just a simple compilation for whatever C++ compiler you happen to be using (clang, icc, cl, g++)

$(CXX) –o cmpimage –O2 cmpimage.cpp

$(CXX) –o getdep –O2 getdep.cpp

This seems like it would fit rather easily into test-suite/tools, which already exists and has a Makefile that the commands to build

these could be integrated into.

This is my best guess/opinion based on a cursory look over the test-suite directory structure.

Kevin

Personally, I would prefer this either live in it’s own repository, or llvm/tools/. None of my use cases will likely involve the test-suite.

p.s. If this is going to end up an llvm tool, it will need to follow LLVM style.

p.p.s. We should probably start a new thread with the proposed addition since I imagine many folks are ignoring this one by now given how deep it’s gotten.

Philip

Personally, I would prefer this either live in it's own repository, or
llvm/tools/. None of my use cases will likely involve the test-suite.

p.s. If this is going to end up an llvm tool, it will need to follow LLVM
style.

p.p.s. We should probably start a new thread with the proposed addition
since I imagine many folks are ignoring this one by now given how deep it's
gotten.

Maybe just putting it on github for now is easiest to at least make it
generally available for review. If we later want to officially pull it in
or integrate it with our build system we can do that.

-- Sean Silva

On the original thread topic, in r238965 I committed a much better detection algorithm, which uses a min of diffs approach. And in r238968 I updated the daily report to pass more data to make use of this. The change brings the false positive rate down to about ~15% on our internal reports. For the last two weeks I have actually been able to detect real regression in the 1% range!

This approach only works when the previous samples set is large enough to have some meaningful previous information in it. Call sites of the regression detection have to be changed to pass more data. For the daily report I changed it from comparing last run on current and previous days, to comparing last run on first day to all runs on previous day.

If this works out well, lets consider changing Run comparisons and Field comparisons to work in a similar way.

FWIW - the patch to record the hash of binaries from the test-suite
into the LNT database has finally landed yesterday, see r249026, r249034,
r249035.

So far, LNT only records the hash data into its database, but doesn't use
it in any analysis or chart yet.
If you upgrade your instance of LNT now, hashes will start being recorded.
Future uses of these hashes in LNT analyses will be able to make use of
historical hashes from the point in time you've started using the now
top-of-trunk LNT.

One idea on how to use the data, next to the automatic noise analysis
algorithm, is to color the background of charts based on the hash value,
so that it's immediately visible for which time periods the binary remained
the same. At least for the sparklines on the daily report page, this
shouldn't be too hard to do.

We ought to also upgrade the instance of LNT running at llvm.org/perf,
but I'm still a bit confused over who knows how to do that? Tanya or
Daniel, could you do that?

Thanks,

Kristof