Floating point variance in the test suite

Hi everyone,

I’d like to restart an old discussion about how to handle floating point variance in the LLVM test suite. This was brought up years ago by Sebastian Pop (https://lists.llvm.org/pipermail/llvm-dev/2016-October/105730.html) and Sebastian did some work to fix things at that time, but I’ve discovered recently that a lot of things aren’t working the way they are supposed to, and some fundamental problems were never addressed.

The basic issue is that at least some tests fail if the floating point calculations don’t exactly match the reference results. In 2016, Sebastian attempted to modify the Polybench tests to allow some tolerance, but that attempt was not entirely successful. Other tests don’t seem to have a way to handle this. I don’t know how many.

Melanie Blower has been trying for some time now to commit a change that would make the fp-contract=on the default setting for clang (as the documentation says it is), but this change caused failures in the test suite. Melanie recently committed a change (https://reviews.llvm.org/rT24550c3385e8e3703ed364e1ce20b06de97bbeee) which overrides the default and sets fp-contract=off for the failing tests, but this is not a good long term solution, as it leaves fp contraction untested (though apparently it has been for quite some time).

The test suite attempts to handle floating point tests in several different ways (depending on the test configuration):

  1. Tests write floating point results to a text file and the fpcmp utility is used to compare them against reference output. The method allows absolute and relative tolerance to be specified, I think.

  2. Tests write floating point results to a text file which is then hashed and compared against a reference hash value.

  3. (Sebastian’s 2016 change) Tests are run with two kernels, one which is compiled with FMA explicitly disabled and one with the test suite configured options. The results of these kernels are compared with a specified tolerance, then the FMA-disabled results are written to a text file, hashed and compared to a reference output.

I’ve discovered a few problems with this.

First, many of the tests are producing hashed results but using fpcmp to compare the hash values. I created https://bugs.llvm.org/show_bug.cgi?id=50818 to track this problem. I don’t know the configuration well enough to fix it, but it seems like it should be a simple problem. If the hash values match, it works (for the wrong reasons). It doesn’t allow any FP tolerance, but that seems to be expected (hashing and FP tolerance cannot be combined in the configuration files). Personally, I don’t like the use of hashing with floating point results, so I’d like to get rid of method 2.

Second, when the third method is used FMA is being disabled using the STDC FP_CONTRACT pragma. Currently, LLVM does not respect this pragma when fp-contract=fast is used. This seems to be accepted behavior, but in my opinion it’s obviously wrong. A new setting was added recently for fp-contract=fast-honor-pragmas. I would very much like to see this work the other way – by default fp-contract=fast should honor the pragmas, and if someone needs a setting that doesn’t that can be added. In any event, the relevant information here is that Sebastian’s FMA disabling solution doesn’t work for fp-contract=fast. Both kernels are compiled with FMA enabled, so their results match, but the test fails the hash comparison because the “FMA disabled” kernel really had FMA enabled.

Third, when the third method is used, it’s checking the intermediate results using “FP_ABSTOLERANCE” and in some cases FMA exceeds the tolerance currently configured. For example, in the Polybench symm test I got this output:

A[5][911] = 85644607039.746628 and B[5][911] = 85644607039.746643 differ more than FP_ABSTOLERANCE = 0.000010

The difference there looks pretty reasonable to me, but because we’re looking for a minimal absolute difference, the test failed. Incidentally, the test failed with a message that said “fpcmp-target: FP Comparison failed, not a numeric difference between ‘0’ and ‘b’” because the above output got hashed and compared to a reference hash value using the tool that expected both hash values to be floating point values. The LNT bots don’t even give you this much information though, as far as I can tell, they just tell you the test failed. But I digress.

Finally, a few of the Polybench tests are intending to use the third method above but aren’t actually calling the “strict” kernel so they fail with fp-contract=on.

So, that’s what I know. There is an additional problem that has never been addressed regarding running the test suite with fast-math enabled.

Now I suppose I should say what kind of feedback I’m looking for.

I guess the first thing I want is to find out who is interested in the results and behavior of these tests. I assume there are people who care, but they get touched so infrequently and are in such a bad state that I don’t know if anyone is even paying attention beyond trying to keep the bots green. I don’t want to spend a lot of time cleaning up tests that aren’t useful to anyone just because they happen to be in the test suite.

Beyond that, I’d like to get general feedback on the strategy that we should be adopting in the test suite. In the earlier discussion of this issue, the consensus seemed to be that the problems should be addressed on a case-by-case basis, doing the “right thing” for each test. This kind of implies a level of test ownership. I would like input for each test from someone who cares about that test.

Finally, I’d like to hear some general opinions on whether the tests we have now are useful and sufficient for floating point behavior.

Thanks,

Andy

Hi,

I would be looking forward to improving the tests. I had to fix the
fpcmp program twice already because of how it backtracked to find the
beginning of a number. If no tolerance is specified, it should be
nearly equal to a binary comparison. fpcmp is always used to compare
program output, even with not tolerance specified:

  if(REFERENCE_OUTPUT)
    set(DIFFPROG ${FPCMP})
    if(FP_TOLERANCE)
      set(DIFFPROG "${DIFFPROG} -r ${FP_TOLERANCE}")
    endif()
    if(FP_ABSTOLERANCE)
      set(DIFFPROG "${DIFFPROG} -a ${FP_ABSTOLERANCE}")
    endif()
    llvm_test_verify(WORKDIR ${CMAKE_CURRENT_BINARY_DIR}
      ${DIFFPROG} %o ${REFERENCE_OUTPUT}
    )

That is, the only issue should be the misleading message return by
fpcmp. I was thinking adding a `-b` for strict binary switch to fpmcp,
but I don't think that false positive due to `0` and `0.0` being
considered equal being that much of a problem.

Add me as a reviewer if you are going to fix these.

Michael

Hi everyone,

I’d like to restart an old discussion about how to handle floating point variance in the LLVM test suite. This was brought up years ago by Sebastian Pop (https://lists.llvm.org/pipermail/llvm-dev/2016-October/105730.html) and Sebastian did some work to fix things at that time, but I’ve discovered recently that a lot of things aren’t working the way they are supposed to, and some fundamental problems were never addressed.

The basic issue is that at least some tests fail if the floating point calculations don’t exactly match the reference results. In 2016, Sebastian attempted to modify the Polybench tests to allow some tolerance, but that attempt was not entirely successful. Other tests don’t seem to have a way to handle this. I don’t know how many.

Melanie Blower has been trying for some time now to commit a change that would make the fp-contract=on the default setting for clang (as the documentation says it is), but this change caused failures in the test suite. Melanie recently committed a change (https://reviews.llvm.org/rT24550c3385e8e3703ed364e1ce20b06de97bbeee) which overrides the default and sets fp-contract=off for the failing tests, but this is not a good long term solution, as it leaves fp contraction untested (though apparently it has been for quite some time).

The test suite attempts to handle floating point tests in several different ways (depending on the test configuration):

  1. Tests write floating point results to a text file and the fpcmp utility is used to compare them against reference output. The method allows absolute and relative tolerance to be specified, I think.

  2. Tests write floating point results to a text file which is then hashed and compared against a reference hash value.

  3. (Sebastian’s 2016 change) Tests are run with two kernels, one which is compiled with FMA explicitly disabled and one with the test suite configured options. The results of these kernels are compared with a specified tolerance, then the FMA-disabled results are written to a text file, hashed and compared to a reference output.

I’ve discovered a few problems with this.

First, many of the tests are producing hashed results but using fpcmp to compare the hash values. I created https://bugs.llvm.org/show_bug.cgi?id=50818 to track this problem. I don’t know the configuration well enough to fix it, but it seems like it should be a simple problem. If the hash values match, it works (for the wrong reasons). It doesn’t allow any FP tolerance, but that seems to be expected (hashing and FP tolerance cannot be combined in the configuration files). Personally, I don’t like the use of hashing with floating point results, so I’d like to get rid of method 2.

We’ve hit at least one failure when porting because the hashing means that miniscule differences in standard library implementations are not tolerated.

We’ve hit at least one failure when porting because the hashing means that miniscule differences in standard library implementations are not tolerated.

How did you address that? I would expect there to be many small differences between standard library implementations and across different platforms.

Wherever the language standard doesn’t require correctly rounded results there is a potential for small variations. I think this problem becomes more prominent with tests that execute on GPU and similar hardware.

fpcmp is always used to compare program output, even with not tolerance specified

Are you saying fpcmp is used even when the test produces integer or non-numeric output that has to be compared?

Renato Golin made some comments in Bugzilla that I’d like to copy here because I think they further the discussion:

As you see from the snippet from SingeMultiSource.cmake, it is. I
don't see it as a big issue though; tolerances are not set and there
is only a low probability for a false negative.

Michael

Hi Andrew,

Sorry I didn’t see this before. My reply to bugzilla didn’t take into account the contents, here, so are probable moot.

I don’t agree that the result doesn’t matter for benchmarks. It seems that the benchmarks are some of the best tests we have for exercising optimizations like this and if the result is wrong by a wide enough margin that could indicate a problem. But I understand Renato’s point that the performance measurement is the primary purpose of the benchmarks, and some numeric differences should be acceptable.

Yes, that’s the point I was trying to make. You can’t run a benchmark without understanding what it does and what the results mean. Small variations can be fine in one benchmark and totally unacceptable in others. However, what we have in the test-suite are benchmark-turned-tests and tests-turned-benchmarks in which the output is a lot less important if it’s more important if it’s totally different (ex. error messages, NaNs). My comment was just to the subset we have in the test-suite, not benchmarks in general.

If you truly want to benchmark LLVM, you should really be running specific benchmarks in specific ways and looking very carefully at the results, not relying on the test-suite.

In the previous discussion of this issue, Sebastian Pop proposed having the program run twice – once with “precise” FP results, and once with the optimizations being tested. For the Blur test, the floating point results are only intermediate and the final (printed) results are a matrix of 8-bit integers. I’m not sure what would constitute an acceptable result for this program. For any given value, an off-by-one result seems acceptable, but if there are too many off-by-one values that would probably indicate a problem. In the Polybench tests, Sebastian modified the tests to do a comparison within the test itself. I don’t know if that’s practical for Blur or if it would be better to have two runs and use a custom comparison tool.

Given the point above about the difference between benchmarks and test-suite-benchmarks, I think having comparisons inside the program itself is probably the best way forward. I should have mentioned that on my list, as I did that, too, in the test-suite.

The main problem with that, for benchmarks, is that they can add substantial runtime and change the profile of the test. But that can be easily fixed by iterating a few more times on the kernel (from the ground state).

What we want is to make sure the program doesn’t generate garbage, but garbage means different things for different tests, and having an external tool that knows what each of the tests think is garbage is not practical.

The way I see it, there are only three types of comparison:

  • Text comparison, for tests that must be identical on every platform.
  • Hash comparison, for those above where the output is too big.
  • FP-comparison, for those where the text and integers must be identical but the FP numbers can vary a bit.

The weird behaviour of fpcmp looking at hashes and comparing the numbers in them is a bug, IMO. As is comparing integers and allowing wiggle room.

Using fpcmp for comparing text is fine, because what it does with text and integers should be exactly the same thing as diff, and if the text has FP output, then it also can change depending on precision and it’s mostly fine if it does.

To me, the path forward is to fix the tests that break with one of the alternatives above, and make sure fpcmp doesn’t identify hex, octal, binary or integers as floating-point, and treat them all as text.

For the Blur test, a quick comparison between the two matrices inside the program (with appropriate wiggle room) would suffice.

If you truly want to benchmark LLVM, you should really be running specific benchmarks in specific ways and looking very carefully at the results, not relying on the test-suite.

This gets at my questions about which benchmarks are important and who considers them to be important. I expect a lot of us have non-public testing going on for the benchmarks that we consider to be critical. I see the test suite benchmarks as more of a guard rail to catch changes that degrade performance early and in a way that is convenient for other community members to address. So, to me, the benchmarks don’t have to be perfect measures. On the other hand, if we just disable things like fast-math and FMA, the benchmarks won’t tell us anything at all about the impact of changes touching those optimizations.

What we want is to make sure the program doesn’t generate garbage, but garbage means different things for different tests, and having an external tool that knows what each of the tests think is garbage is not practical.

Yes, I agree. Your example in Bugzilla of NEON versus VFP instructions brings up another issue. If I run a test with value-changing optimizations enabled, small variations are acceptable, but if I run the same test with “precise” floating point options, I shouldn’t see any differences from the expected results (depending, of course, on library implementations).

So, I think we need a way for each test to indicate whether it can be run in value-unsafe modes, to set different tolerances for different modes, and to be built to run differently in different modes. For example, if I’m running the Blur test in a value safe mode, there’s no need to perform an internal comparison and a hashed output comparison can be used. If I’m running with fp-contract=on or fast-math, I’d want an internal value check but those modes might have different tolerances. Finally, I might want a way to run the test as a benchmark with either fp-contract=on or fast-math without any check of the results in order to get better performance data.

As for updating the tests, I’m going to bring up test ownership again because I don’t know what constitutes acceptable variation for any given test. I could take a guess at it, but if I get it wrong, my wrong guess becomes semi-enshrined in the test suite and may not be noticed by people who would know better.

For the blur example, the FMA is happening on this line:

sum_in_current_frame += (inputImage[i + k][j + l] *

gaussianFilter[k + offset][l + offset]);

That’s an accumulated result inside four nested loops. It looks like in practice the differently rounded results with FMA must be getting averaged out most of the time, which makes sense assuming a relatively consistent magnitude of values, but I’d have to study the algorithm to understand exactly what’s happening and how to check the results reliably for a range of inputs. I think that’s too much to expect from someone who is just making some optimization change that triggers a failure in the test.

In the case that led me to start the discussion this week, Melanie was just making the behavior of clang match its documentation. She didn’t even change any optimizations. The failures that were exposed would always have happened if certain compilation options were used. Naturally, she just wanted to not turn any buildbots red. Then I started looking at the failing tests and ended up opening this can of worms.

-Andy

We’ve hit at least one failure when porting because the hashing means that miniscule differences in standard library implementations are not tolerated.

How did you address that?

We’re still in the process of trying to address it. First by turning off the hashing. I am not sure the “implement portable replacement for standard library function” strategy is appropriate in the case we’re dealing with. We’re working to get better traces.

It currently looks like small differences keep growing, so this may be a “reduce number of passes” situation.

This gets at my questions about which benchmarks are important and who considers them to be important. I expect a lot of us have non-public testing going on for the benchmarks that we consider to be critical. I see the test suite benchmarks as more of a guard rail to catch changes that degrade performance early and in a way that is convenient for other community members to address. So, to me, the benchmarks don’t have to be perfect measures. On the other hand, if we just disable things like fast-math and FMA, the benchmarks won’t tell us anything at all about the impact of changes touching those optimizations.

Right, that’s why we ended up with the complicated fp-contract situation.

Moreover, the benchmarks we have in the test-suite weren’t super tuned like other commercial ones. Worse still, they have to “compare similar” on a large number of platforms, with some potentially unstable output.

To make those benchmarks stable we’d need to:

  • Understand what the program is trying to generate
  • Only output relevant information (not like a dump of a huge intermediary matrix)
  • Make sure the output is stable under varying conditions on different architectures

Only then comparing outputs, even with fpcmp, will be meaningful. Right now, we’re on the stage “let’s just hope the output is the same”, which makes discussions like this one a recurrent theme.

So, I think we need a way for each test to indicate whether it can be run in value-unsafe modes, to set different tolerances for different modes, and to be built to run differently in different modes. For example, if I’m running the Blur test in a value safe mode, there’s no need to perform an internal comparison and a hashed output comparison can be used. If I’m running with fp-contract=on or fast-math, I’d want an internal value check but those modes might have different tolerances. Finally, I might want a way to run the test as a benchmark with either fp-contract=on or fast-math without any check of the results in order to get better performance data.

Yes, I think we may need different comparisons for different runs. For example, on a test run, the FP delta must be really small, but on a benchmark run, it can be larger or even ignore some numbers we know don’t change the overall result.

I believe this has to be in each benchmark’s code, not in the comparison tool, which has to be as dumb as possible.

As for updating the tests, I’m going to bring up test ownership again because I don’t know what constitutes acceptable variation for any given test. I could take a guess at it, but if I get it wrong, my wrong guess becomes semi-enshrined in the test suite and may not be noticed by people who would know better.

Unfortunately, there are no owners to the tests like that.

There are people who know more about certain tests than others, but I have added tests and benchmarks to the test-suite without really knowing a lot about them in the past, and I believe many other people have, too.

There’s no way to know who knows more about a particular test than asking, so I think the easiest way forward is to send an RFC to the list for each benchmark we want to change with the proposal. If no one thinks that’s a bad idea, we go with it. If someone downstream raises issues, reverting the commit to one single test/benchmark is easier than one that touches a lot of different tests.

That’s an accumulated result inside four nested loops. It looks like in practice the differently rounded results with FMA must be getting averaged out most of the time, which makes sense assuming a relatively consistent magnitude of values, but I’d have to study the algorithm to understand exactly what’s happening and how to check the results reliably for a range of inputs. I think that’s too much to expect from someone who is just making some optimization change that triggers a failure in the test.

Absolutely agreed.

The work that Melanie, Sebastian and many others have done on improving the test-suite is an important but thankless job.

Unfortunately, no one has time to spend on cleaning up the test-suite for more than a month (stretching) so it gets some attention then fades.

Many years ago I spent a good few months on it because running on Arm would yield the wildest differences in output, and my goal was to make Arm to be a first-class citizen on LLVM, so it had to run on Arm buildbots without noise.

As you’d expect, I only touched the tests that broke on Arm (at the time, too many!), but cleaning the test-suite wasn’t my primary goal. Since then, many people have done the same, for new targets, new optimisations, etc. But all with the test-suite as a secondary goal, which brings us here.

There were notable exceptions, for example when the Benchmark mode was added, when LNT had its interface revamped with good statistics, Sebastian’s fp-contract change, now Melanie’s work, etc. But they were few and far between.

We had a GSOC project to make the test-suite robust, but honestly, that’s not the sexiest GSOC project ever, so we’re still waiting for some kind soul to go through the painful task of understanding everything and validating the output stability.

In the case that led me to start the discussion this week, Melanie was just making the behavior of clang match its documentation. She didn’t even change any optimizations. The failures that were exposed would always have happened if certain compilation options were used. Naturally, she just wanted to not turn any buildbots red. Then I started looking at the failing tests and ended up opening this can of worms.

I sincerely apologise. :smiley:

I know we all have more important things to do (our jobs, for starter) than to fix some spaghetti monster that should have been good enough from the beginning. But truth is, testing and benchmarking is a really hard job.

I think this is not just an important part of the project to do, but I also think less experienced developers should all go through an experience like that some time in their early careers.

The work is painful, but it is also interesting. Understanding the benchmarks teaches you a lot about their fields (ray tracing, physics simulation, Fourier transforms, etc) as well as the numeric techniques used.

You also learn about output stability, good software engineering, integer and floating point arithmetic and all the pitfalls around them. It may not make for the best CV item, but you do become a better programmer after working on those hairy issues.

So, while we do what we can when we must, it really needs someone new, with fresh eyes, looking at it as if everything is wrong, and come up with a much better solution than what we have today.

cheers,
–renato

So, while we do what we can when we must, it really needs someone new, with fresh eyes, looking at it as if everything is wrong, and come up with a much better solution than what we have today.

Well, I’m old and my eyes aren’t so fresh, but this is my first time seriously stepping into the floating point handling in the test suite and I am looking at it as if everything were wrong. I’ll see what I can do. :slight_smile:

-Andy

Well, I’m old and my eyes aren’t so fresh, but this is my first time seriously stepping into the floating point handling in the test suite and I am looking at it as if everything were wrong. I’ll see what I can do. :slight_smile:

Awesome! Thanks! :slight_smile:

Here’s a first attempt, starting with one of the Polybench tests.

https://reviews.llvm.org/D104935

This passes for me on a Broadwell system with fp-contract=on.

-Andy