LLVM LNT floating point performance tests on X86 - using the llvm-test-suite benchmarks

Hello.
I have a patch to commit to community ⚙ D74436 Change clang option -ffp-model=precise to select ffp-contract=on that changes command line settings for floating point. When I committed it previously, it was ultimately rolled back due to bot failures with LNT.

Looking for suggestions on how to use the llvm-test-suite benchmarks to analyze this issue so I can commit this change.

We think the key difference in the tests that regressed when I tried to commit the change was caused by differences in unrolling decisions when the fmuladd intrinsic was present.

As far as I can tell, the LNT bots aren't currently running on any x86 systems, so I have no idea what settings the bots used when they were running. I'm really not sure how to proceed.

It seems to me that FMA should give better performance on systems that support it on any non-trivial benchmark.

Thanks!

You can run the LNT tests locally and I would assume the tests to be impacted (on X86).

The Polybench benchmarks, probably some others, have hased result files. Thus, any change
to the output is flagged regardless how minor. I'd run it without and with this patch and
compare the results. If they are in the expected tolerance I'd recreate the hash files for
them and create a dependent commit for the LLVM test suite.

Does that make sense?

~ Johannes

Thank you, I have more questions.

I am using a shared Linux system (Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHz) to build and run the llvm-test-suite. Do I need to execute the tests on a quiescent system? I tried running a "null check" i.e. execute and collect results from llvm-lit run using the same set of test executables and the differences between the two runs (ideally would be zero since it's the same test executable) ranged from +14% to -18%.

What is the acceptable tolerance?

I work in clang not the backend optimization so I am not familiar with analysis techniques to understand what optimization transformations occurred due to my patch. Do you have any tips about that?

Using the real test (unpatched compiler versus patched compiler), I compared the assembly for symm.test since it’s SingleSource, compiling with the 2 different compilers

test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test 10.88 9.93 -8.7%

Here’s the only difference in the .s file, seems unlikely that this would account for 8% difference in time.

.LBB6_21: # in Loop: Header=BB6_18 Depth=2
        leaq (%r12,%r10), %rdx
        movsd (%rdx,%rax,8), %xmm3 # xmm3 = mem[0],zero
        mulsd %xmm1, %xmm3
        movsd (%r9), %xmm4 # xmm4 = mem[0],zero
        mulsd %xmm0, %xmm4
        mulsd (%rsi), %xmm4
        addsd %xmm3, %xmm4

With patch for contract changes:
.LBB6_21: # in Loop: Header=BB6_18 Depth=2
        movsd (%r9), %xmm3 # xmm3 = mem[0],zero
        mulsd %xmm0, %xmm3
        mulsd (%rsi), %xmm3
        leaq (%r12,%r10), %rdx
        movsd (%rdx,%rax,8), %xmm4 # xmm4 = mem[0],zero
        mulsd %xmm1, %xmm4
        addsd %xmm3, %xmm4

The difference for test flops-5 was 25% but the code differences are bigger. I can try dump-after-all as first step.

I'm nowhere near "generating new hash values" but won't the hash value be relative to the target microarchitecture? So if my system is different arch than bot, the hash value I compute here wouldn't compare equal to the bot hash?

This is how I tested. Is the build line correct for this purpose (caches/O3.cmake), should I use different options when creating the test executables?

git clone GitHub - llvm/llvm-test-suite test-suite
cmake -DCMAKE_C_COMPILER=/iusers/sandbox/llorg-ContractOn/deploy/linux_prod/bin/clang \
   -DTEST_SUITE_BENCHMARKING_ONLY=true -DTEST_SUITE_RUN_BENCHMARKS=true \
   -C/iusers/test-suite/cmake/caches/O3.cmake \
     /iusers/test-suite
make
llvm-lit -v -j 1 -o results.json .
(Repeat in a different build directory using the unmodified compiler)
python3 test-suite/utils/compare.py -f --filter-short build-llorg-default/results.json build-llorg-Contract/results.json >& my-result.txt

Thank you, I have more questions.

I am using a shared Linux system (Intel(R) Xeon(R) Platinum 8260M CPU @ 2.40GHz) to build and run the llvm-test-suite. Do I need to execute the tests on a quiescent system? I tried running a "null check" i.e. execute and collect results from llvm-lit run using the same set of test executables and the differences between the two runs (ideally would be zero since it's the same test executable) ranged from +14% to -18%.

What is the acceptable tolerance?

I'm not following what the "results" is here.

I work in clang not the backend optimization so I am not familiar with analysis techniques to understand what optimization transformations occurred due to my patch. Do you have any tips about that?

It doesn't necessarily matter. If you want to know without any other information, you could compare the
outputs of -mllvm -print-all w/ and w/o your patch. I don't think it is strictly necessary if the tests
are not impacted too much.

Using the real test (unpatched compiler versus patched compiler), I compared the assembly for symm.test since it’s SingleSource, compiling with the 2 different compilers

test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm.test 10.88 9.93 -8.7%

Here’s the only difference in the .s file, seems unlikely that this would account for 8% difference in time.

.LBB6_21: # in Loop: Header=BB6_18 Depth=2
         leaq (%r12,%r10), %rdx
         movsd (%rdx,%rax,8), %xmm3 # xmm3 = mem[0],zero
         mulsd %xmm1, %xmm3
         movsd (%r9), %xmm4 # xmm4 = mem[0],zero
         mulsd %xmm0, %xmm4
         mulsd (%rsi), %xmm4
         addsd %xmm3, %xmm4

With patch for contract changes:
.LBB6_21: # in Loop: Header=BB6_18 Depth=2
         movsd (%r9), %xmm3 # xmm3 = mem[0],zero
         mulsd %xmm0, %xmm3
         mulsd (%rsi), %xmm3
         leaq (%r12,%r10), %rdx
         movsd (%rdx,%rax,8), %xmm4 # xmm4 = mem[0],zero
         mulsd %xmm1, %xmm4
         addsd %xmm3, %xmm4

The difference for test flops-5 was 25% but the code differences are bigger. I can try dump-after-all as first step.

I'm not sure we need to look at this right now.

I'm nowhere near "generating new hash values" but won't the hash value be relative to the target microarchitecture? So if my system is different arch than bot, the hash value I compute here wouldn't compare equal to the bot hash?

This is how I tested. Is the build line correct for this purpose (caches/O3.cmake), should I use different options when creating the test executables?
  git clone GitHub - llvm/llvm-test-suite test-suite
cmake -DCMAKE_C_COMPILER=/iusers/sandbox/llorg-ContractOn/deploy/linux_prod/bin/clang \
    -DTEST_SUITE_BENCHMARKING_ONLY=true -DTEST_SUITE_RUN_BENCHMARKS=true \
    -C/iusers/test-suite/cmake/caches/O3.cmake \
      /iusers/test-suite
make
llvm-lit -v -j 1 -o results.json .
(Repeat in a different build directory using the unmodified compiler)
python3 test-suite/utils/compare.py -f --filter-short build-llorg-default/results.json build-llorg-Contract/results.json >& my-result.txt

I'm a little confused what you are doing, trying to do. I was expecting you
run the symm executable compiled w/ and w/o your patch, then look at the
numbers that are printed at the end. So compare the program results, but not
the compile or execution time. If the results are pretty much equivalent,
we can use the results w/ patch to create a new hash file. If not, we need
to investigate why. Does that make sense?

~ Johannes

What I'm trying to do is to determine whether the patch I'm submitting is going to cause benchmarking problems that force the patch to be reverted--since that happened the last time I committed the patch (several months ago).

Since my patch did cause problems last time, I want to run the tests now and develop an explanation for any regressions. I thought that "compare.py" would tell me what I need to know. I assumed the summary line for symm was telling me that symm had improved 8.7%. (Where 8.7% is describing the difference between 10.88 and 9.93)

But I looked at the lines in result.json for the 2 different executions (using not patched and patched) and the "hash" is the same for symm.

Can I find what I need to know from results.json? For both runs there are 736 lines in results.json showing "code" : "PASS". Does that mean it's all OK and I just need to see if the hash value is the same?

I also put one reply below. Thanks a lot.

From: Johannes Doerfert <johannesdoerfert@gmail.com>
Sent: Wednesday, May 19, 2021 11:31 AM
To: Blower, Melanie I <melanie.blower@intel.com>;
spatel+llvm@rotateright.com; llvm-dev <llvm-dev@lists.llvm.org>;
florian_hahn@apple.com; hal.finkel.llvm@gmail.com
Subject: Re: LLVM LNT floating point performance tests on X86 - using the llvm-
test-suite benchmarks

> Thank you, I have more questions.
>
> I am using a shared Linux system (Intel(R) Xeon(R) Platinum 8260M CPU @
2.40GHz) to build and run the llvm-test-suite. Do I need to execute the tests on
a quiescent system? I tried running a "null check" i.e. execute and collect results
from llvm-lit run using the same set of test executables and the differences
between the two runs (ideally would be zero since it's the same test executable)
ranged from +14% to -18%.
>
> What is the acceptable tolerance?

I'm not following what the "results" is here.

[Blower, Melanie] I mean the results.json file which is created by the llvm-lit run. One of the results.json file was execution results from the unpatched compiler, and the other results.json file was execution results from the patched compiler. (I also did the "null check" but let's ignore that)

What I’m trying to do is to determine whether the patch I’m submitting is going to cause benchmarking problems that force the patch to be reverted–since that happened the last time I committed the patch (several months ago).

IIUC the problem with the patch was not runtime performance, but unexpected results causing some tests/benchmarks to fail.

I think you might need to select a CPU on X86 to cause the mis-compares. Below the commands I used to reproduce the failure of MultiSource/Applications/oggenc/oggenc. Note -march=native. Without that, the test passes. There are a couple of other failures as well.

cmake -G Ninja
-DCMAKE_C_COMPILER=/path/to/bin/clang
-DCMAKE_C_FLAGS=“-O3 -march=native” -DCMAKE_CXX_FLAGS=“-O3 -march=native”
Path/to/llvm-test-suite

ninja MultiSource/Applications/oggenc/oggenc

llvm-lit MultiSource/Applications/oggenc/

– Testing: 1 tests, 1 workers –
FAIL: test-suite :: MultiSource/Applications/oggenc/oggenc.test (1 of 1)

IIUC the problem with the patch was not runtime performance, but unexpected results causing some tests/benchmarks to fail.

I believe we saw several problems. I may be merging issues with a couple of different variations of this change in my mind, but we definitely saw performance regressions on an LNT bot running on a Broadwell-based system at some point due to different loop unrolling behavior when the llvm.fmuladd intrinsic was used.

There were, as you note, test failures reported because of the change Melanie is working with. I think those were caused by the tests being intolerant of variations in floating point results (such as would be expected when FMA is used) and not being very easy to update. Any suggestions on how to handle that would also be helpful.

Finally, there was a build failure with aarch64 when this change was made. That seems to have been caused by a problem in the aarch64 backend that was exposed by this change. The associated bug (https://bugs.llvm.org/show_bug.cgi?id=44892) has since been marked as fixed.

-Andy

Using march=native as suggested by @Florian Hahn I can see 20 failing tests, thanks!

I looked at the source for SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm which is one of the failing tests and noticed that there is #pragma STDC FP_CONTRACT OFF in many places, but not in the function kernel_trmm

Perhaps this function was overlooked for that pragma because the FMA is syntactically not obvious due to the += operation? So I added the pragma and reran that test, and now that test is passing.

Would it be acceptable to modify trmm with this change?

Using march=native as suggested by @Florian Hahn<mailto:florian_hahn@apple.com> I can see 20 failing tests, thanks!

I looked at the source for SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm which is one of the failing tests and noticed that there is #pragma STDC FP_CONTRACT OFF in many places, but not in the function kernel_trmm

Perhaps this function was overlooked for that pragma because the FMA is syntactically not obvious due to the += operation? So I added the pragma and reran that test, and now that test is passing.

Would it be acceptable to modify trmm with this change?

Yes.

Thanks. I created ⚙ D102861 Suppress FP_CONTRACT due to planned command line changes to modify test-suite and suppress FP_CONTRACT in certain source files, this test-suite patch allows testing to pass with the patch for ⚙ D74436 Change clang option -ffp-model=precise to select ffp-contract=on in place