ARM LNT test-suite Buildbot

Hi Folks,

Looks like our LNT ARM buildbot with the vectorizer is running and producing good results. There are only 11 failures:

FAIL: MultiSource/Applications/Burg/burg.execution_time (1 of 1104)
FAIL: MultiSource/Applications/ClamAV/clamscan.execution_time (2 of 1104)
FAIL: MultiSource/Applications/lemon/lemon.execution_time (3 of 1104)
FAIL: MultiSource/Applications/sqlite3/sqlite3.execution_time (4 of 1104)
FAIL: MultiSource/Benchmarks/McCat/12-IOtest/iotest.execution_time (5 of 1104)
FAIL: MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount.execution_time (6 of 1104)
FAIL: MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.execution_time (7 of 1104)
FAIL: MultiSource/Benchmarks/Ptrdist/anagram/anagram.execution_time (8 of 1104)
FAIL: MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.execution_time (9 of 1104)
FAIL: SingleSource/Benchmarks/BenchmarkGame/puzzle.execution_time (10 of 1104)
FAIL: SingleSource/Benchmarks/Shootout-C++/except.execution_time (11 of 1104)

Plus 8 Exception Handling “expected” failures, since EHABI is not working yet:

FAIL: SingleSource/Regression/C++/EH/ctor_dtor_count.execution_time (12 of 1104)
FAIL: SingleSource/Regression/C++/EH/ctor_dtor_count-2.execution_time (13 of 1104)
FAIL: SingleSource/Regression/C++/EH/exception_spec_test.execution_time (14 of 1104)
FAIL: SingleSource/Regression/C++/EH/function_try_block.execution_time (15 of 1104)
FAIL: SingleSource/Regression/C++/EH/inlined_cleanup.execution_time (16 of 1104)
FAIL: SingleSource/Regression/C++/EH/simple_rethrow.execution_time (17 of 1104)
FAIL: SingleSource/Regression/C++/EH/simple_throw.execution_time (18 of 1104)
FAIL: SingleSource/Regression/C++/EH/throw_rethrow_test.execution_time (19 of 1104)

I’ll be investigating each one of the 11 failures during the next months alongside other projects, so I’m not sure I can get them green by the next release, 3.3. But with the check-all buildbots green and fast, at least we can make sure we won’t regress from now on.

About exception handling, we should have at least a roadmap. I believe we should tackle them after the 11 generic bugs, and do it in one big sprint. Anton, I’ll need your help here.

I also want to increase the coverage (more bots, different types of bots, etc) in the near future. I hope to make ARM a certifiable first-class target, especially during releases.

I’d like to thank Galina and David for the their support in setting the bots up and testing all possibilities.

Now, on with the hard work… :wink:

cheers,
–renato

Hi Renato,

I've investigated a few of these for AArch64 recently, and some of the
results will be applicable in the 32-bit world too.

MultiSource/Benchmarks/McCat/12-IOtest/iotest.execution_time

This is because of disagreement between ABIs over whether "char" is
signed. ARM says no, x86 says yes.

MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount.execution_time

This is also failing on x86, I think
(http://lab.llvm.org:8011/builders/clang-x86_64-debian-fnt/builds/14868/steps/make.test-suite/logs/fail.LLC)

MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.execution_time
MultiSource/Benchmarks/Ptrdist/anagram/anagram.execution_time (8 of
SingleSource/Benchmarks/BenchmarkGame/puzzle.execution_time

I've also seen failures on these with x86, though they seem to be
passing on the bot I linked to. I think I tracked down the last to a
dependence on the exact output of the C library's random number
generator, which is obviously non-portable.

Tim.

Thanks a lot, Tim! This will save me a huge amount of time! :wink:

Maybe I should start with them first, then, and change the tests or the expected results to account for architecture changes.

I’ll also need you input on the AArch64 side later on. We should aim to clear both 32 and 64 architectures on each change, if possible.

cheers,
–renato

Hi Renato,

I noticed the bot yesterday. Thanks for working on this!

Hi Folks,

Looks like our LNT ARM buildbot with the vectorizer is running and producing good results.

Do you have a base run with vectorization turned off? So we could see where we are degrading things?

When you say good results, I take it you mean successfully completing the test, not execution time of the resulting binary? Or did you do an analysis of performance, too?

There are only 11 failures:

FAIL: MultiSource/Applications/Burg/burg.execution_time (1 of 1104)
FAIL: MultiSource/Applications/ClamAV/clamscan.execution_time (2 of 1104)
FAIL: MultiSource/Applications/lemon/lemon.execution_time (3 of 1104)
FAIL: MultiSource/Applications/sqlite3/sqlite3.execution_time (4 of 1104)
FAIL: MultiSource/Benchmarks/McCat/12-IOtest/iotest.execution_time (5 of 1104)
FAIL: MultiSource/Benchmarks/MiBench/automotive-bitcount/automotive-bitcount.execution_time (6 of 1104)
FAIL: MultiSource/Benchmarks/MiBench/telecomm-FFT/telecomm-fft.execution_time (7 of 1104)
FAIL: MultiSource/Benchmarks/Ptrdist/anagram/anagram.execution_time (8 of 1104)
FAIL: MultiSource/Benchmarks/TSVC/Reductions-flt/Reductions-flt.execution_time (9 of 1104)
FAIL: SingleSource/Benchmarks/BenchmarkGame/puzzle.execution_time (10 of 1104)
FAIL: SingleSource/Benchmarks/Shootout-C++/except.execution_time (11 of 1104)

Plus 8 Exception Handling "expected" failures, since EHABI is not working yet:

FAIL: SingleSource/Regression/C++/EH/ctor_dtor_count.execution_time (12 of 1104)
FAIL: SingleSource/Regression/C++/EH/ctor_dtor_count-2.execution_time (13 of 1104)
FAIL: SingleSource/Regression/C++/EH/exception_spec_test.execution_time (14 of 1104)
FAIL: SingleSource/Regression/C++/EH/function_try_block.execution_time (15 of 1104)
FAIL: SingleSource/Regression/C++/EH/inlined_cleanup.execution_time (16 of 1104)
FAIL: SingleSource/Regression/C++/EH/simple_rethrow.execution_time (17 of 1104)
FAIL: SingleSource/Regression/C++/EH/simple_throw.execution_time (18 of 1104)
FAIL: SingleSource/Regression/C++/EH/throw_rethrow_test.execution_time (19 of 1104)

I'll be investigating each one of the 11 failures during the next months alongside other projects, so I'm not sure I can get them green by the next release, 3.3. But with the check-all buildbots green and fast, at least we can make sure we won't regress from now on.

About exception handling, we should have at least a roadmap. I believe we should tackle them after the 11 generic bugs, and do it in one big sprint. Anton, I'll need your help here.

I also want to increase the coverage (more bots, different types of bots, etc) in the near future. I hope to make ARM a certifiable first-class target, especially during releases.

I'd like to thank Galina and David for the their support in setting the bots up and testing all possibilities.

Now, on with the hard work... :wink:

To me setting up build infrastructure *is* the hard work :wink:

Thanks,
Arnold

Do you have a base run with vectorization turned off? So we could see
where we are degrading things?

I wanted to, but after a few failed attempts, I couldn't pass the option to
clang to disable vectorization. I don't want to make Galina reconfig the
master every time, so I set up a master on my own laptop and will fiddle.
But the fastest way I can test, for now, is to run the LNT tests manually
with and without vectorization and compare.

I'm not expecting many issues with vectorization, to be honest, but you
never know... :wink:

When you say good results, I take it you mean successfully completing the

test, not execution time of the resulting binary? Or did you do an analysis
of performance, too?

Good results because this is the first public test-suite for ARM and we
only had 19 errors out of 1104!! And 8 of them are "expected", so it's
about 1% or failures.

The non-EH problems should be either mechanical changes on tests, or simple
fixes in LLVM, so I'm not expecting a lot of work to get the LNT on the
same state on ARM than x86.

I'm not checking performance yet, but the data is being collected here
http://llvm.org/perf/db_default/v4/nts/machine/10 and should give us some
idea on how to proceed from now on on performance measurements.

For now, I'm interested in correctness, so I won't worry too much with
those numbers (I've heard I should disable some Turbo mode to make more
predictable results, though I only saw one test running at a time, so maybe
it was off).

Once we have an acceptable state (mostly green, except EH), I'll start
worrying about performance.

cheers,
--renato

Do you have a base run with vectorization turned off? So we could see
where we are degrading things?

I wanted to, but after a few failed attempts, I couldn't pass the option
to clang to disable vectorization. I don't want to make Galina reconfig the
master every time, so I set up a master on my own laptop and will fiddle.
But the fastest way I can test, for now, is to run the LNT tests manually
with and without vectorization and compare.

I'm not expecting many issues with vectorization, to be honest, but you
never know... :wink:

When you say good results, I take it you mean successfully completing the

test, not execution time of the resulting binary? Or did you do an analysis
of performance, too?

Good results because this is the first public test-suite for ARM and we
only had 19 errors out of 1104!! And 8 of them are "expected", so it's
about 1% or failures.

The non-EH problems should be either mechanical changes on tests, or
simple fixes in LLVM, so I'm not expecting a lot of work to get the LNT on
the same state on ARM than x86.

I'm not checking performance yet, but the data is being collected here
http://llvm.org/perf/db_default/v4/nts/machine/10 and should give us some
idea on how to proceed from now on on performance measurements.

For now, I'm interested in correctness, so I won't worry too much with
those numbers (I've heard I should disable some Turbo mode to make more
predictable results, though I only saw one test running at a time, so maybe
it was off).

Turbo is a CPU option, not a test suite execution option. Turning it off
stops the system from varying the CPU clock based on load (when this
feature is enabled it can be a power saving, but it results in varying
performance - bad for perf analysis).

Anotehr thing to consider disabling is Address Space Layout Randomization
so that you get consistent hashing & other behavior run-over-run.

Once we have an acceptable state (mostly green, except EH), I'll start
worrying about performance.

Sounds reasonable.

- David

Hi Renato,

I’m playing with A15 bots too (running Ubuntu). This is probably what you want to have predictable performance:

Disable auto-resetting the CPU scaling governor to ondemand:
sudo update-rc.d -f ondemand remove

Then add this to /etc/rc.local:

Disable power management.

for cpu in find /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $cpu
done

Done, Thanks!

--renato

Good idea, done, thanks!

--renato