YA Vectorization Benchmark

Folks,

Has anyone tried this benchmark before?

http://www.netlib.org/benchmark/livermorec

Looks interesting, maybe should be added to test-suite?

It's also a good way to learn Fortran... :wink:

From: "Renato Golin" <rengolin@systemcall.org>
To: "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Monday, November 5, 2012 2:57:35 AM
Subject: [LLVMdev] YA Vectorization Benchmark

Folks,

Has anyone tried this benchmark before?

http://www.netlib.org/benchmark/livermorec

Looks interesting, maybe should be added to test-suite?

I agree. FWIW, there is an updated version in this archive:
http://www.roylongbottom.org.uk/classic_benchmarks.tar.gz

-Hal

Ok, I'll see what I can do... I'll try to transform it into a single file.

Seems a bit messy at the moment with the 32-bit/64-bit includes/objects.

Renato,

Thanks for the link. At the moment we are unable to vectorize any of the loops in this benchmark. I found two main problems:

1. We do not allow reductions on floating point types. We should allow them when unsafe-math is used.
2. All of the arrays are located in a struct. At the moment we don't detect that these arrays are disjoin, and this prevents vectorization.

We should be able to vectorize many loops in this benchmark with relatively minor changes.

Thanks,
Nadav

Indeed, they look like simple changes. If no one is dying to get them
working, I suggest I try these first.

I'll first get the tests running in the test-suite, than I'll try to
vectorize them.

That would be great!

Ok, I got the benchmark to work on test-suite, but it's not printing
details for each run (or execution wouldn't work). I had to comment
out the printf lines, but nothing more than that.

I'm not sure how individual timings would have to be extracted, but
the program produces output via text file, which can be used for
comparison. Also, it does check the results and does report if they
were as expected (not sure yet how that's calculated in detail).
Nevertheless, should be good to have this test, at least to make sure
we're not breaking floating point loops with vectorization in the
future.

Attached is a tar ball with the contents of LivermoreLoops to be
included inside test-suite/SingleSource/Benchmarks. Daniel, can I just
add this to the SVN repository, or are there other things that need to
be done as well? It might need some care to fully use the testing
infrastructure, though.

cheers,
--renato

LivermoreLoops.tar.gz (17.1 KB)

Hey Renato,

Cool, glad you got it working.

There is very primitive support for tests that generate multiple output results, but I would rather not use those facilities.

Is it possible instead to refactor the tests so that each binary corresponds to one test? For example, look at how Hal went about integrating TSVC:
http://llvm.org/viewvc/llvm-project/test-suite/trunk/MultiSource/Benchmarks/TSVC/
It isn’t particularly pretty, but it fits well with the other parts of the test suite infrastructure, and probably works out nicer in practice when tests fail (i.e., you don’t want to be staring at a broken bitcode with 24 kernels in one function).

Other things that I would like before integrating it:

  • Rip out the CPU ID stuff, this isn’t useful and adds messiness.
  • Have the test just produce output that can be compared, instead of including its own check routines
  • Have the tests run for fixed iterations, instead of doing their own adaptive run
  • Produce reference output files, so it works with USE_REFERENCE_OUTPUT=1

The kernels themselves are really trivial, so it would be ideal if it was split up to be one-test-per file with minimal other stuff in the test other than setup and output.

  • Daniel

Is it possible instead to refactor the tests so that each binary corresponds
to one test? For example, look at how Hal went about integrating TSVC:

It should be possible. I'll have to understand better what the
preamble does to make sure I'm not stripping out important stuff, but
also what to copy to each kernel's initialization.

Also, I don't know how the timing functions perform across platforms.
I'd have to implement a decent enough timing system, platform
independent, to factor out the initialization step.

Other things that I would *like* before integrating it:
- Rip out the CPU ID stuff, this isn't useful and adds messiness.

Absolutely, that is meaningless.

- Have the test just produce output that can be compared, instead of
including its own check routines

I can make it print numbers in order, is that good enough for the
comparison routines?

If I got it right, the tests self-validates the results, so at least
we know it executed correctly in the end. I can make it produce "OK"
and "FAIL" either way with some numbers stating the timing.

- Have the tests run for fixed iterations, instead of doing their own
adaptive run

Yes, that's rubbish. That was needed to compare the results based on
CPU specific features, but we don't need that.

- Produce reference output files, so it works with USE_REFERENCE_OUTPUT=1

Is this a simple diff or do you compare the numerical results by value+-stdev?

cheers,
--renato

> Is it possible instead to refactor the tests so that each binary
corresponds
> to one test? For example, look at how Hal went about integrating TSVC:

It should be possible. I'll have to understand better what the
preamble does to make sure I'm not stripping out important stuff, but
also what to copy to each kernel's initialization.

Also, I don't know how the timing functions perform across platforms.
I'd have to implement a decent enough timing system, platform
independent, to factor out the initialization step.

The way we handle timing of all the other tests is just by timing the
executable. This isn't perfect, but its what we use everywhere else so we
should stick with keeping it outside the tests.

> Other things that I would *like* before integrating it:
> - Rip out the CPU ID stuff, this isn't useful and adds messiness.

Absolutely, that is meaningless.

> - Have the test just produce output that can be compared, instead of
> including its own check routines

I can make it print numbers in order, is that good enough for the
comparison routines?

Yup.

If I got it right, the tests self-validates the results, so at least
we know it executed correctly in the end. I can make it produce "OK"
and "FAIL" either way with some numbers stating the timing.

Yeah, we can get rid of that, the way we check all the other tests is by
comparing to reference outputs or output from a reference binary (compiled
by the system compiler).

> - Have the tests run for fixed iterations, instead of doing their own
> adaptive run

Yes, that's rubbish. That was needed to compare the results based on
CPU specific features, but we don't need that.

> - Produce reference output files, so it works with
USE_REFERENCE_OUTPUT=1

Is this a simple diff or do you compare the numerical results by
value+-stdev?

We have support for value + tolerance. You can set FP_TOLERANCE=.0001 or so
in the Makefile to tune the limit. For example
see SingleSource/Benchmarks/Misc/Makefile.

- Daniel

cheers,

Daniel, Nadav, Hal,

So, after some painstakingly boring re-formatting, I've split the 24
kernels into 24 files (and left a horrible header file with code in
it, which I'll clean up later).

Since we're taking times in the benchmark tool, and we're trying to
assert the quality of the FP approximation by the vectorization, I'll
try to come up with a reasonable watermark for each test. Maybe adding
the values of all elements in the result vector, or something, and
printing that value, using FP_TOLERANCE to detect errors in the
vectorized code.

Does this seem sensible?

cheers,
--renato

Seems fairly reasonable to me.

I don’t know what size of arrays you are dealing with, if they are reasonably small it is probably also fine to just output each element in the result.

It’s fine to start by just setting FP_TOLERANCE to a small value and if it breaks in the future because of an actual precision change we can tweak it.

Thanks for the reformatting, its great to see new benchmarks getting added!

  • Daniel

It's from 64 to more than 1000. I'll sum the elements up and then
people can change later for a better heuristics on a case by case
basis.

Hi Renato!

Thanks for working on this! It's really important to have more array-ish benchmarks.

That's great news!

Attached is the whole benchmark, divided into 24 kernels and running
on LNT with FP comparison and timings.

Unpack the file onto SingleSource/Benchmarks and change the Makefile
to add LivermoreLoops to the tests. Run the LNT tests with
--only-test SingleSource/Benchmarks/LivermoreLoops to see it pass.

The heuristics are dumb accumulations of the array values, just to get
a value that will change considerably with small FP errors, so we can
easily test and adjust how much FP error is being incurred by
fast-math.

If no one objects, I'll commit to the test-suite.

LivermoreLoops.tar.gz (11.9 KB)

LGTM!