[PATCH 0/9] R600 load/store improvements

This series attempts to clean up the r600 int32 vload/vstore logic.

After this series is done, we use assembly versions of vload/vstore
for all int32 loads/stores for all address spaces on r600. The only downside
is that there's an RFC patch at the beginning of the series which is
necessary at the moment to prevent the nextafter/sign builtins from
regressing for float16 data types (see the associated piglit tests).

The RFC patch shouldn't have any side-effects, but it seems to paper
over a bug in clang or llvm somewhere. It does seem to result in slight
savings due to some simpler shufflevector operations.

With this series applied, clover for the OSS radeon driver generates
much cleaner llvm bitcode from CL kernels on my machine when
using 32-bit int/float types. In the future, I'll give a shot at
expanding the assembly optimizations to other types such as char,
short, long, double, and half.

There are odd things happening with 16x vectors in nextafter() and sign().

When I changed float16 load/store to use the assembly path in later
patches instead of the macro-vectorized version, nextafter and sign
stopped working (next 2 commits/patches).

As near as I can tell, we're getting correct results for
elements 0-7, but then elements 8+ are wrong. The final result
seems to be composed of the first 8 elements of the computed
result, and then elements 16-23, which are likely uninitialized.

I'd like to say the issue is in clang, but I have nothing to back
that up at the moment and give that this patch fixes the issue, I’m
not sure how much time I want to spend investigating right now.

Explicitly splitting all of the vectorize macros the way that this patch
does gets everything working again, but I have a feeling that we're
papering over a bug somewhere, hence the RFC subject.

I’m not sure what’s going on here, and it’s only the nextafter/sign
functions that regressed. This patch fixes the test results in piglit.

No significant change on number of instructions in bitcode
for nextafter float16 (2-3 instructions savings over ~350 lines).

Signed-off-by: Aaron Watry <awatry@gmail.com>

These were not actually enabled before... not sure why (other than
a think-o when I wrote this originally).

Noticed while inspecting bitcode while working on the next patch.

Signed-off-by: Aaron Watry <awatry@gmail.com>

float values can be loaded/stored via casting through int first
which prevents us having to write float load/store assembly paths (which
can be done if deemed desirable).

This cast is the same method we use for unsigned int types. This lets
us write one assembly path for all 32-bit value types (and eventually
i8/i16/i64 paths too).

This results in a fabs(float16) unit test kernel going from 101 lines to
8 lines of llvm bit code and decompiled shader size of 296dw and 32gprs
to 94dw and 8gprs on evergreen (CEDAR).

Signed-off-by: Aaron Watry <awatry@gmail.com>

This works now, at least on the cedar (EG) and pitcairn (SI) cards I tested.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Tested on evergreen.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Not used yet... but very soon.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Now all address spaces have their loads treated identically and we
get cleaner code in r600 as a result.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Not used until next commit.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Tested on evergreen.

Signed-off-by: Aaron Watry <awatry@gmail.com>

Why include the addrspace(0)s? I'm also wondering why it's necessary to write these in IR? Does casting the pointer type in C not do the right thing?

I'm thinking that the answer is inertia. The previous int load/store
was all done that way, so I kept going that way... Never mind that I
believe I was the one who wrote the original version. I believe that
I had issues at the time doing what I've done in the attached patch
(pointer casting as you suggested).

I've abandoned patches 2-9 of this series in my local checkout and
would like to replace it with the patch I've attached.

I've run a full piglit CL test run with 0 changes in pass/fail rate,
and I'm getting sensible bitcode from the compiled CL test kernels.

Does this look better to you?

--Aaron

0001-vload-vstore-Improve-clc-implementation-to-make-asse.patch (12.9 KB)

This series attempts to clean up the r600 int32 vload/vstore logic.

After this series is done, we use assembly versions of vload/vstore
for all int32 loads/stores for all address spaces on r600. The only downside
is that there's an RFC patch at the beginning of the series which is
necessary at the moment to prevent the nextafter/sign builtins from
regressing for float16 data types (see the associated piglit tests).

The RFC patch shouldn't have any side-effects, but it seems to paper
over a bug in clang or llvm somewhere. It does seem to result in slight
savings due to some simpler shufflevector operations.

With this series applied, clover for the OSS radeon driver generates
much cleaner llvm bitcode from CL kernels on my machine when
using 32-bit int/float types. In the future, I'll give a shot at
expanding the assembly optimizations to other types such as char,
short, long, double, and half.

I tried this patches and both piglit and gegl run OK on my TURSK GPU.
you can add my tested-by.

jan

There are odd things happening with 16x vectors in nextafter() and sign().

When I changed float16 load/store to use the assembly path in later
patches instead of the macro-vectorized version, nextafter and sign
stopped working (next 2 commits/patches).

As near as I can tell, we're getting correct results for
elements 0-7, but then elements 8+ are wrong. The final result
seems to be composed of the first 8 elements of the computed
result, and then elements 16-23, which are likely uninitialized.

I'd like to say the issue is in clang, but I have nothing to back
that up at the moment and give that this patch fixes the issue, I’m
not sure how much time I want to spend investigating right now.

Explicitly splitting all of the vectorize macros the way that this patch
does gets everything working again, but I have a feeling that we're
papering over a bug somewhere, hence the RFC subject.

I’m not sure what’s going on here, and it’s only the nextafter/sign
functions that regressed. This patch fixes the test results in piglit.

No significant change on number of instructions in bitcode
for nextafter float16 (2-3 instructions savings over ~350 lines).

Were you testing this on Evergreen? The sign() test passes for me on
SI without this patch.

-Tom

There are odd things happening with 16x vectors in nextafter() and sign().

When I changed float16 load/store to use the assembly path in later
patches instead of the macro-vectorized version, nextafter and sign
stopped working (next 2 commits/patches).

As near as I can tell, we're getting correct results for
elements 0-7, but then elements 8+ are wrong. The final result
seems to be composed of the first 8 elements of the computed
result, and then elements 16-23, which are likely uninitialized.

I'd like to say the issue is in clang, but I have nothing to back
that up at the moment and give that this patch fixes the issue, I’m
not sure how much time I want to spend investigating right now.

Explicitly splitting all of the vectorize macros the way that this patch
does gets everything working again, but I have a feeling that we're
papering over a bug somewhere, hence the RFC subject.

I’m not sure what’s going on here, and it’s only the nextafter/sign
functions that regressed. This patch fixes the test results in piglit.

No significant change on number of instructions in bitcode
for nextafter float16 (2-3 instructions savings over ~350 lines).

Were you testing this on Evergreen? The sign() test passes for me on
SI without this patch.

I don't have my EG card available at the moment, but I *think* that
passed. All I've got available at the moment is my pitcairn (si). For
me on pitcairn, I get failures in
piglit/generated_tests/cl/builtin/math/builtin-float-nextafter-1.0.generated.cl
for only the float16 type (about half of them pass, half don't).

What happens if you test with [1] and [2] applied (enable vstore
optimizations and then add float to those optimizations), but without
the clcmacro refactoring patch [3]?

[1] http://www.pcc.me.uk/pipermail/libclc-dev/2014-July/000461.html
[2] http://www.pcc.me.uk/pipermail/libclc-dev/2014-July/000462.html
[3] http://www.pcc.me.uk/pipermail/libclc-dev/2014-July/000460.html

--Aaron

>> There are odd things happening with 16x vectors in nextafter() and sign().
>>
>> When I changed float16 load/store to use the assembly path in later
>> patches instead of the macro-vectorized version, nextafter and sign
>> stopped working (next 2 commits/patches).
>>
>> As near as I can tell, we're getting correct results for
>> elements 0-7, but then elements 8+ are wrong. The final result
>> seems to be composed of the first 8 elements of the computed
>> result, and then elements 16-23, which are likely uninitialized.
>>
>> I'd like to say the issue is in clang, but I have nothing to back
>> that up at the moment and give that this patch fixes the issue, I’m
>> not sure how much time I want to spend investigating right now.
>>
>> Explicitly splitting all of the vectorize macros the way that this patch
>> does gets everything working again, but I have a feeling that we're
>> papering over a bug somewhere, hence the RFC subject.
>>
>> I’m not sure what’s going on here, and it’s only the nextafter/sign
>> functions that regressed. This patch fixes the test results in piglit.
>>
>> No significant change on number of instructions in bitcode
>> for nextafter float16 (2-3 instructions savings over ~350 lines).
>>
>
> Were you testing this on Evergreen? The sign() test passes for me on
> SI without this patch.
>

I don't have my EG card available at the moment, but I *think* that
passed. All I've got available at the moment is my pitcairn (si). For
me on pitcairn, I get failures in
piglit/generated_tests/cl/builtin/math/builtin-float-nextafter-1.0.generated.cl
for only the float16 type (about half of them pass, half don't).

This test passes for me too. How current is your install of LLVM, I think there
were some shufflevector fixes/improvements recently that may have fixed this test.

-Tom

LLVM: git 6800393083a4030 / svn r213860 (From 7/24)
CLANG: git 5ce5cbd836b3 / svn r213853
libclc: git a63df067faf8a / svn r213762 with the following two patches applied:

r600: Actually use vstore assembly optimizations
r600: improve float vload/vstore path

With that setup, and mesa 0ddc28b026688d, I get failures in the
float16 nextafter tests.

I'll give a shot at updating llvm/clang when I get home and see if
that helps things out. I had noticed earlier today that Matt had
recently committed a bunch of vector load/store and alignment changes
(r214055 looks especially applicable), so maybe something in there did
the trick.

--Aaron

LLVM: git 6800393083a4030 / svn r213860 (From 7/24)
CLANG: git 5ce5cbd836b3 / svn r213853
libclc: git a63df067faf8a / svn r213762 with the following two patches applied:

r600: Actually use vstore assembly optimizations
r600: improve float vload/vstore path

With that setup, and mesa 0ddc28b026688d, I get failures in the
float16 nextafter tests.

OK, I will try this. I think I may not have had those two libclc
patches applied when I tested.

-Tom

LLVM: git 6800393083a4030 / svn r213860 (From 7/24)
CLANG: git 5ce5cbd836b3 / svn r213853
libclc: git a63df067faf8a / svn r213762 with the following two patches applied:

r600: Actually use vstore assembly optimizations
r600: improve float vload/vstore path

With that setup, and mesa 0ddc28b026688d, I get failures in the
float16 nextafter tests.

OK, I will try this. I think I may not have had those two libclc
patches applied when I tested.

That would be helpful. I'm trying to recover at the moment from
something that got upgraded and broke all of my systems which are
built against upgraded mesa/llvm versions. Until I can resolve that
issue, I'm unable to do any CL dev work on Evergreen and CL or GL on
SI.

If we can confirm that the llvm patches for alignment issues fixed the
use of those 2 patches on their own, then I'm at least happy to drop
the clcmacro.h refactoring. I just can't actually test that myself at
the moment.

--Aaron

>> LLVM: git 6800393083a4030 / svn r213860 (From 7/24)
>> CLANG: git 5ce5cbd836b3 / svn r213853
>> libclc: git a63df067faf8a / svn r213762 with the following two patches applied:
>>
>> r600: Actually use vstore assembly optimizations
>> r600: improve float vload/vstore path
>>
>> With that setup, and mesa 0ddc28b026688d, I get failures in the
>> float16 nextafter tests.
>>
>
> OK, I will try this. I think I may not have had those two libclc
> patches applied when I tested.
>

That would be helpful. I'm trying to recover at the moment from
something that got upgraded and broke all of my systems which are
built against upgraded mesa/llvm versions. Until I can resolve that
issue, I'm unable to do any CL dev work on Evergreen and CL or GL on
SI.

Could this be caused by the recent version bump of LLVM from 3.5 to 3.6?

-Tom