[PATCH 1/2] R600: Actually use vstore assembly optimizations

These were not actually enabled before... not sure why (other than
a think-o when I wrote this originally).

Noticed while inspecting bitcode while working on the next patch.

Signed-off-by: Aaron Watry <awatry@gmail.com>

float values can be loaded/stored via casting through int first
which prevents us having to write float load/store assembly paths (which
can be done if deemed desirable).

This cast is the same method we use for unsigned int types. This lets
us write one assembly path for all 32-bit value types (and eventually
i8/i16/i64 paths too).

This results in a fabs(float16) unit test kernel going from 101 lines to
8 lines of llvm bit code and decompiled shader size of 296dw and 32gprs
to 94dw and 8gprs on evergreen (CEDAR).

Signed-off-by: Aaron Watry <awatry@gmail.com>

What’s wrong with 3 x vectors? They don’t work very well and get split into multiple loads currently, but they should work correctly for now

That's good to hear. When I originally wrote the vload/vstore code,
all I ever got was instruction selection errors for 3-element vectors,
so I left it out of the original version (over a year ago, I think).

For now, I had noticed that there was a word selection error (I had
copy/pasted vload to vstore and not changed the wording here), so I
fixed that while doing the float additions. I haven't changed any
actual code here with regards to 3-element vectors.

If we want, I can re-visit the feasibility of <3 x i32> load/stores in
both vload/vstore, as I'd love to not have to special case it anymore.
Follow-up patch material?

--Aaron

Sure. Is there actually any reason not to include it even if it does work for R600? It shouldn't break the library build, and
would only break on actual uses of it

s/does/doesn't/

I've got a large queue of patches I'm working on, including adding the
3x int32 logic. 3x int/uint/float load/store works on R600 now, so we
might as well enable it.

--Aaron

float values can be loaded/stored via casting through int first
which prevents us having to write float load/store assembly paths (which
can be done if deemed desirable).

This cast is the same method we use for unsigned int types. This lets
us write one assembly path for all 32-bit value types (and eventually
i8/i16/i64 paths too).

This results in a fabs(float16) unit test kernel going from 101 lines to
8 lines of llvm bit code and decompiled shader size of 296dw and 32gprs
to 94dw and 8gprs on evergreen (CEDAR).

LGTM.