[PATCH 1/2] Make image builtins r600/llvm-3.9 only

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---
AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
passing, so I'm pretty sure this does not work in 3.9 either, but it at
least does not generate external calls.
I'd like to keep the code around as a reference when someone starts
resurrecting image builtins.

ping. this is just a code cleanup. No change in functionality.

Jan

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>

LGTM.

-- Jeroen

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---
AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
passing, so I'm pretty sure this does not work in 3.9 either, but it at
least does not generate external calls.
I'd like to keep the code around as a reference when someone starts
resurrecting image builtins.

ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

--Aaron

> > The implementation uses r600 specific intrinsics
> > LLVM-4 generates calls to functions using _ro_t and _rw_t image types
> > Portions of the code can be moved back as more targets/llvm versions add image support
> >
> > Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> > ---
> > AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
> > passing, so I'm pretty sure this does not work in 3.9 either, but it at
> > least does not generate external calls.
> > I'd like to keep the code around as a reference when someone starts
> > resurrecting image builtins.
>
> ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

thanks. Does it apply to 2/2 as well?

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

I don't think images are a good model for that. It'd be nicer to use
standard implicit parameter passing on clover side and
__buitltin_implicitarg_ptr on libclc side, without any llvm changes.
(It'd be even nicer if NDRange arguments were converted first, so they
are all together in one place)

Jan

> > The implementation uses r600 specific intrinsics
> > LLVM-4 generates calls to functions using _ro_t and _rw_t image types
> > Portions of the code can be moved back as more targets/llvm versions add image support
> >
> > Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> > ---
> > AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
> > passing, so I'm pretty sure this does not work in 3.9 either, but it at
> > least does not generate external calls.
> > I'd like to keep the code around as a reference when someone starts
> > resurrecting image builtins.
>
> ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

I don't think images are a good model for that. It'd be nicer to use
standard implicit parameter passing on clover side and
__buitltin_implicitarg_ptr on libclc side, without any llvm changes.
(It'd be even nicer if NDRange arguments were converted first, so they
are all together in one place)

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a

=1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

--Aaron

> > > > The implementation uses r600 specific intrinsics
> > > > LLVM-4 generates calls to functions using _ro_t and _rw_t image types
> > > > Portions of the code can be moved back as more targets/llvm versions add image support
> > > >
> > > > Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> > > > ---
> > > > AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
> > > > passing, so I'm pretty sure this does not work in 3.9 either, but it at
> > > > least does not generate external calls.
> > > > I'd like to keep the code around as a reference when someone starts
> > > > resurrecting image builtins.
> > >
> > > ping. this is just a code cleanup. No change in functionality.
> >
> > Looks ok to me. I agree it would be nice to resurrect image support
> > in newer llvm, so lets keep this around for now.
>
> thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

oh, I should have mentioned that it's in the images branch here:

I was considering to ask the github llvm-mirror person to enable travis
an forward email to the list, or maybe setup our own mirror with CI
enabled. or maybe add libclc to official llvm build farm.

>
> >
> > If nothing else, it might give me some ideas on how to make an
> > implicit global buffer available for the printf implementation.
>
> I don't think images are a good model for that. It'd be nicer to use
> standard implicit parameter passing on clover side and
> __buitltin_implicitarg_ptr on libclc side, without any llvm changes.
> (It'd be even nicer if NDRange arguments were converted first, so they
> are all together in one place)
>

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a
> =1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

I don't think you need to implement full printf in libclc. It should be
enough to dump format string and arguments into the buffer and let
clover to the ugly part. printf code doesn't sound like something that
would run efficiently on GPUs anyway.

Jan

> > > > The implementation uses r600 specific intrinsics
> > > > LLVM-4 generates calls to functions using _ro_t and _rw_t image types
> > > > Portions of the code can be moved back as more targets/llvm versions add image support
> > > >
> > > > Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> > > > ---
> > > > AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
> > > > passing, so I'm pretty sure this does not work in 3.9 either, but it at
> > > > least does not generate external calls.
> > > > I'd like to keep the code around as a reference when someone starts
> > > > resurrecting image builtins.
> > >
> > > ping. this is just a code cleanup. No change in functionality.
> >
> > Looks ok to me. I agree it would be nice to resurrect image support
> > in newer llvm, so lets keep this around for now.
>
> thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

oh, I should have mentioned that it's in the images branch here:
Travis CI - Test and Deploy Your Code with Confidence

I was considering to ask the github llvm-mirror person to enable travis
an forward email to the list, or maybe setup our own mirror with CI
enabled. or maybe add libclc to official llvm build farm.

>
> >
> > If nothing else, it might give me some ideas on how to make an
> > implicit global buffer available for the printf implementation.
>
> I don't think images are a good model for that. It'd be nicer to use
> standard implicit parameter passing on clover side and
> __buitltin_implicitarg_ptr on libclc side, without any llvm changes.
> (It'd be even nicer if NDRange arguments were converted first, so they
> are all together in one place)
>

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a
> =1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

I don't think you need to implement full printf in libclc. It should be
enough to dump format string and arguments into the buffer and let
clover to the ugly part. printf code doesn't sound like something that
would run efficiently on GPUs anyway.

I considered forwarding the format spec and arguments to clover, but I
figured that we would need a custom printf implementation given that
CLC's printf supports the vector length specifier (and hl length) that
standard printf doesn't anyway, and that CLC doesn't support the full
set of printf specifiers that normal C allows.

The same comment from my previous mail about enabling varargs calls in
CLC for printf still would apply, at which point the real question is
performance, and how we'd deal with sending variable length arguments
back to clover in a buffer (probably just a giant byte array?). I'm
not really convinced that printf is a performance critical function.
There may also be complexities involved in forwarding the printf
argument list back to clover in the cases where there are more
arguments supplied to the printf call than the format calls for.

In this case, I figured that we at least needed a global buffer (and
output size variable) that was passed implicitly by clover to libclc
to dump the print output in, at which point I think we can just get
away with an atomically incremented/forwarded position variable for
where in the buffer the cursor currently is so that multiple threads
can output to the buffer in parallel.

It honestly seems easier to do this in raw CLC than to pass the
contents back to clover to run through a custom printf function
anyway. But what I have is currently in C, so I guess we could
consider it...

--Aaron

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---
AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
passing, so I'm pretty sure this does not work in 3.9 either, but it at
least does not generate external calls.
I'd like to keep the code around as a reference when someone starts
resurrecting image builtins.

ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

oh, I should have mentioned that it's in the images branch here:
Travis CI - Test and Deploy Your Code with Confidence

I was considering to ask the github llvm-mirror person to enable travis
an forward email to the list, or maybe setup our own mirror with CI
enabled. or maybe add libclc to official llvm build farm.

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

I don't think images are a good model for that. It'd be nicer to use
standard implicit parameter passing on clover side and
__buitltin_implicitarg_ptr on libclc side, without any llvm changes.
(It'd be even nicer if NDRange arguments were converted first, so they
are all together in one place)

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a

=1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

I don't think you need to implement full printf in libclc. It should be
enough to dump format string and arguments

There’s not even a reason to dump the format string in the buffer, as per
the OpenCL specification it’s known at compile time what the format
string is going to be.

into the buffer and let
clover to the ugly part. printf code doesn't sound like something that
would run efficiently on GPUs anyway.

I considered forwarding the format spec and arguments to clover, but I
figured that we would need a custom printf implementation given that
CLC's printf supports the vector length specifier (and hl length) that
standard printf doesn't anyway, and that CLC doesn't support the full
set of printf specifiers that normal C allows.

The same comment from my previous mail about enabling varargs calls in
CLC for printf still would apply, at which point the real question is

I don’t get the argument about varargs. Contrary to the situation on a host
system both the compiler and printf library function are under your control
here. Hence, you can use the compiler to get rid of any vararg function
calls before they ever reach the hardware.

performance, and how we'd deal with sending variable length arguments
back to clover in a buffer (probably just a giant byte array?). I'm
not really convinced that printf is a performance critical function.
There may also be complexities involved in forwarding the printf
argument list back to clover in the cases where there are more
arguments supplied to the printf call than the format calls for.

In this case, I figured that we at least needed a global buffer (and
output size variable) that was passed implicitly by clover to libclc
to dump the print output in, at which point I think we can just get
away with an atomically incremented/forwarded position variable for
where in the buffer the cursor currently is so that multiple threads
can output to the buffer in parallel.

Be careful here: OpenCL 1.2 does not guarantee that atomic
operations work across work-groups. Hence, you’re going
to need a buffer per work-group.

Jeroen

>
> > > > > > > The implementation uses r600 specific intrinsics
> > > > > > > LLVM-4 generates calls to functions using _ro_t and _rw_t image types
> > > > > > > Portions of the code can be moved back as more targets/llvm versions add image support
> > > > > > >
> > > > > > > Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
> > > > > > > ---
> > > > > > > AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
> > > > > > > passing, so I'm pretty sure this does not work in 3.9 either, but it at
> > > > > > > least does not generate external calls.
> > > > > > > I'd like to keep the code around as a reference when someone starts
> > > > > > > resurrecting image builtins.
> > > > > >
> > > > > > ping. this is just a code cleanup. No change in functionality.
> > > > >
> > > > > Looks ok to me. I agree it would be nice to resurrect image support
> > > > > in newer llvm, so lets keep this around for now.
> > > >
> > > > thanks. Does it apply to 2/2 as well?
> > >
> > > Yes, it does. I just manually tested the check_external_calls.sh
> > > script against both r600 and pitcairn's built libs/aliases on LLVM
> > > 6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
> > > around at the moment to test my BARTS with, unfortunately. Maybe I'll
> > > build that release at some point and install it in its own prefix.
> >
> > oh, I should have mentioned that it's in the images branch here:
> > Travis CI - Test and Deploy Your Code with Confidence
> >
> > I was considering to ask the github llvm-mirror person to enable travis
> > an forward email to the list, or maybe setup our own mirror with CI
> > enabled. or maybe add libclc to official llvm build farm.
> >
> > >
> > > >
> > > > >
> > > > > If nothing else, it might give me some ideas on how to make an
> > > > > implicit global buffer available for the printf implementation.
> > > >
> > > > I don't think images are a good model for that. It'd be nicer to use
> > > > standard implicit parameter passing on clover side and
> > > > __buitltin_implicitarg_ptr on libclc side, without any llvm changes.
> > > > (It'd be even nicer if NDRange arguments were converted first, so they
> > > > are all together in one place)
> > > >
> > >
> > > Yeah, I have been considering the implicit parameter passing that
> > > clover does, and had been leaning towards attempting to do that with a
> > > > =1MB buffer for printf output. The other thing that will probably be
> > >
> > > needed is an atomic counter to keep track of our current location in
> > > that buffer when printing.
> > >
> > > I'm a fair bit of the way done with having a clean-slate pure-C
> > > implementation of printf done that matches the CLC requirements that
> > > I've just been testing for identical output against my system printf.
> > > I'll probably post that as an RFC in its pure C form before converting
> > > it to CLC code just to save myself a bit of refactoring effort later
> > > if needed. Once that's done, I'll take a stab at converting it to
> > > CLC, probably have to bug-fix clang to enable varargs function calls
> > > specifically for printf (CL 2.0 spec, section 6.9 part e specifies
> > > that variadic functions/macros are only allowed for
> > > printf/enqueue_kernel), which I don't believe it allows yet, and then
> > > I'll get the implicit parameter passing set up for clover (hopefully).
> >
> > I don't think you need to implement full printf in libclc. It should be
> > enough to dump format string and arguments

There’s not even a reason to dump the format string in the buffer, as per
the OpenCL specification it’s known at compile time what the format
string is going to be.

we could get away with a pointer to the format string. but it might be
more difficult for clover the figure out the pointers if we want to use
device pointers.

> > into the buffer and let
> > clover to the ugly part. printf code doesn't sound like something that
> > would run efficiently on GPUs anyway.
>
> I considered forwarding the format spec and arguments to clover, but I
> figured that we would need a custom printf implementation given that
> CLC's printf supports the vector length specifier (and hl length) that
> standard printf doesn't anyway, and that CLC doesn't support the full
> set of printf specifiers that normal C allows.

agreed. it'd just be more efficient to implement custom printf in
clover than libclc (+you can use all the C++ help)

>
> The same comment from my previous mail about enabling varargs calls in
> CLC for printf still would apply, at which point the real question is

I don’t get the argument about varargs. Contrary to the situation on a host
system both the compiler and printf library function are under your control
here. Hence, you can use the compiler to get rid of any vararg function
calls before they ever reach the hardware.

varargs should be handled by clang. based on
test/SemaOpenCL/builtin.cl, I'd expect that setting cl-1.2 should be
enough to get it working to get rid of clang complaints (we currently
build libclc using cl-1.0)

> performance, and how we'd deal with sending variable length arguments
> back to clover in a buffer (probably just a giant byte array?). I'm
> not really convinced that printf is a performance critical function.
> There may also be complexities involved in forwarding the printf
> argument list back to clover in the cases where there are more
> arguments supplied to the printf call than the format calls for.
>
> In this case, I figured that we at least needed a global buffer (and
> output size variable) that was passed implicitly by clover to libclc
> to dump the print output in, at which point I think we can just get
> away with an atomically incremented/forwarded position variable for
> where in the buffer the cursor currently is so that multiple threads
> can output to the buffer in parallel.

Be careful here: OpenCL 1.2 does not guarantee that atomic
operations work across work-groups. Hence, you’re going
to need a buffer per work-group.

Do you have a reference for this? CLC2 introduced scoped atomic
operations, but cl-1.0 says that "These transactions are atomic for the
device executing these atomic functions." which, I think can be
interpreted as device scope.

GCN atomics are definitely global (performed in L2 cache). afaik the
same is true for EG/CM, but I'd need to double check that color buffer
works as expected. I don't know about nvidia.

Jeroen

>
> It honestly seems easier to do this in raw CLC than to pass the
> contents back to clover to run through a custom printf function
> anyway. But what I have is currently in C, so I guess we could
> consider it...

you can even implement both a keep the one that performs better :slight_smile:

Jan

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---
AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
passing, so I'm pretty sure this does not work in 3.9 either, but it at
least does not generate external calls.
I'd like to keep the code around as a reference when someone starts
resurrecting image builtins.

ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

oh, I should have mentioned that it's in the images branch here:
Travis CI - Test and Deploy Your Code with Confidence

I was considering to ask the github llvm-mirror person to enable travis
an forward email to the list, or maybe setup our own mirror with CI
enabled. or maybe add libclc to official llvm build farm.

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

I don't think images are a good model for that. It'd be nicer to use
standard implicit parameter passing on clover side and
__buitltin_implicitarg_ptr on libclc side, without any llvm changes.
(It'd be even nicer if NDRange arguments were converted first, so they
are all together in one place)

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a

=1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

I don't think you need to implement full printf in libclc. It should be
enough to dump format string and arguments

There’s not even a reason to dump the format string in the buffer, as per
the OpenCL specification it’s known at compile time what the format
string is going to be.

we could get away with a pointer to the format string. but it might be
more difficult for clover the figure out the pointers if we want to use
device pointers.

into the buffer and let
clover to the ugly part. printf code doesn't sound like something that
would run efficiently on GPUs anyway.

I considered forwarding the format spec and arguments to clover, but I
figured that we would need a custom printf implementation given that
CLC's printf supports the vector length specifier (and hl length) that
standard printf doesn't anyway, and that CLC doesn't support the full
set of printf specifiers that normal C allows.

agreed. it'd just be more efficient to implement custom printf in
clover than libclc (+you can use all the C++ help)

The same comment from my previous mail about enabling varargs calls in
CLC for printf still would apply, at which point the real question is

I don’t get the argument about varargs. Contrary to the situation on a host
system both the compiler and printf library function are under your control
here. Hence, you can use the compiler to get rid of any vararg function
calls before they ever reach the hardware.

varargs should be handled by clang. based on
test/SemaOpenCL/builtin.cl, I'd expect that setting cl-1.2 should be
enough to get it working to get rid of clang complaints (we currently
build libclc using cl-1.0)

performance, and how we'd deal with sending variable length arguments
back to clover in a buffer (probably just a giant byte array?). I'm
not really convinced that printf is a performance critical function.
There may also be complexities involved in forwarding the printf
argument list back to clover in the cases where there are more
arguments supplied to the printf call than the format calls for.

In this case, I figured that we at least needed a global buffer (and
output size variable) that was passed implicitly by clover to libclc
to dump the print output in, at which point I think we can just get
away with an atomically incremented/forwarded position variable for
where in the buffer the cursor currently is so that multiple threads
can output to the buffer in parallel.

Be careful here: OpenCL 1.2 does not guarantee that atomic
operations work across work-groups. Hence, you’re going
to need a buffer per work-group.

Do you have a reference for this? CLC2 introduced scoped atomic
operations, but cl-1.0 says that "These transactions are atomic for the
device executing these atomic functions." which, I think can be
interpreted as device scope.

Yes, Section 3.3.1 of the OpenCL 1.2 specification (OpenCL 2 is a totally
different beast): "here are no guarantees of memory consistency between
different work-groups executing a kernel”. If you talk to OpenCL contributors,
they’ll interpret this as also applying to atomics.

Jeroen

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---
AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
passing, so I'm pretty sure this does not work in 3.9 either, but it at
least does not generate external calls.
I'd like to keep the code around as a reference when someone starts
resurrecting image builtins.

ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

oh, I should have mentioned that it's in the images branch here:
Travis CI - Test and Deploy Your Code with Confidence

I was considering to ask the github llvm-mirror person to enable travis
an forward email to the list, or maybe setup our own mirror with CI
enabled. or maybe add libclc to official llvm build farm.

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

I don't think images are a good model for that. It'd be nicer to use
standard implicit parameter passing on clover side and
__buitltin_implicitarg_ptr on libclc side, without any llvm changes.
(It'd be even nicer if NDRange arguments were converted first, so they
are all together in one place)

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a

=1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

I don't think you need to implement full printf in libclc. It should be
enough to dump format string and arguments

There’s not even a reason to dump the format string in the buffer, as per
the OpenCL specification it’s known at compile time what the format
string is going to be.

we could get away with a pointer to the format string. but it might be
more difficult for clover the figure out the pointers if we want to use
device pointers.

into the buffer and let
clover to the ugly part. printf code doesn't sound like something that
would run efficiently on GPUs anyway.

I considered forwarding the format spec and arguments to clover, but I
figured that we would need a custom printf implementation given that
CLC's printf supports the vector length specifier (and hl length) that
standard printf doesn't anyway, and that CLC doesn't support the full
set of printf specifiers that normal C allows.

agreed. it'd just be more efficient to implement custom printf in
clover than libclc (+you can use all the C++ help)

The same comment from my previous mail about enabling varargs calls in
CLC for printf still would apply, at which point the real question is

I don’t get the argument about varargs. Contrary to the situation on a host
system both the compiler and printf library function are under your control
here. Hence, you can use the compiler to get rid of any vararg function
calls before they ever reach the hardware.

varargs should be handled by clang. based on
test/SemaOpenCL/builtin.cl, I'd expect that setting cl-1.2 should be
enough to get it working to get rid of clang complaints (we currently
build libclc using cl-1.0)

Oh, right. That’s what you mean. I haven’t tried compiling with -std=CL1.0,
but if you compile with -std=CL1.2, clang already generates sensible
bitcode for vararg functions.

Jeroen

The implementation uses r600 specific intrinsics
LLVM-4 generates calls to functions using _ro_t and _rw_t image types
Portions of the code can be moved back as more targets/llvm versions add image support

Signed-off-by: Jan Vesely <jan.vesely@rutgers.edu>
---
AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
passing, so I'm pretty sure this does not work in 3.9 either, but it at
least does not generate external calls.
I'd like to keep the code around as a reference when someone starts
resurrecting image builtins.

ping. this is just a code cleanup. No change in functionality.

Looks ok to me. I agree it would be nice to resurrect image support
in newer llvm, so lets keep this around for now.

thanks. Does it apply to 2/2 as well?

Yes, it does. I just manually tested the check_external_calls.sh
script against both r600 and pitcairn's built libs/aliases on LLVM
6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
around at the moment to test my BARTS with, unfortunately. Maybe I'll
build that release at some point and install it in its own prefix.

oh, I should have mentioned that it's in the images branch here:
Travis CI - Test and Deploy Your Code with Confidence

I was considering to ask the github llvm-mirror person to enable travis
an forward email to the list, or maybe setup our own mirror with CI
enabled. or maybe add libclc to official llvm build farm.

If nothing else, it might give me some ideas on how to make an
implicit global buffer available for the printf implementation.

I don't think images are a good model for that. It'd be nicer to use
standard implicit parameter passing on clover side and
__buitltin_implicitarg_ptr on libclc side, without any llvm changes.
(It'd be even nicer if NDRange arguments were converted first, so they
are all together in one place)

Yeah, I have been considering the implicit parameter passing that
clover does, and had been leaning towards attempting to do that with a

=1MB buffer for printf output. The other thing that will probably be

needed is an atomic counter to keep track of our current location in
that buffer when printing.

I'm a fair bit of the way done with having a clean-slate pure-C
implementation of printf done that matches the CLC requirements that
I've just been testing for identical output against my system printf.
I'll probably post that as an RFC in its pure C form before converting
it to CLC code just to save myself a bit of refactoring effort later
if needed. Once that's done, I'll take a stab at converting it to
CLC, probably have to bug-fix clang to enable varargs function calls
specifically for printf (CL 2.0 spec, section 6.9 part e specifies
that variadic functions/macros are only allowed for
printf/enqueue_kernel), which I don't believe it allows yet, and then
I'll get the implicit parameter passing set up for clover (hopefully).

I don't think you need to implement full printf in libclc. It should be
enough to dump format string and arguments

There’s not even a reason to dump the format string in the buffer, as per
the OpenCL specification it’s known at compile time what the format
string is going to be.

into the buffer and let
clover to the ugly part. printf code doesn't sound like something that
would run efficiently on GPUs anyway.

I considered forwarding the format spec and arguments to clover, but I
figured that we would need a custom printf implementation given that
CLC's printf supports the vector length specifier (and hl length) that
standard printf doesn't anyway, and that CLC doesn't support the full
set of printf specifiers that normal C allows.

The same comment from my previous mail about enabling varargs calls in
CLC for printf still would apply, at which point the real question is

I don’t get the argument about varargs. Contrary to the situation on a host
system both the compiler and printf library function are under your control
here. Hence, you can use the compiler to get rid of any vararg function
calls before they ever reach the hardware.

performance, and how we'd deal with sending variable length arguments
back to clover in a buffer (probably just a giant byte array?). I'm
not really convinced that printf is a performance critical function.
There may also be complexities involved in forwarding the printf
argument list back to clover in the cases where there are more
arguments supplied to the printf call than the format calls for.

In this case, I figured that we at least needed a global buffer (and
output size variable) that was passed implicitly by clover to libclc
to dump the print output in, at which point I think we can just get
away with an atomically incremented/forwarded position variable for
where in the buffer the cursor currently is so that multiple threads
can output to the buffer in parallel.

Be careful here: OpenCL 1.2 does not guarantee that atomic
operations work across work-groups. Hence, you’re going
to need a buffer per work-group.

Well, that's unfortunate (to put it mildly). Global atomics end up
being pretty pointless if they're not actually guaranteed to be
atomic.

This may be do-able if we end up using atomic counters to guarantee
that global atomicity, but that would add one more thing that has to
be supported before we can expose printf:
https://www.khronos.org/registry/OpenCL/extensions/ext/cl_ext_atomic_counters_32.txt

But yeah, it's possible to create either a buffer per group, or just
one larger buffer that's sub-divided between the number of workgroups
being executed. That does eventually get us into buffer-sizing
requirements (CL spec requires a minimum of 1MB, but whether that'd be
per workgroup or total is possibly up to interpretation).

Either way, thanks for the warning. It would've been unfortunate to
find this out the hard way.

--Aaron