Any plan for OpenCL 1.2?

Hi,

It seems libclc currently implements the library requirements of the
OpenCL C programming language, as specified by the OpenCL 1.1
Specification.

I am wondering if there is any active development or plan to upgrade
it to OpenCL 1.2? If not, what are the biggest challenges?

Thanks,
Yang

Hi,

It seems libclc currently implements the library requirements of the
OpenCL C programming language, as specified by the OpenCL 1.1
Specification.

I am wondering if there is any active development or plan to upgrade
it to OpenCL 1.2? If not, what are the biggest challenges?

I haven’t checked in a while, but I think the biggest blocker at this point is that we still don’t have a printf implementation in libclc. Most/all of the rest of the required functions are already implemented to expose 1.2.

I had started on a pure-C printf implementation a while back that would in theory be portable to devices printing to a local/global buffer, but stalled out on it when I got to printing vector arguments and hex-float formats. Also, the fact that global atomics in CL aren’t guaranteed to be synchronized across all work groups executing a kernel (just within a given workgroup for a given global buffer).

If someone wants to take a peek or keep going with it, I’ve uploaded my WIP code for the printf implementation here: https://github.com/awatry/printf

It’s probably horrible, and may have to be re-written from scratch to actually work on a GPU, but it may be a start :slight_smile:

Thanks,
Aaron

> Hi,
>
> It seems libclc currently implements the library requirements of the
> OpenCL C programming language, as specified by the OpenCL 1.1
> Specification.
>
> I am wondering if there is any active development or plan to upgrade
> it to OpenCL 1.2? If not, what are the biggest challenges?
>

I haven't checked in a while, but I think the biggest blocker at this point
is that we still don't have a printf implementation in libclc. Most/all of
the rest of the required functions are already implemented to expose 1.2.

I had started on a pure-C printf implementation a while back that would in
theory be portable to devices printing to a local/global buffer, but
stalled out on it when I got to printing vector arguments and hex-float
formats. Also, the fact that global atomics in CL aren't guaranteed to be
synchronized across all work groups executing a kernel (just within a given
workgroup for a given global buffer).

I don't think we need to worry about that. since both the amd and
nvptx atomics are atomic for all work groups we can just use that
behaviour. the actual atomic op would be target specific and if anyone
wants to add an additional target they add their own implementation
(SPIR-V can just use atomic with the right scope).
AMD targets can be switched to use GDS as an optimization later.

at least cl 1.2 printf only prints to stdout so we only need to
consider global memory.

If someone wants to take a peek or keep going with it, I've uploaded my WIP
code for the printf implementation here: https://github.com/awatry/printf

I'm not sure parsing the format string on the device is the best
approach as it will introduce quite a lot of divergence. it might be
easier/faster to just copy the format string and input data to the
buffer and let the host parse/print everything.

was the plan to:
1.) parse the input once to get the number of bytes
2.) atomic move writepointer
3.) parse the input second time and print characters to the buffer

or did you have anything more specialized in mind?

thanks,
Jan

> > Hi,
> >
> > It seems libclc currently implements the library requirements of the
> > OpenCL C programming language, as specified by the OpenCL 1.1
> > Specification.
> >
> > I am wondering if there is any active development or plan to upgrade
> > it to OpenCL 1.2? If not, what are the biggest challenges?
>
> I haven't checked in a while, but I think the biggest blocker at this
> point
> is that we still don't have a printf implementation in libclc. Most/all
> of
> the rest of the required functions are already implemented to expose 1.2.
>
> I had started on a pure-C printf implementation a while back that would in
> theory be portable to devices printing to a local/global buffer, but
> stalled out on it when I got to printing vector arguments and hex-float
> formats. Also, the fact that global atomics in CL aren't guaranteed to be
> synchronized across all work groups executing a kernel (just within a
> given
> workgroup for a given global buffer).

I don't think we need to worry about that. since both the amd and
nvptx atomics are atomic for all work groups we can just use that
behaviour. the actual atomic op would be target specific and if anyone
wants to add an additional target they add their own implementation
(SPIR-V can just use atomic with the right scope).
AMD targets can be switched to use GDS as an optimization later.

at least cl 1.2 printf only prints to stdout so we only need to
consider global memory.

> If someone wants to take a peek or keep going with it, I've uploaded my
> WIP
> code for the printf implementation here: https://github.com/awatry/printf

I'm not sure parsing the format string on the device is the best
approach as it will introduce quite a lot of divergence. it might be
easier/faster to just copy the format string and input data to the
buffer and let the host parse/print everything.

was the plan to:
1.) parse the input once to get the number of bytes
2.) atomic move writepointer
3.) parse the input second time and print characters to the buffer

I'm currently doing an imlementation for mesa amd target, using what is
already implemented on LLVM.

It means having a `__global char * __printf_alloc(uint bytes) {}` that return
an address of a global buffer.
The adress is calculated from a global buffer adress + an offset of what have
aready been stored.

Mine is not using atomic yet, since I'm working on the buffer runtime
management on clover for the moment and will finish libclc later.

Serge

>
> > Hi,
> >
> > It seems libclc currently implements the library requirements of the
> > OpenCL C programming language, as specified by the OpenCL 1.1
> > Specification.
> >
> > I am wondering if there is any active development or plan to upgrade
> > it to OpenCL 1.2? If not, what are the biggest challenges?
> >
>
> I haven't checked in a while, but I think the biggest blocker at this point
> is that we still don't have a printf implementation in libclc. Most/all of
> the rest of the required functions are already implemented to expose 1.2.
>
> I had started on a pure-C printf implementation a while back that would in
> theory be portable to devices printing to a local/global buffer, but
> stalled out on it when I got to printing vector arguments and hex-float
> formats. Also, the fact that global atomics in CL aren't guaranteed to be
> synchronized across all work groups executing a kernel (just within a given
> workgroup for a given global buffer).

I don't think we need to worry about that. since both the amd and
nvptx atomics are atomic for all work groups we can just use that
behaviour. the actual atomic op would be target specific and if anyone
wants to add an additional target they add their own implementation
(SPIR-V can just use atomic with the right scope).
AMD targets can be switched to use GDS as an optimization later.

Yeah, if we go the route of what I had started (not saying we should),
then making it a target-specific implementation with no generic one is
probably the easiest route.

at least cl 1.2 printf only prints to stdout so we only need to
consider global memory.

>
> If someone wants to take a peek or keep going with it, I've uploaded my WIP
> code for the printf implementation here: https://github.com/awatry/printf

I'm not sure parsing the format string on the device is the best
approach as it will introduce quite a lot of divergence. it might be
easier/faster to just copy the format string and input data to the
buffer and let the host parse/print everything.

Yeah, I don't remember if some of my notes from when I was working on
this were along that line, but I know the thought crossed my head a
few times (and I hadn't given up on the idea at all due to the
performance, branchiness, and the sheer amount of code and
stack/register pressure that the implementation I was working on would
introduce). If it weren't for the special vector output formats, we
could pretty much forward the print format and arguments back to the
host and just use the standard system printf. It might still be easier
to only do special handling of that format (and there might've been
one or two other differences from standard C printf, it's been a while
since I started this).

was the plan to:
1.) parse the input once to get the number of bytes
2.) atomic move writepointer
3.) parse the input second time and print characters to the buffer

or did you have anything more specialized in mind?

The one that I was working on actually walked the print format input
character by character until it hit a '%' (or anything else that was
special) and when it came time to output anything, the idea would be
that we'd use an atomic increment to allocate a character in the
output buffer and write it. Racy, to be sure, and you'd end up with
output interleaved from all threads attempting to write output
simultaneously. A previous conversation I had indicated that the CL
spec doesn't guarantee that atomic operations/buffers are synchronized
across work groups, so that got me started down the mental path of
partitioning the output buffer into N segments (where N is the number
of work groups launched), so you could at least synchronize the output
amongst work groups.

I will fully admit that the implementation has its issues, but from my
reading of the spec I think it would've at least been compliant.

That being said, I got a good start on a set of unit tests while
working on it, so it wasn't a complete waste. If Serge is working on
an implementation that copies the format specs and arguments from the
device to mesa in order to print them on the host, I'm more than
willing to go with that, and I can probably port my tests over to
piglit at some point just for a sanity check if the CTS isn't thorough
enough.

--Aaron

>
> >
> > > Hi,
> > >
> > > It seems libclc currently implements the library requirements of the
> > > OpenCL C programming language, as specified by the OpenCL 1.1
> > > Specification.
> > >
> > > I am wondering if there is any active development or plan to upgrade
> > > it to OpenCL 1.2? If not, what are the biggest challenges?
> > >
> >
> > I haven't checked in a while, but I think the biggest blocker at this point
> > is that we still don't have a printf implementation in libclc. Most/all of
> > the rest of the required functions are already implemented to expose 1.2.
> >
> > I had started on a pure-C printf implementation a while back that would in
> > theory be portable to devices printing to a local/global buffer, but
> > stalled out on it when I got to printing vector arguments and hex-float
> > formats. Also, the fact that global atomics in CL aren't guaranteed to be
> > synchronized across all work groups executing a kernel (just within a given
> > workgroup for a given global buffer).
>
> I don't think we need to worry about that. since both the amd and
> nvptx atomics are atomic for all work groups we can just use that
> behaviour. the actual atomic op would be target specific and if anyone
> wants to add an additional target they add their own implementation
> (SPIR-V can just use atomic with the right scope).
> AMD targets can be switched to use GDS as an optimization later.

Yeah, if we go the route of what I had started (not saying we should),
then making it a target-specific implementation with no generic one is
probably the easiest route.

>
> at least cl 1.2 printf only prints to stdout so we only need to
> consider global memory.
>
> >
> > If someone wants to take a peek or keep going with it, I've uploaded my WIP
> > code for the printf implementation here: https://github.com/awatry/printf
>
> I'm not sure parsing the format string on the device is the best
> approach as it will introduce quite a lot of divergence. it might be
> easier/faster to just copy the format string and input data to the
> buffer and let the host parse/print everything.

Yeah, I don't remember if some of my notes from when I was working on
this were along that line, but I know the thought crossed my head a
few times (and I hadn't given up on the idea at all due to the
performance, branchiness, and the sheer amount of code and
stack/register pressure that the implementation I was working on would
introduce). If it weren't for the special vector output formats, we
could pretty much forward the print format and arguments back to the
host and just use the standard system printf. It might still be easier
to only do special handling of that format (and there might've been
one or two other differences from standard C printf, it's been a while
since I started this).

>
> was the plan to:
> 1.) parse the input once to get the number of bytes
> 2.) atomic move writepointer
> 3.) parse the input second time and print characters to the buffer
>
> or did you have anything more specialized in mind?

The one that I was working on actually walked the print format input
character by character until it hit a '%' (or anything else that was
special) and when it came time to output anything, the idea would be
that we'd use an atomic increment to allocate a character in the
output buffer and write it. Racy, to be sure, and you'd end up with
output interleaved from all threads attempting to write output
simultaneously. A previous conversation I had indicated that the CL
spec doesn't guarantee that atomic operations/buffers are synchronized
across work groups, so that got me started down the mental path of
partitioning the output buffer into N segments (where N is the number
of work groups launched), so you could at least synchronize the output
amongst work groups.

Ahh, yeah, and now the rust is slowly getting polished off. I think I
had planned on creating the printf output in a private buffer/array
and then at the end of the printf operation (or whenever the private
buffer was full), flushing the built string to the global buffer
instead of writing 1 character at a time directly to the global
buffer.

Sorry for the rambling. I started this almost 3 years ago now, and
haven't touched it since Oct 2017, so the memory has faded a bit in
the interim.

--Aaron

Hello

I've created https://reviews.llvm.org/D84392 to add printf to amd target

Serge