> > Hi,
> > It seems libclc currently implements the library requirements of the
> > OpenCL C programming language, as specified by the OpenCL 1.1
> > Specification.
> > I am wondering if there is any active development or plan to upgrade
> > it to OpenCL 1.2? If not, what are the biggest challenges?
> I haven't checked in a while, but I think the biggest blocker at this point
> is that we still don't have a printf implementation in libclc. Most/all of
> the rest of the required functions are already implemented to expose 1.2.
> I had started on a pure-C printf implementation a while back that would in
> theory be portable to devices printing to a local/global buffer, but
> stalled out on it when I got to printing vector arguments and hex-float
> formats. Also, the fact that global atomics in CL aren't guaranteed to be
> synchronized across all work groups executing a kernel (just within a given
> workgroup for a given global buffer).
I don't think we need to worry about that. since both the amd and
nvptx atomics are atomic for all work groups we can just use that
behaviour. the actual atomic op would be target specific and if anyone
wants to add an additional target they add their own implementation
(SPIR-V can just use atomic with the right scope).
AMD targets can be switched to use GDS as an optimization later.
Yeah, if we go the route of what I had started (not saying we should),
then making it a target-specific implementation with no generic one is
probably the easiest route.
at least cl 1.2 printf only prints to stdout so we only need to
consider global memory.
> If someone wants to take a peek or keep going with it, I've uploaded my WIP
> code for the printf implementation here: https://github.com/awatry/printf
I'm not sure parsing the format string on the device is the best
approach as it will introduce quite a lot of divergence. it might be
easier/faster to just copy the format string and input data to the
buffer and let the host parse/print everything.
Yeah, I don't remember if some of my notes from when I was working on
this were along that line, but I know the thought crossed my head a
few times (and I hadn't given up on the idea at all due to the
performance, branchiness, and the sheer amount of code and
stack/register pressure that the implementation I was working on would
introduce). If it weren't for the special vector output formats, we
could pretty much forward the print format and arguments back to the
host and just use the standard system printf. It might still be easier
to only do special handling of that format (and there might've been
one or two other differences from standard C printf, it's been a while
since I started this).
was the plan to:
1.) parse the input once to get the number of bytes
2.) atomic move writepointer
3.) parse the input second time and print characters to the buffer
or did you have anything more specialized in mind?
The one that I was working on actually walked the print format input
character by character until it hit a '%' (or anything else that was
special) and when it came time to output anything, the idea would be
that we'd use an atomic increment to allocate a character in the
output buffer and write it. Racy, to be sure, and you'd end up with
output interleaved from all threads attempting to write output
simultaneously. A previous conversation I had indicated that the CL
spec doesn't guarantee that atomic operations/buffers are synchronized
across work groups, so that got me started down the mental path of
partitioning the output buffer into N segments (where N is the number
of work groups launched), so you could at least synchronize the output
amongst work groups.
I will fully admit that the implementation has its issues, but from my
reading of the spec I think it would've at least been compliant.
That being said, I got a good start on a set of unit tests while
working on it, so it wasn't a complete waste. If Serge is working on
an implementation that copies the format specs and arguments from the
device to mesa in order to print them on the host, I'm more than
willing to go with that, and I can probably port my tests over to
piglit at some point just for a sanity check if the CTS isn't thorough