Sorry for replying late, it seems like the mail server of my (french) school is down since yesterday. So I will use this address (swiss university) during the shutdown.
(I have read the replies to my mails from the libclc archives.)
I will try to address all the things you pointed out:
- I agree with you, my mails are not really pretty formated (I used git format-patch and then manually sent the mails instead of using send-mail directly. Stupid error) .
- @Tom: I don't understand what you mean when you wrote :
"I would really like to see a generic implementation of this which use barrier(). "
in reply to the patch 2.
Do you mean that you would like that the async copy should use a barrier? I think so too. I think that a barrier at the beginning of the async copy would be enough to start the copy in good conditions.
- I have read the spec and my conclusion is that a barrier is a work-group syncpoint, whatever are the flags. So I think that we must have a barrier nofence() call.
- For the localglobal() stuff used everywhere, it is used to mimic how the closed driver seems to do. In their IR output we can see that they have chosen to use different pseudo-instructions for all the possibilities: barriers and memory fences seem to have different intrinsics according to the different flags and all. So I thought that maybe, it would be intereseting to do the same.
Thanks to that, it is really easy to lower correctly intrinsics, and we have no change to do if someday some hardware has a special instruction for every combination (very irealistic however).
But I can change that if you want.
- I have considered making a very simple implementation of barriers with a call to mem_fence and the actual barrier intrinsic. But the close driver have special intrinsics so... ^^
I will look at the replies to my LLVM patches now,
I have read the spec and my conclusion is that a barrier is a work-group syncpoint, whatever are the flags. So I think that we must have a barrier nofence() call.
I would agree, though the spec is ambiguous. I would make it fence all address spaces as the fallback else case for a non compile time constant (though I remember finding that was not allowed, though I’ve never re-found where in the spec that is specified. It should be a frontend warning anyway)
For the localglobal() stuff used everywhere, it is used to mimic how the closed driver seems to do. In their IR output we can see that they have chosen to use different pseudo-instructions for all the possibilities: barriers and memory fences seem to have different intrinsics according to the different flags and all.
This is because in AMDIL the same fence instruction with different modifiers implements all of the variations of barrier and mem_fence. LLVM is not aware of the hardware details of how it works and does not do any real scheduling
So I thought that maybe, it would be intereseting to do the same.
Thanks to that, it is really easy to lower correctly intrinsics, and we have no change to do if someday some hardware has a special instruction for every combination (very irealistic however).
But I can change that if you want.
I have considered making a very simple implementation of barriers with a call to mem_fence and the actual barrier intrinsic. But the close driver have special intrinsics so… ^^
As mentioned in the LLVM thread, barrier can’t be used to implement a mem_fence
I have seen that when flags is 0, the closed driver queues a memory fence for local and global. So I think we should do like you say. I will change that. Ok, it is certainly a better way of doing this. Yeah I know. It was a wrong first implementation for sure. But barriers queue memory fences right? So it could be possible to implement like a memory fence then a sync point, isn’t it?