I misunderstood the use-case I was designing for and this is not what I want currently. Actually what I need is a memory scope for the level of synchronization done with a gpu.barrier. I’ve create a proof of concept here: Gpu barrier memfence by FMarno · Pull Request #3 · FMarno/llvm-project · GitHub
Right now there are a couple of issues:
- GPU_StorageClass seems to overlap a lot with address space
- local memory is a bit overloaded in terminology and has different meanings in CUDA and OpenCL terminology