I have been considering that we should do loop transformations as such and keep the mapping relatively simple. These transformations can still be driven by the same set of annotations, but become easier to test and reuse. For example, if you want to map multiple loops to the same block/thread id, you can coalesce those loops into a single loop and map just it. Same for tiling with dynamic values, we can do it as a transformation and map the outer loop and keep the inner in the kernel, potentially canonicalizing it away if it is known to have a single iteration statically. We already have a mapLoopToProcessorIDs for non-parallel loops that does exactly that.