[RFC] Device runtime library (re)design

Thanks for the RFC. I think I’ve caught up on the thread and associated patches. I’m late to the party so this email is a bit long.

Today we have a working nvptx implementation in tree. There’s also a working implementation for amdgcn on githib. @Ravi if yours is open source I’d love to compare the implementations, if not, I’d invite you to look at

Largely shared code would make bringing up new devices easier if the abstractions are in the right place and otherwise more difficult. I’m wary of significantly rewriting the code while only a single target is supported in case we draw the lines in the wrong place. To that end, we (amd) would like to add our target. That’ll remove some hard coded nvptx behaviour and hopefully reduce the cost of adding a third accelerator.

Right now, there’s zero code shared with nvptx. We copied the cuda and #ifdef’ed the parts that differ. That’s too expensive to maintain under churn to ‘common’ code and not something I’d like to commit as is. Instead, I’d like to incrementally refactor the deviceRTL to extract nvptx specific stuff, with close reference to our GitHub repo, until the difference is so slight that we can upstream a couple of files and some cmake to bring amdgcn online.

Reducing the use of cuda, bringing the code more inline with LLVM’s coding style, documenting target bring up are all great goals. I think that can also be done incrementally and is largely orthogonal to getting some target diversity in the codebase. I’d like to be involved in both tracks.

As a concrete, minimal proposal, I’d like to put the functions that use nvptx asm behind a zero runtime overhead interface and get that up for review.

Thanks all,