PTX backend support for atomics

I notice that there is not currently any intrinsic support for atomics in the PTX backend. Is this on the roadmap? Should it be as easy to add as it seems (plumbing through just like the thread ID instructions, &c.)? The obvious difference is that these ops have side effects.

I notice that there is not currently any intrinsic support for atomics in the PTX backend. Is this on the roadmap? Should it be as easy to add as it seems (plumbing through just like the thread ID instructions, &c.)? The obvious difference is that these ops have side effects.

It should be just a matter of defining these as back-end intrinsics. Patches are always welcome. :slight_smile:

Looking further during down time at the dev meeting today, it actually
seems that PTX atom.* and red.* intrinsics map extremely naturally
onto the LLVM atomicrmw and cmpxchg instructions. The biggest issue is
that a subset of things expressible with these LLVM instructions do
not trivially map to PTX, and the range of things naturally supported
depends on the features of a given target. With sufficient effort, all
possible uses of these instructions could be emulated on all targets,
at the cost of efficiency, but this would significantly complicate the
codegen and probably produce steep performance cliffs.

The basic model:

  %d = cmpxchg {T}* %a, {T} %b, {T} %c
  --> atom.{space of %a}.cas.{T} d, [a], b, c

  %d = atomicrmw {OP} {T}* %a, {T} b
  --> atom.{space of %a}.{OP}.{T} d, [a], b
  for op in { add, and, or, xor, min, max, xchg }

with the special cases:

  %d is unused --> red.{space of %a}.{OP}.{T} d, [a], b # i.e. use
the reduce instr instead of the atom instr

  {OP} == {add, sub} && b == 1 --> use PTX inc/dec op

I think the right answer for the indefinite future is to map exactly
those operations and types which trivially map to the given PTX and
processor versions, leaving other cases as unsupported. (Even on the
SSE and NEON backends, after all, select with a vector condition has
barfed for years.) In the longer run, it could be quite useful for
portability to map the full range of atomicrmw behaviors to all PTX
targets using emulation when necessary, but relative to the current
state of the art (manually writing different CUDA code paths with
different sync strategies for different machine generations), only
supporting what maps trivially is not a regression.

Thoughts?

For the short term, I definitely agree that implementing the trivial maps is most important. I’m not too concerned about the corner cases at the moment.

As for emulating atomics when they are not available, this is probably just something we have to live with. To be complete, we should support all intrinsics on all targets and leave it up to the front-end to determine how best to implement source-level functionality. I’m not particularly troubled by steep performance curves for emulated atomics. It’s ultimately the job of the LLVM IR generator to decide how best to map to target intrinsics given a target.