[RFC] Add inc and dec operations to atomicrmw

Hi, I’ve posted ⚙ D137361 IR: Add atomicrmw inc and dec to add inc and dec operations to the IR. This provides consistency to some atomic operations we’re currently handling through target intrinsics, which are more limited compared to real atomicrmw support

1 Like

At the very least, this needs to be named better, to indicate its saturating semantics (e.g. like llvm.uadd.sat.* intrinsics).

But, even given a better name, it kinda seems excessively-special-case-ish to add only saturating add-{1, -1} operations. Hrm.

For the record, they’re wrapping semantics, not saturating. But sure, a more explicit name doesn’t hurt.

It seems at least PTX and AMDGPU both would support this, so it’s not overly special-case-ish. And there’s another reason for these operations: they follow the pattern of remote RMW atomic operations with a single payload value. (Remote here means that the atomic isn’t executed locally in the core but is sent as a memory transaction to some larger, distant cache or even the memory controller.) With an architecture that typically supports a single payload value, inc-wrap and dec-wrap suddenly make a lot of sense, which is why you see the convergent evolution here in GPU targets.

How about wrap_uinc/wrap_udec?

uinc_wrap,udec_wrap would be more consistent with how other operation variants are named

I assume the add-with-arbitrary-wrapping operation should be uniquely useful for GPU programming in some way, right, since it’s only implemented there?

However, I tried looking for uses of this functionality, and it seems to be extremely rarely used. Even in the very few calls I did see, they appeared to not actually require the wrapping semantic – either because they pass something like maxint or -1 as the wrap value, or else they are done with the computation upon reaching the topmost value, anyways (such that the wrap-around to 0 is irrelevant).

(Also found atomicInc doesn't work on host accessible mapped memory (hipHostMalloc) · Issue #815 · ROCm-Developer-Tools/HIP · GitHub yikes!)

FWIW, we use it in the OpenMP GPU runtime (for both NVIDIA and AMD GPUs).

Things like this is part of why it would be good to have this pass through the normal legalization process. We can fallback to cmpxchg loop based on the syncscope like for other atomics rather than blindly assuming it’s OK because an intrinsic was used