Hi everyone,
I’ve recently added llvm.atomicrmw
and llvm.cmpxchg
and this proposal aims to add std.atomic_rmw
. It’s mostly mechanical, but there’s an open question and adding an op to the standard dialect seems like having a review is appropriate.
Thanks!
[RFC] Add std.atomic_rmw op
Background
Atomic read-modify-write blocks have useful semantics beyond the LLVM dialect. In conjunction with affine.parallel_for, for example, they can be used to represent common machine learning reduction operations such as convolution or pooling. We believe including atomic_rmw
in the standard dialect will make these operations available at a more appropriate level of abstraction than llvm.atomicrmw
. This also enables us to represent atomic RMW’s using operations not available in LLVM’s atomic RMW by providing a lowering into llvm.cmpxchg
.
Goals
- Add an op to represent atomic read-modify-write sequences to the Standard dialect.
- Add lowering from
std.atomic_rmw
into the appropriate op in the LLVM dialect.-
llvm.atomicrmw
for trivial cases. -
llvm.cmpxchg
for complex cases.
-
Proposal
IR Representation
def AtomicRMWOp : Std_Op<"atomic_rmw"> {
let arguments = (ins AnyMemRef:$memref, Variadic<Index>:$indices);
let regions = (region SizedRegion<1>:$body);
}
def AtomicRMWYieldOp :
Std_Op<"atomic_rmw.yield", [HasParent<"AtomicRMWOp">, Terminator]> {
let summary = "terminator for atomic_rmw operations";
let arguments = (ins AnyType:$result);
}
Lowering into llvm.atomicrmw
Pattern matching can be used to determine how to lower a particular std.atomic_rmw
op. Any trivial body that has a single op that matches one of the AtomicBinaryOp
enum values will be lowered directly into llvm.atomicrmw
.
For example:
def @sum(%memref : memref<10xf32>, %i : index, %val : f32) {
atomic_rmw %iv = %memref[%i] : memref<10xf32> {
%local = addf %iv, %val : f32
atomic_rmw.yield %local : f32
}
}
Lowers into:
!memref_ptr = type !llvm<"{ float*, float*, i64, [1 x i64], [1 x i64] }*">
!memref_val = type !llvm<"{ float*, float*, i64, [1 x i64], [1 x i64] }">
llvm.func @sum(%memref: !memref_ptr, %i : !llvm.i64, %val: !llvm.float) {
%load = llvm.load %memref : !memref_ptr
%buf = llvm.extractvalue %load[1] : !memref_val
%ptr = llvm.getelementptr %buf[%i] : (!llvm<"float*">, !llvm.i64) -> !llvm<"float*">
llvm.atomicrmw fadd %ptr, %val acq_rel : !llvm.float
}
Lowering into llvm.cmpxchg
All other lowerings make use of llvm.cmpxhg
. For example, to lower
a floating point max reduction:
func @max(%memref : memref<10xf32>, %i : index, %val : f32) {
atomic_rmw %iv = %memref[%i] : memref<10xf32> {
%cmp = cmpf "ogt", %iv, %val : f32
%max = select %cmp, %iv, %val : f32
atomic_rmw.yield %max : f32
}
return
}
Lowers into:
!memref_ptr = type !llvm<"{ float*, float*, i64, [1 x i64], [1 x i64] }*">
!memref_val = type !llvm<"{ float*, float*, i64, [1 x i64], [1 x i64] }">
llvm.func @max(%memref: !memref_ptr, %i : !llvm.i64, %val: !llvm.float) {
%load = llvm.load %memref : !memref_ptr
%buf = llvm.extractvalue %load[1] : !memref_val
%ptr = llvm.getelementptr %buf[%i] : (!llvm<"float*">, !llvm.i64) -> !llvm<"float*">
%init_loaded = llvm.load %ptr : !llvm<"float*">
llvm.br ^loop(%init_loaded : !llvm.float)
^loop(%loaded: !llvm.float):
%cmp = llvm.fcmp "ogt" %loaded, %val : !llvm.float
%max = llvm.select %cmp, %loaded, %val : !llvm.i1, !llvm.float
%pair = llvm.cmpxchg %ptr, %loaded, %max acq_rel monotonic : !llvm.float
%new_loaded = llvm.extractvalue %pair[0] : !llvm<"{ float, i1 }">
%success = llvm.extractvalue %pair[1] : !llvm<"{ float, i1 }">
llvm.cond_br %success, ^end, ^loop(%new_loaded : !llvm.float)
^end:
llvm.return
}
Using the following logic:
+---------------------------------+
| <code before the AtomicRMWOp> |
| <compute initial %iv value> |
| br loop(%iv) |
+---------------------------------+
|
-------| |
| v v
| +--------------------------------+
| | loop(%iv): |
| | <body contents> |
| | %pair = cmpxchg |
| | %ok = %pair[0] |
| | %new = %pair[1] |
| | cond_br %ok, end, loop(%new) |
| +--------------------------------+
| | |
|----------- |
v
+--------------------------------+
| end: |
| <code after the AtomicRMWOp> |
+--------------------------------+
Open Questions
AtomicOrdering
This proposal uses the AtomicOrdering::acq_rel
value for both the trivial llvm.atomicrmw
lowering and the success ordering for llvm.cmpxchg
. It also uses AtomicOrdering::monotonic
for the failure ordering of llvm.cmpxchg
.
Are these the proper orderings for this operation? Should an AtomicOrdering
be exposed via the std.atomic_rmw
op? If so, which orderings should be exposed?
Future Work
- Lowering of
std.atomic_rmw
into upcoming OpenMP dialect. - Lowering of
std.atomic_rmw
into GPU dialect. - Determine if it makes sense for
std.atomic_rmw
to be used within the body of aloop.parallel
.