Atomic operations: minimal or maximal?

Looking through the various architectures, it seems that the minimal
approach to atomic intrinsics isn't necessarily the best.

If we assume CAS and atomic add, then we can implement atomic N, where
n is some other operation with a loop. however, for the ll/sc
architectures, this will lower into a double loop (the outer loop of
load-op-CAS and the CAS loop. On such archs, the atomic op can be
done as one loop. To generate the best code, we would have to
recognize loops that equated to atomic N, and raise them to a more
efficient implementation. The alternative is to implement atomic N
for all the Ns in gcc's atomic ops, and let all the ll/sc archs
generate efficient code easily, and just lower to a loop for x86,
sparc, and ia64.

Which is a long way to ask, what do people think design wise? Should
we have a large set of atomic ops that most platforms support natively
and the couple that don't can easily lower, or have a minimal set and
try to raise the lowered gcc atomic ops to efficient code on archs
that support ll/sc (essentially trying to recognize the ld, op, CAS
loops during codegen).

Andrew

Andrew Lenharth wrote:

Looking through the various architectures, it seems that the minimal
approach to atomic intrinsics isn't necessarily the best.

If we assume CAS and atomic add, then we can implement atomic N, where
n is some other operation with a loop. however, for the ll/sc
architectures, this will lower into a double loop (the outer loop of
load-op-CAS and the CAS loop. On such archs, the atomic op can be
done as one loop. To generate the best code, we would have to
recognize loops that equated to atomic N, and raise them to a more
efficient implementation. The alternative is to implement atomic N
for all the Ns in gcc's atomic ops, and let all the ll/sc archs
generate efficient code easily, and just lower to a loop for x86,
sparc, and ia64.
  

Most of these atomic operations for the GCC builtins seem to be variations of fetch_and_phi where phi is some integer or bitwise operation (and, or, add, sub, inc, dec, etc). There are relatively few of them, and they'd all be nearly identical to the fetch_and_add implement ion on LL/SC architectures, so development effort is minimal.

I'd say implement them all (or exclude only those that get very, very little usage). There aren't that many atomic different kinds of atomic builtins, and the analysis to raise atomic op loops looks like a lot of effort.

-- John T.

I'd suggest starting with a minimal set. It's easier to add things lazily as needed than it is to take things out that end up not being needed.

-Chris

Right, that is what is done. And they are sufficient, just not easy
to make efficient. But I mostly agree and we can wait until the PPC
people complain that their locks are too slow.

Andrew