Are x86/ARM likely to support atomics larger than 2 pointers?

There's a discussion over on cfe-commits about how future-proof to
make the C1x/C++11 atomic ABI.
(http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20111010/047647.html)

One argument is that, because C ABI changes are painful, and
processors may introduce larger atomic operations in the future, we
should try to design the atomics implementation in such a way that it
can take advantage of future instruction sets without needing an ABI
change.

The other argument (apologies if I misstate this) is that atomics
larger than 2 pointers aren't useful, so we shouldn't make anything
more expensive than today's implementation needs, just to support
hypothetical instructions that processors may never implement.

If any of the processor designers on this list want to chime in, this
would be a good time to do so, so the wrong decision doesn't get
written in stone until the next ABI change.

Thanks,
Jeffrey

This is more-or-less my argument, but allow me the privilege of restating.

First, the basic rules of the language. For essentially any type T,
there is a type _Atomic(T). Objects of this type guarantee atomic
access to the value, using one of four operations: load, store,
exchange, and compare-and-exchange. Since T is arbitrary, at
least some types will require the use of locks "under the hood" —
consider a 2k struct. Obviously, we don't want to use locks if we
don't have to.

_Atomic(T) is allowed to be larger and/or more aligned than T, which
gives us some flexibility. It's also illegal to access an _Atomic(T)
object except through specific functions, which gives us some more
flexibility. So that's good.

In an unchanging world, the implementation choices here would be
completely obvious:
  1. The target processor can natively support the four atomic
      operations on certain operand sizes, given adequate alignment.
  2. If T isn't too large for all of those, we give _Atomic(T) the size
      and alignment of the best match, and we directly emit the
      appropriate instructions for the operations.
  3a. Otherwise, we're going to need a lock, so we give _Atomic(T)
      the normal size and alignment of T, and we issue calls to the
      C runtime library to do all the operations for us under a global
      lock (probably a striped spin lock).

But the world isn't unchanging; several major processor ISAs,
including x86-32, x86-64, and ARM, are all regularly extended
with new instructions. So now (3a) isn't necessarily the right thing
to do: suppose that ARM develops a new "Wide Atomics
Extension" (WAE), and now armv13 chips can do lock-free
operations on 32-byte operands if they're 32-byte-aligned. We
can make the C runtime functions on WAE-compliant systems
check for 32-byte operands that are also 32-byte-aligned and
just use the new instructions, but if the compiler isn't making
objects large enough or aligned enough, that might not kick in,
and then we'll be stuck using locks in situations where it's
ideally unnecessary.

So there's an alternative proposal:
  3b. If T is small enough that it's plausible that the ISA might grow
      new atomic operations for it, then we should make _Atomic(T)
      an adequate size and alignment for those operations.
      Specifically, we should do this for sizes 16, 32, and 64, as it's
      plausible that atomics might grow to a full cache line, but
      no larger.

Okay, now the arguments. I see them like this:

A. We have to make a decision. We can't make _Atomic(T) larger
or more aligned for an existing type T without changing the ABI.

A1. However, if a language extension adds a new type, that type
can be given new rules.

B. If sizeof(T) isn't a power of 2, (3a) makes _Atomic(T) smaller
than it would be under (3b).

C. (3a) makes _Atomic(T) less aligned than it would be under (3b),
reducing the amount of wasted space when it's embedded in a
struct.

D. Space usage is also important for performance, so (B) and (C)
are bad. They're particularly unconscionable if they don't gain us
anything.

E. Under low contention, a good spin-locking implementation is
probably slower than native atomic operations by ~3-4x. That's
bad, but it's still quite cheap in the large scale of things.

F. Making the C runtime functions check for WAE-compatible
operands is not free.

G. (3b) has no advantage at all if the ISA never grows something
like WAE.

H. I don't see as likely that anybody would implement something
like WAE.

H1. For one, I don't know of any current chips at all that support
atomics on operands larger than two pointers, except
transactional-memory chips that obviously don't have this problem
in the first place.

H2. Nor do I know of any plans for such an extension, or even
serious proposals for one.

H3. I also don't know of anyone who would find this particularly
useful; having (say) a locklessly atomic quartet of pointers is
not obviously more powerful than having a locklessly atomic
pair of pointers. So the risk of making this less efficient is not
terribly disappointing to me.

I. It is easy enough to add a language extension, say a new
attribute, which changes a specific _Atomic(T) lock-free even
if the ABI says they're generally not.

So in the final analysis, I think future-proofing for the possibility
of WAE would waste memory for no reasonable prospect of gain.

John.