Inserting a synchronisation before volatile and atomic loads

The z architecture is defined so that the equivalent of:

    while (*x == 0);

is allowed to reuse the same value of *x indefinitely, even if another
CPU writes to *x and synchronises the result to make it globally visible.
There's no guarantee of forward progress without an explicit synchronisation
on the read side. (Well, forward progress is guaranteed after an interrupt,
and in practice there will always be a hypervisor time slice interrupt
at some point, but that kind of response time wouldn't be good enough
for spin loops.)

FWIW, the exact quote from the architecture manual is:

  Following is an example showing the effects of serialization. Location
  A initially contains FF hex.

  CPU 1 CPU 2
  MVI A,X'00' G CLI A,X'00'
  BCR 15,0 BNE G

  The BCR 15,0 instruction executed by CPU 1 is a serializing
  instruction that ensures that the store by CPU 1 at location A is
  completed. However, CPU 2 may loop indefinitely, or until the next
  interruption on CPU 2, because CPU 2 may already have fetched from
  location A for every execution of the CLI instruction. A serializing
  instruction must be in the CPU-2 loop to ensure that CPU 2 will again
  fetch from location A.

(I.e. this is an "unbounded reuse" rather than a memory ordering issue.
The architecture has strong memory ordering and "... *x; ... *y;" will
always behave as if the loads completed in order.)

This means that we need to insert a synchronisation/serialisation
instruction before both volatile and atomic loads. What would be the
best way of doing that?

Following the example of atomic fences, I wondered about adding a new
ISD opcode and inserting it in visitLoad() and co in SelectionDAGBuilder.
Would that be OK? Or is there a better way?