RFC: non-temporal fencing in LLVM IR

ordering guarantees relied upon by the synchronization primitives, so that
non-temporal accesses don't need to be considered when implementing

Then I think an SFENCE following x86 non-temporal stores would be correct.
And empirically we don't need anything to before a non-temporal store to
order it with respect to earlier normal stores. But I don't the latter
conclusion follows from the spec.

I looked at the MOVNTDQA non-temporal load documentation again, and I'm
confused. It sounds like so long as the memory is WB-cacheable, we may be
OK without any fences. But I can't tell that for sure. In the WC case, a
LOCKed instruction seems to be documented to work as a fence.

In the ARM LDNP case, things seem to be messy. I don't think we currently
need fences for C++, since we don't normally use the dependency-based
ordering guarantees. (Except to prevent out-of-thin-air results, which
don't seem to be precluded by the ARM spec. Intentional or bug?) But the
difference does matter when implementing Java final fields or

I'm actually getting a little worried that these things are just too
idiosynchratic to reflect in portable intrinsics.