Element atomic vector stores: just do it?

Hi everyone,

Recently I started to work on optimization of @llvm.memcpy.element.unordered.atomic intrinsics to make them available and widely usable. Currently, LLVM provides a langref description for it, but no baked-in lowering. To change this situation, some steps need to be taken.

The obvious way to lower @llvm.memcpy.element.unordered.atomic is smth like

for (i = 0; i < len; i++) {

v = load atomic src[i]

store atomic v, dest[i]

}

But this code itself is awkwardly slow, compared to efficient implementations of regular memcpy. What we would really want to do is

for (i = 0; i < len; i += stride) {

vector_v = load element atomic

store element atomic vector_v, dest[i]

}

However, currently there is no way to express this concept in LLVM. What we can do is

for (i = 0; i < len; i += stride) {

vector_v = load

store vector_v, dest[i]

}

When max vector size is supported on the platform, we can hope (but just hope!) that it will lower into corresponding vector load/stores, such as ymm/xmm load/stores in X86. But:

  • There is no guarantee of that. Some IR pass theoretically may break the vectors as it pleases because there is no atomicity demand.
  • I don’t think any pass does it in reality, but they have a right to.- Even if it is lowered into ymm/xmm/smth like this, yhere is no guarantee of atomicity of xmm or ymm registers. So even if this code is lowered into corresponding ymm stores, the whole store might not be atomic (especially if it crosses the boundary of cache line).
  • In codegen level, ymm load/store may be torn. For example, I find this pass that says:

https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp

// The pass currently only handles cases where memcpy is lowered to

// XMM/YMM registers, it tries to break the memcpy into smaller copies.

// breaking the memcpy should be possible since there is no atomicity

// guarantee for loads and stores to XMM/YMM.

So breakedge of non-atomic xmm/ymm loads is a real thing that should be accounted for.

However, there is a bit of a positive moment. Specification of @llvm.memcpy.element.unordered.atomic says:

For each of the input pointers align parameter attribute must be specified. It must be a power of two no less than the element_size. Caller guarantees that both the source and destination pointers are aligned to that boundary.

So if we somehow enforce lowering of the vector stores into hardware supported operations and prohibit other passes from tearing it apart, we’ll have ymm loads aligned by some basic type (say i32). It’s a widely known that on X86, despite xmm/ymm stores are not atomic, they don’t tear words if they are aligned by the width of the word (please correct me if it’s not true!). So by just enforcing straightforward lowering of such loads into ymm/xmm loads, and prohibit all passes (including codegen) to touch it, we should have the desired atomic behavior on X86.

This might not be true for other platforms, but we should start somewhere.

Proposition

We could have another flag in load/store instructions, similar to atomic, being element-atomic which only matters for vector loads and stores. We can guarantee that these loads only survive till codegen on platforms that support it. We can also go and update all passes that potentially tear vectors (I don’t think there is many) and prohibit them from touching these loads and stores. And we’ll also need an assert (on X86) that the pointers are aligned properly.

It doesn’t look a very hard task (the only hard thing is to detect all such places), but maybe there are some pitfalls I don’t know about.

Please discuss if you have an opinion on how to do it best, and what possible problems do you anticipate.

Thanks,

Max

As a practical matter of implementation, it is hard to imagine that an implementation wouldn’t use at least word-size aligned accesses if it split an SSE/AVX op, and as far as I know every actual implementation has done so, but this is not actually guaranteed by the x86 ISA, nor by Intel’s or AMD’s documentation.

– Steve

I guess there are two separate questions here.

1) Currently element atomic memcpys are lowered to calls to predefined symbols. If we want to provide an efficient lowering instead, how do we do this?

We can follow the same scheme as the regular memcpy. There is a target-specific hook in SelectionDAGTargetInfo::EmitTargetCodeForMemcpy. We can introduce a similar hook for element-atomic memcpy. I’m not sure though how hard it would be to emit a vectorized loop in SelectionDAG representation.

We can have a middle-end pass which does the lowering instead. We can either emit a hand-rolled vectorized loop or we can emit a naive loop and rely on the vectorizer to optimize it.

If we do the middle-end lowering we would need to think about interactions with the backed optimizations for element atomic memcpys. E.g. an element atomic memcpy with a small constant length is currently converted into a series of loads and stores in the backed. If you rewrite all element atomic memcpys into vectorized loops in the middle-end you might obscure this optimization.

2) If we decide to do the middle-end lowering we would need to solve the element atomic vector problem. If we emit a naive loop today, it won’t be vectorized because an element-wise atomic vector can’t be represented in the IR. If we decide to write a hand-rolled vectorized loop we will have the same issue.

Here is some previous discussion on the topic of introducing element-wise atomic vectors.
https://lists.llvm.org/pipermail/llvm-dev/2015-December/093237.html

Philip might have better insight on what it will take to introduce this construct, since he attempted it back then.

Artur

Hi everyone,

Recently I started to work on optimization of @llvm.memcpy.element.unordered.atomic intrinsics to make them available and widely usable. Currently, LLVM provides a langref description for it, but no baked-in lowering. To change this situation, some steps need to be taken.

The obvious way to lower @llvm.memcpy.element.unordered.atomic is smth like

for (i = 0; i < len; i++) {
  v = load atomic src[i]
  store atomic v, dest[i]
}

But this code itself is awkwardly slow, compared to efficient implementations of regular memcpy. What we would really want to do is

for (i = 0; i < len; i += stride) {
  vector_v = load element atomic <stride x i32>
  store element atomic vector_v, dest[i]
}

However, currently there is no way to express this concept in LLVM. What we can do is

for (i = 0; i < len; i += stride) {
  vector_v = load <stride x i32>
  store vector_v, dest[i]
}

When max vector size is supported on the platform, we can hope (but just hope!) that it will lower into corresponding vector load/stores, such as ymm/xmm load/stores in X86. But:

  * There is no guarantee of that. Some IR pass theoretically may break the vectors as it pleases because there is no atomicity demand.
     * I don’t think any pass does it in reality, but they have a right to.
  * Even if it is lowered into ymm/xmm/smth like this, yhere is no guarantee of atomicity of xmm or ymm registers. So even if this code is lowered into corresponding ymm stores, the whole store might not be atomic (especially if it crosses the boundary of cache line).
  * In codegen level, ymm load/store may be torn. For example, I find this pass that says:

https://github.com/llvm-mirror/llvm/blob/master/lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
// The pass currently only handles cases where memcpy is lowered to
// XMM/YMM registers, it tries to break the memcpy into smaller copies.
// breaking the memcpy should be possible since there is no atomicity
// guarantee for loads and stores to XMM/YMM.
So breakedge of non-atomic xmm/ymm loads is a real thing that should be accounted for.

However, there is a bit of a positive moment. Specification of @llvm.memcpy.element.unordered.atomic says:

For each of the input pointers align parameter attribute must be specified. It must be a power of two no less than the element_size. Caller guarantees that both the source and destination pointers are aligned to that boundary.

So if we somehow enforce lowering of the vector stores into hardware supported operations and prohibit other passes from tearing it apart, we’ll have ymm loads aligned by some basic type (say i32). It’s a widely known that on X86, despite xmm/ymm stores are not atomic, they don’t tear words if they are aligned by the width of the word (please correct me if it’s not true!). So by just enforcing straightforward lowering of such loads into ymm/xmm loads, and prohibit all passes (including codegen) to touch it, we should have the desired atomic behavior on X86.

This might not be true for other platforms, but we should start somewhere.

Proposition
We could have another flag in load/store instructions, similar to atomic, being element-atomic which only matters for vector loads and stores. We can guarantee that these loads only survive till codegen on platforms that support it. We can also go and update all passes that potentially tear vectors (I don’t think there is many) and prohibit them from touching these loads and stores. And we’ll also need an assert (on X86) that the pointers are aligned properly.

It doesn’t look a very hard task (the only hard thing is to detect all such places), but maybe there are some pitfalls I don’t know about.

Please discuss if you have an opinion on how to do it best, and what possible problems do you anticipate.

Thanks,
Max