Question on Fence Instruction

Hi,

I have a question with the latest released LLVM which supports Fence Instruction as IR. Say if I intentionally place a Sequentially Consistent Fence Instruction somewhere in the code, then would the other transformation passes that applied later respect the Fence and do not perform any reordering across it?

Going further, if I place an SC Fence immediately after every load/store instruction, before any opt is done, and then apply those standard compile transformations, would the result code be SC-preserving? In other words, does the compiler restrict any reordering for potentially shared accesses that “protected” by these fences?

Thank you.

Yuelu

In theory, all optimization passes should respect sc. If you find any
counter example, I think it's a bug.

HTH,
chenwj

Thank you very much for the quick reply. I was trying to confirm what I did was correct. I did a test that could enable a simple way of sc-preserving compilation by inserting fences for every load/store instruction before any opts, applying standard opts and then removing them after assembly code generation. It turned out that such sc-preserving compilation only caused ~4% slowdown for 18 benchmarks on average on a Intel Xeon machine. The result surprised me a lot because it was reported that such naive way of compilation can cause 20% slowdown in a recent PLDI paper (they also use LLVM), so I posted this question. I will try to examine if the generated binary code really respects sc fences.

Yuelu

Thank you very much for the quick reply. I was trying to confirm what I did was correct. I did a test that could enable a simple way of sc-preserving compilation by inserting fences for every load/store instruction before any opts, applying standard opts and then removing them after assembly code generation. It turned out that such sc-preserving compilation only caused ~4% slowdown for 18 benchmarks on average on a Intel Xeon machine. The result surprised me a lot because it was reported that such naive way of compilation can cause 20% slowdown in a recent PLDI paper (they also use LLVM), so I posted this question. I will try to examine if the generated binary code really respects sc fences.

Perhaps I'm misunderstanding something, but why are you removing the
fences before code generation? I would think that removing the fences
would permit the hardware to re-order loads and stores in a way that
violates sequential consistency. In other words, while you've ensured
that the compiler doesn't do anything to violate sc, you're letting the
hardware violate sc.

Are you compiling for a machine that is sequentially consistent by default?

Also, to what PLDI paper are you referring?

-- John T.

My 2 cents is that maybe x86 already has a pretty strong memory model that
it doesn't cause much performance loss if you remove those SC fence.

HTH,
chenwj

Hi,

The paper is "A Case for an SC-Preserving Compiler" from PLDI 2011. What I did is following their "naive SC preserving compilation", that restricts the compiler to do any reordering for potentially shared load/store instructions. The paper says the resulting code running on x86 machine (SC-preserving binary run on non-SC hardware) will get 22% slowdown comparing with a normally optimized code running on same machine (non-SC binary run on non-SC hardware). The experiment is to see how much performance will be lost by restricting the reordering of shared load/store instructions because of those disabled compiler transformations. The fences are removed from the assembly code because they are too costly so that the performance lost of compilation restriction can not be checked independently.

The result I get shows such reordering restriction in compilation only lead to 4% slowdown, way less than the paper's report. The reason could be that the compiler does not respect SC fences so unexpected reordering is done and lead to better performance. It could also be that their implementation is different than mine. I am not sure.

-Yuelu

The paper is "A Case for an SC-Preserving Compiler" from PLDI 2011. What I did is following their "naive SC preserving compilation", that restricts the compiler to do any reordering for potentially shared load/store instructions. The paper says the resulting code running on x86 machine (SC-preserving binary run on non-SC hardware) will get 22% slowdown comparing with a normally optimized code running on same machine (non-SC binary run on non-SC hardware). The experiment is to see how much performance will be lost by restricting the reordering of shared load/store instructions because of those disabled compiler transformations. The fences are removed from the assembly code because they are too costly so that the performance lost of compilation restriction can not be checked independently.

The result I get shows such reordering restriction in compilation only lead to 4% slowdown, way less than the paper's report. The reason could be that the compiler does not respect SC fences so unexpected reordering is done and lead to better performance. It could also be that their implementation is different than mine. I am not sure.

  Perhaps the best way is asking the authors? :wink: Anyway, I skim through
the abstract and it says:

   An SC-preserving compiler, obtained by restricting the optimiza-
   tion phases in LLVM, a state-of-the-art C/C++ compiler, incurs an
   average slowdown of 3.8% and a maximum slowdown of 34% on
   a set of 30 programs from the SPLASH-2, PARSEC, and SPEC
   CINT2006 benchmark suites.

Note that the average slowdown is 3.8%, which is pretty close to your
result. Parallel benchmark like SPLASH-2 and PARSEC got 22% slowdown
as described in section 3.4, I guess it's because they are parallel
programs, the re-ordering has much more impact on their performance
lose.

HTH,
chenwj