Since PR32143 (https://github.com/llvm/llvm-project/issues/31491) we’ve disabled using stack coloring for the unsafe stack by default, due to reported miscompilations. Is the stack-coloring optimization safe to use, given that report?
Fuchsia would like to be able to use this optimization, since it compiles with SafeStack by default. But we are not sure if it is safe to do so, given that the optimization is disabled by default in SafeStack.cpp due to this bug .
If the optimization is safe to use, then how do we vet that the original miscompilations are no longer an issue? Do we have guidelines for re-enabling optimizations after they cause miscompilation?
For context, the original report was from 2017. The compiler has had a significant amount of development in the intervening 5 years. I am unable to produce the miscompilations using the reproducer provided in the original report with clang releases going back until at least clang-9 (i.e., compiling with safe stack and safe stack coloring). I only stopped at clang-9 because I would have to change my development environment, so it may fail to reproduce with even older versions of clang. I also looked through the SafeStack tests, and didn’t find anything related to this issue either. I don’t feel confident saying that proves the optimization is now safe, but I’m unsure how to confirm it would still be a problem for the compiler today.
If the test cases work now, it seems like it should be safe to turn back on. It would be nice to know which change caused them to pass, though.
I can try out some older versions. Is this how you were trying to reproduce the failures: clang -fsanitize=safe-stack -mllvm --safe-stack-coloring -O2 test.c
Thank you. It bothered me that I had incomplete info, so I went and repeated my testing going back to clang-4 on an Intel Xeon, and AMD Ryzen 7 with identical results. The bug appears in clang-4 to clang-7, but stops manifesting in clang-8. I still think it would be good to get confirmation on another system.
Bisecting may be our only recourse here, though that would take a significant amount of time. I have a suspicion that its a combination of transforms interacting rather than something specific in the coloring algorithm. I didn’t see anything that stood out in the blame list for the SafeStack passes, but my quick look was hardly comprehensive. If identifying the change is a blocker I can start bisecting today.
Yes that is how I tested the reproducer. I also tried out -O3 but the behavior seems to be the same.
Thanks for pointing this out. I realized we probably have some old artifacts in Fuchsia’s CI infrastructure I could use in a similar (if cumbersome) fashion.
There are a few interesting commits in that range. The changes to DAGCombiner seem likely, but one of the changes to GVN also seems like it could be related.
Based on what this commit does, it makes me think that the bug is still there but just hidden. I think we need more analysis of what was actually causing the crash in the first place.
I spent some time investigating this issue, and I believe the root cause has been fixed now in the main branch. Maybe someone with more knowledge of this pass can double check my analysis:
I’ve uploaded an LLVM IR tests case (with .txt extension to get around Discourse file-type rules) which can be used to reproduce the issue. I believe the problem is that the SafeStackColoring pass was miscalculating the live range of this alloca %temp.i = alloca [32 x i8], align 16 The pass was not tracking live ranges through PHI instructions and was missing the uses of this alloca in the if.end block
The best way to observe this is to build commit 64ad0ad5ed92fc9881075cffebce418d05f61288 and run: opt --safe-stack-coloring -debug-only=safestackcoloring -safe-stack safe-stack-coloring-bug.txt -S -o /dev/null
When I run this test case with an opt build from the latest main branch, I can see that the live range looks to be correctly calculated: opt -debug-only=stack-lifetime --safe-stack-coloring -debug-only=safestackcoloring -safe-stack safe-stack-coloring-bug.txt -S -o /dev/null
From my perspective this bug is fixed, and I think we can enable the pass in main now.