Summary
We discovered some hard-to-reproduce race conditions when building Linalg dialect. This turns out to only happen at our highest core count (256) build machines with make -j$(nproc)
. Personally I have never been able to reproduce it with my dev machine. The frequency of it happens is roughly 1 failure out of ~50 builds of the same code base. It has persists for roughly past half year.
Environment
Build tool: Make (instead of ninja)
OS: Multiple OSes, including ubuntu, centos
Machine nproc: 256
Failure Signature
…
2022-04-19T04:42:06.794272790Z [ 47%] Built target MLIRLinalgOpsIncGen
2022-04-19T04:42:06.794202669Z Included from /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/llvm-project-mlir-release-rocm-5.2/external/llvm-project/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td:274:
2022-04-19T04:42:06.794290490Z /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/build/external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yamlgen.td:3735:20: error: Couldn’t find class ‘LinalgStructur’
2022-04-19T04:42:06.794297140Z def SoftPlus2DOp : LinalgStructur
2022-04-19T04:42:06.794302770Z ^
2022-04-19T04:42:06.794308300Z gmake[2]: *** [external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/CMakeFiles/MLIRLinalgStructuredOpsIncGen.dir/build.make:123: external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.cpp.inc] Error 1
2022-04-19T04:42:06.794314851Z gmake[2]: *** Waiting for unfinished jobs…
2022-04-19T04:42:06.794320581Z Included from /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/llvm-project-mlir-release-rocm-5.2/external/llvm-project/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td:274:
2022-04-19T04:42:06.794327081Z /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/build/external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yamlgen.td:3735:20: error: Couldn’t find class ‘LinalgStructur’
2022-04-19T04:42:06.794346731Z def SoftPlus2DOp : LinalgStructur
2022-04-19T04:42:06.794352501Z ^
2022-04-19T04:42:06.794358121Z gmake[2]: *** [external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/CMakeFiles/MLIRLinalgStructuredOpsIncGen.dir/build.make:174: external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.h.inc] Error 1
2022-04-19T04:42:06.794364351Z gmake[1]: *** [CMakeFiles/Makefile2:26904: external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/CMakeFiles/MLIRLinalgStructuredOpsIncGen.dir/all] Error 2
2022-04-19T04:42:06.794370471Z gmake[1]: *** Waiting for unfinished jobs…
Analysis
There are a couple of targets involved:
-
LinalgOdsGen
: It generate the tablegen file according to the yaml file (LinalgNamedStructuredOps.yaml
) -
MLIRLinalgStructuredOpsIncGen
: It generate a header according to the tablegen fromLinalgOdsGen
What likely happened is:
- In a machine that cannot do highly parallelized build, the two targets happen at the right order
- LinalgOdsGen finished compilation, and tablegen written out to the disk
- The generated tablegen picked up by
MLIRLinalgStructuredOpsIncGen
- In a machine that can do highly parallelized build, there’s a race condition
- LinalgOdsGen finished compilation, output file created and tablegen partially written to disk
- Before 1 is completed done,
MLIRLinalgStructuredOpsIncGen
picked up a partially written tablegen, finding that its content doesn’t make sense, decide that this is a failure - Cmake precedes with unfinished task that actually give up only after 1 is completely done. Therefore, in a environment where it failed the build, I can see a fully constructed tablegen, and do not have direct evidence of the partial written file.
I think the underlining problem is that the LinalgOdsGen
target (here) is using the below paradigm trying to make sure an ordering happens:
add_custom_target(
myCustomTarget
COMMAND foo outfile
)
add_dependencies(myTarget myCustomTarget)
The ordering does get populated correctly. However, Later build stages will use the incompletely-written outfile. I can’t seem to figure a way for later build stages to force to wait till the custom target to run at its full completion.
Suggestion
Commit the yamlgen.td
and yamlgen.cpp.inc
, instead of having them dynamically generated, make them a manual target. Make it such that the developer that updated the yaml file has to run a codegen and commit it together. This way we wouldn’t need the custom command to happen before the tablegen therefore avoiding the race condition.