LinalgOdsGen target causing build-time race conditions that fails build

Summary

We discovered some hard-to-reproduce race conditions when building Linalg dialect. This turns out to only happen at our highest core count (256) build machines with make -j$(nproc). Personally I have never been able to reproduce it with my dev machine. The frequency of it happens is roughly 1 failure out of ~50 builds of the same code base. It has persists for roughly past half year.

Environment

Build tool: Make (instead of ninja)
OS: Multiple OSes, including ubuntu, centos
Machine nproc: 256

Failure Signature


2022-04-19T04:42:06.794272790Z [ 47%] Built target MLIRLinalgOpsIncGen
2022-04-19T04:42:06.794202669Z Included from /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/llvm-project-mlir-release-rocm-5.2/external/llvm-project/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td:274:
2022-04-19T04:42:06.794290490Z /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/build/external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yamlgen.td:3735:20: error: Couldn’t find class ‘LinalgStructur’
2022-04-19T04:42:06.794297140Z def SoftPlus2DOp : LinalgStructur
2022-04-19T04:42:06.794302770Z ^
2022-04-19T04:42:06.794308300Z gmake[2]: *** [external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/CMakeFiles/MLIRLinalgStructuredOpsIncGen.dir/build.make:123: external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.cpp.inc] Error 1
2022-04-19T04:42:06.794314851Z gmake[2]: *** Waiting for unfinished jobs…
2022-04-19T04:42:06.794320581Z Included from /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/llvm-project-mlir-release-rocm-5.2/external/llvm-project/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.td:274:
2022-04-19T04:42:06.794327081Z /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/MLOpen/deps/cget/build/tmp-fa6ce9b8cc6a4fda9ed7d678dcc4b43a/build/external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yamlgen.td:3735:20: error: Couldn’t find class ‘LinalgStructur’
2022-04-19T04:42:06.794346731Z def SoftPlus2DOp : LinalgStructur
2022-04-19T04:42:06.794352501Z ^
2022-04-19T04:42:06.794358121Z gmake[2]: *** [external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/CMakeFiles/MLIRLinalgStructuredOpsIncGen.dir/build.make:174: external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.h.inc] Error 1
2022-04-19T04:42:06.794364351Z gmake[1]: *** [CMakeFiles/Makefile2:26904: external/llvm-project/llvm/tools/mlir/include/mlir/Dialect/Linalg/IR/CMakeFiles/MLIRLinalgStructuredOpsIncGen.dir/all] Error 2
2022-04-19T04:42:06.794370471Z gmake[1]: *** Waiting for unfinished jobs…

Analysis

There are a couple of targets involved:

  • LinalgOdsGen: It generate the tablegen file according to the yaml file (LinalgNamedStructuredOps.yaml)
  • MLIRLinalgStructuredOpsIncGen: It generate a header according to the tablegen from LinalgOdsGen

What likely happened is:

  • In a machine that cannot do highly parallelized build, the two targets happen at the right order
    1. LinalgOdsGen finished compilation, and tablegen written out to the disk
    2. The generated tablegen picked up by MLIRLinalgStructuredOpsIncGen
  • In a machine that can do highly parallelized build, there’s a race condition
    1. LinalgOdsGen finished compilation, output file created and tablegen partially written to disk
    2. Before 1 is completed done, MLIRLinalgStructuredOpsIncGen picked up a partially written tablegen, finding that its content doesn’t make sense, decide that this is a failure
    3. Cmake precedes with unfinished task that actually give up only after 1 is completely done. Therefore, in a environment where it failed the build, I can see a fully constructed tablegen, and do not have direct evidence of the partial written file.

I think the underlining problem is that the LinalgOdsGen target (here) is using the below paradigm trying to make sure an ordering happens:

add_custom_target(
    myCustomTarget
    COMMAND foo outfile
)
add_dependencies(myTarget myCustomTarget)

The ordering does get populated correctly. However, Later build stages will use the incompletely-written outfile. I can’t seem to figure a way for later build stages to force to wait till the custom target to run at its full completion.

Suggestion

Commit the yamlgen.td and yamlgen.cpp.inc, instead of having them dynamically generated, make them a manual target. Make it such that the developer that updated the yaml file has to run a codegen and commit it together. This way we wouldn’t need the custom command to happen before the tablegen therefore avoiding the race condition.

Thank you for the report. We had problems with this but thought they had been fixed. I will take some time to look at this in the next few days.

Thank you. I have followed your past couple of fixes but unfortunately they didn’t fix the failure I observed. (So this is likely an orthogonal problem versus the ones you have addressed.)

I could have submit a CL to implement the “Suggestion” section but I do like the existing approach better to have the cmake logic take care of generation for us. I’m not a cmake expert therefore hoping we can find a better approach and address this annoying race condition.

Why would this happen only with the output of the yaml generator and not the output of any other tool? Understanding this should hint towards the right fix I think.

Are there any other instances where there’s a codegen that requires to happen before a tablegen? I suspect not but please point out if you there is. I’d be interested in finding out how the ordering is controlled in there. Also, just because other instances doesn’t have this problem does not eliminate the possibility of a race condition (The custom command may write a short enough output that manage to always finish before the next custom command).

My theory is that this nested dependencies across different custom command (based on file as a dependency) creates dependency only on the instance of file but not the completion of writing to it. So thinking of this, we may be able to launch another custom command inside add_linalg_ods_yaml_gen that touch/creates a dummy file after the existing custom command. And instead of

add_custom_target(LinalgOdsGen
  DEPENDS
  MLIRLinalgNamedStructuredOpsYamlIncGen
)

Amend it to

add_custom_target(LinalgOdsGen
  DEPENDS
  dummy_file
)

CMake handles target and file level dependencies slightly different, which is notoriously difficult to get right with second level generated files. It doesn’t surprise me that something is wonky here, but I would need to look at it more closely. iirc, there was some interaction with the tablegen macro that made something awkward.

I don’t quite get how the build system would rely on file for the dependency tracking during execution of the commands: that is if the build system knows about the dependency between two commands (and it likely does here otherwise you would be able to reproduce without parallelism by trying to build the failing target directly from a clean state) then it won’t start executing a command before the previous one finished (that is the process terminated).

So either there is a problem of flushing data to the filesystem or something, or maybe there are two commands that can write to the same output file and tablegen is executed while the second one clobbers the output of the first one.

What I’ve seen before is that is will set up the wrong dependency such that the producer and consumer have nothing stopping them from running concurrently, but the target level deps make this only manifest when “far away” things can execute concurrently.

Question to the OP: is this on clean builds or incremental?

Right, what you’re describing is a case of missing dependency. But this always reproduce in a build irrespective of the parallelism by invoking the build system with just the failing action (consumer), because in this case the other one (producer) will not be triggered at all.
(for example ninja tools/mlir/include/mlir/Dialect/Linalg/IR/LinalgStructuredOps.h.inc)

Generally I agree with your simplification, but I have also seen this specific case be subtle, even with generator specific behavior. I’m not saying you are wrong, but I’ve learned caution about predicting the issue without getting into the details on this kind of dep.

These are clean builds starting from newly created docker containers.

FWIW, I just made an experiment on my previous assumption by inserting the following chunk into the compilation chain (and have the later stages reference ${TEMP_FILE} as a dependency:

add_custom_command(
  OUTPUT ${TEMP_FILE}
  COMMAND touch ${TEMP_FILE} && sleep 300
  DEPENDS
  ${MLIR_LINALG_ODS_YAML_GEN_EXE}
)

According to my assumption, it should proceed and finish with rest of the compilation process since ${TEMP_FILE}'s generation finished much earlier than the entire bash command completion. However, from what I observed, the compilation process will correctly wait till the sleep 300 finished and proceed with further execution. With this it seems to confirm that my theory is wrong. I can’t think of how the original sporadic failure can happen now.

(If it helps with context, this is all within a -j [very very many] build for release QA that is using Unix Makefiles to build [our fork of] MLIR for uniformity’s sake.)

Can we get this case filed as an issue in the llvm-project GitHub? I’m going to have a hard time with loose discourse threads for tracking progress.

For sure, I’ll give a link to the issue once filed.