Can't use cancelRootUpdate, keeps crashing

Hi,

I’m currently being haunted by an error which I can’t explain. I’m basically rewriting the body of a function and checking whether the rewritten parts are more ‘efficient’ according to some cost model. If the cost model however deems the new body ‘not profitable’, I need to revert back to the old version of the body.

As far as I know, this is a situation where the startRootUpdate(), cancelRootUpdate() and finalizeRootUpdate() methods of pattern rewriters are very useful, since they allow reverting back to old operation states. I am trying to use them in a similar fashion to other places of the MLIR code base (e.g. here), i.e.

LogicalResult BodyPattern::matchAndRewrite(FuncOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const {
    rewriter.startRootUpdate(op);
    // rewrite body with using other patterns
    // ...
    if (!profitable(op)) {
        rewriter.cancelRootUpdate(op);
        return failure();
    }
    rewriter.finalizeRootUpdate(op);
    return success();
}

Unfortunately, I always encounter a crash and the following error message during cancelRootUpdate() when executing my pipeline:

llvm/llvm-src/llvm/include/llvm/ADT/SmallVector.h:709: llvm::SmallVectorImpl::iterator llvm::SmallVectorImpl<(anonymous namespace)::OperationTransactionState>::erase(llvm::SmallVectorImpl::const_iterator) [T = (anonymous namespace)::OperationTransactionState]: Assertion `this->isReferenceToStorage(CI) && "Iterator to erase is out of bounds."' failed.

Even if I don’t modify the body at all, e.g.

LogicalResult BodyPattern::matchAndRewrite(FuncOp op, ArrayRef<Value> operands, ConversionPatternRewriter& rewriter) const {
    rewriter.startRootUpdate(op);
    rewriter.cancelRootUpdate(op);
    return failure();
}

it crashes at cancelRootUpdate(). If I replace the cancel with a finalize, it doesn’t crash, so it seems to occur here (?)

Is there some pipeline configuration parameter I have missed that needs to be enabled before I can use these rootUpdate methods? I couldn’t find anything like that in the documentation or the codebase, so any help is greatly appreciated!

Can you please post the complete stack trace of the assert?

Hi @bondhugula,

the stack trace looks like this:

#0  0x00007ffff76db9e5 in raise () from /lib64/libc.so.6
#1  0x00007ffff76c48a4 in abort () from /lib64/libc.so.6
#2  0x00007ffff76c4789 in __assert_fail_base.cold () from /lib64/libc.so.6
#3  0x00007ffff76d4026 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff4fdde62 in mlir::ConversionPatternRewriter::cancelRootUpdate(mlir::Operation*) () from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRTransformUtils.so.13git
#5  0x00007ffff6d60397 in mlir::spn::VectorizeSingleTask::matchAndRewrite (this=0x5a1d80, task=..., operands=..., rewriter=...)
    at /tmp/tmp.UMQjVO7w69/mlir/lib/Conversion/LoSPNtoCPU/Vectorization/VectorizeStructurePatterns.cpp:69
#6  0x00007ffff6d3c247 in mlir::detail::OpOrInterfaceConversionPatternBase<mlir::spn::low::SPNTask>::matchAndRewrite (this=0x5a1d80, op=0x59c6b0, operands=..., rewriter=...)
    at /home/csvtuda/deps/llvm/llvm-src/mlir/include/mlir/Transforms/DialectConversion.h:386
#7  0x00007ffff4fddf84 in mlir::ConversionPattern::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&) const () from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRTransformUtils.so.13git
#8  0x00007ffff4f3bea9 in mlir::PatternApplicator::matchAndRewrite(mlir::Operation*, mlir::PatternRewriter&, llvm::function_ref<bool (mlir::Pattern const&)>, llvm::function_ref<void (mlir::Pattern const&)>, llvm::function_ref<mlir::LogicalResult (mlir::Pattern const&)>) () from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRRewrite.so.13git
#9  0x00007ffff4fe90aa in (anonymous namespace)::OperationLegalizer::legalize(mlir::Operation*, mlir::ConversionPatternRewriter&) ()
   from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRTransformUtils.so.13git
#10 0x00007ffff4fe1398 in (anonymous namespace)::OperationConverter::convertOperations(llvm::ArrayRef<mlir::Operation*>) () from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRTransformUtils.so.13git
#11 0x00007ffff4fe30ae in mlir::applyPartialConversion(mlir::Operation*, mlir::ConversionTarget&, mlir::FrozenRewritePatternSet const&, llvm::DenseSet<mlir::Operation*, llvm::DenseMapInfo<mlir::Operation*> >*) ()
   from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRTransformUtils.so.13git
#12 0x00007ffff6defdd8 in mlir::spn::LoSPNtoCPUStructureConversionPass::runOnOperation (this=0x56db20) at /tmp/tmp.UMQjVO7w69/mlir/lib/Conversion/LoSPNtoCPU/LoSPNtoCPUConversionPasses.cpp:58
#13 0x00007ffff4c6a8ec in mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) ()
   from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRPass.so.13git
#14 0x00007ffff4c6da8b in mlir::PassManager::run(mlir::Operation*) () from /home/csvtuda/deps/llvm/llvm-bin/lib/libMLIRPass.so.13git
#15 0x00007ffff7edeca7 in spnc::MLIRPipelineBase<spnc::LoSPNtoCPUConversion>::execute (this=0x52a670) at /tmp/tmp.UMQjVO7w69/compiler/src/codegen/mlir/transformation/../MLIRPassPipeline.h:50
#16 0x00007ffff7ee0429 in spnc::MLIRPipelineBase<spnc::CPUtoLLVMConversion>::execute (this=0x52a780) at /tmp/tmp.UMQjVO7w69/compiler/src/codegen/mlir/transformation/../MLIRPassPipeline.h:44
#17 0x00007ffff7ee4d18 in spnc::MLIRtoLLVMIRConversion::execute (this=0x52a8e0) at /tmp/tmp.UMQjVO7w69/compiler/src/codegen/mlir/conversion/MLIRtoLLVMIRConversion.cpp:28
#18 0x00007ffff7ea9129 in spnc::EmitObjectCode::execute (this=0x5008d0) at /tmp/tmp.UMQjVO7w69/compiler/src/driver/action/EmitObjectCode.cpp:53
#19 0x00007ffff7eab6ce in spnc::ClangKernelLinking::execute (this=0x4e4f80) at /tmp/tmp.UMQjVO7w69/compiler/src/driver/action/ClangKernelLinking.cpp:32
#20 0x00007ffff7e9748d in spnc::Job<spnc::Kernel>::execute (this=0x4ec630) at /tmp/tmp.UMQjVO7w69/compiler/include/driver/Job.h:89
#21 0x00007ffff7e966f3 in spnc::spn_compiler::compileQuery (inputFile=Python Exception <class 'gdb.error'> There is no member named _M_dataplus.: 
, options=std::map with 4 elements = {...}) at /tmp/tmp.UMQjVO7w69/compiler/src/Driver.cpp:36
#22 0x0000000000442e5b in main (argc=10, argv=0x7fffffffd888) at /tmp/tmp.UMQjVO7w69/execute/src/main.cpp:25

The error occurs in VectorizeSingleTask. A task is an operation that contains a single region with operations, which I want to rewrite as described above. I used FuncOp above because it’s the op that resembles a task the most.

The code basically looks like this for the moment:

LogicalResult VectorizeSingleTask::matchAndRewrite(TaskOp task, ArrayRef<Value> operands,
                                                   ConversionPatternRewriter& rewriter) const {
  if (!applicable()) {
    return rewriter.notifyMatchFailure("not applicable");
  }
  rewriter.startRootUpdate(task);
  rewriter.cancelRootUpdate(task); // <-- error here
  // ...
}

ConversionPatternRewriter::cancelRootUpdate appears to be missing at least an assert. In this method (pasted below), right before the erase call, can you add an assert to ensure op is indeed in rootUpdates and try dumping the op there as well (op->dump()).

  1477 /// PatternRewriter hook for updating the root operation in-place.
  1478 void ConversionPatternRewriter::cancelRootUpdate(Operation *op) {
  1479 #ifndef NDEBUG
  1480   assert(impl->pendingRootUpdates.erase(op) &&
  1481          "operation did not have a pending in-place update");
  1482 #endif
  1483   // Erase the last update for this operation.
  1484   auto stateHasOp = [op](const auto &it) { return it.getOperation() == op; };
  1485   auto &rootUpdates = impl->rootUpdates;
  1486   auto it = llvm::find_if(llvm::reverse(rootUpdates), stateHasOp);
  1487   rootUpdates.erase(rootUpdates.begin() + (rootUpdates.rend() - it));
  1488 }

I think you may have corrupted IR and op might just be garbage (for eg. a freed op). Running with asan or with valgrind will reveal that immediately.

Thank you for the debugging tips, I immediately tried them out like this:

/// PatternRewriter hook for updating the root operation in-place.
void ConversionPatternRewriter::cancelRootUpdate(Operation *op) {
#ifndef NDEBUG
  assert(impl->pendingRootUpdates.erase(op) &&
         "operation did not have a pending in-place update");
#endif
  // Erase the last update for this operation.
  auto stateHasOp = [op](const auto &it) { return it.getOperation() == op; };
  auto &rootUpdates = impl->rootUpdates;
  auto it = llvm::find_if(llvm::reverse(rootUpdates), stateHasOp);
  assert(rootUpdates.rend() != it);
  op->dump();
  llvm::dbgs() << "op: " << op << "\n";
  llvm::dbgs() << "stored updates: " << rootUpdates.size() << "\n";
  for(auto& update : rootUpdates) {
    llvm::dbgs() << "\tupdate stored for op: " << update.getOperation() << "\n";
  }
  llvm::dbgs() << "it: " << it->getOperation() << "\n";
  llvm::dbgs() << "rend - it: " << (rootUpdates.rend() - it) << "\n";
  llvm::dbgs() << "begin + (rend - it) op: " << (rootUpdates.begin() + (rootUpdates.rend() - it))->getOperation() << "\n";
  rootUpdates.erase(rootUpdates.begin() + (rootUpdates.rend() - it));
}

The assert does not fail and the debug statements produce the following output:

"task"(<<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>>) ( {
  // body ops ..
}) : (memref<?x4xf64>, memref<?xf64>) -> ()
op: 0x1a76fb0
stored updates: 1
  update stored for op: 0x1a76fb0
it op: 0x1a76fb0
rend - it: 1
begin + (rend - it) op: 0xfffffffffffff000

So obviously cancelRootUpdate() tries to erase a second element from rootUpdates although it only contains a single one.

Otherwise, as far as I could see, valgrind did not come up with any memory access problems in my code.

The code above is actually wrong (it is off by one) - I should’ve mentioned this in the previous post. I’m surprised it has worked so far. Please change this to:

rootUpdates.erase(rootUpdates.begin() + (std::prev(rootUpdates.rend()) - it));

or

rootUpdates.erase(rootUpdates.begin() + (rootUpdates.rend()) - it - 1));

or

rootUpdates.erase(it.base());

and it should be fine. I can post a patch to fix it upstream - thanks for reporting this.

Fixed. https://reviews.llvm.org/D105397

Thank you for your quick answer! The first two versions work perfectly fine, but the it.base() one is out of bounds again. Shouldn’t that be it.base() - 1 as well?

That’s right - should be it.base() - 1. The second form might be more readable and I’ve used that in the revision now.

1 Like