Newbie question here: I want to remain as close to the affine dialect as possible and encode accumulators. I am thinking of the “fold” constructs of OCaML or the Reduce construct of MapReduce. Or, to the least, something that can compute the sum of a vector, like in TVM.
For now, the only solution I could find involves creating a memref and writing everything down between cycles. Is this how it’s supposed to be done?
Best regards,
dpotop
PS: if someone has a matrix multiplication or convolution example, possibly with mlir-opt options for optimizing it well, I’d be grateful.
That’s right - you’ll have to store the reduction variable as a single element memref (say either memref<f32> or memref<1xf32>). For a matrix multiplication optimization example, here’s a reference.
(Some of the passes used in that article aren’t upstream, but are available in the branch referred to there.)
Using a single element memref for this purpose leads to no performance penalty through the pass pipeline (all the way through LLVM); the reduction variables do get into registers.
On this note, there’s a recent proposal to add loop live-out/cross loop SSA vars for loop.for (which affine.for is lowered to).