Tiling on gml_st or linalg

Hi Community, I’ve noticed the removal of Linalg.tiled_loop. See discuss on this removal

I am wondering what is the best pratice to do tiling on ops nowadays. Are we moving on to gml_st now?

You can use scf.for and scf.foreach_thread for tiling. I was planning to upstream gml_st.parallel and it would become very similar to scf.foreach_thread but with upper/lower bounds and steps. Potentially, it could replace scf.parallel, that does not work on tensors.


Nice that would be great, do you have any timeline? When do you think this will happen?

@chelini If you want to use it, I can surely prioritize it. I will try to write a proper RFC this week. I think there are 3 ways how to upstream it.

  1. gml_st.parallel becomes yet another loop-like op in SCF dialect. It replaces neither scf.parallel or scf.foreach_thread. That’s the worst variant.

  2. gml_st.parallel becomes yet another loop-like op in SCF dialect, but then it replaces scf.parallel after it is good enough to cover the current use cases for scf.parallel within OpenMP, TF KernelGen and Affine.

  3. scf.foreach_thread gets ubs, lbs and steps and gml_st.parallel is removed. We can still try to replace scf.parallel after.

My quick 2 cents: scf.foreach_thread aims at also being usable in more general cases (e.g. with lists of ranges / variadic step sizes for load/balancing; since tail effects on GPU can quickly reach 20-30%).

scf.foreach_thread could gain a ubs / lbs / steps form (i.e. a single range) as a step forward.

So option 3. and more hands on deck participating to the design and implementation is quite appealing to me.
Updating the tensor semantics part of scf.parallel also makes sense as a followup cleanup.

We should be sure avoid a situation where we have multiple duplicated transformations on loops over tensors.


Thanks for clearing the cloud here, I vote for option 3, as @nicolasvasilache did.

1 Like

Thanks, and sorry for the latency. I agree with Nicolas; there are already some duplications of transformation on loops over tensor (i.e., tile and fuse), and we should avoid adding more. My use case is related to preserve parallel semantics at the tensor level and have a lowering for CPUs. So if scf.foreach_thread could gain ubs/lbs and steps and lower it to scf.parallel, maybe after bufferization it would be perfect for me. Happy to contribute or review patches in this direction if you already have.

1 Like

One of the nice byproducts, in addition to all of us continuously improving the same transformations, is that @pifon2a also has a proposal to support parallel reductions at the subtensor level.

I’d also love to see that tackled as a followup.

@chelini @GuoliangZhu [RFC] Parallel loops on tensors in MLIR