Hi Community, I’ve noticed the removal of Linalg.tiled_loop. See discuss on this removal
I am wondering what is the best pratice to do tiling on ops nowadays. Are we moving on to gml_st now?
Hi Community, I’ve noticed the removal of Linalg.tiled_loop. See discuss on this removal
I am wondering what is the best pratice to do tiling on ops nowadays. Are we moving on to gml_st now?
You can use scf.for
and scf.foreach_thread
for tiling. I was planning to upstream gml_st.parallel
and it would become very similar to scf.foreach_thread
but with upper/lower bounds and steps. Potentially, it could replace scf.parallel
, that does not work on tensors.
Nice that would be great, do you have any timeline? When do you think this will happen?
@chelini If you want to use it, I can surely prioritize it. I will try to write a proper RFC this week. I think there are 3 ways how to upstream it.
gml_st.parallel
becomes yet another loop-like op in SCF dialect. It replaces neither scf.parallel
or scf.foreach_thread
. That’s the worst variant.
gml_st.parallel
becomes yet another loop-like op in SCF dialect, but then it replaces scf.parallel
after it is good enough to cover the current use cases for scf.parallel
within OpenMP, TF KernelGen and Affine.
scf.foreach_thread
gets ubs, lbs and steps and gml_st.parallel
is removed. We can still try to replace scf.parallel
after.
My quick 2 cents: scf.foreach_thread
aims at also being usable in more general cases (e.g. with lists of ranges / variadic step sizes for load/balancing; since tail effects on GPU can quickly reach 20-30%).
scf.foreach_thread
could gain a ubs
/ lbs
/ steps
form (i.e. a single range) as a step forward.
So option 3. and more hands on deck participating to the design and implementation is quite appealing to me.
Updating the tensor semantics part of scf.parallel
also makes sense as a followup cleanup.
We should be sure avoid a situation where we have multiple duplicated transformations on loops over tensors.
Thanks, and sorry for the latency. I agree with Nicolas; there are already some duplications of transformation on loops over tensor (i.e., tile and fuse), and we should avoid adding more. My use case is related to preserve parallel semantics at the tensor level and have a lowering for CPUs. So if scf.foreach_thread
could gain ubs
/lbs
and steps
and lower it to scf.parallel
, maybe after bufferization it would be perfect for me. Happy to contribute or review patches in this direction if you already have.