There’s a thought about fir to keep things at a high-level for as long as possible. I don’t know how well that axiom is followed today, but it’s a worthy goal.
For lowering sum, are you thinking of calling specialized sum1d and sum2d routines? Or changing the existing sum (et al) to special case 1d and 2d cases?
Perhaps the runtime can be compiled to bitcode & supplied to llvm for llvm-driven inlining?
So many options, which to choose! Will you publish a short design spec before you get too far along with the coding?