Thanks for sharing! I haven’t looked at the code yet, just read the README file you have and it has already answered a lot of questions that I initially had. Some general comments…
I’m very happy to see that Simon’s predication changes were useful to your work. It’s a nice validation of their work and hopefully will help SVE, too.
Your main approach to strip-mine + fuse tail loop is what I was going to propose for now. It matches well with the bite-sized approach VPlan has and could build on existing vector formats. For example, you always try to strip-mine (for scalable and non-scalable) and then only for scalable, you try to fuse the scalar loops, which would improve the solution and give RVV/SVVE an edge over the other extensions on the same hardware.
There were also in the past proposals to vectorise the tail loop, which could be a similar step. For example, in case the main vector body is 8-way or 16-way, the tail loop would be 7-way or 15-way, which is horribly inefficient. The idea was to further vectorise the 7-way as 4+2+1 ways, same for 15. If those loops are then unrolled, you end up with a nice decaling down pattern. On scalable vectors, this becomes a noop.
There is a separate thread for vectorisation cost model  which talks about some of the challenges there, I think we need to include scalable vectors in consideration when thinking about it.
The NEON vs RISCV register shadowing is interesting. It is true we mostly ignored 64-bit vectors in the vectoriser, but LLVM can still generate them with the (SLP) region vectoriser. IIRC, support for that kind of aliasing is not trivial (and why GCC’s description of NEON registers sucked for so long), but the motivation of register pressure inside hot loops is indeed important. I’m adding Arai Masaki in CC as this is something he was working on.
Otherwise, I think working with the current folks on VPlan and scalable extensions will be a good way to upstreaming all the ideas you guys had in your work.