Gpu.all_reduce among a subgroup?

Hi folks,

I was wondering is there a way to all_reduce among threads in a subgroup (corresponding to cuda warp) instead of ones in a work group (corresponding to cuda block) with existing dialects? If I have to define new operations, would you please give me some advices about which parts of the code should I read and follow?

Thanks!