About tile+fuse for DSA(such as npu), pls give me some suggestion

now I design the tile+fuse for DSA(such as npu), the host cpu lauch subgraph to npu, host cpu and npu share the DRAM, npu(it have the conv、matmul atomic instruction) have inner sram, now I tile+fused the subgraph, hope it can implement one by one based on ineer sram, can u give me some suggestion about the tile and fused? thanks.

PS: npu ave the conv、matmul etc atomic instruction, it’s operand is at inner sram.