the %arg3 in affine is 4. but, in my understanding, the arg3 should be 2.
Am I missing anything in the linalg::genericOp?
Or linalg::genericOp only supports inputs and outputs with same shape?

A linalg.generic iterates all elements of the input and output tensors you provide and it assumes the sizes of input and output tensor dimensions that map to the same iteration dimension match. In your example, the iteration dimension d2 maps the tensor sizes 2 and 4 which is invalid. When lowering to loops, the lowering takes the first shape associated to d2 which is 4. As a result, the smaller tensor is accessed out of bounds.

You are thus right that inputs and outputs that map to the same iteration dimensions need to have the same shape!

Your problem can be solved with a tensor.extract_slice operation. The following code should work:

%a = tensor.extract_slice %b[0,0,0] [8,1,2] [1,1,1]: tensor<8x1x4xf32> to tensor<8x1x2xf32>

I think it is not possible to fuse GenericOp → ExtractSliceOp → GenericOp on the Linalg level. It may be possible to fuse on the Affine level though (not sure about that).

Element-wise fusion can only fuse GenericOps that share the same iteration domain since the result of fusion is again a GenericOp which can only represent a perfectly nested loop nest.