Redundant byval in C codegen?

For this C code,

struct s {
int a[40];
};

void g(struct s a) {
a.a[0] = 4;
}

void f() {
struct s a;
g(a);
}

clang generates llvm IR:

define void @g(%struct.s* byval %a) nounwind {
entry:
%tmp = getelementptr inbounds %struct.s* %a, i32 0, i32 0 ; <[40 x i32]> [#uses=1]
%arraydecay = getelementptr inbounds [40 x i32]
%tmp, i32 0, i32 0 ; <i32*> [#uses=1]
%arrayidx = getelementptr inbounds i32* %arraydecay, i64 0 ; <i32*> [#uses=1]
store i32 4, i32* %arrayidx
ret void
}

define void @f() nounwind {
entry:
%a = alloca %struct.s, align 4 ; <%struct.s*> [#uses=1]
%agg.tmp = alloca %struct.s ; <%struct.s*> [#uses=2]
%tmp = bitcast %struct.s* %agg.tmp to i8* ; <i8*> [#uses=1]
%tmp1 = bitcast %struct.s* %a to i8* ; <i8*> [#uses=1]
call void @llvm.memcpy.i64(i8* %tmp, i8* %tmp1, i64 160, i32 4)
call void @g(%struct.s* byval %agg.tmp)
ret void
}

Since we have already alloca’ed a temporary struct.s %agg.tmp, why is there still a ‘byval’ in g’s parameter? The consequence of this is when assembly code is generated, we end up with allocating 3 structs on the stack:

f:
.Leh_func_begin2:
pushq %rbp
.Llabel3:
movq %rsp, %rbp
.Llabel4:
subq $496, %rsp

Could somebody explain the rationale behind this behavior? Thanks.

Could somebody explain the rationale behind this behavior? Thanks.

I have no idea why we are unable to remove one of the allocas in the
caller, but I think I know why we always keep a copy in the caller and
one in the callee. Take a modified testcase:

Does that mean that if we created a temporary for a struct argument,
'byval' attribute should not be used to avoid an alloca?

In some cases we do that (small structs). In some cases you don't have
a choice (ABI). In others not using a byval will bloat the IL with
lots of scalar arguments. Not using a byval might also worsen the
callee code since the arguments will be on the stack but nothing
before the codegen will know that.

My idea is that if

struct s {int a; int b; int c; int d int e;};
void g(struct s a);
void f() {
  struct s a = {1, 2, 3, 4, 5};
g(a);
}

Could be compiled to something like