The front end I’m building for an existing interpreted language is unfortunately producing output similar to this far too often;
define void @foo(i8* nocapture %dest, i8* nocapture %src, i32 %len) nounwind {
%1 = tail call noalias i8* @malloc(i32 %len) nounwind
tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %src, i32 %len, i32 1, i1 false)
tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dest, i8* %1, i32 %len, i32 1, i1 false)
tail call void @free(i8* %1) nounwind
ret void
}
I’d like to be able to reduce this pattern to this;
define void @foo(i8* nocapture %dest, i8* nocapture %src, i32 %len) nounwind {
tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dest, i8* %src, i32 %len, i32 1, i1 false)
ret void
}
Optimising all cases of this pattern from within my front end’s AST would be difficult. I’d rather implement this as an llvm pass or two that runs after other function passes have already cleaned up the mess I’ve made.
Has anyone written any passes to detect and combine multiple memory copies that originated from the same data?
And then eliminate stores and malloc / free pairs for local pointers that are never read from or captured?
Hi Jeremy,
The front end I'm building for an existing interpreted language is unfortunately
producing output similar to this far too often;
define void @foo(i8* nocapture %dest, i8* nocapture %src, i32 %len) nounwind {
%1 = tail call noalias i8* @malloc(i32 %len) nounwind
tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %1, i8* %src, i32 %len, i32 1,
i1 false)
tail call void @llvm.memcpy.p0i8.p0i8.i32(i8* %dest, i8* %1, i32 %len, i32 1,
i1 false)
tail call void @free(i8* %1) nounwind
ret void
}
could you allocate the memory on the stack instead (alloca instruction)?
Ciao, Duncan.
could you allocate the memory on the stack instead (alloca instruction)?
This is mainly for string or binary blob handling, using the stack isn’t a great idea for size reasons.
While I’m experimenting with simple code examples now, and I picked a simple one for this email. I’m certain things will get much more complicated once I implement more features of the language.
I have been playing with some ideas in this space. I haven't gotten beyond toy implementations yet, but would be happy to brainstorm if nothing else.
I'm traveling at the moment, but should have some time next week if you want to discuss.
Philip Reames
Hi Jeremy,
> could you allocate the memory on the stack instead (alloca instruction)?
This is mainly for string or binary blob handling, using the stack isn't a great
idea for size reasons.
While I'm experimenting with simple code examples now, and I picked a simple one
for this email. I'm certain things will get much more complicated once I
implement more features of the language.
the optimizer that does memcpy forwarding is in
lib/Transforms/Scalar/MemCpyOptimizer.cpp
You might want to look into teaching it how to handle malloc'd memory and not
just alloca instructions. I think the logic is in
MemCpyOpt::performCallSlotOptzn
Ciao, Duncan.