Thanks for the details. Note that -supports-hot-cold-new is used during the LTO link - so please add -Wl,-mllvm,-supports-hot-cold-new to the thin link (it should not be needed by the initial or backend compiles). If it works then when you disassemble a.o.thinlto.bc you should also no longer see “none” for the allocs versions.
Let me know if that fixes it or if you see more issues.
Note that -supports-hot-cold-new is used during the LTO link - so please add -Wl,-mllvm,-supports-hot-cold-new to the thin link (it should not be needed by the initial or backend compiles). If it works then when you disassemble a.o.thinlto.bc you should also no longer see “none” for the allocs versions.
Thanks for reply. It works when adding -Wl,-mllvm,-enable-memprof-context-disambiguation -Wl,-mllvm,-supports-hot-cold-new to thin link. I also can see allocs: ((versions: (notcold, none, cold, notcold) in a.o.thinlto.bc
Hi @teresajohnson@snehasish
I and @Enna1 found now MemProf handle the callchain which may have circles in thin link. It means the final callchain alloc type may be incorrect in our opinion.
For example, there are two callchain:
A->B->A->B->new
A->A->B->A->B->new
These two callchain will be considered as one callchain in thinlink:
A->B->new
And the alloc type is the result of the OR of two callchains.
I think it’s may better if we handle callchain in llvm-profdata, truncating the callchain and then merging MIB info. The final alloc type is calculated by merged MIB info.
Hi @lifengxiang1025 , thanks for the note (and the patch, will respond on that shortly)!
Recursion is a little tricky, and I tried to have it be conservatively correct. In the example below, where there is some mutual recursion, we should be detecting this during the thin link and preventing cloning.
There’s presumably some more sophisticated ways of handling these cases, but more difficult to get the necessary cloning correct. E.g. if we collapsed A->B->A->B->new to A->B->new we might miss renaming the call from B->A with the appropriate callee clone of A. So I chose to leave the recursion visible in the metadata and handle it conservatively. Definitely open to suggestions and/or patches!
Thanks for reply. I can’t understand why we might miss renaming the call from B->A with the appropriate callee clone of A. Could you please explain that in more detail?
It’s probably easier to discuss in terms of callsites, not functions, since that’s what is captured in the call contexts and what the cloning algorithm primarily operates on. So for example, if we consider the following code:
The call chain A->B->A->B->new is actually the callsite context a1->b1->a1->b2.
If we collapse that to A->B->new then we are collapsing the context to a1->b2. If we clone for that we completely miss cloning callsite b1, and anything that goes through b1 will call the uncloned version of A(). I.e. we would get the following cloned code for the cold memory allocations:
If we didn’t collapse and we kept the context as-is, we have a cycle in the CCG (CallsiteContextGraph) that the cloning operates on. Right now the cloning code doesn’t handle this case, so we handle it conservatively and don’t clone. This certainly can be improved. I.e. if all contexts with different levels of recursion through these callsites all lead to cold memory allocations, ideally we would want the final cloned code like:
With the direct recursion flattening, contexts with a regex like a1->[a1->]*->a2 all become a1->a2. There is no cycle in the CCG graph. Assuming that all the profiled
recursive contexts that collapse to the same thing have the same memory coldness, we should be able to create the cloned code:
Thanks!! I’m now aware that truncating callsite context is not a good idea.
Then I think we may filter circle callsite context to chain which is a directed chain from main to new.
For example:
And corresponds to four MIB info:MIB1,MIB2,MIB3,MIB4.
The directed chain from main to now in callsite context 1 2 3 is a1->b2 which is same as callsite context 4.
So we merge MIB1234 to MIB5 in llvm-profdata and the compiler will see only one callsite context and MIB info:
Yep, I think we should probably take the approach of doing some flattening and MIB merging in llvm-profdata. As mentioned earlier it would result in somewhat less accurate cloning though if we flattened down to just a1->b2, as any call path through b1 will not reach the cloned and hinted allocation.
To address this I think we would actually only want to merge 1, 2, and 3, into:
b1->a1->b2
MIB123 (merged from MIB1, MIB2, and MIB3)
I’m not sure what to do about context 4, although in real cases presumably a1 is called from somewhere else, e.g. c1->a1->b2, so we would be able to distinguish it and handle it appropriately.