I’m working on a tool to reason about C++ programs. In the tool, I use clang to generate ASTs for C++ programs. I want to do this in a modular way so that I can reason about at header file and then use the results of that to reason about other files that include the header. One issue that I’ve run into along these lines is that the names of anonymous structs, unions, and enums, are not stable. Consider:
union {
X = 0,
Y = 1
};
struct C {
struct { int x; };
struct { int y; };
};
I’ve been looking at the ItaniumMangle.cpp file and see that it generates names for anonymous types using a fresh name generator (i.e. increment a counter). I wrote some code to replace this with sometime more deterministic and have a prototype / algorithm that seems to work on the examples that I’ve come across so far. I’m wondering if this would be a useful contribution to clang.
For things without linkage (in addition, lambdas!), they aren’t part of the ABI, so whatever we choose as a name is not really necessary. One of the bigger issues is to make sure it is encoded in a way that cannot be caused by something with linkage.
I would be interested in some level of replacement for what we have IF it was ‘negative cost’ in some way (that is, lower memory use, shorter compile-time, shorter names,etc).
So I guess my answer is: Without further details on your proposal, I’m not sure whether I’d accept it, but I’d be open to something ‘better’.
Thanks @erichkeane. The compile-time cost is not negative, and the current implementation is not particularly pretty but could probably be made much more efficient than it is. I would hazard a guess that a truly negative cost solution does not exist since a map lookup is so cheap.
The partial algorithm is based on the common uses that I’ve seen so far and I vaguely recall seeing it somewhere else, but I could be wrong. Here are the basic rules that I’ve worked out so far, the actual prefix symbol is arbitrary.
Global Unnamed Structs
struct { int x; } y; // _Z2%y
Prefix the name of the first declaration, %x.
Anonymous enums
enum { X = 0 }; // _Z2~X
Use the name of the first enum element, ~X.
Unnamed Structs inside aggregates
struct C {
struct { int x; }; // _ZN1C2.xE
struct { int y; }; // _ZN1C2.yE
};
Prefix the name of the first field with a ., e.g. .x. I think it is reasonable to repeat the . for each nested aggregate.
struct C {
struct {
struct { int x; }; // _ZN1C3..xE
}; // _ZN1C2.xE
};
I am really afraid if you change the order of two unnamed objects. What does stable name mean?
If you change the order of two named objects, then maybe the compiler complains. However, the names are stable.
I’ve likewise encountered a need for stable mangled names for symbols (with or without linkage) when working on a product for a former employer. There are limits to how stable the names can be for code undergoing change as mentioned in prior comments, but changes like the the ones indicated work well in practice.
@gmalecha, note that the Itanium ABI reserves $ and . for use in extensions like this (see Itanium C++ ABI). I recommend avoiding use of characters like % and ~.
@tschuett the algorithm above derives the stable name from the contents of the object (and its DeclContext), not from the order that it appears in the file (or even the order it appears in its parent DeclContext), so I think that the algorithm cases that I’m proposing would result in stable names under permutation. For example,
struct C {
struct { int y; }; // _ZN1C2.yE
struct { int x; }; // _ZN1C2.xE
};
would produce the same name manglings according to the algorithm.
@tahonermann thanks for the pointer on the Itanium ABI. It should be fine to replace % and ~ with $ and . without any issue. In the mangling scheme, these characters become part of string names, i.e. a demangler would never actually “see” the . or $ as a potential control character.
We could possibly make this cheap by putting stable name generation behind a flag and default to the unstable case. Use cases like mine and @tahonermann’s could enable stable name generation explicitly.
There are cases where ordering concerns can’t be avoided and discriminators are required. An example follows. It would be possible to allocate discriminators based on look-alike cases, but at that point the names would be more stable than what the Itanium ABI requires for symbols with external linkage, so is unlikely to be worth the effort.
void f() {
{ enum { e }; }
{ enum { e }; }
}
I would rather have just one naming scheme. Changing the names of symbols with no linkage or with internal linkage should be low risk. I think. @rjmccall would know better than I.
Thanks a good point @tahonermann. Personally, I think that depending on order within a function is not a big problem since they are relatively rare and the scope of those declarations is so limited. If the body of the function changes, then you effectively need to re-analyze the function.
I agree. As I mentioned, creating more stable names than what the Itanium ABI requires for symbols with external linkage is unlikely to be worth the effort (particularly since there is presumably no intent to improve stability for symbols with external linkage).
Are you only interested in improving stability for the Itanium ABI mangling? Or are you also interested in changes for the Microsoft mangling?
Ideally, I’d love to see the Itanium specification extended to include something like a recommendation (not a requirement) for mangling of names with no linkage or internal linkage though that is arguably out of scope for the ABI spec. @rjmccall, would something like that be considered acceptable for inclusion? Regardless, I think the best starting point would be drafting a specification for the names and posting that for feedback (probably as a GitHub issue for GitHub - itanium-cxx-abi/cxx-abi: C++ ABI Summary). Once there appears to be consensus for a specification, then a pull request that implements it would probably be more welcome since it would be backed by something peer reviewed and concrete.
Are you also interested in stable names for templates (not template specializations, but for template definitions). I also needed such names in the past.
At the moment, our tools do not use the Microsoft mangling. In fact, we can get away in our tool by writing a completely custom mangler, but having pre-existing tools like c++filt work is very useful which is why our mangling scheme is based so heavily on the Itanium mangler.
It seems likely that we will need it in the future, but since our tool currently only processes templates after they have been instantiated, I don’t have enough insight at the moment to know exactly what to do here.
I don’t know how complex or time consuming it would be to get the mangling specification extended. Do you think something like that would be necessary for a Clang PR to be accepted? Or do you think it would just be a nice-to-have?
If we want a formal mangling scheme that’s actually supported by demanglers, then having a specification for that scheme seems necessary. If the goal is just to informally make the mangling change less without formally specifying it, we don’t necessarily need a spec. (We don’t have a spec for what we currently do, anyway.)
I’m concerned that names generated by concatenating all the member names could get excessively long, particularly for enums.
How often does clang IR generation need to mangle unnamed types in IR, though? I guess they can make it into templates arguments in some cases? Or get directly used with decltype? Or are you using the mangler directly in some sort of plugin?
Stability for function names is relevant for PGO and other forms of profiling.
Reach out to me when/if you decide to take on templates. There are many interesting cases such as generating unique names for each class template partial specialization and ensuring unique names for function templates that can be distinguished via SFINAE. For example:
template<typename T, typename T::X>
int ft();
template<typename T, typename T::Y>
int ft();
struct S {
using X = int;
};
auto g = ft<S, 1>();
I think it would help to catch design issues. If the goal is to produce stable names, then fixing design bugs later will cause instability. That can make fixes prohibitive in some cases. Name mangling is pretty tricky.
In most cases, it is only necessary to use the first member since its name must be unique in the surrounding scope.
Yes, they can. For example:
template<typename>
int ft() { return 0; }
inline auto f() {
return []{};
}
using T = decltype(f());
auto v = ft<T>();
Note that the template specialization has external linkage despite the unnamed argument because the lambda type is declared in an inline function.