RFC: changing variable naming rules in LLVM codebase

TL;DR: change the rule for variable names from UpperCamelCase to lowerCamelCase.

Just to get wider visibility on this, I’m raising this again as an RFC, as suggested by Roman Lebedev.

My original post from last week is here and gives a rationale: http://lists.llvm.org/pipermail/llvm-dev/2019-February/129854.html. There seemed to be general agreement that the status quo is not ideal.

Chris Lattner pointed out that this came up in 2014 as well: http://lists.llvm.org/pipermail/llvm-dev/2014-October/077685.html

I’ve created a patch to implement the change. Review and comments welcome: https://reviews.llvm.org/D57896

I’m concerned about the internal inconsistency we would live with for (likely) many years if there is not a migration plan to converge the code base toward the new naming convention.
(other than that, changing the convention is fine with me, I don’t have much preference here)

I think consistency within a function (for locals) and within a class (for class members) is enough. I don’t think that is any worse than the consistency over
int p;
versus
int
p;

I have the same concern. The whole advantage of a common coding
convention is consistency. There are exceptions, but the vast majority
of LLVM and Clang code I've read does indeed stick to the current
CamelCase convention. Unless there's a plan for conversion then the
practical impact of the naming convention change is that the codebase
will be a muddle of mixed conventions for years. That seems like a
regression, even if camelBack is a better convention.

On a more practical note, if the intent is to move to camelBack I
think it would be worth adding an example to the coding standard for
handling an acronym. e.g. is it the intent that TTI would become tti.

Best,

Alex

Regarding a plan for conversion, I'm keen to avoid perfect being the enemy of better.

Privately, people I've spoken with have told me that they're opposed to a large scale conversion. Reasons given include breaking git blame, and creating needless merge conflicts. I might be wrong, but the evidence I've seen suggests that it's going to be very hard to get consensus on a conversion.

So what's worse: inconsistent capitalization or keeping a convention that discourages good naming?

Taking my previous example [1]:

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
  &LVL, &CM);

If we imagine that over time it evolves such that 50% of the variables have been renamed to camelBack versions of the type names, then it will look like this:

InnerLoopVectorizer LB(loop, PSE, loopInfo, DT, targetLibraryInfo, TTI,
  assumptionCache, ORE, vectorizationFactor.Width, IC,
  &loopVectorizationLegality, &CM);

Yes it's inconsistent, but IMHO it still conveys so much more information than the original that the benefit greatly outweighs the cost.

So is the disruption implied by changing to the new convention justified? I think so.

[1] https://github.com/llvm/llvm-project/blob/ec0263027811f48b907224ede0dd24c33c1c7507/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L7433

+1

Regarding a plan for conversion, I'm keen to avoid perfect being the enemy of better.

Privately, people I've spoken with have told me that they're opposed to a large scale conversion. Reasons given include breaking git blame, and creating needless merge conflicts. I might be wrong, but the evidence I've seen suggests that it's going to be very hard to get consensus on a conversion.

So what's worse: inconsistent capitalization or keeping a convention that discourages good naming?

Taking my previous example [1]:

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
  &LVL, &CM);

If we imagine that over time it evolves such that 50% of the variables have been renamed to camelBack versions of the type names, then it will look like this:

InnerLoopVectorizer LB(loop, PSE, loopInfo, DT, targetLibraryInfo, TTI,
  assumptionCache, ORE, vectorizationFactor.Width, IC,
  &loopVectorizationLegality, &CM);

I find myself less productive in a codebase with inconsistent styling
like you show above because it is more difficult to "guess" the name
of a variable. E.g. is the LoopInfo parameter named LI or loopInfo?
I'll have to double check to be sure, which adds an extra step.

So maybe a gradual transition plan could be to allow these upper case
acronyms for specific classes? For instance we could start by
designating a set of "common" classes like Function, BasicBlock
DominatorTree, LoopInfo, ScalarEvolution whose instances would
instances still be called F, BB, DT, LI and SE, but mandate all other
classes should use the new camelCase convention to name their
instances? I think this helps readability since the reader will know
that "BB" is always the basic block, "F" is always the function etc.
And I don't have the "guessing" problem I mentioned above since
VectorationFactor instances are always "vectorizationFactor" and
LoopInfo instances are always "LI". We could then try to shrink this
set down over time (or not).

-- Sanjoy

Hold on...

The change from UpperCamel to lowerCamel should be separate from going from X to somethingOtherName.

It seems like in this example, TLI is changed to targetLibraryInfo for the purpose of having a name that lowerCamel can be applied to, which is, at best, backwards.

When a new person sees "TLI", they won't know what it is. When an LLVM developer sees "TLI" they know exactly what it is, even without any context. At the same time, a person is new to LLVM for only a certain period of time, much shorter than the lifetime of the code.

The key to readability is to make the important things easy to see, and get the unimportant things out of the way. By using completely expanded names we run the risk of making everything equally "easy to see"...

-Krzysztof

That's interesting because I have always thought it strange to name members differently
from other variables. I guess in my mind if a local variable isn't easily identified as such,
it's either declared much too far away from its use (the function is too large, is lacking
proper scoping, whatever) or it is not well-named such as to denote its use. Note that
I specifically write, "denote its use" and not, "denote its scope." Of course the poor
naming could go the other way; naming a member "i," for example.

I don't think I've ever come across a naming convention that treats function parameters
specially. Why? Arguably they are as different from locals as members are, particularly
when it comes to reference parameters.

Slapping an "m_" in front of poorly-named members isn't really going to help much, any
more than slapping an "l_" in front of local variables would.

That said, I am certainly open to being convinced otherwise.

                                               -David

Chandler wrote:

FWIW, I’m pretty strongly opposed to humbleCamelCase. We already use that
style so something else.

Presumably you are equally opposed to RegularCamelCase, because we already
use that style for something else.

I would be opposed to going through the very significant cost of changing the naming convention, merely to end up there.

The convention we already use has a huge advantage of already being relatively consistent.

But really, objecting on the grounds that a given style is already used for
function names is really a very weak argument. IME function names are
incredibly hard to confuse with anything else, because they always have
surrounding syntactic context. Given TheStuff->fooBar().getThingy() is it
even conceivable that you might not instantly get that fooBar and getThingy
are methods? Therefore, using the same convention for some other kind of
name is Not Confusing.

I disagree FWIW… Lambdas (and callables generally) at the least make this ambiguous.

I think the fact that named things are called does not fully disambiguate what they are.

I’m not trying to say that the collision with functions is as confusing as that of colliding with types. Merely that both seem confusing. And I find foo_bar_baz and fooBarBaz basically equivalent[1]. So between those equivalents, I would choose the one with fewer collisions.

[1]: Ok, not quite, but I find this to be a more personal preference and am trying to weight it lower as a consequence. I find functions much more similar to types – they are manifest properties of the program. I find FooBarBaz and fooBarBaz to be very similar looking. There is a distinction, but it is a minor one. I would prefer a greater visual difference for variables, which foo_bar_baz provides.

OTOH, TheStuff comes out of nowhere with no clues to its origin, and that
is a barrier to code-reading IME. Even renaming it to stuff would help
approximately zero percent. Parameter? Local? Class member? Global? LLVM has
incredibly few globals for other reasons, but using the same convention for
locals and class members is a real problem for code-reading, especially code
operating in methods for classes you’re not super familiar with.

I acknowledge that the current RFC doesn’t propose a member naming convention
different from other variables, but IMO it really ought to. That is the
distinction that would really help in reading unfamiliar code.

I could see a lot of utility of this, but I do find it to be orthogonal.

Also FWIW, I also like m_ prefix for member variables when combined with lowercase_underscore naming.

IF it is worth going through the significant cost of switching, and we have a plan to minimize the cost to developers reading inconsistent code.

There is of course some amount of llvm and clang code which already uses initialLowerCaseNames for variable names too, contrary to the style guide. I don’t know how to easily quantify how much.

There is also a decent amount of code in Clang using foo_bar_baz. ::shrug::

I think the amount of all of these pales in comparison to LLDB, and I think generally all of these are not going to significantly change the total cost of transition.

-Chandler

FWIW, I agree with the idea of this (the inconsistency is very expensive for me as well).

However, I would find tying it to the type still terribly difficult. I would have a very hard time remembering which classes were part of which set.

Personally, I’d much rather the granularity of either an “interface” (type + methods, or overload set of namespace functions) being consistent, or a “file” being consistent. Both of those would be much cheaper for me to remember and reliably follow.

It would also largely match our existing points of divergence.

Taking my previous example [1]:

InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
&LVL, &CM);

If we imagine that over time it evolves such that 50% of the variables have been renamed to camelBack versions of the type names, then it will look like this:

InnerLoopVectorizer LB(loop, PSE, loopInfo, DT, targetLibraryInfo, TTI,
assumptionCache, ORE, vectorizationFactor.Width, IC,
&loopVectorizationLegality, &CM);

Hold on…

The change from UpperCamel to lowerCamel should be separate from going
from X to somethingOtherName.

FWIW, I suspect separating the transition of our acronyms from the transition of identifiers with non-acronym words may be an effective way to chip away at the transition cost… Definitely an area that people who really care about this should look at carefully.

It seems like in this example, TLI is changed to targetLibraryInfo for
the purpose of having a name that lowerCamel can be applied to, which
is, at best, backwards.

When a new person sees “TLI”, they won’t know what it is. When an LLVM
developer sees “TLI” they know exactly what it is, even without any
context. At the same time, a person is new to LLVM for only a certain
period of time, much shorter than the lifetime of the code.

The key to readability is to make the important things easy to see, and
get the unimportant things out of the way. By using completely expanded
names we run the risk of making everything equally “easy to see”…

I think this bias towards acronyms (which I used to share) due to keeping things short but still recognizable once people become deeply familiar with LLVM is the wrong prioritization. It does work well for experienced LLVM developers, but I think we should do much more to facilitate and encourage people who are not in this set. While this does come at some cost to highly experienced LLVM developers (reading library_info instead of TLI), but it seems easily worth it to make the codebase more accessible to new contributors.

> Taking my previous example [1]:
>
> InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
> &LVL, &CM);
>
> If we imagine that over time it evolves such that 50% of the variables have been renamed to camelBack versions of the type names, then it will look like this:
>
> InnerLoopVectorizer LB(loop, PSE, loopInfo, DT, targetLibraryInfo, TTI,
> assumptionCache, ORE, vectorizationFactor.Width, IC,
> &loopVectorizationLegality, &CM);

Hold on...

The change from UpperCamel to lowerCamel should be separate from going
from X to somethingOtherName.

FWIW, I suspect separating the transition of our acronyms from the transition of identifiers with non-acronym words may be an effective way to chip away at the transition cost... Definitely an area that people who really care about this should look at carefully.

It seems like in this example, TLI is changed to targetLibraryInfo for
the purpose of having a name that lowerCamel can be applied to, which
is, at best, backwards.

When a new person sees "TLI", they won't know what it is. When an LLVM
developer sees "TLI" they know exactly what it is, even without any
context.

I wouldn't downplay context, my first thought when I see TLI is
TargetLoweringInfo (of course, then I see Vectorizer around and get
back on track, but when it's spelled out that reflex just doesn't kick
in to begin with).

At the same time, a person is new to LLVM for only a certain
period of time, much shorter than the lifetime of the code.

True, but their impression of LLVM when they are new to it may
influence their decision to stick with it or not.

The key to readability is to make the important things easy to see, and
get the unimportant things out of the way. By using completely expanded
names we run the risk of making everything equally "easy to see"...

I think this bias towards acronyms (which I used to share) due to keeping things short but still recognizable once people become deeply familiar with LLVM is the wrong prioritization. It does work well for experienced LLVM developers, but I think we should do much more to facilitate and encourage people who are not in this set. While this does come at some cost to highly experienced LLVM developers (reading `library_info` instead of `TLI`), but it seems easily worth it to make the codebase more accessible to new contributors.

+1 for making things accessible to new contributors. Note that this
also works for people that aren't new to LLVM per se, but may be new
to a given part of the codebase.

I'm also not sure it comes at such a high cost to experienced
developers, since you'd get used to seeing 'library_info' after a
while. It's not like your brain processes every single letter to
recognize a word. Personally, when I look through code I decode the
acronyms to the full name in my head anyway (it's not something I do
on purpose, it's just how it works for me).

Just my 2c.
Diana

+1 to David’s statement that naming members and locals differently seems strange to me. I can’t think of a case where it’s been important for me to distinguish between a local and class member and it hasn’t already been clear at a glance/click etc. Frankly, I just find turning things into non-English words (e.g. due to a prefix/suffix) strange and makes it harder for me to actually read/talk about code with people in person etc (e.g. try talking about the theoretical member variable mThing/m_thing with someone).

For what it’s worth, I’ve definitely been in situation where I’ve wondered whether some variable is a member or not (i.e. local or function argument). Especially if I wasn’t using IDE and was looking at an unfamiliar code.
It usually wasn’t hard to resolve but it incurred a “context switch” penalty.

Never personally had a problem with short one letter_ prefixes. You don’t need to pronounce them or treat them like part of the name. Similarly to how you don’t need to underscores.

That’s obviously just my opinion (and I’m just starting to get familiar with LLVM code base). Also, pervasiveness of acronyms was bigger obstacle in reading new code, in my experience.

FWIW, I suspect separating the transition of our acronyms from the transition
of identifiers with non-acronym words may be an effective way to chip away at
the transition cost... Definitely an area that people who really care about
this should look at carefully.

If it makes for an easier transition then I'd be happy to go with upper case acronyms and camelBack for non-acronyms. I've updated https://reviews.llvm.org/D57896 accordingly.

-Michael

The issue here is that neither of us is a new contributor, but we're trying to guess what it would be like for someone new. It may seem that long names make it easier, but when I started with LLVM I actually found the naming convention (and the use of abbreviations) very appealing.

I was recently involved in another project, and the most important thing for me was to identify the logical structure of it: what components it's made of, what's the role of each component, etc. Reading the lines of code in detail was secondary. In my experience, accessibility of a project is directly related to how easy it is to understand what you need to do (conceptually) when you want to implement some changes. In that sense LLVM does really well, but it's more of a function of its design.

I'd be careful not to overestimate the impact of the naming convention on project's accessibility. Barring extreme cases, I'd even suggest that it's one of the easier things to get used to.

-Krzysztof

Regarding a plan for conversion, I’m keen to avoid perfect being the enemy of better.

It seems a bit early to discuss conversion given there’s not consensus on a style. For example:

If we imagine that over time it evolves such that 50% of the variables have been renamed to camelBack versions of the type names, then it will look like this:

This comment seems to assume that camelBack is the agreed upon style and all we need to do now is move forward, and I do not believe that to be the case

I see it a bit differently. The first question is "should we change
the LLVM naming conventions". I view the plan for conversion as
essential to answering this question - IMHO if we're going to live
with mixed styles for years (which would be the case if there were no
concerted conversion effort) then the advantages of changing naming
convention are outweighed by the disadvantages by quite some way. So
while I appreciate the desire to separate concerns, I find it
difficult to do so in this case.

Best,

Alex

It seems a bit early to discuss conversion given there’s not consensus on a style.

My reason for pushing the conversion conversation is that if it's decided that we want to clang-tidy everything then we have much more leeway in which style we use. If we allow conversion to happen organically, then for that to work we need to take into account the current style coexisting with the new style, whatever that may be.

My fear is that if we insist on a big-bang transition then there will be sufficient opposition that we end up making no change at all. For that reason I favour allowing conversion to happen organically.

This comment seems to assume that camelBack is the agreed upon style and all we need to do now is move forward

No, I don't believe it's agreed upon, sorry to give that impression. I see strong views on both sides. I just picked one for arguments sake.