RFC: changing variable naming rules in LLVM codebase

TL;DR: change the rule for variable names from UpperCamelCase to lowerCamelCase.

Just to get wider visibility on this, I’m raising this again as an RFC, as suggested by Roman Lebedev.

My original post from last week is here and gives a rationale: http://lists.llvm.org/pipermail/llvm-dev/2019-February/129854.html. There seemed to be general agreement that the status quo is not ideal.

Chris Lattner pointed out that this came up in 2014 as well: http://lists.llvm.org/pipermail/llvm-dev/2014-October/077685.html

I’ve created a patch to implement the change. Review and comments welcome: https://reviews.llvm.org/D57896

An excellent first step toward improving our naming style, I’m all for it. Very much tired of circumlocutions like “Triple TheTriple;”.

–paulr

Just to add another data point: I’m pretty much in love with the lowerCamelCase.
To be more objective: I’ve such code across LLVM’s code base:

    class Dragon {};
    /// ...
    Dragon Dragon;

Which compiles correctly, but debuggers get confused when you ask them to do any actions on Dragon: they could not differentiate between variable and class*.

* not sure if that’s still the case with the recent versions of gdb/lldb, but it was caused some problems while debugging in the past.

I don’t care about the convention, but I’m really not sure it’s worth the churn which would result in the code base. The hurtle which needs cleared here is not “is it a better naming style”, but “is the disruption implied by changing to the new convention justified”. To be clear, not opposed, just hesitant.Â

Philip

(Sorry if this subject already has been discussed, but I could not find any clear rules/recommendations.)

What would the recommendation be for acronyms (I’ve seen the rule about avoiding them unless they are “well known”,

but sometimes an acronym is useful, and we at least need to have some recommendation for the “well known” ones).

Example:

if (ConstantExpr *CE = dyn_cast(V))

if (CE->getOpcode() == Instruction::GetElementPtr &&

CE->getOperand(0)->isNullValue()) {

In the above example, is the recommendation to use “ce” instead of “CE” now? Or should it be “cE”?

With lowerCamelCase one might think that “cE” is the correct one (although I personally think that one looks quite ugly).

Maybe there should be an exception that variable names that start with an acronym still should start with an upper case letter?

/Björn

I would assume that the proper name in this case is constantExpr, and not CE.
This is not really an acronym, but rather a shortcut taken for some unclear reason.

The reason is clear: the variable name in such a context doesn't add anything, since it's obvious what it is. Long names should be used where meaning needs to be conveyed, otherwise they just obfuscate the code needlessly.

-Krzysztof

It very much depends on what is following the code snippit. If the
second "if" is guarding a substantial block of code, "constantExpr"
might very well be a good name. Otherwise something like "cExpr" or
"constExpr" might be fine.

In the past when I have seen things like "CE" in the code, it's not
always immediately clear to me what it is. I have to go find the
declaration.

                             -David

Krzysztof Parzyszek via llvm-dev <llvm-dev@lists.llvm.org> writes:

One additional consideration is that LLDB currently uses underscore_names. It might be worth considering that style on these grounds alone, as that would bring a very large existing part of LLVM into conformance

FWIW, I’m pretty strongly opposed to humbleCamelCase. We already use that style so something else. One of the biggest advantages of changing the variable naming convention would be to differentiate variables from other constructs IMO, and that’s the nature of many examples here.

Using underscore_names for variables as Zach suggests would, onhe other hand, be a significant improvement IMO.

That said, all of these changes are pretty dramatic and expensive. I’m personally in favor but you’d want a lot of buy in from the community as well as some really good tooling I think to help people update code they’re about to make substantial changes to prior to those changes, much like we often do with clamg-format.

(Sorry if this subject already has been discussed, but I could not find any clear rules/recommendations.)

What would the recommendation be for acronyms (I’ve seen the rule about avoiding them unless they are “well known”,
but sometimes an acronym is useful, and we at least need to have some recommendation for the “well known” ones).

Example:

    if (ConstantExpr *CE = dyn_cast<ConstantExpr>(V))
      if (CE->getOpcode() == Instruction::GetElementPtr &&
          CE->getOperand(0)->isNullValue()) {

In the above example, is the recommendation to use “ce” instead of “CE” now? Or should it be “cE”?
With lowerCamelCase one might think that “cE” is the correct one (although I personally think that one looks quite ugly).

In most examples, you’d use something other than an initialism. I was the one who started the whole CE thing as a contraction of the type into something short, but there are almost always things that work otherwise. For example, I’d now write that example as:

    if (ConstantExpr *expr = dyn_cast<ConstantExpr>(value))
      if (expr->getOpcode() == Instruction::GetElementPtr &&
          expr->getOperand(0)->isNullValue()) {

or perhaps ‘constant' instead of ‘expr’.

Maybe there should be an exception that variable names that start with an acronym still should start with an upper case letter?

That would also be fine with me - it could push such a debate down the road, and is definitely important for a transition period anyway.

-Chris

“CE” was just an example of a less common acronym. I think such an acronym

can be used for various purposes, so it is probably quite local in the code what it

stands for.

We also have other (I think more common) acronyms like TLI for

TargetLoweringInfo, or MF for MachineFunction.

As Krzystzof were saying “Long names should be used where

meaning needs to be conveyed, otherwise they just obfuscate the code

needlessly.”

Having to write out targetLoweringInfo instead of the short form TLI

everywhere would probably obfuscate things (lots of long lines, line splits)

rather than making things easier to read/understand.

FWIW, there is a proposal in the RFC, and the coding standard still say that

acronyms may be used. I still haven’t seen any answers about what the

“correct way” would be when using an acronym if the RFC is accepted.

I’ve only seen answers about not using acronyms, but that has not part

of the RFC afaict.

My point was that the RFC need to be clear about what to do with the

acronyms if we go down that path.

/Björn

Chandler Carruth <chandlerc@gmail.com> writes:

FWIW, I'm pretty strongly opposed to humbleCamelCase. We already use
that style so something else. One of the biggest advantages of
changing the variable naming convention would be to differentiate
variables from other constructs IMO, and that's the nature of many
examples here.

I guess I don't see a lot of point to being different for difference's
sake. Functions use humbleCamelCase but it's fairly easy to
differentiate a function from a variable at the point of use, unless one
is referencing a function's address in which case it kinda is like a
variable in that context.

humbleCamelCase also has the advantage of removing the weird special
case for lambdas, where in one sense it's a function and in another
sense it's a variable.

What are the current uses of humbleCamelCase that concern you?

In the end I don't really care what the convention is. I'm not sure a
mechanical updating of the source is worth it, as that can make using
git blame slightly inconvenient.

                               -David

Chandler wrote:

FWIW, I'm pretty strongly opposed to humbleCamelCase. We already use that
style so something else.

Presumably you are equally opposed to RegularCamelCase, because we already
use *that* style for something else.

But really, objecting on the grounds that a given style is already used for
function names is really a very weak argument. IME function names are
*incredibly* *hard* to confuse with anything else, because they *always* have
surrounding syntactic context. Given `TheStuff->fooBar().getThingy()` is it
even conceivable that you might not instantly get that fooBar and getThingy
are methods? Therefore, using the same convention for some other kind of
name is Not Confusing.

OTOH, `TheStuff` comes out of nowhere with no clues to its origin, and *that*
is a barrier to code-reading IME. Even renaming it to `stuff` would help
approximately zero percent. Parameter? Local? Class member? Global? LLVM has
incredibly few globals for other reasons, but using the same convention for
locals and class members is a real problem for code-reading, especially code
operating in methods for classes you're not super familiar with.

I acknowledge that the current RFC doesn't propose a member naming convention
different from other variables, but IMO it really ought to. *That* is the
distinction that would really help in reading unfamiliar code.
--paulr

I want to reiterate the benefit that underscore_names would bring. To be clear it’s not my favorite style, but it does have a very concrete advantage which is that we have a very large subproject already using it. it doesn’t make sense to do a purely aesthetic move that not everyone is going to agree on anyway, when we could do one with actual tangible value.

There is of course some amount of llvm and clang code which already uses initialLowerCaseNames for variable names too, contrary to the style guide. I don’t know how to easily quantify how much.

E.g. ParseGNUAttributes in clang/include/clang/Parse/Parser.h is one I noticed.

I have to agree with Paul that I think it is rather useful to have a naming convention that distinguishes class members from locals, etc. I’m not sure what that would look like, whether an m prefix for data members would be something others would entertain, but something that makes it clear would probably be useful. To use Paul’s example, I think that mTheStuff vs. TheStuff makes it super easy to visually identify what this is. I imagine this wasn’t mentioned in this thread or previously adopted because of some good reason I am not aware of.

A more minor point about underscores vs camel case - what I like about camel case is that it generally keeps my fingers on the 3 rows of the keyboard I use the most. From an ergonomics perspective, I find typing a whole lot of underscores a bit unnatural. So since I find camel case easier to type and equally as readable, I would favour it over underscores.

fwiw, LLDB also uses m_ for member variables, so if we were to adopt an m prefix, then in conjunction with lowercase_underscore the entire codebase would be conforming.

“CE” was just an example of a less common acronym. I think such an acronym

can be used for various purposes, so it is probably quite local in the code what it

stands for.

We also have other (I think more common) acronyms like TLI for

TargetLoweringInfo, or MF for MachineFunction.

As Krzystzof were saying “Long names should be used where

meaning needs to be conveyed, otherwise they just obfuscate the code

needlessly.”

Having to write out targetLoweringInfo instead of the short form TLI

everywhere would probably obfuscate things (lots of long lines, line splits)

rather than making things easier to read/understand.

FWIW, there is a proposal in the RFC, and the coding standard still say that

acronyms may be used. I still haven’t seen any answers about what the

“correct way” would be when using an acronym if the RFC is accepted.

I’ve only seen answers about _not_ using acronyms, but that has not part

of the RFC afaict.

My point was that the RFC need to be clear about what to do with the

acronyms if we go down that path.

I vote for treating acronyms like any other individual word written in
camel case or snake case. That way, no one has to waste cycles trying
to dodge initial or consecutive acronyms when they make sense.

For example, given acronyms FOO and BAR, the phrase "FOO BAR" becomes
"fooBar", "FooBar", or "foo_bar".

Another occasional bonus is that a dictionary of acronyms isn't
required to automatically convert arbitrary names among the forms.

Some people think UpperCamelCase is ugly for acronyms, such as "Gdb"
or "Llvm". That's the only objection I'm aware of for this approach,
and it annoys me far less than the above issues. Maybe I'm in the
minority.

Joel

I’m concerned about the internal inconsistency we would live with for (likely) many years if there is not a migration plan to converge the code base toward the new naming convention.
(other than that, changing the convention is fine with me, I don’t have much preference here)