[RFC] Coding Standards: "prefer `int` for regular arithmetic, use `unsigned` only for bitmask and when you intend to rely on wrapping behavior."

Hi,

The LLVM coding style does not specify anything about the use of signed/unsigned integer, and the codebase is inconsistent (there is a majority of code that is using unsigned index in loops today though).

I’d like to suggest that we specify to prefer int when possible and use unsigned only for bitmask and when you intend to rely on wrapping behavior, see: https://reviews.llvm.org/D63049

A lot has been written about this, good references are [unsigned: A Guideline for Better Code] https://www.youtube.com/watch?v=wvtFGa6XJDU) and Garbage In, Garbage Out: Arguing about Undefined Behavior…, as well as this panel discussion:

Other coding guidelines already embrace this:

It is rare that overflowing (and wrapping) an unsigned integer won’t trigger a program bug when the overflow was not intentionally handled. Using signed arithmetic means that you can actually trap on over/underflow and catch these bugs (when using fuzzing for instance).

Chandler explained this nicely in his CPPCon 2016 talk “Garbage In, Garbage Out: Arguing about Undefined Behavior…”. I encourage to watch the full talk but here is one relevant example: https://youtu.be/yG1OZ69H_-o?t=2006 , and here he mentions how Google experimented with this internally: https://youtu.be/yG1OZ69H_-o?t=2249

Unsigned integer also have a discontinuity right to the left of zero. Suppose A, B and C are small positive integers close to zero, say all less than a hundred or so. Then given:

A + B > C
and knowing elementary school algebra, one can rewrite that as:
A > B - C
But C might be greater than B, and the subtraction would produce some huge number. This happens even when working with seemingly harmless numbers like A=2, B=2, and C=3.

I'd prefer us to have something neater than static_cast<int> for the
loop problem before we made that change. Perhaps add an ssize (or
equivalent) method to all of our internal data structures? They're a
lot more common than std::* containers.

But it's not something that would keep me up at night.

Cheers.

Tim.

The LLVM coding style does not specify anything about the use of signed/unsigned integer, and the codebase is inconsistent (there is a majority of code that is using unsigned index in loops today though).

I’d like to suggest that we specify to prefer int when possible and use unsigned only for bitmask and when you intend to rely on wrapping behavior, see: https://reviews.llvm.org/D63049

I’d prefer us to have something neater than static_cast for the
loop problem before we made that change.

Not sure this is much of a problem in LLVM-land… We already encourage a loop style that avoids the classical problem:

for (int Index = 0, Size = MyVector.size(); Index < Size; ++Index) {
// ...
}

Alternatively, LLVM supports a form I like even more:

for (int Index : llvm::seq<int>(0, MyVector.size())) {
// ...
}

If passing the explicit type is an issue, we could fix that?

FWIW, I’d also be fine changing the return value ef size() for LLVM containers, and given the variability of standard library container performance, I’m generally a fan of LLVM code consistently using its own containers.

-Chnadler

1 Like

+1

Since C++20 is also introducing ssize [1] members, this makes a lot of
sense to me. Using it would help avoiding an unsigned comparison as in

    if (IndexOfInterestingElement >= Container.size())
      ...

to sneak in from the start.

Michael

[1] p1227R1: Signed ssize() functions, unsigned size() functions

Maybe it’s just because I work in code around the binary file formats almost exclusively, but unsigned (or more often uint64_t) is FAR more common than int everywhere I go. I don’t have time right now to read up on the different links you provided, and I expect this is covered in them, but it also seems odd to me to use int in a loop when indexing in a container (something that can’t always be avoided), given the types of size() etc.

I’m in the same situation James is in and thus have the same bias but I’ll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.

+1 / me too

I don’t think there’s a problem here that needs to be fixed with a coding standard.

I’m in the same situation James is in and thus have the same bias but I’ll +1 that comment nevertheless. I think I prefer using size_t or the uintX_t types where applicable. Only when I need a signed value do I use one.

+1 to prefering unsigned types.

~Aaron

I’d appreciate if you guys could provide rational that address the extensive arguments and opinion provided in the C++ community that I tried to summarize in the link above.
Otherwise I don’t know what to take out of unmotivated “+1”.

Yeah, I want to make it clear that I am in favour of "int" as a
default, even if preparatory changes are needed to get there. My main
reservation was over the loops

    for (int i = 0; i != static_cast<int>(X.size()); ++i)

required to satisfy compilers when *::size returns unsigned. I dislike the

    for (int i = 0, Size = X.size(); i != Size; ++i)

form that has traditionally been suggested because it's still horribly verbose.

I hadn't even heard of the llvm::seq construct Chandler mentioned
though, and I think it addresses all my worries. It looks really neat
to me, and I'm not bothered by the duplicated "int".

So my current position is that I'm all in favour of switching to int,
with llvm::seq as the suggested alternative in for loops.

Cheers.

Tim.

Sorry for the brevity, I am currently travelling and responding on a cell phone. I won’t be able to give you a full accounting until later, but 1) I don’t see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won’t use this convention consistently anyway. I have yet to be convinced by the c++ community’s very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

~Aaron

Sorry for the brevity, I am currently travelling and responding on a cell phone. I won’t be able to give you a full accounting until later,

There is no hurry right now!

but 1) I don’t see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won’t use this convention consistently anyway.

As far as I can tell, the same arguments applies to “unsigned” I believe: it does not represent the full extent of size_t on every platform either! Yet we’re still using it as an index, including to index into objects/array.

I have yet to be convinced by the c++ community’s very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

Note that this is not about “switching” anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using int in general for a while I believe).

Sorry for the brevity, I am currently travelling and responding on a cell phone. I won’t be able to give you a full accounting until later,

There is no hurry right now!

but 1) I don’t see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won’t use this convention consistently anyway.

As far as I can tell, the same arguments applies to “unsigned” I believe: it does not represent the full extent of size_t on every platform either! Yet we’re still using it as an index, including to index into objects/array.

I have yet to be convinced by the c++ community’s very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

Note that this is not about “switching” anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using int in general for a while I believe).

Glad to hear there’s no plan to change the codebase, but then I fail to see the purpose of the coding rule. Won’t it boil down to “use the correct datatype for what you’re doing?”

What I mostly want to avoid is code that uses signed values for inherently unsigned operations like comparing against the size of an object. Then we start wanting ssize() functions or using casts for confused reasons.

~Aaron

Sorry for the brevity, I am currently travelling and responding on a cell phone. I won’t be able to give you a full accounting until later,

There is no hurry right now!

but 1) I don’t see a motivating problem this churn solves, 2) signed int does not represent the full size of an object like size_t does and is inappropriate to use for addressing into objects or arrays, which means we won’t use this convention consistently anyway.

As far as I can tell, the same arguments applies to “unsigned” I believe: it does not represent the full extent of size_t on every platform either! Yet we’re still using it as an index, including to index into objects/array.

I have yet to be convinced by the c++ community’s very recent desire to switch everything to signed integers and would be very unhappy to see us switch without considerably more motivating rarionale.

Note that this is not about “switching” anything: there is no standard in LLVM right now (as far as I know) and the codebase is inconsistent (I am using int in general for a while I believe).

Glad to hear there’s no plan to change the codebase

Just like every guideline, new code should follow unless there is a good reason not to (the wording is “prefer”).

, but then I fail to see the purpose of the coding rule. Won’t it boil down to “use the correct datatype for what you’re doing?”

How do you define what is the correct datatype?

What I mostly want to avoid is code that uses signed values for inherently unsigned operations like comparing against the size of an object.

I believe if (index < v.size() - 1) can be very surprising when v.empty().

I’m personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it’s solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by “real problem” here, I’m talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a “yes, it seems theoretically possible for this to overflow”.

On the other hand, a large number of people need to work in this codebase every day, and multiplied over the same time period, my belief is that having the code make sense and be simple has a higher net value.

It simply doesn’t make sense (conceptually) to use a signed type for domains that are inherently unsigned, like the size of an object. IMO, we should revisit this if and when the deficiencies in the C++ Standard are addressed.

James Henderson via llvm-dev <llvm-dev@lists.llvm.org> writes:

Maybe it's just because I work in code around the binary file formats
almost exclusively, but unsigned (or more often uint64_t) is FAR more
common than int everywhere I go. I don't have time right now to read
up on the different links you provided, and I expect this is covered
in them, but it also seems odd to me to use int in a loop when
indexing in a container (something that can't always be avoided),
given the types of size() etc.

The reason to use int as an index in loops is for its algebraic
properties, which allows some optimizations (vectorization, for
example), where unsigned would cause problems.

Basically, the optimizer can assume overflow is undefined behavior and
optimize accordingly.

+10000 for preferring signed unless there's a good reason for unsigned.

                    -David

I would argue that unless you’ve benchmarked a piece of code found this to be a problem, then writing code with the express purpose of having it be more optimizable at the expense of readability and ease of understanding is a classic example of premature optimization. And if you have, then you can used signed integers in that specific piece of code, and leave a comment about why it should not be changed.

I agree that readability, maintainability, and ability to debug/find issues are key.

I haven’t found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned increases my mental load when reading code.

I'm personally against changing everything to signed integers. To me, this is an example of making code strictly less readable and more confusing in order to fight deficiencies in the language standard. I get the problem that it's solving, but I view this as mostly a theoretical problem, whereas being able to read the code and have it make sense is a practical problem that we must face on a daily basis. If you change everything to signed integers, you may catch a real problem with it a couple of times a year. And by "real problem" here, I'm talking about a miscompile or an actual bug that surfaces in production somewhere, rather than a "yes, it seems theoretically possible for this to overflow".

Doesn't it make it already worth it?

On the other hand, a large number of people need to work in this codebase every day, and multiplied over the same time period, my belief is that having the code make sense and be simple has a higher net value.

It simply doesn't make sense (conceptually) to use a signed type for domains that are inherently unsigned, like the size of an object. IMO, we should revisit this if and when the deficiencies in the C++ Standard are addressed.

The underlying problem is that the C family of languages mixes two
orthogonal properties: value range and overflow behavior. There is no
unsigned type with undefined wraparound. So the question becomes: What
property is more important to reflect? Do we want catch unintended
wraparound behavior using a sanitizer/make optimizations based on it?
Do we need the additional range provided by an unsigned type? As
Chandler says in one of his talks linked earlier: "If you need more
bits, use more bits" (such as int64_t).

Michael

I agree that readability, maintainability, and ability to debug/find issues are key.

I haven’t found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned increases my mental load when reading code.

I’m on the other end. I’m always reading the code wondering “is this going to warn?” “Why could a container ever have a negative number of elements?” “The maximum value representable by the return type (unsigned) is larger than that of the value i’m storing it in (signed), so an overflow could happen even if there were no error. What then?”