[RFC] Coding Standards: "prefer `int` for regular arithmetic, use `unsigned` only for bitmask and when you intend to rely on wrapping behavior."

andrewrk · June 11, 2019, 8:10pm

This is why the Zig frontend has the arithmetic operators +,/,*,- assert
that overflow does not occur for *both* signed and unsigned integers.
The wrapping operators +%,/%,*%,-% are available when one needs a
guaranteed twos complement wraparound.

This gives the best of both worlds, where you can use proper types
matching the range of values, yet have the desired performance
optimizations (and sanitation checks) that only signed integers have in
C/C++.

In fact, in Zig, unsigned integers are actually more optimizable than
signed integers! Compiler Explorer

Andrew

Jake_Ehrlich · June 11, 2019, 8:22pm

This whole debate seems kind of odd to me. I don’t know that cases where it isn’t clear what type to use come up that often. If a value can truly never be negative you should use an unsigned value. If a value can be negative, you should use a signed value. Anecdotal evidence in my case is that the vast majority of values are unsigned by this rule.

Is there a reason to use a signed value when you know a value will never be negative? Trapping on overflow doesn’t seem motivated to me to me since I’m not aware of anything that does that. UBSan also checks for overflow in unsigned types by default as well so you can still check for that issue.

I’m not going to go watch the YouTube videos but the ES.102 lacks merit. On systems I work with the bug they mention wouldn’t be caught the way they say. They also use subtraction (a rare operation IMO) as a motivating example and arbitrarily declare large values to be less obvious bugs than negative values without evidence to this.

ES.101 is valid but is not a reason to prefer signed to unsigned values in any context. I’ve also run into a number of instances of signed shifts being used and the interplay between negation and bitwise operators being used. Not that those are common but it’s just to say that exceptions exist even to that rule.

Bigcheese · June 11, 2019, 8:44pm

sequence containers (like vector) cannot hold more items than a ptrdiff_t (a signed type) can represent due to requirements based on std::distance.

I don’t see a use for unsigned types outside of bit manipulation.

Michael Spencer

qcolombet · June 11, 2019, 9:33pm

I agree that readability, maintainability, and ability to debug/find issues are key.

I haven’t found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned increases my mental load when reading code.

I’m on the other end. I’m always reading the code wondering “is this going to warn?” “Why could a container ever have a negative number of elements?” "The maximum value representable by the return type (unsigned) is larger than that of the value i’m storing it in (signed), so an overflow could happen even if there were no error. What then?”

+1

Aaron_Ballman · June 12, 2019, 6:55am

I agree that readability, maintainability, and ability to debug/find issues are key.

I haven’t found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned increases my mental load when reading code.

I’m on the other end. I’m always reading the code wondering “is this going to warn?” “Why could a container ever have a negative number of elements?” “The maximum value representable by the return type (unsigned) is larger than that of the value i’m storing it in (signed), so an overflow could happen even if there were no error. What then?”

Strong +1 to this.

~Aaron

Renato_Golin2 · June 12, 2019, 8:01am

+1 to both points here.

reinterpretcast · June 12, 2019, 2:17pm

+1 to both points here.

I agree that readability, maintainability, and ability to debug/find issues are key.

I haven’t found myself in a situation where unsigned was helping my readability: on the opposite actually I am always wondering where is the expecting wrap-around behavior and that is one more thing I have to keep in mind when I read code that manipulate unsigned. So YMMV but using unsigned increases my mental load when reading code.

I’m on the other end. I’m always reading the code wondering “is this going to warn?”

Indeed, using comparison against 0u from the “obvious context” has saved me from warnings.

“Why could a container ever have a negative number of elements?” “The maximum value representable by the return type (unsigned) is larger than that of the value i’m storing it in (signed), so an overflow could happen even if there were no error. What then?”

Strong +1 to this.

clang::ASTContext::toCharUnitsFromBits(int64_t) makes me cringe. It tends to be called with uint64_t values like the Width field in the instance of clang::TypeInfo.

kparzysz-quic · June 12, 2019, 3:04pm

The underlying problem is that the C family of languages mixes two orthogonal properties: value range and overflow behavior. There is no unsigned type with undefined wraparound. So the question becomes: What property is more important to reflect? Do we want catch unintended wraparound behavior using a sanitizer/make optimizations based on it?

That's a valid argument, but I suspect that the vast majority of loops using an unsigned induction variable, start at an explicit 0 and go up by 1. Such loops cannot overflow, so there is nothing to catch there.

mehdi_amini · June 12, 2019, 3:50pm

This isn’t entirely true: some comparison can exist in the code if (idx - 1 < N) { ....} which does not give the same result when idx is unsigned and zero.

Just like Michael mentioned: there are two aspects of unsigned, and the wrap-around behavior is the one causing bugs. After being bitten by unsigned wrap-around twice a year, and because debugging these isn’t fun (and correctness is important and hard), my code reading (and others, as I sourced initially) has adjusted to be suspicious and careful around unsigned: if I see unsigned I need to check more invariants when I read the code.
Obviously this is a tradeoff: losing the “this can’t be negative” property enforced in the type system is sad, but I haven’t encountered any bugs or issue caused by using int in place of unsigned (while the opposite is true).
(any bug caused “negative indexing of a container” would be as much of a bug with unsigned, please watch Chandler’s talk).

David_A_Greene · June 12, 2019, 4:44pm

Mehdi AMINI <joker.eph@gmail.com> writes:

I agree that readability, maintainability, and ability to debug/find
issues are key. I haven't found myself in a situation where unsigned
was helping my readability: on the opposite actually I am always
wondering where is the expecting wrap-around behavior and that is one
more thing I have to keep in mind when I read code that manipulate
unsigned. So YMMV but using unsigned *increases* my mental load when
reading code.

My experience is the same. "unsigned" to me gives no useful information
and actually requires more mental thought to look for pitfalls.

-David

David_A_Greene · June 12, 2019, 4:54pm

Jake Ehrlich via llvm-dev <llvm-dev@lists.llvm.org> writes:

This whole debate seems kind of odd to me. I don't know that cases
where it isn't clear what type to use come up that often. If a value
can truly never be negative you should use an unsigned value. If a
value can be negative, you should use a signed value. Anecdotal
evidence in my case is that the vast majority of values are unsigned
by this rule.

Is there a reason to use a signed value when you know a value will
never be negative?

Since this thread is really long, I want to make sure to address this
specific point even though it's been covered elsewhere.

One reason to prefer signed is optimization. The compiler simply cannot
optimize code with unsigned as well as it can with signed, because of
unsigned's breaking of standard integer algebra. This affects
everything from simple expression simplification to vectorization and
parallelization. Using unsigned can have serious performance
consequences. Because of the nature of the work I do, I see it all the
time.

Some have said this is premature optimization but to me there is no
additional mental load with signed. In fact it's less for me than
unsigned because of the mental gymnastics I have to go through to verify
code that uses unsigned.

-David

David_A_Greene · June 12, 2019, 4:56pm

Krzysztof Parzyszek via llvm-dev <llvm-dev@lists.llvm.org> writes:

The underlying problem is that the C family of languages mixes two
orthogonal properties: value range and overflow behavior. There is
no unsigned type with undefined wraparound. So the question becomes:
What property is more important to reflect? Do we want catch
unintended wraparound behavior using a sanitizer/make optimizations
based on it?

That's a valid argument, but I suspect that the vast majority of loops
using an unsigned induction variable, start at an explicit 0 and go up
by 1. Such loops cannot overflow, so there is nothing to catch there.

I have in fact seen loops that rely on overflow of induction variables,
as crazy as that sounds.

But again, for loops it's not (directly) about overflow as much as it is
about optimization. You're leaving a lot of potential performance on
the floor with unsigned, precisely because compilers have to guard
against overflowing induction variables.

-David

qcolombet · June 12, 2019, 5:11pm

Jake Ehrlich via llvm-dev <llvm-dev@lists.llvm.org> writes:

This whole debate seems kind of odd to me. I don't know that cases
where it isn't clear what type to use come up that often. If a value
can truly never be negative you should use an unsigned value. If a
value can be negative, you should use a signed value. Anecdotal
evidence in my case is that the vast majority of values are unsigned
by this rule.

Is there a reason to use a signed value when you know a value will
never be negative?

Since this thread is really long, I want to make sure to address this
specific point even though it's been covered elsewhere.

One reason to prefer signed is optimization.

FWIW. If you care about optimization, signed size_t is probably the way to go in general.

Int type will incur a sign extension for any address accesses on 64-bit platform (32-bit to 64-bit extension).
Unsigned on the other hand creates zero extension which are most of the time free.

Thus, unsigned is sometimes better for codegen than signed, in particular in a compiler code base where vectorization is not really a thing.

Anyway, it seems to me that there are enough people on both sides of the fence that this shouldn’t be in the coding standard.

My 2c.

Quentin

Stephen_Checkoway1 · June 12, 2019, 5:29pm

Hi Mehdi,

Drive by comment, please feel free to disregard.

(any bug caused "negative indexing of a container" would be as much of a bug with unsigned,

Signed comparison with a maximum size or number of things is a classic software vulnerability. E.g.,

#define MAX_WIDGETS 200
struct widget widgets[MAX_WIDGETS];
...
if (count >= MAX_WIDGETS)
return -1;
/* now do something with widgets[count] */

With unsigned integers, there would be no bug. There are simple bugs using either signed or unsigned integers so I don't find that dispositive.

please watch Chandler's talk).

It was a great talk! I'm glad you pointed it out, but I'm not convinced any of the examples he discusses impact loop counters, which seems to be one of the main motivations for this. Let's go through them.

1. Shifting by more than the bit width <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=24m40s> doesn't involve signed values.

2. Signed left shift vs. multiplication <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=27m48s> isn't relevant for loops.

3. The example from Dietz et al. <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=33m15s> boils down to computing 16u + (n-1) * 8u. As far as I can tell, the signedness of the quantities involved does not impact the computation. See <https://godbolt.org/z/HVA6xD>\.

4. Using a 32-bit, unsigned integer to index an array <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=39m22s>\. Here, using a signed 32-bit integer lets the compiler promote it to a signed 64-bit integer and the compiler can produce slightly better code. Using an appropriately sized integer for the index, namely size_t, has the same effect except with unusual ABIs like x32*. ssize_t would also be fine (and better than size_t for x32).

5. C++ doesn't have a supported arithmetic right shift <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=51m49s>\. Great point, but not relevant here.

6. UB twitter's favorite example: memcpy(0, 0, 0) <https://www.youtube.com/watch?v=yG1OZ69H_-o&t=55m0s>\. Again, not relevant here.

Up thread, Chandler suggests

for (int Index = 0, Size = MyVector.size(); Index < Size; ++Index)

or

for (int Index : llvm::seq<int>(0, MyVector.size()))

I believe int can produce better code than unsigned int here, but neither seem correct. Surely both should be size_t (again, ssize_t would be fine). I'm not sure if LLVM containers ever contain more than INT_MAX elements. If not, then int is fine but since it'll produce the same code as using a correctly sized integer, I don't see the point. (As an aside, the talk by Jon Kalb you linked to <https://www.youtube.com/watch?v=wvtFGa6XJDU&t=4m45s> says that if you need one extra bit, you'll probably need two extra bits. "A sneeze puts you over." But that's another 2 billion more entries in your data structure so I'm not sure I buy that argument.)

One of the guides you linked to <Google C++ Style Guide, says "When appropriate, you are welcome to use standard types like size_t and ptrdiff_t." Looping over structures from 0 to some size expressed as a size_t seems like an appropriate time to use size_t to me.

I haven't yet heard or read a compelling argument for using an integer type smaller than the register width for arithmetic, which int usually is these days. I can't think of a good reason to use unsigned ever though.

* Example provided by Rich Felker <https://twitter.com/RichFelker/status/1138166935575826433>\.

Cheers,

Steve

Michael_Kruse2 · June 12, 2019, 10:07pm

According to your definition above, this is not a "real problem".
64-bit CPUs do not support using the full 64-bit virtual address
space. Even if one does in the future, PTRDIFF_MAX (and as a
consequence std::distance that a std::vector must support) effectively
limits the max allocation size that can be worked with without
undefined behavior.

To summarize, with signed sizes we could fix a few production-bugs a
year, while unsigned sizes only have a theoretical advantage in
production.

Michael

chandlerc · June 13, 2019, 1:20am

FWIW, the talks linked by Mehdi really do talk about these things and why I don’t think the really are the correct trade-off.

Even if you imagine an unsigned type that doesn’t allow wrapping, I think this is a really bad type. The problem is that you have made the most common value of the type (zero in every study I’m aware of) be a boundary condition. Today, it wraps to a huge value if you cross it. Afterward, it would trap. Both are super surprising.

Another way of looking at the same lens: do you subtract these values? Should a + (b - c) be the same as (a + b) - c? You either need a signed type or wrapping to have reasonable answers here. And if you solve this with wrapping, then it makes any attempt to write assertions or other checks in the same type system very difficult. The fact that you write an assert to check for “did I accidentally go past zero?” by conjuring some “it’s probably too large” value and then comparing if it is greater than that is … extraordinarily confusing.

Meanwhile, with signed types, it is quite easy to write asserts that check for non-negative values in the correct places. They are easy to read and produce easily understood errors. The boundary conditions are uncommon.

Even on the C++ standards committee, there is remarkably strong consensus that in the absence of unsigned types coming back from .size() methods and such, we should be using signed types for the reasons above.

The fact that we have unsigned size_t in a bunch of places is, IMO, a concern and it is important to have good ways of avoiding warnings. But I think we have so very many ways that don’t require us to just use unsigned types everywhere and deal with the above issues:

Change the return types of our containers size() methods.
Add a ssize() method. (This is the direction the committee is moving AFAICT, but they are constrained by a powerful desire to break zero code, where as LLVM’s containers have much more API freedom.)
Use idioms like the one I suggested with llvm::seq.

Any or all of these seem significantly preferable to the readability concerns I outline above, at least to me. This is why I am still strongly in favor of signed types and assertions around value at known points where the value should obey that assertion.

-Chandler

stefan.teleman · June 13, 2019, 4:58pm

FWIW, the talks linked by Mehdi really do talk about these things and why I don't think the really are the correct trade-off.

Even if you imagine an unsigned type that doesn't allow wrapping, I think this is a really bad type. The problem is that you have made the most common value of the type (zero in every study I'm aware of) be a boundary condition. Today, it wraps to a huge value if you cross it. Afterward, it would trap. Both are super surprising.

[ ... ]

Any or all of these seem significantly preferable to the readability concerns I outline above, at least to me. This is why I am still *strongly* in favor of signed types and assertions around value at known points where the value should obey that assertion.

Have there been any documented cases in LLVM where a for() loop with
an unsigned int induction variable has wrapped around to 0? In other
words, is there any container - either LLVM or C++ Standard Library -
that ended up storing more than UINT_MAX or ULLONG_MAX elements?

I'm looking at these values in <limits.h>:

#define UINT_MAX 4294967295U
#define ULLONG_MAX 18446744073709551615ULL

and I am having a really hard time imagining a llvm::SmallVector<Foo>
storing 18446744073709551615ULL + 1ULL Foo elements. But i'm happy to
be proven wrong.

As far as the C++ Standard Library is concerned, all the containers
implement std::<container-type>::max_size(), which is of type
std::size_t and is always - and intentonally - smaller than either
UINT_MAX or ULLONG_MAX.

So I'm not even sure how an unsigned induction variable testing for
std::vector<Foo>::size() or a std::string::size() - for example -
would ever end up wrapping around to 0. The container will blow up
when its number of elements attempts to exceed its max_size().

Plus, it's not that hard to write

std::vector<Foo> FooVector;
for (unsigned I = 0; I < ${SOMETHING} && I < FooVector.max_size(); ++I) {
}

if unsigned's wrap-around is a material concern.

Maybe the compiler should just warn when sizeof(unsigned) <
sizeof(std::container<Foo>::max_size()). I think that would be enough
of a hint, and in the vast majority of cases it will be moot anyway.

Just my 0.02.

vtjnash · June 13, 2019, 5:17pm

Should a + (b - c) be the same as (a + b) - c? You either need a signed type or wrapping to have reasonable answers here

Depending on what “reasonable” means here, only wrapping (unsigned in C) gets you this commutative property. For a signed value with C, it’s possible for one of these to be undefined behavior, while the other returns a reasonable value. For instance, a == b == c == std:: numeric_limits<typeof(a)>::min() is probably unusual as a value, but could be used as a sentinel (perhaps to represent an infinite or empty set). Of course, the unsigned result might just be nonsense.

Anyways, I don’t have a strong opinion either way, since I think they both can have surprises.

One other occasional benefit to using unsigned that can be surprising is that power-of-two division is slightly cheaper (since it doesn’t need to handle negative numbers):

(ssize_t)x / 2

shrq $63, %rax
leaq (%rax,%rdi), %rax
sarq %rax

(size_t)x / 2

shrq %rdi

is there any container

I’d posit that UINT_MAX is uncommon, but pretty easy to exceed (although it needs a 64-bit machine to represent it). For example, anything that might need to handle the return value of MemoryBuffer::getFile could come across a file that’s larger than 2GB.

Adrien_Guinet · June 18, 2019, 11:48am

My 2 cents on this.

I watched Chandler's talk, and it looks like there are two statements that are a bit
contradictory one to another:

CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior..." - YouTube "it's fairly easy to get enough bits
w/o using the sign bits"
versus
CppCon 2016: Chandler Carruth “Garbage In, Garbage Out: Arguing about Undefined Behavior..." - YouTube "using size_t takes more space in data
structures"

Why not using size_t for indexes/length, while storing uint32_t integers and cast them
back to size_t when necessary (i.e. for space reasons)?

About checking for overflows, let's take the various examples here:
Compiler Explorer, which are various ways to load a byte at index "n+10" where
n is user-controlled. In this case, we need to check that "n+10" does not overflow.
The first thing is that, IMHO, overflow checks are tricky to get right for both the signed
and unsigned case (and has led to numerous vulnerabilities). But that's a personal
opinion. If we talk about code size (and performance), the load3 function confirms what
Chandler showed in his talk (using uint32_t can be a performance issue on some 64-bit
systems, do to the enforced wrapping that has to be done on 32bits integers). The other
load* functions all emit the intended "mov eax, DWORD PTR [rdi+40+rsi*4]". Notable
differences arise in load2, which needs to sign extend esi to rsi. In load1, there is also
an extra instruction to set the value of (2**63-12).

One conclusion is that, at least on x86-64, using size_t for these kind of overflow checks
is the most efficient. But I might miss something / do something wrong in this analysis,
and I'd like another POV on this!

One argument for the usage of signed integers (actually integer types that don't allow
wrapping) is indeed the ability to automatically insert these overflow checks at compile time.

Topic		Replies	Views
[RFC] Coding Standards: "prefer `int` for, regular arithmetic, use `unsigned` only for bitmask and when you, intend to rely on wrapping behavior." LLVM Dev List Archives	7	173	June 14, 2019
libc++: max_size() of a std::vector Clang Frontend	12	331	February 18, 2015
Reference Manual Clarifications LLVM Dev List Archives	28	287	April 22, 2008
The nsw story LLVM Dev List Archives	34	403	December 12, 2011
Dependence Analysis [was: Flow-Sensitive AA] LLVM Dev List Archives	29	291	August 28, 2008

[RFC] Coding Standards: "prefer `int` for regular arithmetic, use `unsigned` only for bitmask and when you intend to rely on wrapping behavior."

Related topics