libc++: max_size() of a std::vector

Hi,

I am puzzled by the result of std::vector<char>::max_size() on the n = 32 and n = 64 bits system I have tested. The result is 2^n − 1. Let me explain why I am puzzled.

Every implementation of std::vector<T> that I know of, libc++ included, has three members of type T*: begin_, end_, capacity_. begin_ points to the first value of the vector and end_ points to the one after the last. The size() method is computes end_ - begin_. But the result of this difference is of type std::ptrdiff_t which is a signed integer of n bits on every implementation that I know of. Therefore, this type can not store the integer 2^n − 1, but only up to 2^(n − 1) − 1.

I would expect this last number for max_size(). Is it a bug or something that I overlooked?

Hi Francois,
The return type of size() is size_t, which is unsigned, so it can
represent values up to 2^n-1.

std::vector<T,Allocator>::max_size - cppreference.com says:

"This value is typically equal to
std::numeric_limits<size_type>::max(), and reflects the theoretical
limit on the size of the container. "

Csaba

Hi Csaba,

I know that the type of the objet returned by the .size() method is std::size_t which is an unsigned integer of n bits (where n = 32 on x86 and n = 64 on x86-64). My claim is that libstdc++ (and every standard library that I know of) runs into undefined behaviour for vectors of size >= 2^(n-1) when the size method is called.

To make things concrete, let’s suppose that:

  • You are on a 32 bits system (n = 32)
  • You construct a std::vector of size 2^31, which represents 2GB of memory. We assume that this memory is available so the constructor does not throw any exception
  • You call the .size() method on that object

Here is the code available in the libstdc++ that comes with gcc 4.9.2

size_type
size() const _GLIBCXX_NOEXCEPT
{ return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }

As you can see, it takes the difference of the pointers this->_M_impl._M_finish and this->_M_impl._M_start. The second one points to the first element of the array and the first one points to the one after the last element of the array. Now, if you look at the C++11 standard, 5.7 (Additive operators), point 6, about the difference of two pointers:

"When two pointers to elements of the same array object are subtracted, the result is the difference of the subscripts of the two array elements. The type of the result is an implementation-defined signed integral type; this type shall be the same type that is defined as std::ptrdiff_t in the header (18.2). As with any other arithmetic overflow, if the result does not fit in the space provided, the behavior is undefined….

As you can see, the difference of 2 pointers is of type std::ptrdiff_t which is a signed integer of n bits. Therefore, the maximum value it can store is 2^(n-1) - 1. In our example, 2^31 is too big for std::ptrdiff_t, and the standard says that this is undefined behaviour. The fact that this number is casted to std::size_t after that operation is irrelevant.

François

Hi François,

I think you are absolutely correct. And this should definitely be addressed. I think the best would be for you to file a bug report on both libc++ and libstdc++.

Here are a few additional things I would point out:

Note, however, that this only affects vectors with elements of size 1 or less (like vector of bool), because any other type, of size 2 or above, can never have a pointer difference that would overflow ptrdiff_t. So, we are really only talking about char, unsigned char, bool, and empty classes.

I see only one sensible option to address this, which is to restrict max_size() to output the max positive value of ptrdiff_t. Something like this:

size_type max_size() const noexcept {
return ( PTRDIFF_MAX < /* alloc max-size / ? PTRDIFF_MAX : / alloc max-size */ );
}

It could also be conditional upon sizeof(T) such that this extra step is only done for elements of size 1.

The alternative of accounting for overflow on the calculation of size() would not only have undue overhead on size(), but it would also imply the same precaution be taken everywhere else (in iterators, in most mutating member functions, etc…). And I’m also not sure that this would be standard conforming since the difference_type would no longer be able to represent all iterator differences.

Cheers,
Mikael.

However, the end pointer can never be less than the begin pointer, so you can safely cast the ptrdiff_t result to size_t. Restricting it to PTRDIFF_MAX seems unnecessarily restrictive.

Hi Dick,

I disagree. It should be restricted to PTRDIFF_MAX if you want to be compliant with any C++11 compiler.

If the difference of the 2 pointers does not fit into std::ptrdiff_t, you run into undefined behaviour. Nevertheless, I am quite sure most compilers will give the “expected result” for static_caststd::size_t(end_ - begin_). But it should be considered as luck, not as something certified by the standard.

François

Hi,

I think this problem is even worse than you suspect François. Even if the static-cast to size_t has the “expected result” (that the unsigned value turns out correct even if it went through an overflowing operation on the ptrdiff_t), which I agree will probably happen on nearly all platforms… this is not the worst problem.

The worst problem is that the max-size function is used internally by the vector to determine if an increase in capacity is possible. This means that the vector would accept a resize operation that makes the size exceed PTRDIFF_MAX, and at that point, if you use the iterators in any algorithm or code that takes a difference between them (e.g., std::sort), you will get a negative difference_type value (overflown and still interpreted as a signed value) between those iterators, which will lead to complete disaster on all but the most pedantic / defensive code out there (how many algorithms do you think check if the (last - first) difference is negative or overflows? probably very few).

The real issue is that if max-size is supposed to regulate how large a vector can be in practice (and that’s exactly what the standard requires this value to represent), then resizing a vector to that size should produce a “perfectly good” vector in the sense that all operations with it (incl. algorithms on random-access iterators obtained from that vector) should be well-behaved. And if max-size is allowed to exceed PTRDIFF_MAX, this is simply not the case, because nearly everything you could do with a vector larger than PTRDIFF_MAX is undefined behaviour.

Hi Mikael,

Thanks for the information. I did not realise that as I am quite new to C++. I am coming from a Fortran background which is why indices mean so much to me :slight_smile:

For the bug report, it turns out that I’ve done more tests. For std::vector<char>
- libc++ version 3.5 has been fixed and max_size() returns PTRDIFF_MAX
- libstdc++ version 4.9.2 is buggy and returns SIZE_MAX
- Visual Studio 2013 is buggy and returns SIZE_MAX
So libc++ is safe and I’ll fill a bug report against the other standard libraries.

From the very beginning, my feeling is that this choice of using std::size_t for indices is a *very* bad decision for various reasons: comparing signed and unsigned integers is like shooting yourself in the foot, compilers can’t optimize anything with unsigned integers because it must assume modulo 2^n arithmetic, etc. The only explanation I had for this choice was that in the old days of 16 bit computers 1 bit was a big difference in between std::size_t and std::ptrditt_t. Now, even this historical reason is falling apart.

Best regards,
François

I take a somewhat opposite position -- the problem isn't that size is
unsigned. The size being unsigned properly matches the set of possible
sizes (a size cannot be negative). The problem, IMO, is that our only way
of calculating the difference between pointer values assumes that we may be
doing something like A - B in a situation where A < B, and so that
operation yields an instance of a signed type. In generic algorithms, I do
not know of situations where you are doing subtraction of pointer values
where you don't already know that one iterator is greater than or equal to
another. Calculating the size of a vector is a good example of this, as
when doing end() - begin(), you know that end() >= begin(). Rather, if we
simply had a way in the language (or more likely, via a standard library
function with an implementation that's not fully specified in the standard)
to calculate the difference between pointers in a way that yielded a size_t
rather than a ptrdiff_t, with the precondition of the relative order of the
operands, then we would arrive at a solution that doesn't require having a
size() function that returns a type that can represent negative values.

Hi,

I’ll try to give some arguments on why I think using an unsigned integer for indices is the worst idea one could have:

  1. for(std::size_t k = v.size() - 1; k >=0; —k) is a nice infinite loop
  2. for(std::size_t k = 0; k < v.size() - 1; ++k) might take some time if v.size() == 0
  3. It pollutes your types and when you compare signed and unsigned integers, you are very close to a bug
  4. Who needs modulo 2^n behaviour but people working with “bits”?
  5. Compiler can optimise a + 1 < b + 1 into a < b if a and b are signed. They can’t if they are unsigned.

No, let’s get a little less technical. Watch this video http://www.youtube.com/watch?v=Puio5dly9N8 at time 42:38 and 1:02:50. You’ll find out that:

  1. Bjarne Stroustrup thinks that it is “one of the sad things” of the standard library
  2. Herb Sutter says “Sorry, we were young"
  3. Chandler Carruth say that using unsigned integers makes the compiler think that you really need modulo 2^n behaviour and that it prevents warnings and optimisation

Now, for the people defending the use of std::size_t which is unsigned, their arguments are often

  1. Indices are nonnegative, therefore they should be unsigned. To those I would say :
  • sqrt(x) does not mean anything for x < 0. Nobody invented an unsigned floating point number for that!
  • v[k] still does not mean anything for k >= v.size()
  • An unsigned integer is not just an integer which is signed. It is a member of the mathematical ring Z/(2^n)Z which is another way of saying that it does have modulo 2^n behaviour. It is not what people want when they use unsigned int. And having taught arithmetic and Z/pZ for many years to student, I can tell you, nobody wants to mess with that ring.
  1. You get another bit and therefore you can address more memory.
  • Maybe, for the people who needs to work with arrays of chars of more than 2 GB on 32 bit systems…
  • But as you can see with this bug, it does fall apart.
  1. It is faster too check if you are in bounds with unsigned integers
  • No it is not. If k and n are std::size_t, the check is k < n. If k and n are std::ptrdiff_t, the check is static_caststd::size_t(k) < static_caststd::size_t(n) which gives the same assembly code.
  1. Indices are useless anyway since we have iterators.
  • I still use indices because:
  • they don’t get invalidated when we copy arrays
  • you need to use them when you want to use struct of arrays instead of array of structs.
  • indices are much more rich than iterators: multiplication, division have no meaning for iterators.

I care about getting indices right because arrays are used so much in my field. It is too late for C++ but many people have been raised to think that it is a wise thing to use unsigned int for array indices. Unfortunately Rust seems to make the same error :frowning: So it would be useful to say this things loudly it was a huge mistake and there is not a single benefit to it.

Cheers,
François

Hi,

I’ll try to give some arguments on why I think using an unsigned integer
for indices is the worst idea one could have:

1) for(std::size_t k = v.size() - 1; k >=0; —k) is a nice infinite loop

All you've stated is that you can accidentally create an infinite loop if
you incorrectly write a loop. I'd rather have a data type that properly
fits my value's constraints and doesn't artificially limit the size of a
range.

2) for(std::size_t k = 0; k < v.size() - 1; ++k) might take some time if
v.size() == 0

Again, all you're doing is writing an invalid loop. It doesn't make sense
to use a signed type just to support a poorly written loop.

In reference to both of your above examples, I could produce similar
mistakes with respect to signed types, only when size() is very close to
the largest ptrdiff_t value. They'd be more rare, but so what -- in either
case, choosing your datatype based on such an erroneous loop is choosing
the wrong solution to the wrong problem. In both cases, the *user* made a
mistake in the way that they wrote their loop, and it would be horrible to
use a datatype that doesn't actually match your range of values just
because of this. Rather than try to "do the write thing" when the user does
the wrong thing, the effort should instead be focused on making it so that
the user isn't doing the wrong thing to begin with. In other words, instead
of using a datatype that doesn't match the range, why not just encourage
proper usage of high-order functions so that users aren't frequently
writing silly loops like this. That's not to say that you won't still
encounter more manual loops, but I'd say that changing the datatype is a
poor solution to making incorrectly written manual loops "just work."

3) It pollutes your types and when you compare signed and unsigned

integers, you are very close to a bug

Yes, this is important, but I argue that consistently using signed types is
not the ideal solution. Consistently using unsigned types in this case is.
I am very concerned about this, much more so than most people, and I cringe
whenever I even see casting between signed an unsigned types. In that
respect, I feel that we are very much in agreement. In my personal
experience, though, there is nothing at all difficult about consistently
using unsigned types and I genuinely feel that we would be better off in
the long run to embrace them more-so than currently is the case, and not
less. The only time I encounter problems is when *other* APIs use signed
types for everything.

4) Who needs modulo 2^n behaviour but people working with “bits”?

I agree that modulo 2^n behavior being required is an unfortunate
limitation regarding optimization, though it is also notably just a
language-imposed requirement -- we could just as easily have additional
unsigned types with UB on overflow in the language and have no potential
difference regarding efficiency when compared to signed types. If we had
that, there would be no difference in terms of optimizations, no artificial
limitation on sizes, and we'd be using a data type whose set of values more
accurately represents the set of possible sizes of our containers. IMO,
using a signed type here would just be a hack.

No, let’s get a little less technical. Watch this video

http://www.youtube.com/watch?v=Puio5dly9N8 at time 42:38 and 1:02:50.
You’ll find out that:

Yes, I'm aware, and I feel that they are misguided.

Now, for the people defending the use of std::size_t which is unsigned,
their arguments are often

1) Indices are nonnegative, therefore they should be unsigned. To those I
would say :
    - sqrt(x) does not mean anything for x < 0. Nobody invented an
unsigned floating point number for that!

A couple of differences, here. For one, operations with operands generally
do not map from one datatype to another, apart from promotion (or
operations on things like quantities with units attached) because they are
usually a part of larger expressions where you need to be consistent. With
respect to something like "size()" returning an unsigned type, we have no
operand to be consistent with. It is effectively a terminal. Implying that
it simply *may* be used in signed integer expressions is not a compelling
argument, and in practice, I find it much more common to be used in
unsigned expressions. If one were to make size() yield a signed value, then
all of a sudden any time you use it in an expression with unsigned data
you'd be back to having similar mismatching concerns. I don't see what
advocates really believe they're gaining.

As well, a simple unsigned type that has UB on overflow would take no
effort to standardize, and it would get rid of the modular arithmetic and
optimization concerns -- this is not quite the case for a floating point
type. All you'd have to do for such an unsigned integral type is specify
that overflow is UB or implementation specified in the standard for such a
type. The only difficult part is actually convincing people that it's
worthwhile.

Finally, when writing code that deals with ranges over data structures,
which is what our examples have been alluding to here, the need for
negative values in full expressions and sets of statements that deal with
these range sizes is entirely unconvincing to me -- the examples you've
shown certainly do not imply any such need, they just show that some people
accidentally write incorrect loops. As said earlier, I've written a lot of
generic code and the only time I ever find myself having to deal with
signed/unsigned mismatches are when other APIs use signed datatypes when
they could/should have used an unsigned type.

    - v[k] still does not mean anything for k >= v.size()

I'm not sure I see your point here. You can still represent a very large
size when compared to your word size, which is not possible with a signed
type. With a signed type, on the other hand, half of all of your values are
always completely worthless. I realize that people frequently say "a large
size is uncommon," but to me that is an extremely weak argument and does
not warrant using a type that doesn't properly match the valid set of
values that you need to represent.

    - An unsigned integer is not just an integer which is signed. It is a

member of the mathematical ring Z/(2^n)Z which is another way of saying
that it does have modulo 2^n behaviour. It is not what people want when
they use unsigned int. And having taught arithmetic and Z/pZ for many years
to student, I can tell you, nobody wants to mess with that ring.

Agreed, but IMO *this* is what needs to be fixed. Making size() return a
signed type for this reason would, imo, not solve the *actual* problem.

2) You get another bit and therefore you can address more memory.

    - Maybe, for the people who needs to work with arrays of chars of more
than 2 GB on 32 bit systems...

This genuinely can be a problem and I think it's reasonable on its own as a
rationale for unsigned, though really it's the artificial change in the
range of values if you were to adopt a signed type that is much more
upsetting to me, personally, even though it's much more ideological.

    - But as you can see with this bug, it does fall apart.

Again, if users are incorrectly using the datatype, then *that* is what
needs to be addressed. By artificially changing the datatype to make a
common mistake "just work," you're creating other subtle problems,
minimizing the possible sizes that are able to be represented, and (in
practice) making function preconditions frequently more complicated because
certain constraints are no longer implied by the type's invariants.

3) It is faster too check if you are in bounds with unsigned integers

    - No it is not. If k and n are std::size_t, the check is k < n. If k
and n are std::ptrdiff_t, the check is static_cast<std::size_t>(k) <
static_cast<std::size_t>(n) which gives the same assembly code.

4) Indices are useless anyway since we have iterators.

    - I still use indices because:
        - they don’t get invalidated when we copy arrays
        - you need to use them when you want to use struct of arrays
instead of array of structs.
        - indices are much more rich than iterators: multiplication,
division have no meaning for iterators.

Not much to say here other than that I personally wouldn't argue these.

I care about getting indices right because arrays are used so much in my

field. It is too late for C++ but many people have been raised to thik that
it is a wise thing to use unsigned int for array indices. Unfortunately
Rust seems to make the same error :frowning: So it would be useful to say this
things loudly *it was a huge mistake* and there is not a single benefit to
it.

I realize that you feel it's an "error" but I strongly urge you to
reconsider. I agree with you 100% regarding modular arithmetic, and that to
me is the real problem. It is the only legitimately compelling reason to
avoid unsigned types that I ever hear. However, I'd much rather simply see
additional unsigned types that do not require this behavior on overflow as
a solution as opposed to, imo, using signed types instead of unsigned types
because you don't need modular arithmetic.

Hi Matt,

You seem to favour the usage of std::size_t because it does not “limit the range of values”. My original post shows that the usage of std::size_t does not even allow std::vector to have a size() >= PTRDIFF_MAX in all the 3 major standard library implementation.

  • Just run the following code compiled on a 32 bit system (or with -m32) with libc++
    auto v = std::vector( );
    std::cout << PTRDIFF_MAX << " " << SIZE_MAX << " " << v.max_size() << std::endl;
    and you’ll find out that v.max_size() is equal to PTRDIFF_MAX and not SIZE_MAX.
  • libstdc++ and the Windows standard library implementation returns SIZE_MAX and it you look at my original post you’ll understand that this is a bug (size() has undefined behaviour if the size is > PTRDIFF_MAX). This bug has been in their standard library implementation for 20 years!

François

Right, I already addressed that in a previous reply.