MemoryBuffer

I was writing something using MemoryBuffer, and while looking through
its code I came across line 59:
  assert(BufEnd[0] == 0 && "Buffer is not null terminated!");
I am curious if the MemoryBuffer only supports non-binary, non-null
embedded files, or if it supports binary as well. I do not see
anything inherently not expecting binary files except for that one
line, so I am curious as to the purpose of that assert?

Also, is the BufferEnd parameter supposed to be a pointer to the end
of the input, or to one past the end as in the STL? It seems it is
supposed to be to the end of the input since that assert could access
invalid memory if it was to one past the end, but if it was to the end
of input and not one past the end that that seems inconsistent, so I
am curious as to which it is? I know the comments in the header
indicate that the end needs to be dereferencable for getMemBuffer, but
not for getMemBufferCopy and other ones, so I am curious as to why it
is inconsistent? I do not see any real reason why getMemBuffer has
the requirement that the EndPtr needs to be dereferencable except for
that assert, and I do not really see a reason for that assert right
off the bat. Basically, is there a rational section somewhere for the
design of MemoryBuffer?

Although the public member function getBufferEnd has no comments, from
the looks of it, I think it returns one past the end based on how
getBufferSize is implemented. Out of curiosity, since const char*
fulfills the concept of a randomIterator concept, would there be any
aversion to add an stl style begin/end that returns the same things?
I have some template code I use (and do not control) that can take an
STL style container (only in that it calls begin/end on it), and I see
no real reason as to why MemoryBuffer could not implement an STL style
begin/end iterator interface by just returning const char*. This
would allow it to fulfill a few other concepts as well. Perhaps even
giving it a more complete interface as well, would not change
functionality, but would add capability. Would anyone be opposed to a
patch?

Your question is difficult to understand. MemoryBuffer doesn't care what is in the file. It works fine if you have embedded nul's in the file. It just requires the mapped image to have a nul at the end.

-Chris

I am mostly just curious as to why it requires a nul at the end, seems
odd when specifying two pointers for a begin/end style interface. I
am mostly just worried because a library I use returns such iterators
that work in Memory Buffer (they are const char*'s as well), but the
end pointer is one-past-the-end, and may point into invalid memory
(say if it was at the end of a virtual page), thus could cause an
access violation. To prevent that I would need to copy it, and with a
sizable chunk of memory that it may be, can cause the program to
freeze for a sec. It would also be nice to not require that as you
could create a MemoryBuffer that references some *part* of an existing
MemoryBuffer that does not have a nul as the defreferenced end pointer
that you are giving in it. Mostly I just do not understand why this
restriction exists, so I am just looking for comprehension.

NUL-termination is an optimization, in particular for clang's parser, where it's cheaper to check for '\0' to signal EOF than it is to compare pointers. Its presence doesn't move the end iterator (just makes it dereference-able) and in no way impacts your ability to use iterator pairs to delimit the binary content of the file. This feature does require file contents be copied into local memory (rather than mapped in) if the file is exactly a multiple of the system page size.

Out of curiosity, what code in Clang is optimized by doing a pointer
derefence then compare to 0, rather then just comparing two points
directly? Does not seem that efficient when laid out like that, which
is why I am curious what code actually is helped by that pattern?

Out of curiosity, what code in Clang is optimized by doing a pointer derefence then compare to 0, rather then just comparing two points directly? Does not seem that efficient when laid out like that, which is why I am curious what code actually is helped by that pattern?

Consider parsing an integer:

// With NUL termination.

while (‘0’ <= *p && *p <= ‘9’)
n = n * 10 + (*p - ‘0’);

// Without.

while (p != e && ‘0’ <= *p && *p <= ‘9’)
n = n * 10 + (*p - ‘0’);

— Gordon

The code is out there, go read it :slight_smile:

-Chris

That does not help with binary files though (as in my case). Why not
use a templated range interface (I could make one if you so wish),
thus you could specify binary or not as a policy to the template and
you could allow it to test for end internally, comparing pointers for
end if not null terminated, and comparing to null if null terminated
for an optimization, based on the policy design you could even let it
terminate on any given passed in terminator type, fully optimized for
all situations. Policy based trumps hand-rolled trumps generic in
terms of speed for all situations. I write code like that all the
time, would be quite simple, would you accept such code? Just from a
quick thought the interface would change from using
getBufferStart()/getBufferEnd()/getBufferSize() to something like
getRange() which returns a random_access_range_type<no_policy>, which
would probably have an equivalent use like (using a monotype font to
line things up):

typedef const char cchar;
MemoryBuffer mb;
MemoryBuffer::range_type<policy> rt(mb.getRange());

Current MemoryBuffer Range-based MemoryBuffer