UB in MemoryBufferMMapFile

In MemoryBuffer::init, we have an assert that reads the memory at
`BufEnd`, which is one past the end of some memory region:

from lib/Support/MemoryBuffer.cpp:45:

void MemoryBuffer::init(const char *BufStart, const char *BufEnd,
                        bool RequiresNullTerminator) {
  assert((!RequiresNullTerminator || BufEnd[0] == 0) &&
         "Buffer is not null terminated!");
  BufferStart = BufStart;
  BufferEnd = BufEnd;
}

However, this can be, and often is, one past an allocated region, since
in MemoryBufferMMapFile we mmap `Len` bytes and then call this with
`BufEnd = Start + Len`:

from lib/Support/MemoryBuffer.cpp:210:

  MemoryBufferMMapFile(bool RequiresNullTerminator, int FD, uint64_t Len,
                       uint64_t Offset, std::error_code &EC)
      : MFR(FD, sys::fs::mapped_file_region::readonly,
            getLegalMapSize(Len, Offset), getLegalMapOffset(Offset), EC) {
    if (!EC) {
      const char *Start = getStart(Len, Offset);
      init(Start, Start + Len, RequiresNullTerminator);
    }
  }

This is UB, since we're reading one past an allocated region, and I have
an internal test case where I'm seeing it crash because that memory
happens to be unreadable.

There are three options here:

1. Remove the assert - it isn't possible to write correctly.
2. Remove the RequiresNullTerminator argument from MemoryBufferMMapFile
   and hard code it to false.
3. Allocate an extra byte here and zero it.

I'm not sure which is best. Thoughts?

Why isn’t it BufEnd[-1] == 0 ?

What is the use-case for this RequiresNullTerminator after the end of the buffer?

I feel I’m missing some piece here!

In MemoryBuffer::init, we have an assert that reads the memory at
`BufEnd`, which is one past the end of some memory region:

from lib/Support/MemoryBuffer.cpp:45:

void MemoryBuffer::init(const char *BufStart, const char *BufEnd,
                       bool RequiresNullTerminator) {
assert((!RequiresNullTerminator || BufEnd[0] == 0) &&
        "Buffer is not null terminated!");
BufferStart = BufStart;
BufferEnd = BufEnd;
}

Commenting from a historical perspective, this is as intended. In the case of an mmapped buffer, you usually mmap some file that is not a multiple of a page size, and the OS zero fills the rest of the page. This makes this behavior free. In the case where you’re not mmaping a file (e.g. for clang, if the file is small) it is easy enough to malloc one byte more than you need to ensure the zero byte is present.

However, this can be, and often is, one past an allocated region, since
in MemoryBufferMMapFile we mmap `Len` bytes and then call this with
`BufEnd = Start + Len`:

from lib/Support/MemoryBuffer.cpp:210:

MemoryBufferMMapFile(bool RequiresNullTerminator, int FD, uint64_t Len,
                      uint64_t Offset, std::error_code &EC)
     : MFR(FD, sys::fs::mapped_file_region::readonly,
           getLegalMapSize(Len, Offset), getLegalMapOffset(Offset), EC) {
   if (!EC) {
     const char *Start = getStart(Len, Offset);
     init(Start, Start + Len, RequiresNullTerminator);
   }
}

This is UB, since we're reading one past an allocated region, and I have
an internal test case where I'm seeing it crash because that memory
happens to be unreadable.

Well that is wrong. It should mmap the file, and if it knows that the file is an exact multiple of a page size, it should copy it into a malloc’d buffer that is one byte larger, and put a zero there.

There are three options here:

1. Remove the assert - it isn't possible to write correctly.

This will break clang. Clang “knows” that files end with a nul byte.

2. Remove the RequiresNullTerminator argument from MemoryBufferMMapFile
  and hard code it to false.
3. Allocate an extra byte here and zero it.

#3 seems right to me assuming the problematic client is clang. If the problematic client isn’t clang, then it should mirror whatever clang is doing with this API. If Clang isn’t using this (which seems to be the case on a cursory look?), then we should change this implementation to respect memorybuffer’s invariants IMO.

-Chris

Chris Lattner <clattner@apple.com> writes:

In MemoryBuffer::init, we have an assert that reads the memory at
`BufEnd`, which is one past the end of some memory region:

from lib/Support/MemoryBuffer.cpp:45:

void MemoryBuffer::init(const char *BufStart, const char *BufEnd,
                       bool RequiresNullTerminator) {
assert((!RequiresNullTerminator || BufEnd[0] == 0) &&
        "Buffer is not null terminated!");
BufferStart = BufStart;
BufferEnd = BufEnd;
}

Commenting from a historical perspective, this is as intended. In the
case of an mmapped buffer, you usually mmap some file that is not a
multiple of a page size, and the OS zero fills the rest of the page.
This makes this behavior free. In the case where you’re not mmaping a
file (e.g. for clang, if the file is small) it is easy enough to
malloc one byte more than you need to ensure the zero byte is present.

However, this can be, and often is, one past an allocated region, since
in MemoryBufferMMapFile we mmap `Len` bytes and then call this with
`BufEnd = Start + Len`:

from lib/Support/MemoryBuffer.cpp:210:

MemoryBufferMMapFile(bool RequiresNullTerminator, int FD, uint64_t Len,
                      uint64_t Offset, std::error_code &EC)
     : MFR(FD, sys::fs::mapped_file_region::readonly,
           getLegalMapSize(Len, Offset), getLegalMapOffset(Offset), EC) {
   if (!EC) {
     const char *Start = getStart(Len, Offset);
     init(Start, Start + Len, RequiresNullTerminator);
   }
}

This is UB, since we're reading one past an allocated region, and I have
an internal test case where I'm seeing it crash because that memory
happens to be unreadable.

Well that is wrong. It should mmap the file, and if it knows that the
file is an exact multiple of a page size, it should copy it into a
malloc’d buffer that is one byte larger, and put a zero there.

This doesn't seem to be working correctly, unless
sys::Process::getPageSize() is returning the wrong thing for me.

I'm getting here from getOpenFileImpl() in MemoryBuffer.cpp, which is
deciding it's safe to mmap (by calling shouldUseMMap, where the page
size check you mention is). The file is 561152 bytes, and the page
size reports 16384, so this is not a multiple of page size (0x89000 &
0x3fff == 0x1000). Yet, the memory one byte past the mmap is not
readable.

There are three options here:

1. Remove the assert - it isn't possible to write correctly.

This will break clang. Clang “knows” that files end with a nul byte.

Strictly speaking, it won't "break" clang, it will just make a certain
class of bugs harder to track down.

2. Remove the RequiresNullTerminator argument from MemoryBufferMMapFile
  and hard code it to false.
3. Allocate an extra byte here and zero it.

#3 seems right to me assuming the problematic client is clang. If the
problematic client isn’t clang, then it should mirror whatever clang
is doing with this API. If Clang isn’t using this (which seems to be
the case on a cursory look?), then we should change this
implementation to respect memorybuffer’s invariants IMO.

This isn't clang, but the problem is happening entirely within LLVM -
I'm getting here from a call to MemoryBuffer::getFileOrSTDIN(SomeFile).
I can work around the problem by updating my client to pass
RequiresNullTerminator=false, but that doesn't actually fix the problem
in LLVM itself.

This is UB, since we're reading one past an allocated region, and I have
an internal test case where I'm seeing it crash because that memory
happens to be unreadable.

Well that is wrong. It should mmap the file, and if it knows that the
file is an exact multiple of a page size, it should copy it into a
malloc’d buffer that is one byte larger, and put a zero there.

This doesn't seem to be working correctly, unless
sys::Process::getPageSize() is returning the wrong thing for me.

I'm getting here from getOpenFileImpl() in MemoryBuffer.cpp, which is
deciding it's safe to mmap (by calling shouldUseMMap, where the page
size check you mention is). The file is 561152 bytes, and the page
size reports 16384, so this is not a multiple of page size (0x89000 &
0x3fff == 0x1000). Yet, the memory one byte past the mmap is not
readable.

That sounds like sys::Process::getPageSize() is broken or that there is a kernel bug. mmap is defined in terms of page granularity.

There are three options here:

1. Remove the assert - it isn't possible to write correctly.

This will break clang. Clang “knows” that files end with a nul byte.

Strictly speaking, it won't "break" clang, it will just make a certain
class of bugs harder to track down.

Clang’s lexer performance depends on this. I agree it won’t break the functionality of clang, but regressing compile time perf isn’t acceptable either.

-Chris

This is UB, since we're reading one past an allocated region, and I have
an internal test case where I'm seeing it crash because that memory
happens to be unreadable.

Well that is wrong. It should mmap the file, and if it knows that the
file is an exact multiple of a page size, it should copy it into a
malloc’d buffer that is one byte larger, and put a zero there.

This doesn't seem to be working correctly, unless
sys::Process::getPageSize() is returning the wrong thing for me.

I'm getting here from getOpenFileImpl() in MemoryBuffer.cpp, which is
deciding it's safe to mmap (by calling shouldUseMMap, where the page
size check you mention is). The file is 561152 bytes, and the page
size reports 16384, so this is not a multiple of page size (0x89000 &
0x3fff == 0x1000). Yet, the memory one byte past the mmap is not
readable.

That sounds like sys::Process::getPageSize() is broken or that there is a kernel bug. mmap is defined in terms of page granularity.

It depends “what kind of pages”?

From About the Virtual Memory System :
"In OS X and in earlier versions of iOS, the size of a page is 4 kilobytes. In later versions of iOS, A7- and A8-based systems expose 16-kilobyte pages to the 64-bit userspace backed by 4-kilobyte physical pages, while A9 systems expose 16-kilobyte pages backed by 16-kilobyte physical pages."

I wonder if this difference between 16-kilobyte userspace pages vs 4KB physical pages couldn’t explain the behavior here: mmap will be handled by the kernel per 4kB pages, while sys::Process::getPageSize() returns the user space pages.

This is assuming Justin was testing on iOS with an A7/A8-based system.

There are three options here:

1. Remove the assert - it isn't possible to write correctly.

This will break clang. Clang “knows” that files end with a nul byte.

Strictly speaking, it won't "break" clang, it will just make a certain
class of bugs harder to track down.

Clang’s lexer performance depends on this. I agree it won’t break the functionality of clang, but regressing compile time perf isn’t acceptable either.

I think Justin was saying to just remove the *assert*, nothing else. Which wouldn’t break clang but may lead to other bugs down the line if the issue isn’t caught by the assert. This shouldn’t regress clang :slight_smile: