RFC: Adding llvm::ThinStream

Some background:

A while back while working on code to read / write PDB files, I came up with Yet Another Stream Abstraction. Note that LLVM already has a few. Off the top of my head, theres:

  1. MemoryBuffer and its associated class hierarchy
  2. raw_ostream and it’s associated classes.
  3. DataExtractor which is used for reading from a StringRef.

There’s probably more, and indivdiual subprojects might have even created their own.

The reason I couldn’t use any of these and needed to invent another is because PDB files are not laid out contiguously in memory. You can think of it as a file system where there is an MFT that defines the blocks that individual files live on, and then you have to re-sequence the blocks in order to read out a “file”.

To add to the complexity, each “file” inside of here is actually a list of variable length records where records, or even individual fields of records might cross a block boundary and require multiple reads to piece together. I needed a way to view these streams as being contiguous, and reading data out of them allowing all the magic of broken fields and records, discontiguous records, etc to just disappear behind the interface. Also, I needed the ability to read and write using a single interface without worrying about the details of the block structure. None of LLVM’s existing abstractions provided convenient mechanisms for writing this kind of data.

So I came up with the following set of abstractions:

ThinStream - An abstract base class that provides read-only access to data. The interface is:
virtual Error readBytes(uint32_t Offset, uint32_t Size, ArrayRef<uint8_t> &Buffer) const = 0;

virtual Error readLongestContiguousChunk(uint32_t Offset, ArrayRef<uint8_t> &Buffer) const = 0;
virtual uint32_t getLength() const = 0;

An important distinction between ThinStream and existing stream implementations is that API encourages implementations to be zero copy. Instead of giving it a buffer to write into, it just returns you a slice of the existing buffer. This makes it very efficient. Similar to ArrayRef / MutableArrayRef, I also provide WritableThinStream for cases where your data is not read-only. This is another area where functionality is provided that was not present in existing abstractions (i.e. writeability).

I have several implementations of this class and some abstractions for working with them. ByteStream and MutableByteStream provide a concrete implementatino where the backing store is an ArrayRef or MutableArrayRef. MappedBlockStream (which is PDB specific) provides an implementation that seeks around a file, piecing together blocks from various locations in a PDB file. When a call to readBytes spans a block boundary, it does multiple reads, allocates a contiguous buffer from a BumpPtrAllocator and returns a reference to that (subsequent requests for the same offset return from the cached allocation). There is also FileBufferThinStream which adapts an llvm::FileOutputBuffer so you can write to a file system object. One could easily imagine an implementation that adapts llvm::MemoryBuffer so that you could read and write from mmap’ed files.

But all of these just allow reading and writing raw bytes. To handle reading and writing semantic data, there are two additional classes. ThinStreamReader and ThinStreamWriter.

These accept any subclass of ThinStream and maintain an offset and allow you to read integers in any endianness, strings, objects, arrays of objects (both fixed length records and variable length records), and various other things.

Finally, there are ThinStreamRef and WritableThinStreamRef which you can think of as ArrayRef and MutableArrayRef for ThinStreams. They allow slicing, dropping, etc and copy-semantics so that you can easily pass streams around without worrying about reference lifetime issues.

To re-iterate: When using a ThinStreamReader, copies are the exception, not the rule and for most implementations of ThinStream will never occur. Suppose you have a struct:

struct Header {
char Magic[48];
ulittle16_t A;
ulittle16_t B;

ulittle64_t C;
};

To read this using a ThinStreamReader, you would write this:

ThinStreamReader Reader(Stream);
const Header *H;
if (auto EC = Reader.readObject(H))
return EC;

and ThinStreamReader just reinterpret_casts the underlying bytes to your structure. The same is true for null terminated strings, arrays of objects, and everything else. It is up to the user to ensure that reads and writes happen at proper alignments. (LLVM can still assert though if you do a misaligned read/write). But when reading and writing records in binary file formats, this is not a new issue.

The proposal:
This code has been used in LLVM’s PDB library for some time, and I want to move it up to Support.

I’ve discussed this with some people offline. beanz@ has expressed interest for some work he wants to do on libobject. It can also replace a few thousand lines of (untested) code in LLDB that does essentially the same thing. I suspect it can be used anytime anyone is reading from or writing to a binary file format, perhaps even being faster than existing implementations due to the zero-copy aspect.

I have a (somewhat large) patch locally that gets LLVM working with the code raised up to Support. If you want to look at the existing implementation, the following files are relevant:

// Files that would move up to Support
include/DebugInfo/MSF/ByteStream.h

include/DebugInfo/MSF/StreamInterface.h
include/DebugInfo/MSF/StreamReader.h
include/DebugInfo/MSF/StreamWriter.h

include/DebugInfo/MSF/StreamArray.h

include/DebugInfo/MSF/StreamRef.h

lib/DebugInfo/MSF/StreamReader.cpp
lib/DebugInfo/MSF/StreamWriter.cpp

lib/DebugInfo/MSF/StreamRef.cpp

// Files that would remain PDB specific
include/DebugInfo/MSF/MappedBlockStream.h

lib/DebugInfo/MSF/MappedBlockStream.cpp

(In the existing implementation, Thin is not used in the class names)

The code is lacking doxygen style comments, but I would add complete documentation as the result of any move.

Questions / comments welcome.

I take this as no objections. I’ve changed the name from ThinStream to BinaryStream as it more accurately conveys what it is used for, and I’ve got the tests and comments mostly ready to go, so I’ll commit this later today if there’s no objections?

Haven’t got to this but would like to take a look/review it before it goes in.

skimming over some of the description

Sounds like ‘stream’ might not be the right terminology - since they return pointers into data that (I think) remains valid for the life of the stream? (this also makes me wonder a bit about memory usage if the cross-block operation is used a lot (causing allocations) but the values are then discarded by the user - the memory can’t be reused, it’s effectively leaked)

Also - the wrappers ThinStreamReader/Writer - do they benefit substantially from being thin-stream aware, rather than abstractions around any stream of bytes? (since they transform the data anyway)
Oh, reinterpret casting… hrm. That kind of file reading/writing scheme usually makes me a bit uncomfortable due to portability concerns (having to align, byte swap, etc, structs to match the on-disk format can make those structures problematic to work with - if you have to byte swap anyway, you’d need to copy the data out of the underlying buffer anyway, right?)

Haven’t got to this but would like to take a look/review it before it goes in.

skimming over some of the description

Sounds like ‘stream’ might not be the right terminology - since they return pointers into data that (I think) remains valid for the life of the stream? (this also makes me wonder a bit about memory usage if the cross-block operation is used a lot (causing allocations) but the values are then discarded by the user - the memory can’t be reused, it’s effectively leaked)

cross-block allocation is the exception though because PDB is weird, I don’t expect that to be heavily used in other places. That said, in the PDB’s implementation of BinaryStream we take this into account and cache allocation requests at a given offset. The data isn’t just random, it’s structured into records, and so we expect reads for the same data to always be at the same offsets (if you’ve got a list of records, you’re going to read records 1, 2, and 3, not some random bytes in the middle). Because of that, we can cache allocation requests for a particular offset and size, and if we see another request come in that was previously cached, we just return that. It still handles the “weird” case anyway just for the sake of completeness, but I don’t think it’s actually needed.

Also - the wrappers ThinStreamReader/Writer - do they benefit substantially from being thin-stream aware, rather than abstractions around any stream of bytes? (since they transform the data anyway)

This is actually a surprisingly handy abstraction. Since a BinaryStream (I’ve abandoned the term Thin btw, but we can discuss the name orthogonally if you like) most likely represents some kind of structured format with records, you often have quite a bit of data to read out. And since BinaryStream can be any arbitrary implementation, you don’t always have a single list of bytes representing everything. I’ll use the canonical PDB case where you’ve got all these blocks spread around. I might know that I have 1000 records, but not where they come from. Any given record might cross a boundary and return me a pointer allocated from the stream’s internal BumpPtrAllocator. But if I just want to read 1000 records, it’s convenient to treat the entire thing as one contiguous sequence. So I just make a BinaryStreamReader and iterate over it until it becomes empty, reading out the records. If I had to give it a sequence of bytes, this would become very cumbersome.

That said, there is one implementation of BinaryStream that is just a sequence of bytes. So this is general enough to support that.

Oh, reinterpret casting… hrm. That kind of file reading/writing scheme usually makes me a bit uncomfortable due to portability concerns (having to align, byte swap, etc, structs to match the on-disk format can make those structures problematic to work with - if you have to byte swap anyway, you’d need to copy the data out of the underlying buffer anyway, right?)

For byte swapping usually this is done by using llvm’s endian aware types. If you have a struct with a bunch of llvm::support::ulittle16_t’s, then it doesn’t matter what platform you’re on, you can reinterpret_cast the pointer to that struct, and when you read the values they will be correct.

On the other hand, sometimes it’s convenient to just say “give me an integer”, and in that case the read function lets you specify what endianness you want (defaults to host) and it copies the value into an output parameter rather than returning a pointer. In either case, the user of the API never has to deal with byte swapping.

I’ll upload a patch shortly.

Out of curiosity, how do we expect this to be useful in libobject?

Peter

There are two applications for this in libObject that I think are interesting.

(1) Enabling reading binaries that are incomplete or not in contiguous memory. This is the biggest reason why LLDB has its own object parsers instead of using libObject, so we need to have a solution to this problem in order to get rid of LLDB’s object file readers.

(2) Enabling lightweight rewriting of parts of object files. We want to implement strip in LLVM on top of libObject, and today we can’t because libObject has no mechanism for modifying object files. One of the things that has come up during discussions of how to handle writing was the idea of breaking object files up into a list of buffers, where each buffer corresponds to a structural part of the file. Then if you rewrite the data in one part of the file, you can just replace the buffer in the list.

This is particularly interesting for tools like strip, where you could parse the file into a list of buffers, then just remove a few of the buffers, and re-emit (hand wave hand wave, I know there’s a lot more to it than that, I like oversimplifying things).

-Chris

I’ll let Chris answer this one since he was the one that mentioned that with me, but there’s one more thing I forgot to mention.

It’s still up in the PDB library, but I also have a class called CodeViewRecordIO which is similar in spirit to YamlIO. You initialize it with either a BinaryStreamReader or BinaryStreamWriter, and then instead of calling functions like readInteger, writeInteger, readObject, or readZeroString, you call mapInteger, mapObject, mapZeroString. If it’s initialized with a reader, it reads, otherwise it writes.

This turned out to be extremely useful as it allowed me to merge the reading and writing codepaths into one codepath. We had bugs before where we couldn’t round-trip a PDB because we would write one thing and read another thing (maybe we forgot to skip some padding or something on either the read or the write). When there’s one codepath, it almost becomes declarative. You just write what order the fields come in, and then both reading and writing work automatically.

It’s not ready to be generalized alongside this quite yet, but that would be the next logical step.

This second one might just work out of the box, or at the very least should be really easy. I already have one implementation of BinaryStream called BinaryItemStream that stores an ArrayRef. It’s useful when you want to dump out a bunch of records, but don’t know how many you’re going to have or how many bytes you’re going to need until you’ve generated them all, so it requires doing a separate allocation for each record.

So you build up a vector<T*> , and you can add / remove from the middle or whatever, and when you’re done, you just create a BinaryItemStream from it, then wrap it in a BinaryStreamReader and you can just slam it out to disk in like one line.

How would this work if the endiannes and byte size of the struct (e.g.
ELF header) is only decided at runtime. For example, if I need a
single piece of code to handle both x86_64 little endian and arm32 big
endian versions of a struct. Am I supposed to template everything
based byte size and endiannes (with some runtime dispatch at the top
level)? Or is this not a use case that you have in mind here?

pl

There are two ways that integers can get read/written

  1. by calling the function

Error readInteger(T& Dest)

In this case the endianness of the integer is assumed to be that which you constructed the Reader/Writer with. This is the only way to read something where the endianness is only known at runtime

  1. By calling

Error readObject(const T*& Dest)

In this the underlying bytes are reinterpet_casted to a const T*. So if endianness matters to you when using this function, it will have to be part of the type T

So it’s not ideal, but you could read each field out of a struct independently using #1. This kills any performance gain you get by eliminating copying, but I suspect this is already how the code you have in mind works.

Or You can do what you said with multiple types and a runtime dispatch

Also you mention byte size. There is currently no mechanism for reading/writing pointer sized values with runtime specified pointer size, but this would be only a couple of lines to add.

Also, you can always inherit from the readers/writers and add new methods if the core methods are insufficient