SmallString + raw_svector_ostream combination should be more efficient

A very common code pattern in LLVM is

SmallString<128> S;
raw_svector_ostream OS(S);

OS<< …
Use OS.str()

While raw_svector_ostream is smart to share the text buffer itself, it’s inefficient keeping two sets of pointers to the same buffer:

In SmallString: void *BeginX, *EndX, *CapacityX

In raw_ostream: char *OutBufStart, *OutBufEnd, *OutBufCur

Moreover, at runtime the two sets of pointers need to be coordinated between the SmallString and raw_svector_ostream using raw_svector_ostream::init, raw_svector_ostream::pwrite, raw_svector_ostream::resync and raw_svector_ostream::write_impl.
All these functions have non-inlined implementations in raw_ostream.cpp.

Finally, this may cause subtle bugs if S is modified without calling OS::resync(). This is too easy to do by mistake.

In this frequent case usage the client does not really care about S being a SmallString with its many useful string helper function. It’s just boilerplate code for raw_svector_ostream. But it does cost three extra pointers, some runtime performance and possible bugs.

To solve all three issues, would it make sense to have raw_ostream-derived container with a its own SmallString like templated-size built-in buffer?

To solve all three issues, would it make sense to have raw_ostream-derived
container with a its own SmallString like templated-size built-in buffer?

It can be improved. I am not sure templating the stream itself is the
best option.

I think it is possible to implement a derived class that duplicates
nothing and is not templated.

* The constructor gets passed a buffer (pointer, size) and can handle (null, 0).
* It sets up the base class to write past the existing data, like
raw_svector_ostream does.
* When more data is needed, allocate a new buffer, copy data from the
old one, and update the base class pointers.

A user can then pass a stack allocated buffer to get something like a
raw_svector_ostream. It could pass a null to get a convenient class
that handles all allocations.

Cheers,
Rafael

A very common code pattern in LLVM is

SmallString<128> S;
raw_svector_ostream OS(S);
OS<< ...
Use OS.str()

While raw_svector_ostream is smart to share the text buffer itself, it's
inefficient keeping two sets of pointers to the same buffer:

In SmallString: void *BeginX, *EndX, *CapacityX
In raw_ostream: char *OutBufStart, *OutBufEnd, *OutBufCur

Any reason to believe this inefficiency is significant/important?
Given that these are never in long-lived containers, but generally
just on the stack, it doesn't seem like the extra 3 pointers would be
very costly in terms of overall performance.

Even if the memory overhead is small, they do introduce code complexity in coordinating the SmallString and raw_svector_ostream and much runtime cost:

raw_ostream::write() calls raw_svector_ostream::write_impl() and raw_ostream::copy_to_buffer().

raw_svector_ostream::write_impl() calls OS.reserve and SetBuffer.
SetBuffer calls SetBufferAndMode(). testing the BufferMode for every time and writing these three pointers.

Every function has several code paths, adding to complexity and runtime cost. Due to this complexity, there are additional tests in raw_ostream::write() callers trying to take shortcuts to avoid calling write_impl()… these tests also take time.

Essentially, what we are trying to achieve with SmallString SmallString + raw_svector_ostream is be much, much simpler than what we have now, along the lines Rafael suggested:

if there is enough space write the string
else reallocate, copy and write the string.

That’s simpler and shorter code by order of magnitude than the combination have now, and a bonus - with a smaller memory footprint. What more can we ask for?

A very common code pattern in LLVM is

SmallString<128> S;
raw_svector_ostream OS(S);

OS<< …
Use OS.str()

While raw_svector_ostream is smart to share the text buffer itself, it’s inefficient keeping two sets of pointers to the same buffer:

In SmallString: void *BeginX, *EndX, *CapacityX

In raw_ostream: char *OutBufStart, *OutBufEnd, *OutBufCur

Moreover, at runtime the two sets of pointers need to be coordinated between the SmallString and raw_svector_ostream using raw_svector_ostream::init, raw_svector_ostream::pwrite, raw_svector_ostream::resync and raw_svector_ostream::write_impl.
All these functions have non-inlined implementations in raw_ostream.cpp.

Finally, this may cause subtle bugs if S is modified without calling OS::resync(). This is too easy to do by mistake.

FWIW I’ve been bitten exactly by this last week while updating an out-of-tree compiler to adapt API changes around streams (raw_pwrite_stream). On the positive aspect, I had to learn about the internal of the various streams :wink:

The API for raw_svector_ostream is not nice on this aspect, but what is really missing is that there is no assertion to catch it.
I feel it is a bit off from the principle “easy to use, hard to misuse”.

In this frequent case usage the client does not really care about S being a SmallString with its many useful string helper function. It’s just boilerplate code for raw_svector_ostream. But it does cost three extra pointers, some runtime performance and possible bugs.

To solve all three issues, would it make sense to have raw_ostream-derived container with a its own SmallString like templated-size built-in buffer?

Not sure about the exact solution, but “something” could certainly be done to be more “user-friendly”.

A very common code pattern in LLVM is

SmallString<128> S;
raw_svector_ostream OS(S);
OS<< ...
Use OS.str()

While raw_svector_ostream is smart to share the text buffer itself, it's
inefficient keeping two sets of pointers to the same buffer:

In SmallString: void *BeginX, *EndX, *CapacityX
In raw_ostream: char *OutBufStart, *OutBufEnd, *OutBufCur

Moreover, at runtime the two sets of pointers need to be coordinated
between the SmallString and raw_svector_ostream using
raw_svector_ostream::init, raw_svector_ostream::pwrite, raw_svector_ostream::resync
and raw_svector_ostream::write_impl.
All these functions have non-inlined implementations in raw_ostream.cpp.

Finally, this may cause subtle bugs if S is modified without calling
OS::resync(). This is too easy to do by mistake.

In this frequent case usage the client does not really care about S being
a SmallString with its many useful string helper function. It's just
boilerplate code for raw_svector_ostream. But it does cost three extra
pointers, some runtime performance and possible bugs.

I agree the bugs are real (Alp proposed something a while back regarding
this?), but you will need to provide measurements to justify the cost in
runtime performance. One technique I have used in the past to measure these
sorts of things I call "stuffing": take the operation that you want to
measure, then essentially change the logic so that you pay the cost 2
times, 3 times, etc. You can then look at the trend in performance as N
varies and extrapolate back to the case where N = 0 (i.e. you don't pay the
cost).

For example, in one situation where I used this method it was to measure
the cost of stat'ing files (sys::fs::status) across a holistic build, using
only "time" on the command line (it was on Windows and I didn't have any
tools like DTrace available that can directly measure this). In order to do
this, I changed sys::fs::status to call stat N times instead of 1, and
measured with N=1 N=2 N=3 etc. The result was that the difference between
the N and N+1 versions was about 1-2% across N=1..10 (or whatever I
measured). In order to negate caching and other confounding effects, it is
important to try different distributions of stats; e.g. the extra stats are
on the same file as the "real" stat vs. the extra stats are on nonexistent
files in the same directory as the "real" file vs. parent directories of
the "real" file; if these match up fairly well (they did), then you have
some indication that the "stuffing" is measuring what you want to measure.

So e.g. if you think the cost of 3 extra pointers is significant, then
"stuff" the struct with 3, 6, 9, ... extra pointers and measure the
difference in performance (e.g. measure the time building a real project).

-- Sean Silva

Sean, thanks for reminding this, Alp did commit a class derived from raw_svector_ostream conatining an internal SmallString he called small_string_ostream. The commit was reverted after a day due to a disagreement about the commit approval and apparently abandoned.

http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140623/223393.html

http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140623/223557.html

http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140630/223986.html

The class Alp comitted did solve the possible mismatch between the SmallString and the stream by making the SmallString private to the class. This however did not treat the root problem, the duplication of the information about the buffer between SmallString and the stream.

I can make performance measurements but even if the performance difference is neglible (and looking at all the code paths and conditionals of non-inlined functions at raw_ostream.cpp that’s hard to believe), the current SmallString-raw_svector_ostream is simply wrong.

Personally, after the previous “Alp” discussion I found and fixed several such issues in my out-of-tree code only to make new similar new error earlier this year, which I caught only months after, when Rafael committed the pwrite which reminded me the issue. Due ot the information duplication Rafael commit also had a bug and Mehdi reports similar experience. Back then Alp reported similar problems he found in LLVM code which were hopefully fixed otherwise.

In addition to the information duplication, the design is quite confusing to use

  • Should one use the stream.str() function or the SmallString itself?
  • flush or str?
  • How do you properly clear the combination for reuse (that was my mistake, I forgot to resync after OS.clear())?

It’s safe to say this design pattern is very easy to get wrong one way or another, we got burned by it multiple times and it should be replaced.

Alp suggested a derived class containing its own SmallString. That’s a simple and effective approach to avoid the bugs mentioned but does not fix the memory or runtime issues. Small as they may be, why have them at a fundemental data structure?

I was thinking about going further avoiding all duplication with a templated class derived with its own internal buffer storage.

Rafael suggested a more modular approach, a derived adpater class adapter to a simple buffer (or nullptr for fully-dynamic operation) which also won’t duplicate any information the buffer is dumb and has no state.

That solution sounds simple, efficient and safe to use. The implementation would be actually simpler then raw_svector_ostream since all the coordination logic is not required anymore.

Hi,

Is this what you’re thinking about?
The code is not tested yet, I’d like to know if the overall direction and design are correct.

Yaron

rco.diff (2.88 KB)

I don't think we should make flush virtual. Why do you need to do it?
Can't you set up the base class to write to use the tail of the memory
region as the buffer?

Yes, we should do without virtual flush. The problem was raw_char_ostream should not be flushed - it’s always current - so I overloaded flush to a no-op. A cleaner solution is attached, adding a DirectBuffer mode to the raw_ostream.

Also added a simple performance comparison project between raw_char_ostream and raw_svector_ostream. On Window 7 x64 machine, raw_char_ostream was three times faster than raw_svector_ostream when the provided buffer size is large enough and two times as fast with one dynamic allocation resize.

raw_char_ostream.diff (6.63 KB)

Could you dig into why:

  • raw_ostream &write(unsigned char C) override {
  • grow(1);
  • *OutBufCur++ = C;
  • return *this;
  • }

Is 3 times as fast as raw_svector_ostream? I don’t see a good reason why that should be any faster than:

raw_ostream &operator<<(char C) {
if (OutBufCur >= OutBufEnd)
return write(C);
*OutBufCur++ = C;
return *this;
}

You might just be seeing the difference between having write inlined vs. non-inlined, in which case your patch might be a complicated way to pull the likely cases of some raw_ostream methods into the header.

– Sean Silva

I outlined (is that the right word?) raw_char_ostream::grow, raw_char_ostream::write (both) into raw_ostream.cpp with less than 10% difference in performance.

Profiling reveals that the real culprit is the code line

OS.reserve(OS.size() + 64);

called from raw_svector_ostream::write_impl called from raw_svector_ostream::~raw_svector_ostream.
The issue is already documented:

raw_svector_ostream::~raw_svector_ostream() {
// FIXME: Prevent resizing during this flush().
flush();
}

And a solution is provided in raw_svector_ostream::init():

// Set up the initial external buffer. We make sure that the buffer has at
// least 128 bytes free; raw_ostream itself only requires 64, but we want to
// make sure that we don’t grow the buffer unnecessarily on destruction (when
// the data is flushed). See the FIXME below.
OS.reserve(OS.size() + 128);

This solution may be worse than the problem. In total:

  • If the SmallString was less than 128 bytes init() will always(!) heap allocate.
  • If it’s more or equal to 128 bytes but upon destruction less than 64 bytes are left unused it will heap allocate for no reason.

A careful programmer who did size the SmallString to match its use + small reserve will find either of these heap allocations surprising, negating the SmallString advantage and then paying memcpy cost.

I filed http://llvm.org/pr23395 about this. To temporarly avoid this issue, in addition to the outline methods change I commented the OS.reserve line in flush(). (This change of course requires further review.)

With reserve being out of the way, the gap now is smaller but still significant: raw_char_ostream is about 20% faster than raw_svector_ostream in Release=optimized configuration.

+update diff

raw_char_ostream.diff (7.27 KB)

Following a hint from Duncan in http://llvm.org/pr23395, here is a revised patch. Rather then introduce the raw_char_ostream adapter, this version improves raw_svector_ostream.

raw_svector_ostream is now a very lightweight adapter, running without a buffer and fowarding almost anything to the SmallString. This solves both the performance issue and the information duplication issue. The SmallString is always up-to-date and can be used at any time. flush() does nothing and is not required. resync() was kept in this patch as a no-op and will be removed with its users in a follow-up patch. I’ll also try to remove the superfluous flush calls() along the way.

raw_svector_ostream.diff (6.29 KB)

Here’s a performance testcase for the raw_svector_ostream patch. On my WIndows x64 machine it runs in 1010ms with the current code and in 440ms with the patch applies. Is this OK to commit?

CMakeLists.txt (112 Bytes)

raw_svector_ostream.diff (6.29 KB)

raw_svector_ostream_speed.cpp (768 Bytes)

This makes write virtual. What is the impact of that on the other streamers?

+llvm-dev@lists.llvm.org

The impact should be small as all the other streamers usually write directly to the memory buffer and only when out of buffer they call write().

OTOH, raw_svector_ostream (without a buffer) goes though write for every character or block it writes. It can work without virtual write() by overriding the existing virtual write_impl() but this is a slower code path for raw_svector_ostream.

Even though the attached raw_svector_ostream patch without virtual write() is still significantly faster then the current raw_svector_ostream since it avoids the double buffering. With this patch, on my system the example runs at 750ms compared with about 1000ms for the current raw_svector_ostream.

Would you prefer to continue review on Phabricator?

Given that it is faster, I am OK with it.

I get 0.774371631s with master and 0.649787169s with your patch.

Why move all virtual functions to the header? LGTM with them still in
the .cpp file.

r244870 (with the change you requested), thanks!

I initially placed the virtual function in header since they were one-liners.
The coding standards say that an anchor() function should be supplied, which was indeed missing. Other than required anchor function, why should the virtual functions go in the .cpp?
Should I move the empty raw_svector_ostream destructor to the .cpp file too as well?

r244870 (with the change you requested), thanks!

I initially placed the virtual function in header since they were
one-liners.
The coding standards say that an anchor() function should be supplied, which
was indeed missing. Other than required anchor function, why should the
virtual functions go in the .cpp?

Just because there is no advantage for most virtual methods to be in the .h.

Should I move the empty raw_svector_ostream destructor to the .cpp file too
as well?

Maybe. Destructors are the one exception since base class destructors
call them non-virtually. If the class is final we can move them.

Cheers,
Rafael