[RFC] Moving OnDiskHashTable from clang to LLVM

I would like to use the on disk hash table that's currently in clang for
instrumentation based profiling. This is a single header file that's
currently found in include/clang/Basic/OnDiskHashTable.h, and I
propose to move it to include/llvm/Support/OnDiskHashTable.h.

Any strong objections to moving this?

Also, the header includes a "clang::io" namespace with some operations
for reading and writing little endian files. Should these be directly
renamed to "llvm::io", or would something like "llvm::endian::little" or
"llvm::le" be more reasonable?

I would like to use the on disk hash table that's currently in clang for
instrumentation based profiling. This is a single header file that's
currently found in include/clang/Basic/OnDiskHashTable.h, and I
propose to move it to include/llvm/Support/OnDiskHashTable.h.

Any strong objections to moving this?

Also, the header includes a "clang::io" namespace with some operations
for reading and writing little endian files. Should these be directly
renamed to "llvm::io", or would something like "llvm::endian::little" or
"llvm::le" be more reasonable?

Llvm has lib/support/endian.h. Wouldn't the new API be redundant? If not, would it fit in that header?

Rafael Avila de Espindola <rafael.espindola@gmail.com> writes:

Also, the header includes a "clang::io" namespace with some operations
for reading and writing little endian files. Should these be directly
renamed to "llvm::io", or would something like "llvm::endian::little" or
"llvm::le" be more reasonable?

Llvm has lib/support/endian.h. Wouldn't the new API be redundant? If
not, would it fit in that header?

They're obviously related. Endian.h defines types like ulittle32, which
DTRT when read and written from memory, whereas clang::io has functions
to read and write memory like so:

    void Emit32(raw_ostream& Out, uint32_t V);
    uint32_t ReadLE32(const unsigned char *&Data);
    uint32_t ReadUnalignedLE32(const unsigned char *&Data);

The (aligned) ReadLE32 is easily represented by the stuff in Endian.h,
but the write to an ostream and reading unaligned aren't addressed very
well at present.

We could still add this stuff to Endian.h, assuming it isn't an issue to
depend on raw_ostream.h from Endian.h. I'm not sure if that'd be a
problem or not.

They're obviously related. Endian.h defines types like ulittle32, which
DTRT when read and written from memory, whereas clang::io has functions
to read and write memory like so:

    void Emit32(raw_ostream& Out, uint32_t V);
    uint32_t ReadLE32(const unsigned char *&Data);
    uint32_t ReadUnalignedLE32(const unsigned char *&Data);

The (aligned) ReadLE32 is easily represented by the stuff in Endian.h,
but the write to an ostream and reading unaligned aren't addressed very
well at present.

ulittle32 is actually unaligned:

typedef detail::packed_endian_specific_integral
                  <uint32_t, little, unaligned> ulittle32_t

But yes, I don't think we have anything for writing and I it is
probably not trivial to extend it. We can probably add the existing
write functions to Endian.h itself.

What is lld using? If we are moving these functions from clang to llvm
it would be really nice to use them in lld too.

We could still add this stuff to Endian.h, assuming it isn't an issue to
depend on raw_ostream.h from Endian.h. I'm not sure if that'd be a
problem or not.

Cheers,
Rafael

packed_endian_specific_integral supports writing.

- Michael Spencer

Michael Spencer <bigcheesegs@gmail.com> writes:

packed_endian_specific_integral supports writing.

This is true, but as far as I can tell it's pretty awkward to use it
with a raw_ostream:

  raw_ostream OS = ...;
  uint32_t Val = ...;
  ...
  ulittle32_t x = *reinterpret_cast<ulittle32_t *>&Val;
  OS.write((char *)&x, sizeof(x));

Maybe I'm missing something?

Rafael Espíndola <rafael.espindola@gmail.com> writes:

They're obviously related. Endian.h defines types like ulittle32, which
DTRT when read and written from memory, whereas clang::io has functions
to read and write memory like so:

    void Emit32(raw_ostream& Out, uint32_t V);
    uint32_t ReadLE32(const unsigned char *&Data);
    uint32_t ReadUnalignedLE32(const unsigned char *&Data);

The (aligned) ReadLE32 is easily represented by the stuff in Endian.h,
but the write to an ostream and reading unaligned aren't addressed very
well at present.

ulittle32 is actually unaligned:

typedef detail::packed_endian_specific_integral
                  <uint32_t, little, unaligned> ulittle32_t

But yes, I don't think we have anything for writing and I it is
probably not trivial to extend it. We can probably add the existing
write functions to Endian.h itself.

A bit more detail on this.

For reading, simply changing the existing clients to use Endian.h is a
bit of work, as the read functions in clang::io actually move a cursor
over the read data, so simply switching would actually be something like
this:

  - unsigned len = ReadUnalignedLE16(Items);
  + unsigned len = *reinterpret_cast<llvm::support::ulittle16_t *>(Items);
  + Items += 2;

Presumably, hiding the addition helps maintainability. I suppose we
could add functions functions that increment the buffer to Endian.h, and
maybe expose an API like so:

  + unsigned len = support::readFromBuffer<support::ulittle16_t>(Items)

For writing, what about adding this to raw_ostream directly? Something
like:

  template <typename T> raw_ostream &write_le(T V) {
    for (size_t I = 0, E = sizeof(T); I < E; ++I)
      *this << (unsigned char)(0xFF & (V >> (I * 8)));
    return *this;
  }

  template <typename T> raw_ostream &write_be(T V) {
    for (size_t I = sizeof(T); I; --I)
      *this << (unsigned char)(0xFF & (V >> (I * 8)));
    return *this;
  }

Thoughts?

A bit more detail on this.

For reading, simply changing the existing clients to use Endian.h is a
bit of work, as the read functions in clang::io actually move a cursor
over the read data, so simply switching would actually be something like
this:

  - unsigned len = ReadUnalignedLE16(Items);
  + unsigned len = *reinterpret_cast<llvm::support::ulittle16_t *>(Items);
  + Items += 2;

Presumably, hiding the addition helps maintainability. I suppose we
could add functions functions that increment the buffer to Endian.h, and
maybe expose an API like so:

  + unsigned len = support::readFromBuffer<support::ulittle16_t>(Items)

For writing, what about adding this to raw_ostream directly? Something
like:

  template <typename T> raw_ostream &write_le(T V) {
    for (size_t I = 0, E = sizeof(T); I < E; ++I)
      *this << (unsigned char)(0xFF & (V >> (I * 8)));
    return *this;
  }

  template <typename T> raw_ostream &write_be(T V) {
    for (size_t I = sizeof(T); I; --I)
      *this << (unsigned char)(0xFF & (V >> (I * 8)));
    return *this;
  }

Thoughts?

I think I see your point. The API seem sufficiently different that
there might be use cases where one or the other is better. Using an
explicit increment would make the code less readable. The API in
Endian.h is nice for cases where we know the entire layout of large
data section, since they are composable:

struct dos_header {
  support::ulittle16_t Magic;
  support::ulittle16_t UsedBytesInTheLastPage;
...
};

If the use cases we have in clang or the ones you want to add don't
match this, then regular stream api is probably better. I would keep
it out of raw_ostream, at least for now. How a about a
include/llvm/Support/EndianStream.h?

Cheers,
Rafael

I like this idea. I would think that most of the endian-reading functions can be replaced by what's currently in llvm/Support/Endian.h. The writing functions...well, I'm not sure. They're not necessarily going to be useful anywhere else.

Jordan

Rafael Espíndola <rafael.espindola@gmail.com> writes:

I think I see your point. The API seem sufficiently different that
there might be use cases where one or the other is better. Using an
explicit increment would make the code less readable. The API in
Endian.h is nice for cases where we know the entire layout of large
data section, since they are composable:

struct dos_header {
  support::ulittle16_t Magic;
  support::ulittle16_t UsedBytesInTheLastPage;
...
};

If the use cases we have in clang or the ones you want to add don't
match this, then regular stream api is probably better. I would keep
it out of raw_ostream, at least for now. How a about a
include/llvm/Support/EndianStream.h?

I've sent two patches to the list, one which introduces a stream writer
in EndianStream.h, and one that adds a readNext function to Endian.h:

http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140324/210057.html
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140324/210061.html

How do these sound?