[PATCH]: Add SparseBitmap implementation

The attached patch adds a SparseBitmap implementation, which more or
less works the same way as GCC's sparse bitmap.

That is, we only store non-zero bits (broken up into elements of some
bit size), and store the elements in a linked list.

We keep track of the last accessed part of the linked list, so
in-order tests/sets/resets are all constant time, rather than linear
time.

Set operations are all O(set bits).

In the GCC world, we've tried a vast number of alternatives to this
"linked list + current pointer", but we've not come up with a
datastructure that performs faster in practice for random bit testing
*and* doesn't have bad space behavior.

For bitmap operations like &, |, etc, the fact that they are stored in
a linked list makes no difference.

The initial use of this structure will be for the new Andersens's
points-to implementation.

This is not the prettiest datastructure in the world. Sadly, there
are a lot of annoying little corner cases needed to make it fast and
small, so the code looks a bit ugly.

Suggestions, criticisms, etc, are welcome.

sparsebitmap.diff (18.7 KB)

Suggestions, criticisms, etc, are welcome.

I haven't studied the implementation, but I have a few comments on
the interface, and some style comments, below.

Index: include/llvm/ADT/SparseBitmap.h

--- include/llvm/ADT/SparseBitmap.h (revision 0)
+++ include/llvm/ADT/SparseBitmap.h (revision 0)
@@ -0,0 +1,571 @@
+//===- llvm/ADT/SparseBitmap.h - Efficient Sparse Bitmap ----*- C++ -*-===//

Add '-'s to make the first line line exactly 80 cols.

+/// SparseBitmap is an implementation of a bitvector that is sparse by only
+/// storing the elements that have non-zero bits set. In order to make this

I'd like to suggest the name SparseBitVector for this class, to
correspond with the name of the existing BitVector class, which has
a similar interface, but isn't sparse.

+ SparseBitmapElement(unsigned Idx) {

This misses an explicit keyword.

+ if (Bitmap.Elements.empty())
+ {
+ AtEnd = true;
+ return;
+ }

Here and a few other places miss reformatting.

+ template <int ElementSize>
+ class SparseBitmap {

Do you expect clients will often want custom ElementSize values? Otherwise,
it seems like this template parameter should be given a default value, or
even just removed from the API.

+ bool AtEnd;
+
+ SparseBitmap<ElementSize> &Bitmap;
+
+ // Current element inside of bitmap
+ ElementListConstIter Iter;
+
+ // Current bit number inside of our bitmap
+ unsigned BitNumber;
+
+ // Current word number inside of our element
+ unsigned WordNumber;
+
+ // Current bits from the element.
+ typename SparseBitmapElement<ElementSize>::BitWord Bits;

Can these SparseBitmapIterator members, and a few more that follow,
be made private?

+ SparseBitmapIterator(SparseBitmap<ElementSize> &RHS,
+ bool end = false):Bitmap(RHS) {

This appears to miss an explicit keyword.

Dan

> Suggestions, criticisms, etc, are welcome.

I haven't studied the implementation, but I have a few comments on
the interface, and some style comments, below.

> Index: include/llvm/ADT/SparseBitmap.h
> ===================================================================
> --- include/llvm/ADT/SparseBitmap.h (revision 0)
> +++ include/llvm/ADT/SparseBitmap.h (revision 0)
> @@ -0,0 +1,571 @@
> +//===- llvm/ADT/SparseBitmap.h - Efficient Sparse Bitmap ----*- C++ -*-===//

Add '-'s to make the first line line exactly 80 cols.

Will do

> +/// SparseBitmap is an implementation of a bitvector that is sparse by only
> +/// storing the elements that have non-zero bits set. In order to make this

I'd like to suggest the name SparseBitVector for this class, to
correspond with the name of the existing BitVector class, which has
a similar interface, but isn't sparse.

Sure, i have no problem with that.

> + SparseBitmapElement(unsigned Idx) {

This misses an explicit keyword.

Fixed!

> + if (Bitmap.Elements.empty())
> + {
> + AtEnd = true;
> + return;
> + }

Here and a few other places miss reformatting.

Fixed. Anyone have a better .emacs for the indentation than the one in tools?

> + template <int ElementSize>
> + class SparseBitmap {

Do you expect clients will often want custom ElementSize values? Otherwise,
it seems like this template parameter should be given a default value, or
even just removed from the API.

So, actually, at least in GCC, we discovered that we could speed up
some dataflow problems a lot with varying the element size to 256, and
on the other size, interference graphs could be reduced in space and
time a lot by varying the element size down to 64
I'm happy to give it a default value of 128.

> + bool AtEnd;
> +
> + SparseBitmap<ElementSize> &Bitmap;
> +
> + // Current element inside of bitmap
> + ElementListConstIter Iter;
> +
> + // Current bit number inside of our bitmap
> + unsigned BitNumber;
> +
> + // Current word number inside of our element
> + unsigned WordNumber;
> +
> + // Current bits from the element.
> + typename SparseBitmapElement<ElementSize>::BitWord Bits;

Can these SparseBitmapIterator members, and a few more that follow,
be made private?

They were intended to be, sorry.

> + SparseBitmapIterator(SparseBitmap<ElementSize> &RHS,
> + bool end = false):Bitmap(RHS) {

This appears to miss an explicit keyword.

Fixed.

Interesting. Out of curiosity, do you happen to know roughly the sizes
and sparsities where using 256-bit elements was faster?

Looking at the implementation a little more, I see ElementList is a list
of pointers to elements:

+ typedef std::list<SparseBitmapElement<ElementSize> *> ElementList;

The extra pointer indirection is unfortunate; It'd be nice to have just
a list of elements. I guess std::list's push_back always makes an
element copy, which might not be ideal, though that could be weighed
against the extra operator new call that the current code does. Another
option would be to use <llvm/ADT/ilist>, which looks like it can do
insert/push_back without making copies. Either of these approaches would
also fix what looks like a leak in operator&=, where elements are erased
from the list without being deleted.

Dan

> > > + template <int ElementSize>
> > > + class SparseBitmap {
> >
> > Do you expect clients will often want custom ElementSize values? Otherwise,
> > it seems like this template parameter should be given a default value, or
> > even just removed from the API.
>
> So, actually, at least in GCC, we discovered that we could speed up
> some dataflow problems a lot with varying the element size to 256, and
> on the other size, interference graphs could be reduced in space and
> time a lot by varying the element size down to 64
> I'm happy to give it a default value of 128.

Interesting. Out of curiosity, do you happen to know roughly the sizes
and sparsities where using 256-bit elements was faster?

They were pretty sparse, but did enough random bit testing that the
grouping of bits helped.
At some point later, I introduced a sparse bitmap that is more memory
intensive but much faster at random bit testing, and it was a win here
as well, but not worth the memory cost :slight_smile:

Looking at the implementation a little more, I see ElementList is a list
of pointers to elements:

+ typedef std::list<SparseBitmapElement<ElementSize> *> ElementList;

The extra pointer indirection is unfortunate; It'd be nice to have just
a list of elements. I guess std::list's push_back always makes an
element copy, which might not be ideal, though that could be weighed
against the extra operator new call that the current code does. Another
option would be to use <llvm/ADT/ilist>, which looks like it can do
insert/push_back without making copies. Either of these approaches would
also fix what looks like a leak in operator&=, where elements are erased
from the list without being deleted.

I seriously considered not using indirection, but it made a lot of the
functions more annoying, and it didn't seem worth the benefits, given
that if the compiler goes and makes element copies behind your back,
it becomes really really really bad :slight_smile:
In GCC, you can use a custom allocator with the bitmaps that will
reuse freed elements and keep a free list so you could blow them all
away fast (by linking them directly into the free list). This is
helpful for per-iteration pools of bitmaps, which is useful to
points-to analysis.
I was too lazy to make an operator new that emulated this functionality.

Also, someone on IRC asked about the weird function naming
conventions. I know most of llvm uses mixedCase, but i copied the
style of BitVector, which has a weird mix. I'm happy to change it :slight_smile:

In any case, updated patch attached.

sparsebitmap.diff (18.4 KB)

Yes please. By default, it is a good idea to follow llvm coding conventions instead of copying violation of coding style from existing code. Otherwise it pollutes code and, in future, gives someone one additional reason to not follow preferred coding style :slight_smile: While you’re here please fix BitVector also.

Thanks!

[snip]

Don't forget to update ProgrammersManual.html "Picking the Right Data Structure for a Task" section. :slight_smile:

Also, someone on IRC asked about the weird function naming

conventions. I know most of llvm uses mixedCase, but i copied the

style of BitVector, which has a weird mix. I'm happy to change it :slight_smile:
Yes please.

Fixed.

By default, it is a good idea to follow llvm coding conventions
instead of copying violation of coding style from existing code. Otherwise
it pollutes code and, in future, gives someone one additional reason to not
follow preferred coding style :slight_smile: While you're here please fix BitVector
also.

I'll do that in a followup patch, since BitVector has a bunch of users.

[snip]

Don't forget to update ProgrammersManual.html "Picking the Right
Data Structure for a Task" section. :slight_smile:

It doesn't talk about bitvector at all. I'm not sure whether i should
add it to set like containers, or add a section about bitsets that
talks about bitvector and sparsebitvector.

In either case, I think this should be a followup patch, but if you
guys really want it done as part of this one, I will.

One check-in is ideal in svn world but two check-ins are better then no documentation in Programmer's Manual :slight_smile:

Doing this as a followup patch makes sense, thanks Dan.

-Chris

I agree that avoiding the indirection would be good, but, in practice, getting the data structure in and refining it later also is fine.

Please valgrind it when it has a client to make sure it isn't leaking memory.

Some minor comments:

+/// SparseBitVector is an implementation of a bitvector that is sparse by only
+/// storing the elements that have non-zero bits set. In order to make this
+/// fast for the most common cases, SparseBitVector is implemented as a linked list

Please wrap lines to fit in 80 columns.

+ template <int ElementSize = 128>
+ struct SparseBitVectorElement {
+ public:

I'd suggest outdenting this. Indenting due to the namespace doesn't add any value. Also, I'd suggest making ElementSize 'unsigned' instead of int.

+ unsigned ElementIndex; // Index of Element in terms of where first bit
+ // starts

Please use:
   /// ElementIndex - Index of Element in terms of where first bit starts.
   unsigned ElementIndex;

Please end sentences in comments with periods :slight_smile:

+ // This implementation is used for strict weak ordering guarantees. The
+ // only thing it does is compare the ElementIndex's
+ bool operator<(const SparseBitVectorElement &RHS) const {
+ return ElementIndex < RHS.ElementIndex;
+ }
+ bool operator<(unsigned Idx) const {
+ return ElementIndex < Idx;
+ }

Are you sure that this is ok? For two Element's with equal indexes but different contents, this will return "equal" based on their element count. Generally !(x < y) && !(y < x) implies x == y.

+ typedef SparseBitVectorIterator iterator;

It looks like the iterator doesn't allow modification of the iterator, as such you could also typedef "const_iterator".

+ while (ElementIter != RHS.Elements.end())
+ Elements.push_back(new SparseBitVectorElement<ElementSize>(*(*ElementIter)));

line too long.

+ if (BecameZero) {
+ ElementListIter IterTmp = Iter1;
+ delete *IterTmp;

A few tabs are in here before delete, please convert to spaces.

+ iterator begin() {
+ return iterator(*this);
+ }

>> insert/push_back without making copies. Either of these approaches would
>> also fix what looks like a leak in operator&=, where elements are erased
>> from the list without being deleted.
>
> I seriously considered not using indirection, but it made a lot of the
> functions more annoying, and it didn't seem worth the benefits, given
> that if the compiler goes and makes element copies behind your back,
> it becomes really really really bad :slight_smile:

I agree that avoiding the indirection would be good, but, in practice,
getting the data structure in and refining it later also is fine.

Please valgrind it when it has a client to make sure it isn't leaking
memory.

Some minor comments:

+/// SparseBitVector is an implementation of a bitvector that is sparse by only
+/// storing the elements that have non-zero bits set. In order to make this
+/// fast for the most common cases, SparseBitVector is implemented as a linked list

Please wrap lines to fit in 80 columns.

Done

+ template <int ElementSize = 128>
+ struct SparseBitVectorElement {
+ public:

I'd suggest outdenting this. Indenting due to the namespace doesn't add
any value. Also, I'd suggest making ElementSize 'unsigned' instead of
int.

Done

+ unsigned ElementIndex; // Index of Element in terms of where first bit
+ // starts

Please use:
   /// ElementIndex - Index of Element in terms of where first bit starts.
   unsigned ElementIndex;

Please end sentences in comments with periods :slight_smile:

Fixed

+ // This implementation is used for strict weak ordering guarantees. The
+ // only thing it does is compare the ElementIndex's
+ bool operator<(const SparseBitVectorElement &RHS) const {
+ return ElementIndex < RHS.ElementIndex;
+ }
+ bool operator<(unsigned Idx) const {
+ return ElementIndex < Idx;
+ }

Are you sure that this is ok? For two Element's with equal indexes but
different contents, this will return "equal" based on their element count.
Generally !(x < y) && !(y < x) implies x == y.

I removed both operator's since they weren't being used anymore. It
was when i was doing work on testing whether we could remove the
indirection/other containers.

+ typedef SparseBitVectorIterator iterator;

It looks like the iterator doesn't allow modification of the iterator, as
such you could also typedef "const_iterator".

Done

+ while (ElementIter != RHS.Elements.end())
+ Elements.push_back(new SparseBitVectorElement<ElementSize>(*(*ElementIter)));

line too long.

Split.

+ if (BecameZero) {
+ ElementListIter IterTmp = Iter1;
+ delete *IterTmp;

A few tabs are in here before delete, please convert to spaces.

M-x whitespace-cleanup done :slight_smile:

+ iterator begin() {
+ return iterator(*this);
+ }
+
+ iterator end() {
+ return iterator(*this, ~0);
+ }

Can these both be const?

Done

Attached updated patch.

sparsebitmap.diff (17.4 KB)

Looks great, please commit!

-Chris