[RFC] zip_vector: in-memory block compression of integer arrays

Hi Folks,

I want to share what I have been working on as I feel it may be of interest to the LLVM compiler developers, specifically concerning alias analysis and optimizations for iteration of sparse block-based multi-arrays. I also have questions about optimization related to this implementation, specifically the observability of alias analysis pessimization and memory to register optimizations.

I have been working on zip_vector. zip_vector is a compressed variable length array that uses vectorized block codecs to compress and decompress integers using dense variable bit width deltas as well as compressing constant values and sequences. zip_vector employs integer block codecs optimized for vector instruction sets using the Google Highway C++ library for portable SIMD/vector intrinsics.

The high-level class supports 32-bit and 64-bit compressed integer arrays:

  • zip_vector<i32>
    • { 8, 16, 24 } bit signed and unsigned fixed-width values.
    • { 8, 16, 24 } bit signed deltas and per block initial value.
    • constants and sequences using per block initial value and delta.
  • zip_vector<i64>
    • { 8, 16, 24, 32, 48 } bit signed and unsigned fixed-width values.
    • { 8, 16, 24, 32, 48 } bit signed deltas with per block initial value.
    • constants and sequences using per block initial value and delta.

Here is a link to the implementation:

The README has a background on the delta encoding scheme. If you read the source, “zvec_codecs.h” contains the low-level vectorized block codecs while “zvec_block.h” contains a high-level interface to the block codecs using cpuid-based dynamic dispatch. The high-level sparse integer vector class leveraging the block codecs is in “zip_vector.h”. The code has been tested with GCC and LLVM on x86-64 using SSE3, AVX, and AVX-512.

The principle of operation is to employ simplified block codecs dedicated to only compressing fixed-width integers and thus are extremely fast, unlike typical compression algorithms. They are in the order of 30-150GiB/sec on a single core when running within the L1 cache on Skylake AVX-512. From this perspective, zip_vector achieves its performance by reducing global memory bandwidth because it fetches and stores compressed data to and from RAM and then uses extremely fast vector codecs to pack and unpack compressed blocks within the L1 cache. From this perspective, it is similar to texture compression codecs, but the specific use case is closer to storage for index arrays because the block codecs are lossless integer codecs. The performance is striking in that it can be faster for in-order read-only traversal than a regular array, while the primary goal is footprint reduction.

The design use case is an offsets array that might contain 64-bit values but usually contains smaller values. From this perspective, we wanted the convenience of simply using zip_vector<i64> or zip_vector<i32> while benefiting from the space advantages of storing data using 8, 16, 24, 32, and 48-bit deltas.

Q. Why is it specifically of interest to LLVM developers?

I think the best way to answer this is with questions. How can we model a block-based iterator for a sparse array that is amenable to vectorization?

There are aspects to the zip_vector iterator design that are not done yet concerning its current implementation. The iteration has two phases. There is an inter-block phase at the boundary of each block (the logic inside of switch_page) that scans and compresses the previously active block, updates the page index, and decompresses the next block. Then there is a broad-phase for intra-block accesses, which is amenable to vectorization due to the use of fixed-size blocks.

Making 1D iteration as fast as 2D iteration

Firstly I have to say that there is a lot of analysis for the optimization of the iterator that I would like to discuss. There is the issue of hoisting the inter-block boundary test from the fast path so that during block boundary traversal, subsequent block endings are calculated in advance so that the broad phase only requires a pointer increment and comparison with the addresses held in registers.

The challenge is getting past compiler alias analysis. Alias analysis seems to prevent caching of the sum of the slab base address and active area offset in a register versus being demoted to memory accesses. These member variables hold the location of the slab and the offset to the uncompressed page which are both on the critical path. When these values are in memory, it adds 4 or more cycles of latency for base address calculation on every access. There is also the possibility to hoist and fold the active page check as we know we can make constructive proofs concerning changes to that value.

Benchmarks compare the performance of 1D and 2D style iterators. At certain times the compiler would hoist the base and offset pointers from member variable accesses into registers in the 1D version making a noticeable difference in performance. In some respects, from the perspective of single-threaded code, the only way the pointer to the active region can change is inside switch_page(size_t y).

The potential payoff is huge because one may be able to access data ~ 0.9X - 3.5X faster than simply accessing integers in RAM when combining the reduction in global memory bandwidth with LLVM auto-vectorization, but the challenge is generating safe code for the simpler 1D iteration case that is as efficient as explicit 2D iteration.

1D iteration:

for (auto x : vec) x2 += x;

2D iteration:

for (size_t i = 0; i < n; i += decltype(vec)::page_interval) {
    i64 *cur = &vec[i], *end = cur + decltype(vec)::page_interval;
    while(cur != end) x2 += *cur++;
}

Note: In this example, I avoid having a different size loop tail but that is also a consideration.

I trialled several techniques using a simplified version of the zip_vector class where switch_page was substituted with simple logic so that it was possible to get the compiler to coalesce the slab base pointer and active area offset into a single calculation upon page crossings. There is also hoisting of the active_page check (y-parameter) to only occur on block crossings. I found that when the switch_page implementation became more complex, i.e. probably adding an extern call to malloc, the compiler would resort to more conservatively fetching through a pointer to a member variable for the base pointer and offset calculation. See here:

So I got to the point where I thought it would help to get input from compiler developers to figure out how to observe which internal constraints are violated by “switch_page” preventing the base pointer and offset address calculation from being cached in registers. “slab_data” and “active_area” are neither volatile nor atomic, so threads should not expect their updates to be atomic or go through memory.

I tried a large number of small experiments. e.g. so let’s try and collapse “slab_data” and “active_area” into one pointer at the end of “switch_page” so that we only have one pointer to access. Also, the “active_page” test doesn’t necessarily need to be in the broad phase. I had attempted to manually hoist these variables by modifying the iterators but found it was necessary to keep them where they were to avoid introducing stateful invariants to the iterators that could become invalidated due to read accesses.

Stack/register-based coroutines could help due to the two distinct states in the iterator.

I want to say that it is not as simple as one might think on the surface. I tried several approaches to coalesce address calculations and move them into the page switch logic, all leading to performance fall-off, almost like the compiler was carrying some pessimization that forced touched member variables to be accessed via memory versus registers. At one stage the 1D form was going fast with GCC, but after adding support for zip_vector<i32> and zip_vector<i64>, I found that performance fell off. So I would like to observe exactly which code causes pessimization of accesses to member variables, preventing them from being held in registers and causing accesses to go to memory instead. It seems it should be possible to get 1D iteration to be faster than std::vector as I did witness this with GCC but the optimization does not seem to be stable.

So that’s what I would like help with…

Regarding the license, zip vector and its block codecs are released under “PLEASE LICENSE”, a permissive ISC-derived license with a notice about implied copyright. The license removes the ISC restriction that all copies must include the copyright message, so while it is still copyright material i.e. it is not public domain, it is, in fact, compatible with the Apache Software License.

Please have a look at the benchmarks. The 2D iteration numbers with LLVM are insanely fast!

Regards,

Michael J. Clark
Twitter: @larkmjc

If you want a good response to your question, shorter is better, both in terms of prose and a testcase. The background of your overall algorithm is interesting, but mostly not relevant to your question.

As a general matter, proving two accesses don’t alias when you have multiple accesses with the same base pointer and variable indexes is hard, and LLVM currently isn’t very good at it. If you can explicitly hoist invariant accesses, I’d suggest doing that.

Fair comment. The message is about a 10-minute read. The first 4 minutes is a brief introduction to the compressed array component. the remaining 6 is a painful and frustrating loop of trying to coax the compilers into optimizing 3 variables into registers representing about 1 week of work out of the ~3 months it took to create the components (low-level codecs, high-level non-linear multi-array, internal slab allocator, iterators, test cases, benchmarks, write-up, etc).

I tried to sum up what would be useful in the first paragraph and that is the observability of alias analysis pessimization and memory to register optimizations.

I don’t know how to do the equivalent of -fdump-rtl-all in LLVM and could not find this information.

the problem is this just this one function

I know the entire implementation will optimize into registers if I substitute switch_page with an empty function, but bisecting the implementation is a laborious process when someone could say. “hey, you are calling realloc; that tells us to drop the world, all memory might have changed and all member variables need to be reloaded from memory” (hypothetical statement because I don’t know the root cause). With the switch_page implementation in place it was too hard to inspect with godbolt due to the dependencies so that means a trial compile, benchmark, and search to find the place in objdump -d loop.

I am satisfied with the compiler performance for my use case with the somewhat janky 2D iteration style so I see it as finished from my perspective and I am moving on to use the component in an app. I just thought it would be useful to summarise my work if someone else was interested.

stripped down test case versus observability so the user is empowered

A stripped-down test case is a lot of work. There is some push-pull here on compiler observability. Is there a resource on pass observability? something like: mem2reg is going through candidates and lifting them to registers and there is some invariant causing these to fail but I can’t see that. I have seen this for GCC but don’t have the same level of experience with LLVM.

I am of course grateful for any help and may circle back to this at some point in the future because I have exhausted the number of fruitless loop cycles trying to optimize code with limited observability into why it won’t optimize.

oh no. coroutines are heap based

I started looking into C++ co-routines to split the control flow for the block-crossing case versus the intra-block case but to my dismay, they are specified as heap-based and just might optimize if all conditions are just right “the coroutine state, which is an internal, heap-allocated (unless the allocation is optimized out)” object. if I knew they were efficient and would optimize to registers then I might try them but the specification wording is off-putting. observability and control. The co-routine is potentially ideal if the frame variables can be lifted to registers, and the state changes as direct branches, as the non-local states/return could act like an inner and outer loop, versus hopefully not requiring any indirect branches if the yield label addresses are say in a jump table.

Looking into how coroutines lower will take a lot of time…