Apache Arrow Compute Intermediate Representation - RFC


I’m a developer on the Apache Arrow project: we provide a
language-independent columnar memory format.
Recently we’ve started discussion and work around providing a
consistent format for representing relational algebraic operations
against that data 1.

My own intuition is that we could learn a great deal from the MLIR
project since the space targeted by this project has comparable
complexity and heterogeneity, and specifically that we should
probably emulate the less constrained, lowerable/optimizable dialect
structure of MLIR. Most of the other voices in the conversation feel
this is counter to the problem statement and would prefer to declare
representation of lowered operations as out-of-scope.

I am far from an expert on either MLIR or relational algebra, so I’m
writing in hopes that someone here is interested enough to join the
conversation. Regardless of the analogy I’m interested in promoting,
I’m sure you could provide useful notes on our design.

The summary of my proposal (which is out of sync with the rest of the code in
that PR) is here:

This looks really cool. The is exactly the sort of thing that MLIR is great at: providing infra and helping to build compilers for domain specific compilers like this. I don’t personally know much at all about Arrow (or large scale data storage in general) but I think it is likely that there is something here.

Definitely don’t know any of these either, but would be interested to know how it evolves! Thanks for promoting :slight_smile: