Hi,
Sorry for the last mail with wrong subject.
I am a LLVM newbie, I have read some tutorials of LLVM, and play with some demos.
I am developing a Hadoop native runtime, it has C++ APIs and libraries, what I want to do is to compile Hive’s logical query plan directly to LLVM IR or translate Hive’s physical query plan to LLVM IR, then run on the Hadoop native runtime. As far as I know, Google’s tenzing does similar things, and a few research papers mention this technique, but they don’t give details.
Does translate physical query plan directly to LLVM IR reasonable, or better using some part of clang library?
I need some advice to go on, like where can I find similar projects or examples, or which part of code to start to read?
Thanks,
Binglin Chang
Hi Chang,
I am developing a Hadoop native runtime, it has C++ APIs and libraries,
what I want to do is to compile Hive's logical query plan directly to LLVM
IR or translate Hive's physical query plan to LLVM IR, then run on the
Hadoop native runtime. As far as I know, Google's tenzing does similar
things, and a few research papers mention this technique, but they don't
give details.
Does translate physical query plan directly to LLVM IR reasonable, or
better using some part of clang library?
I need some advice to go on, like where can I find similar projects or
examples, or which part of code to start to read?
I don't know how those query language looks like. If the query language
will turn into some kind of intermediate representation during the execution
(like how compiler does), then you might need to find what representation
is easier to be transformed into LLVM IR. Clang is for C-like language. I
am not sure if Clang's library can help you or not.
HTH,
chenwj
Suppose there is a table "invites" with columns
foo int
bar string
A Hive SQL query
SELECT a.bar, count(*) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
will be compiled to the physical query plan below, each operator is a
actually a java class,
chained together, so the whole plan can be executed in a "interpret" way.
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Select Operator
expressions:
expr: bar
type: string
outputColumnNames: bar
Group By Operator
aggregations:
expr: count()
bucketGroup: false
keys:
expr: bar
type: string
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
What I am thinking is to translate this physical query plan to a LLVM
IR, the IR should
inline all the operators, because they are all static, and the input
data record type is known
and static too, the LLVM IR should then be compiled to native code as
functions(one for mapper,
and one for reducer maybe), finally I can integrate them with native
MapReduce runtime and run
them on Hadoop.
The input data types are probably described by some sort of schema, or
just a memory buffer
with layouts like C struct..
I don't know if I described clearly, here are some papers mentioned this:
[Google Tenzing] http://research.google.com/pubs/pub37200.html
[Efficiently Compiling Efficient Query Plans for Modern Hardware]
www.vldb.org/pvldb/vol4/p539-neumann.pdf
Thanks,
Binglin Chang