RFC: general machine-parsable backend for TableGen (e.g. JSON)

Hello llvm-dev,

Would there be any interest in adding a back end to TableGen to produce output in a general-purpose but machine-parsable format?

At the moment, TableGen has two kinds of output option. The -gen-foo options are each tailored to a particular use case; the -print-records option is fully general, but it's difficult to machine-parse, since its output is in the same syntax as the TableGen input language, so any tool that wants to analyse it and pick out some particular class of fact has to start by doing half of TableGen's work all over again.

I've often thought it would be useful to have an output mode which produces all the same information as -print-records, but in a format that's easily parsed by existing standard library facilities in typical scripting languages such as Python. (My opening bid would be JSON.) This would make it convenient to take a large TableGen input such as an entire target description, and run automated processing over it.

Here are a few examples of things I've wanted to do in the past, and would rather have done by this method instead of resorting to fragile regex-based matching on the -print-records output:

* Iterate over all instances of the Instruction class, and output the fixed bits of each one's bits<> vector. (Useful to collect a set of starting points for disassembly testing.)

* List all subtarget features on which at least one Instruction class is conditional. (Useful to collect a set of modes to run testing in.)

* Extract the number, names and types of the oops and iops for each Instruction, in a form that's easy to use to annotate post-isel LLVM diagnostic output. (Useful if you can never remember which way round all the operands go!)

I've written a proof-of-concept back end that outputs JSON, and produces complete enough data to let me implement any of the above examples in a few lines of Python, and (I hope) also the next few queries along these lines that I might happen to think of.

I chose JSON because I wanted it to be supported by the Python core standard library without needing to install any third-party modules. (XML would have been OK as well from that perspective, but JSON is considerably simpler: the Python JSON reader can be called in one line of code without having to set up lots of machinery like a custom parser subclass, and it delivers output in a data structure better suited to the kinds of query I list above.)

Would there be any interest in me finishing this up (polishing the code, documenting the output data representation, etc) and sharing it?

(Of course, another class of use case that this would make easier is the use of TableGen for things that have nothing to do with LLVM, like that blog post a few years back from someone who was using it to manage a set of related SSH configuration files. But I have no idea whether that counts as a pro or a con :slight_smile:

Cheers,
Simon

Hello llvm-dev,

Would there be any interest in adding a back end to TableGen to produce output in a general-purpose but machine-parsable format?

At the moment, TableGen has two kinds of output option. The -gen-foo options are each tailored to a particular use case; the -print-records option is fully general, but it's difficult to machine-parse, since its output is in the same syntax as the TableGen input language, so any tool that wants to analyse it and pick out some particular class of fact has to start by doing half of TableGen's work all over again.

I've often thought it would be useful to have an output mode which produces all the same information as -print-records, but in a format that's easily parsed by existing standard library facilities in typical scripting languages such as Python. (My opening bid would be JSON.) This would make it convenient to take a large TableGen input such as an entire target description, and run automated processing over it.

Here are a few examples of things I've wanted to do in the past, and would rather have done by this method instead of resorting to fragile regex-based matching on the -print-records output:

* Iterate over all instances of the Instruction class, and output the fixed bits of each one's bits<> vector. (Useful to collect a set of starting points for disassembly testing.)

* List all subtarget features on which at least one Instruction class is conditional. (Useful to collect a set of modes to run testing in.)

* Extract the number, names and types of the oops and iops for each Instruction, in a form that's easy to use to annotate post-isel LLVM diagnostic output. (Useful if you can never remember which way round all the operands go!)

I've written a proof-of-concept back end that outputs JSON, and produces complete enough data to let me implement any of the above examples in a few lines of Python, and (I hope) also the next few queries along these lines that I might happen to think of.

I chose JSON because I wanted it to be supported by the Python core standard library without needing to install any third-party modules. (XML would have been OK as well from that perspective, but JSON is considerably simpler: the Python JSON reader can be called in one line of code without having to set up lots of machinery like a custom parser subclass, and it delivers output in a data structure better suited to the kinds of query I list above.)

Would there be any interest in me finishing this up (polishing the code, documenting the output data representation, etc) and sharing it?

This makes sense to me, it seems like general goodness and fits with the spirit of tblgen.

-Chris

Hi Simon,

that makes sense to me. Please add me on any reviews when you're done.

Cheers,
Nicolai

From: Nicolai Hähnle [mailto:nhaehnle@gmail.com]
Sent: 24 April 2018 08:29

that makes sense to me. Please add me on any reviews when you're done.

Thanks! https://reviews.llvm.org/D46054 is a first draft, with a big list in the log message of all the things I know I haven't done yet.

Cheers,
Simon