[RFC] New command line parsing/generating framework for clang and lld.

LLVM Command Line Library

I'm proposing a heavy weight command line parsing and generating library for
LLVM to replace Clang's parser and provide one for lld and any future tools
that may need it.

The scope of this library is slightly larger than what Clang has now, but not
much.

It is centered around the concept of a Tool. A Tool has a set of Options which
can be parsed to Arguments or rendered from Arguments. It also has a set of
Transformations that convert Arguments from another Tool to Arguments for
itself.

An Argument is an Option with bound values.

Scope:

* Parse argv/argc into an ArgumentList according to a TableGen file describing
  the Options for a given Tool.

* Provide typo correction for Options.

* Provide a way to print help text.

* Render an ArgumentList to a string suitable for invoking another Tool.

* Transform an ArgumentList from one Tool to another.

The major addition this has over Clang is Transformations. A Transformation is a
mapping from one pattern in an ArgumentList to another. These replace the hand
written code in a driver that reads arguments and generates a command line to
call another tool.

An example for this from Clang would be going from Clang to Clang -cc1 options.
Quite a few of these are trivial forwards, while others are more complicated
and may depend on the values of other arguments.

Transformations not only make this simpler, they also allow other drivers to
more easily target Clang -cc1. A cl.exe style driver would get its own Tool,
Options, and Transformation set.

This also makes calling out to a single type of tool, such as the linker, with
various tools that implement it (gnu-ld, ld64, link.exe) easier. You simply
select which Tool to use for transformation, and render the resulting
ArgumentList to a string to pass to the program.

The TableGen Option definitions provide enough information to both parse and
render command lines. This allows us to have a single definition of as and ld
options and be able to reuse them in both Clang to call the tool, and in the
llvm implementation of the tool itself to parse the command line.

Here's a mockup of a TableGen file for part of Clang:

Option.td:
  class Tool {
    // The list of all possible prefixes. Not every option in the tool has all
    // prefixes. Any string that does not begin with one of these
prefixes and is
    // not an argument to a previous option is considered an input Argument. A
    // string that does begin with a prefix but is not a known option
is eligible
    // for typo-correction.
  }

  def joined;
  def separate;
  def or;
  def str;

  class Option<list<string> prefixes, string name, Tool tool, dag
strparse, string render, dag rendermatch> {
    // The tool this Option belongs to.
    Tool Tool_ = tool;

    // How to parse the Option from argc+argv.
    dag StringParse = strparse;

    // How to render the Option to a string. RenderMatch is used to capture
    // values and assign them identifiers. When Render is printed, these values
    // are inserted into it in the marked locations.
    string Render = render;
    dag RenderMatch = rendermatch;

    // The meta-variable name of each value.
    list<string> ValueMetavars;

    // The list of valid prefixes for this Option. The parser will check if
    // Prefixes[i] + Name is a prefix of a potential Option for each prefix in
    // Prefixes.
    list<string> Prefixes = prefixes;

    // The name of this Option without any prefixes or postfixes. This is what
    // typo correction is checked against.
    string Name = name;

    // Is Name case insensitive.
    bit IsCaseInsensitive = 0;

    // Should this Option be hidden from the default help.
    bit IsHidden = 0;

    // Used as a tiebraker when multiple Options share the same prefix. Higher
    // values are picked first.
    int Priority = 0;

    // The single Option that this Option is an alias of.
    Option Alias = ?;

     // The help text for this Option.
    string HelpText = ?;
  }

  class Alias<Option alias> {
    Option Alias = alias;
  }

  class MetaVars<list<string> mv> {
    list<string> ValueMetavars = mv;
  }

  class CaseInsensitive {
    bit IsCaseInsensitive = 1;
  }

Clang.td:
  include "Option.td"

  def clang : Tool;

  class ClangOption< list<string> prefixes
                   , string name
                   , dag strparse
                   , string render
                   , dag rendermatch>
    : Option<prefixes, name, clang, strparse, render, rendermatch>;

  class ClangFlag<string name>
    : ClangOption<["-"], name, ?, "-"#name#, ?>;

  class ClangSingleLetterOption<string name>
    : ClangOption< ["-"], name, (or (joined (str ""), (str:$v0)),
                                    (separate (str:$v0)))
                 , "-"#name#"$v0", (str:$v0)> {
    int Priority = 1;
  }

  def clang_f_strict_enums : ClangFlag<"fstrict-enums">;
  def clang_f_no_strict_enums : ClangFlag<"fno-strict-enums">;
  def clang_f_fast_math : ClangFlag<"ffast-math">;
  def clang_o : ClangSingleLetterOption<"o">, MetaVars<["<file>"]>;

  // And now for a simi-strange one. -ftemplate-depth.
  def clang_f_template_depth
    : ClangOption< ["-"], "ftemplate-depth"
                 , (or (joined (str "="), (str:$v0)),
                       (joined (str "-"), (str:$v0)))
                 , "-ftemplate-depth=$v0", (str:$v0)>;
  // Note that we don't need to also have a clang_f_template_depth_EQ.

  // One with a limited set of values.
  class ClangSeparateValues<string name, list<string> values>
    : ClangOption< ["-"], name
                 , (joined (str "="), (str:$v0 values))
                 , "-"#name#"=$v0", (str:$v0)>;

  // This won't match unless the value is one of the ones in the list. We can
  // generate a very good error message with the information we have that
  // includes the list of valid values.
  def clang_f_fp_contract : ClangSeparateValues<"ffp-contract",
["fast", "on", "off"]>;

ClangCC1.td:
  include "Option.td"

  def clang_cc1 : Tool;

  class ClangCC1Option< list<string> prefixes
                      , string name
                      , dag strparse
                      , string render
                      , dag rendermatch>
    : Option<prefixes, name, clang_cc1, strparse, render, rendermatch>;

  class ClangCC1Flag<string name>
    : ClangCC1Option<["-"], name, ?, "-"#name#, ?>;
  class ClangCC1Separate<string name>
    : ClangCC1Option<["-"], name, (separate (str:$v0)), "-"#name#"
$v0", (str:$v0)>;
  class ClangCC1SeparateValues<string name, list<string> values>
    : ClangCC1Option< ["-"], name
                    , (joined (str "="), (str:$v0 values))
                    , "-"#name#"=$v0", (str:$v0)>;

  def clang_cc1_f_strict_enums : ClangCC1Flag<"fstrict-enums">;
  def clang_cc1_f_template_depth : ClangCC1Separate<"ftemplate-depth">;
  def clang_cc1_f_fp_contract : ClangCC1SeparateValues<"ffp-contract",
["fast", "on", "off"]>;

You may wonder why the parsing info is a dag instead of just being essentially
an enum value as it is in Clang's current implementation. The main reason for
this is that there exist tools with option formats that do not nicely fit into
that model. And in fact have many different ways of representing arguments.

These are actually very simple to convert to C++ code from TableGen. It is also
trivial to merge identical parsers before generating them, which means there's
no code size explosion. Here's an example of what this would generate.

  ArgParseResult parseJoinedOrSeperate(const ArgParseState APS) {
    return parseOr(parseJoined("", parseStr(0)),
                   parseSeperate(parseStr(0)))(APS);
  }

Each parse* function is a template function which creates a function object that
implements that parser with the given arguments. The integer argument for
parseStr tell it which Argument value slot to put it in. This is based on v0
from above.

This is an idea of what transforms would look like:

  def not;

  class Transform<list<dag> match, list<dag> produce> {
    list<dag> M = match;
    list<dag> P = produce;
  }

  include "Clang.td"
  include "ClangCC1.td"

  def : Transform< [(clang_f_strict_enums), (not clang_f_no_strict_enums)])
                 , [(clang_cc1_f_strict_enums)]>;

  def : Transform< [(clang_f_template_depth (str:$v0))]
                 , [(clang_cc1_f_template_depth (str:$v0))]>;
  // Since this case is common, there would probably be a:
  def : Forward<clang_f_template_depth, clang_cc1_f_template_depth>;
  // This would simply copy the Argument values.

  def : Forward<clang_f_fp_contract, clang_cc1_f_fp_contract>;

  def : Transform< [(clang_f_fast_math), (not clang_f_fp_contract)]
                 , [(clang_cc1_f_fp_contract (str "fast"))]>;

For each Transform, each dag in M is matched against the ArgumentList in order.
Once a dag matches an Argument the process continues with the next Argument in
the list. Values are extracted using :$<name>. If all dags in M are satisfied,
the dag in P has its :$<name> values substituted, converted to an Argument, then
added to the output ArgumentList.

Not all transforms can be represented in this manner, but you can still hand
write the code for these casses.

Attached is a patch that adds tools/llvm-cltest. This currently contains code
that should be in a library and will not exist in the final version. This is a
proof of concept for what TableGen would actually generate. It does not contain
the actual TableGen implementation.

- Michael Spencer

OptionParsing.patch (42.7 KB)

ping.

- Michael Spencer

Are you proposing that clang switch over to this, or that this be used solely by lld?

-Chris

Both clang and lld.

- Michael Spencer

Ok, please chat with Daniel Dunbar and Chad Rosier about this. They'll definitely have opinions :slight_smile:

-Chris

Hi Michael,

>>
>>
>>>> Attached is a patch that adds tools/llvm-cltest. This currently contains code
>>>> that should be in a library and will not exist in the final version. This is a
>>>> proof of concept for what TableGen would actually generate. It does not contain
>>>> the actual TableGen implementation.
>>>>
>>>> - Michael Spencer
>>>
>>> ping.
>>
>> Are you proposing that clang switch over to this, or that this be used solely by lld?
>>
>> -Chris
>
> Both clang and lld.

Ok, please chat with Daniel Dunbar and Chad Rosier about this. They'll definitely have opinions :slight_smile:

-Chris

Sorry for the late response. I don't feel I'm the best person to comment
on this because of my aborted attempts and broken promises at improving
the driver, but you did add me on cc... :slight_smile:

I like the general idea. But why is this idea so removed from the
current (rather heavyweight) Clang argument parsing library that it
needs a full rewrite?

I get that moving the argparse stuff out of the driver is a good idea so
other tools can use it, but does it need to be reengineered so heavily?

For example, 4/5 of your "scope" points can be handled by the current
library architecture with perhaps a few implementation additions (typo
correction, for example).

It's just transforming arguments from one tool to another that isn't
implemented.

How do transformations fit in with CompilerInvocation?

As a sidenote, I've never understood why we distinguish between joined
and separate arguments. Are there situations where is it useful to
disambiguate between them? I'm always annoyed that clang will accept
"-mcpu=X" but not "-mcpu X". It seems a pointless feature to me.

Many tools take very similar arguments. For example, GCC, Clang, As and
Ld all take the "-mcpu" option. Does this need to be specified again for
every tool? There doesn't seem to be a method of inheritance of options.

I like your transformation idea though - I had a similar idea back last
January but it was more along the lines of a compiler "pass" like
architecture than fully tablegenned. I like it.

These are my initial comments from a quick scan-read - apologies if I've
misunderstood anything significant :slight_smile:

Cheers,

James

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Salvatore is also trying to do this, maybe he has some input.

Hi Michael,

>>
>>
>>>> Attached is a patch that adds tools/llvm-cltest. This currently contains code
>>>> that should be in a library and will not exist in the final version. This is a
>>>> proof of concept for what TableGen would actually generate. It does not contain
>>>> the actual TableGen implementation.
>>>>
>>>> - Michael Spencer
>>>
>>> ping.
>>
>> Are you proposing that clang switch over to this, or that this be used solely by lld?
>>
>> -Chris
>
> Both clang and lld.

Ok, please chat with Daniel Dunbar and Chad Rosier about this. They'll definitely have opinions :slight_smile:

-Chris

Sorry for the late response. I don't feel I'm the best person to comment
on this because of my aborted attempts and broken promises at improving
the driver, but you did add me on cc... :slight_smile:

I like the general idea. But why is this idea so removed from the
current (rather heavyweight) Clang argument parsing library that it
needs a full rewrite?

I get that moving the argparse stuff out of the driver is a good idea so
other tools can use it, but does it need to be reengineered so heavily?

For example, 4/5 of your "scope" points can be handled by the current
library architecture with perhaps a few implementation additions (typo
correction, for example).

It's just transforming arguments from one tool to another that isn't
implemented.

How do transformations fit in with CompilerInvocation?

As a sidenote, I've never understood why we distinguish between joined
and separate arguments. Are there situations where is it useful to
disambiguate between them? I'm always annoyed that clang will accept
"-mcpu=X" but not "-mcpu X". It seems a pointless feature to me.

It is, but its also a compatibility point with GCC. That at least is
the reason why it works the way it does. For this specific case, one
might imagine it could be cleaned up without breaking anything
serious, but its a lot of work for little gain IMHO given that in
general there are many other areas where the GCC style option syntax
doesn't make sense, and we can't fix without breaking compatibility.

- Daniel