[RFC] YAMLGenerateSchema support for producing YAML schemas

Objective

It would be great to have YAML Schema support based on the existing YAML IO parser.

Motivation

When I was working on one llvm-based tool, I had to work with input YAML files with a complex structure. Since all the code was written using YAMLTraits.h and in different files, it is quite difficult to see the full structure of this file. In addition, it would be convenient for users of this tool, writing these input files, to see hints on the format of this file from their IDE. And if you make support for generating the schema from YAML parser, then you will not have to change its schema manually when changing the file structure. This can be useful to see the full structure of the input YAML file for different llvm based tools that use existing YAML parser, for example clang-format, clang-tidy e.t.c.

How It Works

Reading the current code in YAMLTraits.h, I found 2 classes: one for Input YAML and another for Output. I thought that I could create another yaml::IO derived class, with the same interface, so that I could simply dump the schema into some raw_ostream. At the same time, this derived class has access to all keys and types of the current mapping.

How It’s Structured

I prepared a patch that adds this new derived class (I named it GenerateSchema and moved it to a separate file so as not to increase compilation time in cases where this functionality is not required). Internally, this class does two things: first, it builds a tree of schemas for YAML chunks of thr input file. And then it builds a tree of nodes to output the schema to a file in YAML format, using the existing YAMLParser.

Testing

The unit test for this functionality is designed in such a way that it dumps the schema of some simple YAML and then compares it with a previously known

Dear llvm community, please give me your feedback.

For the record, a corresponding PR is here: [llvm][Support] Add YAMLSchemeGen for producing YAML Schemes from YAMLTraits by tgs-sc · Pull Request #133284 · llvm/llvm-project · GitHub.

Just some comments after glancing at this.

YAMLGenerateSchema support for producing YAML schemas

This has the same energy as “let’s add an MLIR dialect to describe the description of MLIR dialects” and it was probably not immediately clear to readers what you were proposing.

(though it is an accurate description from what I can tell)

This sounds like a great idea that people would find useful.

So could you reply with a short example showing how the parser code gets translated into a schema? You can gloss over the details but so we can see how much of a difference there is between reading C++ parser code vs. reading a schema.

Bonus points if you have a screenshot from one of the IDEs showing how the schema gets used

I think you can also give the schema and a file to a verifier tool? Maybe we could root out malformed files in tests that way.

For testing my first thought is: is it possible to discover every “schema” currently in use in llvm and see whether this generator works with them?

For instance, we have a lot of files for use with obj2yaml and yaml2obj. You could generate a schema for those, and then use it to verify existing files that use that schema.

This could generate you a lot of corner cases that you can cover in unit tests using smaller examples.

First, I apologize for the such delay in my response. I was busy with exams first and then went on vacation.

This sounds like a great idea that people would find useful.

It might be. Especially because there are many places in llvm where the yaml format is used to specify input configurations.

So could you reply with a short example showing how the parser code gets translated into a schema?

Sure, lets view at such example:

using namespace llvm;

enum class ColorTy {
  White,
  Black,
  Blue,
};

struct Baby {
  std::string Name;
  ColorTy Color;
};

LLVM_YAML_IS_SEQUENCE_VECTOR(Baby)

struct Animal {
  std::string Name;
  std::optional<int> Age;
  std::vector<Baby> Babies;
};

LLVM_YAML_IS_SEQUENCE_VECTOR(Animal)

namespace llvm {
namespace yaml {

template <> struct ScalarEnumerationTraits<ColorTy> {
  static void enumeration(IO &io, ColorTy &value) {
    io.enumCase(value, "white", ColorTy::White);
    io.enumCase(value, "black", ColorTy::Black);
    io.enumCase(value, "blue", ColorTy::Blue);
  }
};

template <> struct MappingTraits<Baby> {
  static void mapping(IO &io, Baby &info) {
    io.mapRequired("name", info.Name);
    io.mapRequired("color", info.Color);
  }
};

template <> struct MappingTraits<Animal> {
  static void mapping(IO &io, Animal &info) {
    io.mapRequired("name", info.Name);
    io.mapOptional("age", info.Age);
    io.mapOptional("babies", info.Babies);
  }
};

} // namespace yaml
} // namespace llvm

int main() {
  std::vector<Animal> Animals;
  yaml::GenerateSchema Gen(OS);
  Gen << Animals;
}

This code example first defines the types for storing yaml, and then defines the necessary traits for further work with yaml. At the same time, if we create yaml::Input, we will be able to read from raw_ostream yaml. And if we create yaml::Output, then we will be able to dump the object, which we could later change, into raw_ostream. I propose to make another heir of yaml::IO, thanks to which it will be possible to obtain the general structure of a specific yaml.

parser code gets translated into a schema

You don’t have to write any additional code to get the schema. Most of the information about keys, types and default values ​​(unfortunately not all, because the original implementation does not support the callback mechanism with type names) can be obtained from callbacks to the child type of yaml::IO. From the proposed example, we get this schema in json format (initially it was also in yaml format, but it seems that in my IDE the schemas should be in json format).

{
  "$schema": "http://json-schema.org/draft-04/schema",
  "title": "YAML Schema",
  "items": {
    "properties": {
      "age": {
        "type": "string"
      },
      "babies": {
        "items": {
          "properties": {
            "color": {
              "enum": [
                "white",
                "black",
                "blue"
              ],
              "type": "string"
            },
            "name": {
              "type": "string"
            }
          },
          "required": [
            "name",
            "color"
          ],
          "type": "object"
        },
        "type": "array"
      },
      "name": {
        "type": "string"
      }
    },
    "required": [
      "name"
    ],
    "type": "object"
  },
  "type": "array"
}

I added manually options "$schema" and "title" for better integration with IDE. Further conveniences will be demonstrated using IDE VSCode and redhat.vscode-yaml extension for supporting yaml schema display.

I think you can also give the schema and a file to a verifier tool?

Sure, it is needed to add obtained schema to files match by pattern.

{
  "yaml.schemas": {
    "/home/timur/timur/os-llvm/schema-test/schema.yaml": "*.my.yaml"
  }
}

After that, we can start working with mytest.yaml file and see such recomendation from our IDE:

is it possible to discover every “schema” currently in use in llvm and see whether this generator works with them?

I think so. I can try dumping the input file scheme in clang-tidy or clang-format.

This could generate you a lot of corner cases that you can cover in unit tests using smaller examples.

Yes, I think that as a unittest it is possible to dump a schema from some llvm-tool and compare it with the expected.

Thank you for your reply, I apologize again for such a long delay, I am ready to answer your questions.

1 Like

Just from your responses, it sounds like it’s worth pursuing.

I see 2 main use cases / justifications:

  • Users getting better IDE support. Which is always good but doubly so with YAML which is more fiddly than most formats IMO.
  • The project overall being able to confirm that YAML files claiming to be in a schema, actually are.

So if you have trouble convincing reviewers of the value of this, you could start checking schemas in use in llvm-project and see if you find anything weird. Accidentally passing tests for example.

1 Like

Returning back to this theme.

schemas in use in llvm-project and see if you find anything weird.

So, I got the schema for clang-tidy config file:

{
  "flowStyle": "block",
  "optional": [
    "Checks",
    "WarningsAsErrors",
    "HeaderFileExtensions",
    "ImplementationFileExtensions",
    "HeaderFilterRegex",
    "ExcludeHeaderFilterRegex",
    "FormatStyle",
    "User",
    "CheckOptions",
    "ExtraArgs",
    "ExtraArgsBefore",
    "InheritParentConfig",
    "UseColor",
    "SystemHeaders"
  ],
  "properties": {
    "CheckOptions": {},
    "Checks": {},
    "ExcludeHeaderFilterRegex": {
      "type": "string"
    },
    "ExtraArgs": {
      "flowStyle": "block",
      "items": {
        "type": "string"
      },
      "type": "array"
    },
    "ExtraArgsBefore": {
      "flowStyle": "block",
      "items": {
        "type": "string"
      },
      "type": "array"
    },
    "FormatStyle": {
      "type": "string"
    },
    "HeaderFileExtensions": {
      "flowStyle": "block",
      "items": {
        "type": "string"
      },
      "type": "array"
    },
    "HeaderFilterRegex": {
      "type": "string"
    },
    "ImplementationFileExtensions": {
      "flowStyle": "block",
      "items": {
        "type": "string"
      },
      "type": "array"
    },
    "InheritParentConfig": {
      "type": "string"
    },
    "SystemHeaders": {
      "type": "string"
    },
    "UseColor": {
      "type": "string"
    },
    "User": {
      "type": "string"
    },
    "WarningsAsErrors": {
      "type": "string"
    }
  },
  "type": "object"
}

And line-filter config for clang-tidy:

{
  "flowStyle": "flow",
  "items": {
    "flowStyle": "block",
    "optional": [
      "lines"
    ],
    "properties": {
      "lines": {
        "flowStyle": "flow",
        "items": {
          "flowStyle": "block",
          "items": {
            "type": "string"
          },
          "type": "array"
        },
        "type": "array"
      },
      "name": {
        "type": "string"
      }
    },
    "required": [
      "name"
    ],
    "type": "object"
  },
  "type": "array"
}

At the moment I see 3 main problems:

  • The type of elements with ScalarTraits is always string.

However, for int and size_t, we would like to see integer, and in some cases number. In the current form, this cannot be done, but if we add a new optional static method to ScalarTraits getName, then we can eliminate this problem.

  • It is impossible to create recursive objects (however, such objects are not currently used anywhere in llvm).

When implementing processKey, processKeyWithDefault and yamlize methods, the void *SaveInfo parameter is used. It can be used to save a flag unique to each type. And if, when calling this method again, such a flag already existed, then it is necessary to create a field definitions in the Schema and refer to this object.

  • It would be nice to receive more extensive support for CustomMappingTraits and PolymorphicTraits.

For PolymorphicTraits, the yamlize method needs to be rewritten so that in the case where the IO representative is a YAMLGenerateSchema, it will go through all the getAsScalar, getAsMap, and getAsSequence variants and create an oneOf field in the Schema.

The situation with CustomMappingTraits is trickier because for these Traits there is only inputOne and output method. At the schema generation stage we can’t use inputOne because it takes the key name, so I use output. Since it is almost always range based iteration over the object, and at this stage the object is empty, it doesn’t make much sense. I don’t have a good solution for this case, so I would like to consult with someone about this.

Also, in some places, an exact specialization is used for the yamlize method with an explicit cast of IO to Input or Output. For such cases, YamlSchema generation is obviously impossible.

To sum it up, I would like someone to look at my patch at the moment because depending on this I would prefer one or another solution to the problems at hand. I should say that this patch does not break backward compatibility, since the definition of GenerateSchema is moved to a separate header file. So in cases where YamlSchema is not needed, there will be no effect.

I would prefer to solve all the problems in separate pull requests since it will be necessary to make some changes to YAMLTraits.

Does this break the IDE integration or not?

It sounds like you’re saying that the schema actually wants numbers for example, but you currently detect it as strings. If this causes a false negative in an IDE context, is there a way you could ignore that field or emit some sort of “any” type for it?

I guess that if you didn’t put it in the schema at all, the IDE will think that it’s an error to use it.

Here do you mean code backward compatibility (which is not so much of a concern in C++ code) or compatibility with the “implicit” schema of the tools’ YAML files?

I guess you mean the latter? Which is tricky given that we did not have a schema to compare against in the first place.

…which makes me wonder if llvm tools having an option to dump their input schema would be useful. Would save me looking at our tests every time I need to find something.

Does this break the IDE integration or not?

Actually, this does not break the IDE integration, but IDE will complain if you type numbers in this cases.

is there a way you could ignore that field or emit some sort of “any” type for it?

Sure, we can either omit the type in this case, either set any type. But it would be useful to see a type there, because this is a leaf node in result tree. Since ScalarTraits are mostly in YAMLtraits.h, it is enough to add one method to them. In the case of custom scalar types, if the type name is not provided, then by default use “string”.

Here do you mean code backward compatibility (which is not so much of a concern in C++ code) or compatibility with the “implicit” schema of the tools’ YAML files?
I guess you mean the latter?

Yes, I mean the latter part. I just wanted to say that in case you want to obtain Schema, it may require some preparations, like I mentioned before. For example, these cases must be excepted:

Also, in some places, an exact specialization is used for the yamlize method with an explicit cast of IO to Input or Output. For such cases, YamlSchema generation is obviously impossible.

I wanted to say that if we, for example, create the -omit-schema option in some llvm tool, we should first check that the problems I mentioned will not be critical. This is due to legacy features of writing YAMLTraits.h.