[RFC] Tablegen Clang Static Analyzer engine options for better documentation

Summary

We should move the analyzer config option descriptions into a higher-level abstraction which would make the analyzer behavior more in line with the documentation but potentially encourage better documentation as well on the developers’ side.

This proposal is heavily inspired by @Szelethus’ work in D104439 Demonstrate a move from the analyzer-configs .def file to a TableGen file and the tablegen DSL of how attributes are defined and documented.

Related tablegen resources

Primary differences compared to D104439

I would put much more emphasis on compile-time checks of semantic expectations of the input .td file, producing errors and warnings for common mistakes, especially focusing on preventing copy-paste mistakes.
Along with these semantic checks, we would have the ability to cross-reference other analyzer configs and even related checkers, making it possible for users to discover and reason about the relationships and effects of these options in the broader sense.
In the future, we could put more and more quality gates, enforcing that no analyzer option gets introduced without proper documentation, e.g. missing an example explaining the effect of the option in an end-user-friendly way.

Motivation
We write the documentation by hand, and they are mostly very vague. This is a serious problem, but these analyzer config options are not intended to be user-facing in most cases. In contrast to analyzer configs, checkers are categorized into the following three categories, which can be listed by the respective flag:

  • released-analyzer-checker-help
  • alpha-analyzer-checker-help-alpha
  • debug-analyzer-checker-help-developer

The lack of this resulted in the situation, that we (analyzer developers) have no official way of presenting these config options since the end-user should either use the -cc1 or prepend any -analyzer-config with an -Xclang or -Xanalyzer; and none of these approaches are considered supported. So, at the end of the day, the devs don’t have a way of presenting useful user-facing configs such as:

  • elide-constructors
  • expand-macros
  • apply-fixits
  • crosscheck-with-z3
  • consider-single-element-arrays-as-flexible-array-members !
  • assume-controlled-environment !
  • ignore-bison-generated-files !
  • etc.

It would be similarly important to refer to assume-controlled-environment in the documentation from any taint-related checkers, and the opposite way around - since these share some property, thus related in that sense.

Harmony with checkers.td
In this proposal, I’m focusing on reforming the analyzer config options, but to bring out the most of it - e.g. have cross-references from checkers to analyzer configs and around, we should have a common syntax for both. Parse both checker and config options, and perform semantic checks to ensure everything is in line. We could probably keep the current syntax of the checkers.td, but I believe that would be less than ideal.

Proposed format
Please note that I chose this format only for experimentation, and I’m ready for comments to improve it.
Also, note that every field is strongly typed!

class ConfigValue {
  string FlagName;
  string ShortDescription;
  string LongDescription; // optional
  list<string> RelatedConfigs; // optional, should name config definitions.
  // list<string> RelatedCheckers; // optional, should name checker definitions.
  // ConfigCategory Category;
}

class BooleanConfigValue : ConfigValue {
  bit DefaultValue;
}

class EnumConfigValue : ConfigValue {
  string EnumName = NAME;
  list<string> Options;
  string DefaultValue;
}

class IntConfigValue : ConfigValue {
  int Min = 0;
  int Max = ?; // optional
  int DefaultValue;
}

class StringConfigValue : ConfigValue {
  string DefaultValue;
}

class UserModeDependentEnumConfigValue : ConfigValue {
  string EnumName = NAME;
  list<string> Options;
  string ShallowDefaultValue;
  string DeepDefaultValue;
}

class UserModeDependentIntConfigValue : ConfigValue {
  int ShallowMin = 0;
  int DeepMin = 0;
  int ShallowDefaultValue;
  int DeepDefaultValue;
}

/// Analyzer config options:

def UserMode : EnumConfigValue {
  let FlagName = "mode";
  let ShortDescription =
    "Controls the high-level analyzer mode, which influences the "
    "default settings for some of the lower-level config options (such as "
    "IPAMode).";
  let RelatedConfigs = ["MaxInlinableSize", "MaxNodesPerTopLevelFunction", "IPAMode"];
  let Options = ["deep", "shallow"];
  let DefaultValue = "deep";
   // let Category = ReleasedConfig;
}

def ShouldAssumeControlledEnvironment : BooleanConfigValue {
  let FlagName = "assume-controlled-environment";
  let ShortDescription =
    "Whether the analyzed application runs in a controlled environment. "
    "We will assume that environment variables exist in queries and they hold "
    "no malicious data. For instance, if this option is enabled, 'getenv()' "
    "might be modeled by the analyzer to never return NULL.";
  let DefaultValue = False;
  // let Category = ReleasedConfig;
  // let RelatedCheckers = ["TaintPropagation"]
}

def MinCFGSizeTreatFunctionsAsLarge : IntConfigValue {
  let FlagName = "min-cfg-size-treat-functions-as-large";
  let ShortDescription =
    "The number of basic blocks a function needs to have to be considered "
    "large for the 'max-times-inline-large' config option.";
  let RelatedConfigs = ["MaxTimesInlineLarge"];
  let DefaultValue = 14;
  // let Category = DeveloperConfig;
}

def MaxTimesInlineLarge : IntConfigValue {
  let FlagName = "max-times-inline-large";
  let ShortDescription =
    "The maximum times a large function could be inlined.";
  let RelatedConfigs = ["MinCFGSizeTreatFunctionsAsLarge"];
  let DefaultValue = 32;
  // let Category = DeveloperConfig;
}

def MaxInlinableSize : UserModeDependentIntConfigValue {
  let FlagName = "max-inlinable-size";
  let ShortDescription =
    "The bound on the number of basic blocks in an inlined function.";
  let ShallowDefaultValue = 4;
  let DeepDefaultValue = 100;
  // let Category = DeveloperConfig;
}

def CTUDir : StringConfigValue {
  let FlagName = "ctu-dir";
  let ShortDescription = "The directory containing the CTU related files.";
  let DefaultValue = "";
  // let Category = DeveloperConfig;
}
// ...

For brevity, I omitted the ConfigCategory definitions. RelatedCheckers are also omitted, since that is not implemented yet.

I deliberately chose to refer to entities by “name” in the RelatedConfigs instead of referring to the tablegen class directly.
By doing it this way, we can introduce circular references, which couldn’t be done by the direct approach AFAIK.
This makes the implementation slightly more awkward, but the td file still feels natural.

Some of the implemented semantic checks

  • The FlagName, ShortDescription, LongDescription, should not be empty and it should be unique.
  • The LongDescription should be longer than the ShortDescription.
  • The elements of RelatedConfigs should be unique and should refer to an existing config.
  • The RelatedConfigs should not refer to the enclosing Config.
  • The DefaultValue of an EnumConfigValue should refer to an element in the Options list, which must have at least two elements.
  • The DefaultValue should be within Min and Max of an IntConfigValue.
  • many more …

These checks are proved to be useful when I ported the existing def file into the proposed format.

Config categories
I think we should follow what is done with the checkers.
Configs could be categorized also into three distinct categories:

  • Released: These options are intended for a regular end-user. Options within this category should be the first to read after reading the checker’s documentation of course. These options are considered supported. All of these options should have an example clearly showing the difference and the effect of the given option. For example: assume-controlled-environment, consider-single-element-arrays-as-flexible-array-members.
  • Alpha (I would rather call it experimental): This should house the nobs, that are not meant for a regular end-user. These are opt-in experimental features, which might cause unexpected results and we offer only limited support. We discourage using these. For example: crosscheck-with-z3, support-symbolic-integer-casts.
  • Debug (I would rather call it developer): This is generally for tool-vendors (frontend developers). These are primarily intended to be enabled/set by scan-build or CodeChecker or other tools for fine-tuning the analyzers. These are subject to change and are heavily discouraged to use directly. For example: track-conditions-debug, display-ctu-progress, ctu-dir etc.

By having this, we could completely generate a documentation page similar to the checkers.rst file.
Clearly state which config category houses what, and describe an always up-to-date list of the configs within each category. We could also limit the number of configs we dump on the users when it uses the -analyzer-config-help flag. We can even make a driver flag for letting the users directly pass these options to the analyzer engine.

Upstreaming
I can submit patches for providing an alternative tablegen backend parsing the new format.
Then we can let the new tablegen backend generate the current .def file. This would demonstrate that the approach is solid and we can move further for generating some docs.
The format of the docs is still not decided, for that I need to delve more into the details of sphinx and rst.

Conclusion
I believe that shifting to a more capable representation would bring significant value.
By implementing semantic checks, we could shift the tablegen-specific knowledge to the tablegen-backend, where it is supposed to reside, and have a lean and clean self-explanatory description language for the options and even for checkers.

Outlook
I’ve seen that clang-tidy suffers from the exact same problem. If we succeed, they might also consider shifting to tablegen for similar reasons. Although, I’m not convinced yet that the description DSL should be able to handle that out-of-the-box. Currently, I’m not planning to make it reusable for other tools like clang-tidy.

Mentioning some folks to get notified for sure.
@NoQ @martong @Szelethus @vbridgers @ASDenysPetrov @Xazax-hun @AaronBallman

We have talked about this off-list a lot, so you know my opinion :slight_smile: This is a no brainer, my patch was a decent initiative, but this looks like the way to go. Can’t wait to see your patch!

I am only missing why you want to create a .def file instead of a pair of .cpp/.h files?

for (auto checker: all_experimental_checkers) {}

My primary focus was generating documentation.
I’m not particularly supporting the current .def file approach, but I’m not against that either.
One advantage of .def files though how flexible they are. There are several places using the checkers.inc (which is just a generated .def file), demonstrating its versatility.
So I would say, the elimination of .def and .inc files of the analyzer is beyond the scope of this RFC.

Small steps are way better than large steps. I just wanted to say that your tablegen emitter can generate anything you need: enums, lookup tables, …

1 Like

The driver already has its configs generated using tablegen: clang/Options.td at master · llvm-mirror/clang (github.com)

Do you plan to surface user-facing/released configs in the driver? How should this interact with the other driver flags and its tablegen files?

The proposal mentions:

We can even make a driver flag for letting the users directly pass these options to the analyzer engine.

But I’d like this part to be expanded.

I don’t mind TableGen in general. It is a more flexible format.

I think the discussion about alpha options and driver flags is mostly orthogonal to this. Do you intend to decide this right here right now? In any case I’d like to go over the list of flags which are intended to be made user-facing.

And as usual, I’d like to advocate for maximizing backwards compatibility, so that IDEs didn’t suddenly start passing non-existent flags to the updated compiler.

Do you plan to surface user-facing/released configs in the driver? How should this interact with the other driver flags and its tablegen files?

IMO we should have a supported way of setting released configs, thus we need a driver flag for that.
If we go there, we should make no difference between released, alpha, or developer config options.
We should treat them differently only in the corresponding help flags, to list only the interesting set of configs.

However, in this proposal, I’m intended to fix a subset of these while making sure the directions match with some longer-term objectives, such as doing this with the driver flags.

I don’t mind TableGen in general. It is a more flexible format.

Awesome!

I think the discussion about alpha options and driver flags is mostly orthogonal to this. Do you intend to decide this right here right now?

I agree. I just want to make sure everything will fit nicely.

In any case I’d like to go over the list of flags which are intended to be made user-facing.

That’s a tough nut. I already categorized all of these, and I can tell you only a handful of them are user-facing. Some of them are experimental, and the wast majority is developer. I plan this in a separate patch though.

And as usual, I’d like to advocate for maximizing backwards compatibility, so that IDEs didn’t suddenly start passing non-existent flags to the updated compiler.

I’m on board with that. We could still preserve the raw way of defining these.
However, we will likely change the description of the flags, to make released configs actually understandable.