Summary
We should move the analyzer config option descriptions into a higher-level abstraction which would make the analyzer behavior more in line with the documentation but potentially encourage better documentation as well on the developers’ side.
This proposal is heavily inspired by @Szelethus’ work in D104439 Demonstrate a move from the analyzer-configs .def
file to a TableGen file and the tablegen
DSL of how attributes are defined and documented.
Related tablegen
resources
- LLVM TableGen Programmer’s Reference
- YouTube 2021 LLVM Dev Mtg “How to write a TableGen backend”
- YouTube FOSDEM Lessons in TableGen
Primary differences compared to D104439
I would put much more emphasis on compile-time checks of semantic expectations of the input .td
file, producing errors and warnings for common mistakes, especially focusing on preventing copy-paste mistakes.
Along with these semantic checks, we would have the ability to cross-reference other analyzer configs and even related checkers, making it possible for users to discover and reason about the relationships and effects of these options in the broader sense.
In the future, we could put more and more quality gates, enforcing that no analyzer option gets introduced without proper documentation, e.g. missing an example explaining the effect of the option in an end-user-friendly way.
Motivation
We write the documentation by hand, and they are mostly very vague. This is a serious problem, but these analyzer config options are not intended to be user-facing in most cases. In contrast to analyzer configs, checkers are categorized into the following three categories, which can be listed by the respective flag:
-
released
→-analyzer-checker-help
-
alpha
→-analyzer-checker-help-alpha
-
debug
→-analyzer-checker-help-developer
The lack of this resulted in the situation, that we (analyzer developers) have no official way of presenting these config options since the end-user should either use the -cc1
or prepend any -analyzer-config
with an -Xclang
or -Xanalyzer
; and none of these approaches are considered supported. So, at the end of the day, the devs don’t have a way of presenting useful user-facing configs such as:
elide-constructors
expand-macros
apply-fixits
crosscheck-with-z3
-
consider-single-element-arrays-as-flexible-array-members
! -
assume-controlled-environment
! -
ignore-bison-generated-files
! - etc.
It would be similarly important to refer to assume-controlled-environment
in the documentation from any taint-related checkers, and the opposite way around - since these share some property, thus related in that sense.
Harmony with checkers.td
In this proposal, I’m focusing on reforming the analyzer config options, but to bring out the most of it - e.g. have cross-references from checkers to analyzer configs and around, we should have a common syntax for both. Parse both checker and config options, and perform semantic checks to ensure everything is in line. We could probably keep the current syntax of the checkers.td
, but I believe that would be less than ideal.
Proposed format
Please note that I chose this format only for experimentation, and I’m ready for comments to improve it.
Also, note that every field is strongly typed!
class ConfigValue {
string FlagName;
string ShortDescription;
string LongDescription; // optional
list<string> RelatedConfigs; // optional, should name config definitions.
// list<string> RelatedCheckers; // optional, should name checker definitions.
// ConfigCategory Category;
}
class BooleanConfigValue : ConfigValue {
bit DefaultValue;
}
class EnumConfigValue : ConfigValue {
string EnumName = NAME;
list<string> Options;
string DefaultValue;
}
class IntConfigValue : ConfigValue {
int Min = 0;
int Max = ?; // optional
int DefaultValue;
}
class StringConfigValue : ConfigValue {
string DefaultValue;
}
class UserModeDependentEnumConfigValue : ConfigValue {
string EnumName = NAME;
list<string> Options;
string ShallowDefaultValue;
string DeepDefaultValue;
}
class UserModeDependentIntConfigValue : ConfigValue {
int ShallowMin = 0;
int DeepMin = 0;
int ShallowDefaultValue;
int DeepDefaultValue;
}
/// Analyzer config options:
def UserMode : EnumConfigValue {
let FlagName = "mode";
let ShortDescription =
"Controls the high-level analyzer mode, which influences the "
"default settings for some of the lower-level config options (such as "
"IPAMode).";
let RelatedConfigs = ["MaxInlinableSize", "MaxNodesPerTopLevelFunction", "IPAMode"];
let Options = ["deep", "shallow"];
let DefaultValue = "deep";
// let Category = ReleasedConfig;
}
def ShouldAssumeControlledEnvironment : BooleanConfigValue {
let FlagName = "assume-controlled-environment";
let ShortDescription =
"Whether the analyzed application runs in a controlled environment. "
"We will assume that environment variables exist in queries and they hold "
"no malicious data. For instance, if this option is enabled, 'getenv()' "
"might be modeled by the analyzer to never return NULL.";
let DefaultValue = False;
// let Category = ReleasedConfig;
// let RelatedCheckers = ["TaintPropagation"]
}
def MinCFGSizeTreatFunctionsAsLarge : IntConfigValue {
let FlagName = "min-cfg-size-treat-functions-as-large";
let ShortDescription =
"The number of basic blocks a function needs to have to be considered "
"large for the 'max-times-inline-large' config option.";
let RelatedConfigs = ["MaxTimesInlineLarge"];
let DefaultValue = 14;
// let Category = DeveloperConfig;
}
def MaxTimesInlineLarge : IntConfigValue {
let FlagName = "max-times-inline-large";
let ShortDescription =
"The maximum times a large function could be inlined.";
let RelatedConfigs = ["MinCFGSizeTreatFunctionsAsLarge"];
let DefaultValue = 32;
// let Category = DeveloperConfig;
}
def MaxInlinableSize : UserModeDependentIntConfigValue {
let FlagName = "max-inlinable-size";
let ShortDescription =
"The bound on the number of basic blocks in an inlined function.";
let ShallowDefaultValue = 4;
let DeepDefaultValue = 100;
// let Category = DeveloperConfig;
}
def CTUDir : StringConfigValue {
let FlagName = "ctu-dir";
let ShortDescription = "The directory containing the CTU related files.";
let DefaultValue = "";
// let Category = DeveloperConfig;
}
// ...
For brevity, I omitted the ConfigCategory
definitions. RelatedCheckers
are also omitted, since that is not implemented yet.
I deliberately chose to refer to entities by “name” in the RelatedConfigs
instead of referring to the tablegen
class directly.
By doing it this way, we can introduce circular references, which couldn’t be done by the direct approach AFAIK.
This makes the implementation slightly more awkward, but the td
file still feels natural.
Some of the implemented semantic checks
- The
FlagName
,ShortDescription
,LongDescription
, should not be empty and it should be unique. - The
LongDescription
should be longer than theShortDescription
. - The elements of
RelatedConfigs
should be unique and should refer to an existing config. - The
RelatedConfigs
should not refer to the enclosing Config. - The
DefaultValue
of anEnumConfigValue
should refer to an element in theOptions
list, which must have at least two elements. - The
DefaultValue
should be withinMin
andMax
of anIntConfigValue
. - many more …
These checks are proved to be useful when I ported the existing def
file into the proposed format.
Config categories
I think we should follow what is done with the checkers.
Configs could be categorized also into three distinct categories:
-
Released
: These options are intended for a regular end-user. Options within this category should be the first to read after reading the checker’s documentation of course. These options are considered supported. All of these options should have an example clearly showing the difference and the effect of the given option. For example:assume-controlled-environment
,consider-single-element-arrays-as-flexible-array-members
. -
Alpha
(I would rather call it experimental): This should house the nobs, that are not meant for a regular end-user. These are opt-in experimental features, which might cause unexpected results and we offer only limited support. We discourage using these. For example:crosscheck-with-z3
,support-symbolic-integer-casts
. -
Debug
(I would rather call it developer): This is generally for tool-vendors (frontend developers). These are primarily intended to be enabled/set byscan-build
orCodeChecker
or other tools for fine-tuning the analyzers. These are subject to change and are heavily discouraged to use directly. For example:track-conditions-debug
,display-ctu-progress
,ctu-dir
etc.
By having this, we could completely generate a documentation page similar to the checkers.rst
file.
Clearly state which config category houses what, and describe an always up-to-date list of the configs within each category. We could also limit the number of configs we dump on the users when it uses the -analyzer-config-help
flag. We can even make a driver flag for letting the users directly pass these options to the analyzer engine.
Upstreaming
I can submit patches for providing an alternative tablegen
backend parsing the new format.
Then we can let the new tablegen
backend generate the current .def
file. This would demonstrate that the approach is solid and we can move further for generating some docs.
The format of the docs is still not decided, for that I need to delve more into the details of sphinx and rst
.
Conclusion
I believe that shifting to a more capable representation would bring significant value.
By implementing semantic checks, we could shift the tablegen
-specific knowledge to the tablegen
-backend, where it is supposed to reside, and have a lean and clean self-explanatory description language for the options and even for checkers.
Outlook
I’ve seen that clang-tidy
suffers from the exact same problem. If we succeed, they might also consider shifting to tablegen
for similar reasons. Although, I’m not convinced yet that the description DSL should be able to handle that out-of-the-box. Currently, I’m not planning to make it reusable for other tools like clang-tidy.