[RFC] Prefer unittests over matching dumpped AST

ChuanqiXu · February 2, 2024, 5:42am

Hi Clang Developers,

This post got motivated from the review process in [clang] fix merging of UsingShadowDecl by mizvekov · Pull Request #80245 · llvm/llvm-project · GitHub. But it is not required to read it to get the idea though.

The main motivation and reason of the post is that checking dumpped AST is hard to maintain and looks not pretty. We can improve them by utilizing unittests.

Generally we write tests in lit test, which tests the overall behavior of the compilation invocations, it is pretty good and efficient. However, there are cases the things we’re developing is not easy to be tested with the lit test since we need to observe the intermediate state. Then in such cases, there is a practice to dump the AST and match the dumpped AST.

But it is not good. It is pretty fragile to changes to the AST dump. To be honest, as a developer, every time I modify something and find the test with matching dumpped AST fails, I would take a deep breath and I need to make my mind to focus to look at it.

Also from the perspective of testing, it is not good as it adds new dependencies, the printing component. And it would become a disaster once we want to change the printer.

And I want to propose to prefer unittests over matching dumpped AST if we only want to get the intermediate state. An example may be ⚙ D120397 [C++20] [Modules] Make the linkage consistent for template and its specialization. I think it is better since it just tests what we want to test. It makes things pretty clear. It helped me to understand the code structure more too. One downside may be some biolerplate in unittests. But I think it is completely acceptable. One the one side, I think it is worthy to write some additonal lines of code in the unittests to make things tested properly. On the other hand, we can improve the biolerplate by refactoring and I think it offers a good oppotunity to us about which things will get reused more.

So I want to propose, in the new added test, we should forbid matching dumpped AST unless we’re implementing it. (e.g., changing the printer itself).

CC: @mizvekov @AaronBallman @erichkeane @cor3ntin @Endill @shafik @ilya-biryukov

Endill · February 2, 2024, 7:38am

I’m on a fence with this. I’m aware of downsides of FileCheck, but I can’t say I see ASTMatchers (which I presume are cornerstone of unit tests you propose) as a better all around solution. Maybe I’m just not familiar enough with it, though. As of now, I’d allow both, but wouldn’t force authors to convert one into another if they don’t feel like it.

Could this actually come from AST dump tests being overspecified? For instance, matching full dump instead of doing something more localized like #pragma clang __debug dump <expr>.

While I agree in principle, I don’t find this argument particularly compelling in practice. We have a whole corpus of tests where we observe behavior via side-effects of diagnostics (and sometimes very creative at making sure such side-effects occur).

Looking at this at another angle, printer is an additional abstraction layer, that can make things smoother for tests when something about C++ API changes. I don’t have any compelling example to bring up, but maybe others can think of some.

Are you aware of of this happening in the past, or some worthwhile changes to printer that are pending because of potential breakage? I perceive printer as quite stable component, but maybe it happened because of heavy reliance on it, or I’m just plain wrong to see it this way.

If this test is only concerned with linkage of primary template matching linkage of template specialization, and not concerned with them having any particular linkage (external, I guess), then I agree that unit tests is a good fit, and expressing this in FileCheck would be clunky.

While this test is definitely easy to follow, it’s imperative, and reader has to follow the code to understand it. I find declarative nature of FileCheck to be a winner in general case.

Yeah, I definitely see signal-to-noise ratio going down in unit tests. It’s not a blocker in principle, but I’d like proponents to discover good practices on improving this before we ask everyone to prefer this way over AST dumping. Maybe ASTMatchers DSL can help us?

I find this wording too strong, given that unit tests do not look like a better all around solution compared to AST dumping. I’d expect them to be a good fit for some cases even if we go ahead with this RFC.

One minor, but very unfortunate downside of unit tests that you haven’t touched upon, is loss of syntax highlighting for test code inside string literals. This definitely doesn’t help readability, especially in larger tests.

In conclusion, I’d like to thank you for bringing this up. There are definitely things to consider to improve status quo.

ChuanqiXu · February 2, 2024, 8:05am

but I can’t say I see ASTMatchers (which I presume are cornerstone of unit tests you propose) as a better all around solution.

Not strictly true. e.g., llvm-project/clang/unittests/Sema/SemaNoloadLookupTest.cpp at main · llvm/llvm-project · GitHub and [Serialization] Load Specializations Lazily by ChuanqiXu9 · Pull Request #76774 · llvm/llvm-project · GitHub don’t use ASTMatchers.

Looking at this at another angle, printer is an additional abstraction layer, that can make things smoother for tests when something about C++ API changes. I don’t have any compelling example to bring up, but maybe others can think of some.

But the problem is the printer is not designed stable too. It should be designed to help debugging majorly.

Are you aware of of this happening in the past, or some worthwhile changes to printer that are pending because of potential breakage? I perceive printer as quite stable component, but maybe it happened because of heavy reliance on it, or I’m just plain wrong to see it this way.

An example may be the comment of @iains from ⚙ D120397 [C++20] [Modules] Make the linkage consistent for template and its specialization.

Also in my experience of developing offline, it happened a lot of times. I just don’t record them formally. (It is also weird to record that too, though)

BTW, even if the printer itself is relatively stable, its output will change once the AST changes. It is really terrifying to see failures from checking dumpped AST texts.

While this test is definitely easy to follow, it’s imperative, and reader has to follow the code to understand it.

I feel it is a good idea to provide a method for the readers to understand the workflow.

Maybe ASTMatchers DSL can help us?

It is not only about ASTMatchers. It is complex and I think it can be sumarized as all common things in the unittests. It is hard to define the line in ahead.

One minor, but very unfortunate downside of unit tests that you haven’t touched upon, is loss of syntax highlighting for test code inside string literals. This definitely doesn’t help readability, especially in larger tests.

Yeah, but I don’t feel it hurts a lot especially the tests are intended to made small.

My personal experience is, before I wrote the unittests, I was afraid of its complexity too. Since it is a white box. But after I made it, I found it is not so hard completely and it helped me to understand things much better.

A side proof may be the clangd subproject. They do rare lit tests and a intensive unittests. While there are definitely differences, I feel their experience may be helpful.

CC: @HighCommander4 @sam-mccall

Endill · February 2, 2024, 8:13am

For others following, the comment is ⚙ D120397 [C++20] [Modules] Make the linkage consistent for template and its specialization, and it reads

for the record, I experimented with adding the linkage as an output for AST dumps (adding to TextNodeDumper / DeclPrinter), since it seems that this could be generally useful.
this works fine, but, it would create a lot of testsuite churn - around 350 tests would need amendment, so I’ve punted on it for now (perhaps unit tests can always accomplish the same).

AaronBallman · February 2, 2024, 1:43pm

Thank you for raising this! As someone who has done extensive work on AST printing both in text and in JSON, I understand some of the pain points you bring up. Especially:

I’ve definitely been at the “take a deep breath” stages before.

That said, a lot of time that pain comes from the test being over-specified with its CHECK lines. I’ve found that the tests which are generated by a script tend to be the ones that break the most readily (and are also sometimes the easiest to repair because you can re-run the script). So in some cases, I think using a better testing methodology makes the test far less brittle and more readable.

I think this goes a bit too far. I don’t think we should have a policy which forbids AST dump tests unless they’re testing changes to the dumper itself. They are sometimes the most direct way in which to test functionality, such as: llvm-project/clang/test/C/drs/dr206.c at 92bbf615f50c67030ed536f08cc5bb266e0718db · llvm/llvm-project · GitHub or llvm-project/clang/test/Sema/attr-nomerge-ast.cpp at 92bbf615f50c67030ed536f08cc5bb266e0718db · llvm/llvm-project · GitHub (and many others). One thing to note, things like DR testing sometimes drive automated scripts and I don’t think we’d want to switch those over to unit tests regardless.

I think we should definitely prefer any other form of test over an -ast-dump test when possible, but I think that should be left up to the patch author and reviewer to determine when it’s appropriate.

Unit tests do not always provide a good testing experience. They’ll tell you “expected Blah, got Yada” but sometimes that’s too terse (or worse yet, sometimes it’s totally unintelligible output) and the extra output you get from lit showing you surrounding context saves significant time when debugging. I’ve hit some clangd failures from changes in Clang where I have absolutely no idea how to understand the test failure output and the tests have enough layers of abstraction that it’s not necessarily more clear than a lit test. One benefit to lit tests are that they translate quickly into a command line to use to debug the failing test whereas with unit tests, you have to run all of the other tests in the suite before you get to the failure. It’s also worth noting that unit tests often do not test the compiler as the user would use it, but they test a tooling interface that can be subtly different in how it sets up the compiler.

AaronBallman · February 2, 2024, 2:12pm

@Endill just pointed out to me offlist that the scripts can already work on unit tests if we wanted them to. Whether that’s a good idea or not is worth more thought, but this may not be a reason to avoid unit tests like I originally thought.

erichkeane · February 2, 2024, 3:07pm

FWIW, having lived in a downstream with a ton of customization for years: A vast majority of problems with AST-dump tests tend to be a result of overspecification. Too often folks just copy/paste in the whole line and make it really fragile, rather than testing what they care about.

I also find the ‘unit tests’ really hard to read/debug though. While I don’t mind such a guidance (perhaps not as strong as ‘prefer’ as ‘consider strongly’?) as I don’t write many, I would think more unit-tests would be harder for me to code review.

tahonermann · February 2, 2024, 3:32pm

I don’t have much experience with Clang’s AST output matching test, but I have a lot of experience with similar testing methodology used in Coverity development. My experience closely matches that reported by @AaronBallman and @erichkeane. The fragility I saw was usually due to a failure to trim the AST output to what was actually relevant for the test. The features provided by lit are more extensive than what we had in Coverity testing and the AST output format has also been more stable, so problems of fragility have been less frequent than what I’m used to.

I tend to appreciate AST output matching tests for how quick they are to write and how easy they are to read. That enables adding more extensive testing that would be costly if done as unit tests.

There is a tendency for test authors to be lazy when writing tests, particularly if they haven’t had the pleasure of repeatedly fixing tests that validate too much. I don’t have any solutions to offer for that, but prominent example tests, documentation, code review, and experience all help.

vvassilev · February 2, 2024, 4:38pm

Is it worth implementing update_test_checks.py for the clang AST?

Endill · February 2, 2024, 5:04pm

I sincerely hope we can manage without scripts generating our tests.

mizvekov · February 2, 2024, 6:48pm

I don’t think unittests are the right complexity tradeoff for the majority of cases, specially when involving small bug fixes. For one, as previously noted, they take more boilerplate, and are harder to inspect when they fail.

When I am reviewing, I certainly make a point that any functional change should touch tests somehow, but that would give me pause if that were to imply a newcomer would have to create a new unit test from scratch.

An AST test is very visual and easy to learn and modify in place, even on the first eyes-on experience. The FileCheck documentation is also simple and concise.

Unittests on the other hand require using the clang API directly, which is more complex, documentation is more lacking and is more sparse. I think an MR author would be exposed to a much larger API surface area in that case, compared to what might have been necessary for just the fix.

Even if it is completely alright for me personally to be writing unittests, I would certainly prefer to be using the same tools a newcomer would, when that is appropriate, as that would make the entry bar even lower, allowing for cross-contribution.

I worry creating such a hard requirement will drive away casual contributors.

Regarding concern for test churn, that is valid and I have experienced that before multiple times, but we have usually dealt with it just fine, by making them more underspecified, or improving the dump machinery to avoid unnecessary changes in the output.

rnk · February 2, 2024, 10:33pm

I think there is an art to designing a FileCheck-friendly output and testing methodology, and the AST dumper was not designed for this purpose. To date Clang testing has been dominated by IR tests in test/CodeGen* and parser and semantic checks using diagnostic verification. I don’t ever recall writing tests to validate AST structure, but if that is being done more often, I think the LLVM-y thing to do would be to add a new dumper mode that works better for testing purposes.

mizvekov · February 2, 2024, 10:54pm

Perhaps we don’t need to start with a whole new dumper mode with a different format, but just build upon the already existing filtering capabilities.

The motivating example for adding linkage, a few posts above, could be hidden by default behind one such filter.

My concern, about a new output format, is that there are too many existing tests, and we won’t be able to convert them, and then will have to maintain two things.

ChuanqiXu · February 4, 2024, 6:46am

Yeah, the key different between dumpped AST and the LLVM IR text form is that, the LLVM IR text form is a well designed text form and its semantic and format are well defined. But it is not the case for dumpped AST. The dumpped AST is basically a helper for debugger to me.

ChuanqiXu · February 4, 2024, 6:46am

I feel it is conerning to add more facilities to the AST dumper for testing…

ChuanqiXu · February 4, 2024, 6:49am

Thanks for every one involved here.

My summary to the discussion is no consensus. But I still want to sell out unittests. It really makes things much more clear and more explicit to me. I suggest people who didn’t try it to give a try.

ChuanqiXu · February 4, 2024, 6:50am

I don’t like such scripts. It generates too many things and we’re hard to read it actually.

And it will be bad if we did something bad but we cover the failures by such scripts.

AaronBallman · February 5, 2024, 12:46pm

FWIW, we do have an existing script for this, at least for JSON: llvm-project/clang/test/AST/gen_ast_dump_json_test.py at main · llvm/llvm-project · GitHub

However, the complaint about test fragility is still very valid; the AST tests do get to be annoying when changing underlying details about the AST from time to time. But that’s not a reason to avoid using AST dumping for testing purposes, just a call to be mindful of testing only what you need to test.

AaronBallman · February 5, 2024, 12:53pm

AST dumps are intended to help reason about the AST, which is sometimes a debugging aid, sometimes a testing aid, and sometimes a tooling aid. We added JSON dumping for folks who need more regular output for post-processing and the text dumper is used by folks trying to reason about writing AST matchers for tidy checks, etc. But at the end of the day, dumping the AST is sometimes the easiest way to determine if you’ve properly modeled the underlying semantics in the AST. For example, not everything we track on the AST level is testable via C or C++ code; AST dumping tests are helpful for those kinds of situations because it’s a very direct mapping between source code and expected output without the abstraction layers imposed by writing readable unit tests.

IMO, we should be reaching for AST dumping tests as more of a last resort than anything, but I think it’s reasonable to prefer them over unit tests depending on how many tests you need. e.g., If you need to test dozens of variations, perhaps a unit test suite makes more sense but if you need to test “does this AST node track this one piece of information”, perhaps an AST dumping test makes more sense.

sam-mccall · February 10, 2024, 10:23pm

What hasn’t already been said

AST dump tests provide value by describing the intended AST and ensuring it matches reality. The AST is a critical internal abstraction layer, and also a product in itself (“The LLVM Project is a collection of modular and reusable compiler and toolchain technologies”). It’s slightly silly how much we test by checking whether the FE accepts code, without checking what it does with it.

AST dump tests could be written in text files and run by a driver, or gtests written in C++.

C++ tests look different everywhere, it’s easy to add too many/bad abstractions, and getting good error messages is really hard (and hard to incentivise).
lit is not a good driver: it’s textual so you can’t express “child of” directly, you have to use regexes to skip the things you don’t care about, it expects you to write the |`- nonsense, you have to copy/paste tons of clang flags everywhere or cram lots of tests into one file. (These things incentivise huge, overspecified, copy-paste tests). Its diff output is unreliable and cryptic again because it’s textual.

I think lit is the lesser of the evils we have today, but we would be better off with a driver for AST tests. This could be built from reusable components (a tree parser/printer/differ, input of marked code like llvm::Annotations, etc).
These components could be used to build test drivers for other kinds of tests (lexer, completion, template instantiation trees?). This is probably a pipe dream.

AST tests should absolutely not use AST matchers, matchers are full of irregularities, missing nodes/properties, and do-what-I-mean features that inhibit understanding AST structure. (I’ve started anti-recommending them to people writing tools).

It would be nice if the AST was more regular and dump() was just a dump rather than an opinion about its shape. (We joke that clang’s abstract syntax tree is really a concrete semantic digraph). That the text dumper and AST tests are so closely related is silly.

Experience with gtest in clangd: mixed, better than the alternatives.

We needed to factor out common testing patterns: there are at least 10 “shapes” of feature that all need many tests. A dumper + lit for each is harder to iterate on than writing the tests in C++ code.
Code examples inside string literals is bad, but probably not as bad as interleaving code and CHECK lines.
error messages are sometimes much better and sometimes much worse, but very irregular. The key info is missing entirely more often than with lit.
changing a test and waiting for compile+link sucks. A lot.
we often add bad/confusing abstractions, or too many knobs on good ones. We also copy boilerplate without adding needed abstractions.
other abstractions are great: TestTU and clangd::Annotations are way easier to understand than their equivalents in clang (modulo familiarity, but that cuts both ways). These are easier to develop in C++
test quality varies wildly based on whether we have good abstractions
overspecification is less common than lit
incrementally adding tests to an existing feature (e.g. making hover support concepts) is way easier than with lit
coverage and maintenance cost is IMO pretty good vs investment, I’d expect lit to be much worse unless we worked much more on infra
developers coming from outside LLVM understand gtest but don’t understand lit

That’s a lot of text, I should stop writing now.

[1] conflating (written as text vs C++) and (unit test vs integration test) is a weird LLVM-ism that makes these discussions harder!

Topic		Replies	Views
State of the serialization tests, and plan to fix it Clang Frontend	1	91	April 8, 2019
Proposed new policy for tests which are looking at the output of -ast-dump Clang Frontend	1	72	June 21, 2020
cfe-dev Digest, Vol 156, Issue 88 Clang Frontend	0	89	June 18, 2020
Testing Best Practices/Goals (in the context of compiler-rt) LLVM Dev List Archives	40	130	March 1, 2016
Expression Templates for AST matchers Clang Frontend	6	109	April 13, 2015

[RFC] Prefer unittests over matching dumpped AST

Related topics