Adding options for Clang to write diagnostics to file

Clang has had support for SARIF for a bit, but there are two problems:

  1. Piping stderr to a requires manual editing before feeding to a SARIF consumer.
  2. There isn’t support for batching builds (e.g. using a build system).

I propose adding two flags to resolve this problem:

  • -fdiagnostics-output=(stderr|file-overwrite|file-append), which tells Clang how the user expects diagnostics to be written
    • stderr is the default option, and will tell the compiler to write to standard error, as it does today
    • file-overwrite will write to a specified file, and overwrite the file if it exists
    • file-append will write to a specified file, and append the file if it exists
  • -fdiagnostics-output-path=, which tells Clang where to output diagnostics if the above option is not stderr. It’s intended to be an error to set -fdiagnostics-output-path in the event of -fdiagnostics-output=stderr.

It will then be up to build systems to provide support for using these flags with append mode. I suspect that handwritten Makefiles would need to have a clobber step for each invocation. Build system generators (CMake) and more intelligent build systems can probably do something a bit better.

1 Like

I just want to point you at this diff: ⚙ D78903 [Driver] Add option -fproc-stat-report
This diff is a child dependency :⚙ D79066 [Support] Class to facilitate file locking

He faced the same challenge with concurrent appends to a file.

1 Like

@denik pointed out that appending to a single file will be problematic because it’s possible to compile different TUs with different options. As such, I think it will be better to do something similar to GCC’s -fdiagnostics-format, which has sarif-stderr and sarif-file.

Unlike GCC, which writes to $(pwd)/$(filename(source)).sarif, I think we should instead write it to target.sarif. This will make it possible for build system scripts to generate diagnostic files for different targets when they use the same source. For example:

$ pwd
$ clang++ -fdiagnostics-format=sarif-stderr hello.cpp
1 error generated.
$ ls
a.out.sarif hello.cpp
$ rm a.out.sarif
$ clang++ -fdiagnostics-format=sarif-stderr hello.cpp -o /tmp/build/src/hello
1 error generated.
$ ls
$ ls /tmp/build/src/hello

Do you know if the GCC folks have any interest in changing their approach? Alternatively, do you envision adding a way to let users specify the sarif output file so they have a way to generate the same output as GCC? (I’m thinking about users who want their build system to use GCC or Clang but still get the same output artifacts – it might not be an important use case though.)

1 Like

E.g. Cmake hides the differences of gcc vs. clang from you. If your Linux distribution ships both, Cmake will take care of the differences. I prefer one Sarif file per source file. You can easily search for Sarif files and merge them into something useful and get some insights out of them.

If you use ISO C++ modules, there is a chance that your source files will be compiled twice with different flags.

Sounds like a job for Monday Christopher!

Generating one SARIF file per source file is dangerous. Consider the following.

// hello.cpp
int main()
#if __cplusplus < 202303L
  std::cout << "Hello, world!\n"; // error: forgot to include <iostream>
  int x = 0.0; // will warn
  std::print("Hello, world!\n"); // error: forgot to include <format>
  int x = 0.0; // won't warn
} // -W
add_target(cxx20_example hello.cpp)
target_compile_options(-std=c++20 -Wconversion -fdiagnostics-format=sarif-file)

add_target(cxx23_example hello.cpp)
target_compile_options(-std=c++23 -Wno-conversion -fdiagnostics-format=sarif-file)

By writing an output file per source file, we will not see the error for cxx20_example until after we fix cxx23_example. We also cannot combine the diagnostics from each due to their different build options.

All the more reason to have output per target and not output per source.

1 Like

Maybe instead $(pwd)/$(filename(source))-$(hash).sarif and add a hash of the flags.

Sure, CMake does that, but not everyone uses CMake and we should keep those folks in mind.

Sorry, I’m not seeing the motivation for using an unreadable hash, when the target name very much maps to a compiler invocation on inspection.

I missed something. I have a huge ISO C++ modules project and every source is compiled twice (BMI and object).

> make

The target is what?

A target is simply something that the compiler produces. Looking at the Clang C++20 modules documentation, that appears to be specified using -o <name>.pcm, making it no different to producing an object file.

Another problem with hashes is that they will lead to old diagnostic files hanging around when new flags are added to builds. Suppose upgrading from Clang 18 introduces a warning that’s particularly noisy for a single test file, so I add -Wno-noisy-warning to my build file. That means that the previous invocation won’t be overwritten with the new diagnostics, and now the SARIF consumer will be loading diagnostics for the same invocation twice, and one of them will be a set of outdated false positives.

I see. That makes more sense my_awesome_module.pcm.sarif than the GCC approach.

-fdiagnostics-output=(stderr|as-target) should be sufficient now.

D145284, plus one of D145438 and D145439 will be the implementation. Before reviewing the implementation, we should sort out the how we want Clang to behave when given -fdiagnostics-format=sarif-file. In both cases, clang++ -fdiagnostics-format=sarif-file file0.cpp file1.cpp ... fileN.cpp -o target will generate a temporary SARIF file per file, similar to how .o files are generated. The difference is in what they choose to do with these files.

D145438 will take all of the temporary SARIF files’ (single) run objects and join them together into a single SARIF file called target.sarif, not unlike what will happen with the temporary .o files. By factoring this work out into a separate tool, it will also make it possible for larger projects to combine multiple SARIF files that appear from multiple build steps to be reduced into a single file (e.g. perhaps there’s a failure step that does something like sarif-ld $(find . -name '*.sarif') -o my_project.sarif). The advantage of this approach is that there is a single file per invocation, and so tooling that expects there to be exactly one file will still be able to work such as Visual Studio Code’s SARIF tool. It may also be beneficial for the tool to deduplicate identical diagnostics.

After having nearly completed D145438, I was asked to consider D145439. The approach this one takes is to grab all of the temporary SARIF files, and place them in a directory called target_sarif as 0.sarif, 1.sarif, etc. This is similar to producing split DWARF files with ThinLTO. No separate tool is required, and the primary advantage of this approach lies in its simplicity.

A third approach could be to blend the two. For example, we could have D145439 be the default process, and then users that wish to reduce the diagnostics can opt into that via -fdiagnostics-format=sarif-file -fdiagnostics-link (second flag to be bikeshed).

I have some random questions, none of them are blockers, just curious.

  • Which of these approaches would work when the user is using -fsyntax-only? In those cases, we might not have a target name at all, so we cannot not have a target_sarif directory.
  • Would all of these solutions free of races? E.g., a project having multiple targets with the same name (but in different directory).
  • Which of these approaches would work when other parts of the toolchain would also start emitting sarif files? E.g., imagine LLVM emitting optimization remarks as SARIFs, the linker emitting diagnostics as SARIFs and so on. I do not anticipate that happening any time soon, but we do not want to make a design decision that might make the experience worse for those scenarios in the future.
  • Do we want to build this feature in a way that build systems can be aware of these files? E.g., users could potentially expose target_sarif as a target, and make target_sarif would regenerate the compiler diagnostic files. Would all of the approaches work in a scenario like that?

These are excellent questions, thank you.

-fsyntax-only has been the biggest point of frustration for this effort. Right now, clang++ -fsyntax-only -fdiagnostics-format=sarif-file file.cpp will write to ./file.cpp.sarif, and I’m not sure if we can do better (or if we should even try to do better).

This should be fine since both the final SARIF file and the <target>_sarif directory will appear alongside the target (so they’ll also be in different directories).

I think both can work with this model, although neither of them take the necessary steps to harden against this right now. A simple hack to avoid this would be to have clang suffix .clang to the filename (e.g. target.clang.sarif and 0.clang.sarif).

I’m not sure I follow this one. It sounds like a user would need to attempt to build their project, and then rebuild to get the diagnostic files?

This is potentially racy when we process the same file multiple times (e.g., for multiple targets, or with multiple preprocessor macros). One option to avoid races here to include a hash of the command line in the output file name. I have the following use-case in mind. In Rust, one could do cargo check which would only do the type check and emit warnings/errors without the codegen. If -fsyntax-only works well, a build system could do something akin to cargo check using sarif files.

Indeed, feel free to disregard this question.

Agreed on potentially racy. This is similar to the discussion above, and I think it may still have the same problem w.r.t. stale files. I don’t have a good general solution, and it’s entirely possible that sarif-file might need to be incompatible with -fsyntax-only as a result?

As a general rule, I think Clang should never produce output to a filename which cannot be set explicitly on the command-line. (-fdiagnostics-output-path achieves that, for this case).

And, I’d also suggest that, in general, anyone doing anything complex, e.g. with a buildsystem generator, should always explicitly provide such output-filename, arguments rather than depending on builtin default filename rules.

And, since the SARIF output is only intended for complex cases, I’m not even sure it even needs to have default filename rules in the first place.

But if it does have a default, following the -o output name does seem most reasonable. And I’m not sure it matters much about corner cases e.g. with -fsyntax-only. Such users who care should provide the explicit -fdiagnostics-output-path argument.

How did you reach this conclusion? I don’t expect it to be exclusively used for complex build cases at all.

I mean that I would expect that a user typing command-lines manually, like “clang foo.c” at a shell prompt would not want SARIF output, so making a shorter command-line for ease-of-typing doesn’t seem important.