[RFC] Emit SARIF Diagnostics via -fdiagnostics-format=sarif

Hello Everyone,

Below is an RFC on extending the clang -fdiagnostics-format option’s to
let clang to emit machine readable json diagnostics. Feedback is highly appreciated!

Why

Machine consumable diagnostics are important for writing generic static
analysis wrappers and harnesses that want to interact with code bases through
clang, There are two options to consider for the diagnostic format to use in
clang:

  1. Mimic gcc-9 -fdiagnostics-format=json, covered in the previous work section
  2. Emit SARIF diagnostic information, a cross-language standardized format
    that is already supported in clang/lib/StaticAnalyzer (through --analyzer-output=sarif)

We propose (2) as it is a standardized format, which should make it easier for tools to
implement support for it.

Previous Work

gcc-9 -fdiagnostics-format=json

GCC recently implemented serializing diagnostics to JSON. This option
could be implemented as a -fdiagnostics-format=json-gcc in clang to signal
users of its intended interoperability with the corresponding gcc option.
The schema for this format may be inferred from current gcc code.

While not community standard, it can be expected to be reasonably stable as the
original patch states the flag emits machine readable diagnostics.

SARIF diagnostics in LLVM

SARIF (Static Analysis Results Interchange Format) is a standard format
for the output for static analysis tools.

Clang StaticAnalyzer already implements a SARIF diagnostic consumer in
D53814, this should allow us to implement (necessary, if any) extra fields
to the diagnostics output

Mapping clang diagnostics to SARIF

This section assumes the typical compiler diagnostic which looks like what is
provided in the expressive diagnostics page

In SARIF, the attributes can be mapped to the results property as follows:

  1. File name where the diagnostic occurs is relocated to the physicalLocation
    property
  2. Line/Column of the caret marking the error can be stored in the region
    property, this can also encode the source range to which an error corresponds
  3. The error message can be transferred to the message
  4. Each of the locations can store the rendered caret & snippet from clang using the
    snippet property for that region
  5. Nested diagnostics (typically note level items) can be represented using the
    locationRelationShip object
  6. Fixit hints can be communicated through the fixes property

Interface Changes

We propose the following interface changes:

  • Input: Extend the -fdiagnostics-format flag to recognize: -fdiagnostics-format=sarif
  • Output: Clang will emit SARIF formatted diagnostics when -fdiagnostics-format=sarif is provided.

Diagnostic Examples

Various examples for what are available on this github gist (which also renders this message in markdown): https://gist.github.com/envp/3a5fdd33115b91c391c22e5e8a5210f4#diagnostic-examples

+1, i support adopting SARIF.
clang-tidy should also follow suit.

Roman

Hello Everyone,

Below is an RFC on extending the clang `-fdiagnostics-format` option's to
let clang to emit machine readable json diagnostics. Feedback is highly appreciated!

# Why
Machine consumable diagnostics are important for writing generic static
analysis wrappers and harnesses that want to interact with code bases through
clang, There are two options to consider for the diagnostic format to use in
clang:

1. Mimic `gcc-9 -fdiagnostics-format=json`, covered in the previous work section
2. Emit [SARIF][0] diagnostic information, a cross-language standardized format
that is already supported in `clang/lib/StaticAnalyzer` (through `--analyzer-output=sarif`)

We propose (2) as it is a standardized format, which should make it easier for tools to
implement support for it.

I'd support option #2 -- SARIF has a lot of nice tooling support
that's forming in the industry (such as
https://docs.github.com/en/github/finding-security-vulnerabilities-and-errors-in-your-code/uploading-a-sarif-file-to-github).
I'm not super excited about #1 given the existence of #2.

## Previous Work

### `gcc-9 -fdiagnostics-format=json`
GCC [recently][1] [implemented][2] serializing diagnostics to JSON. This option
could be implemented as a `-fdiagnostics-format=json-gcc` in clang to signal
users of its intended interoperability with the corresponding gcc option.
The schema for this format may be inferred from [current gcc code][3].

While not community standard, it can be expected to be reasonably stable as the
[original patch][2] states the flag emits machine readable diagnostics.

## SARIF diagnostics in LLVM

[SARIF][0] (Static Analysis Results Interchange Format) is a standard format
for the output for static analysis tools.

Clang StaticAnalyzer already implements a SARIF diagnostic consumer in
[D53814][4], this should allow us to implement (necessary, if any) extra fields
to the diagnostics output

### Mapping clang diagnostics to SARIF

This section assumes the typical compiler diagnostic which looks like what is
provided in the [expressive diagnostics page][5]

In SARIF, the attributes can be mapped to the [`results`][7] property as follows:
1. File name where the diagnostic occurs is relocated to the [`physicalLocation`][8]
property
2. Line/Column of the caret marking the error can be stored in the [`region`][9]
property, this can also encode the source range to which an error corresponds
3. The error message can be transferred to the [`message`][10]
4. Each of the locations can store the rendered caret & snippet from clang using the
[`snippet`][12] property for that region
5. Nested diagnostics (typically `note` level items) can be represented using the
[`locationRelationShip`][14] object
6. Fixit hints can be communicated through the [`fixes`][13] property

This looks sensible to me.

~Aaron