Let's get Clang's diagnostics translatable!

cjdb · December 8, 2022, 9:44pm

One of the goals stated for improving Clang’s diagnostics is to have a set of diagnostics that are more conversational. After discussing with David Blaikie and Richard Smith, we’ve categorised this as a matter of translation from the incumbent diagnostics. Clang’s internals manual suggests that translations are a long-term goal. Two things we need to consider are where we store the translations, and how we intend to migrate them.

To ensure that we have a robust fallback when translations are incomplete or unavailable, the current diagnostics should remain as they are (built into the Clang binary). All other translations will have their own file, with just an identifier and translation string. This is apparently consistent with how other programs employ internationalisation.

We can either store translations with the Clang source (and have TableGen pack everything into the Clang binary), or we can ship translation files with Clang and have the compiler open the appropriate translation at runtime. Because embedding all translation strings into Clang would cause the Clang binary to grow to frightening levels, I propose that we put translations into JSON files that accompany Clang. To limit repetition, I envision these files as an array of JSON objects with a 1:1 mapping with the fallback diagnostics, and we’d only end up storing an identifier (for readability) and the translated diagnostic in every other translation file.

Our migration story should also be considered: incompleteness will be allowed, and anything that doesn’t have a value will default to the fallback diagnostics that Clang ships with. Future work should be done to ensure that we can have a chained fallback strategy (e.g. conversational Japanese, to formal Japanese, to built-in English).

cjdb · December 8, 2022, 9:54pm

What do you anticipate the performance impact of parsing this file to be?

I wonder if it’d be better to have a binary format here (eg, a header table giving the offset for each diagnostic, looked up by index) that we can quickly load and use, and teaching, say, diagtool to build those binary files from JSON or YAML translation files that have the contents you describe here.

Richard just noted the above in an internal channel, and I’m inclined to agree.

dblaikie · December 8, 2022, 10:14pm

Given this’ll only be used when the diagnostic is going to be emitted anyway (disabled diagnostics, even if the diagnosing code doesn’t check (some diagnosing code does check preemptively if the diagnostic is disabled to avoid doing expensive work to diagnose something disabled anyway - but not all diagnosing code does this, sometimes it’s cheap enough not to bother), the underlying layers will check before they go to form diagnostic text/print it out) I’m not sure perf is a huge deal - but I guess given the table could/will get rather long, wouldn’t hurt to be able to at least binary search through it or the like, and the plan’s not to use the same format for developers editing the file (td) as clang reading it (originally proposed json, but maybe some binary format as you’re mentioning) then it doesn’t seem like a huge step to have tblgen make something more efficient/binary encoded.

pogo59 · December 8, 2022, 10:17pm

What I’ve seen elsewhere is a dictionary keyed by the unique code for each diagnostic. This is typically some file format convenient to the platform; the ones I’ve worked on have had that convenient format baked into the native filesystem, so no need for custom code to manage the file.

Clang is not good at keeping diagnostic handles unique for all time, so the dictionary(ies) would have to be regenerated for every version. This is probably not a huge onus as there will be new/removed diagnostics in every release needing translation. It does mean the dictionary needs to know which version it was built for, which shouldn’t be a big deal.

pogo59 · December 8, 2022, 10:28pm

Expanding on that, it would make me sad to see Clang gain a home-grown keyed-file management library when such things already exist and work well. Code reuse is a good thing.

tahonermann · December 8, 2022, 11:22pm

Doing translations well will be difficult. The current diagnostic approach of substituting fragments into a message doesn’t suffice to produce great results; there are too many language subtleties regarding arity, gender, and more that I’m not familiar with.

The Unicode Consortium created the Message Format Working Group (MFWG) a few years ago to work on improvements for translatable messages. I recommend becoming familiar with their work before embarking on a design for this.

cjdb · December 9, 2022, 1:49am

What I’ve seen elsewhere is a dictionary keyed by the unique code for each diagnostic. This is typically some file format convenient to the platform; the ones I’ve worked on have had that convenient format baked into the native filesystem, so no need for custom code to manage the file. …
Expanding on that, it would make me sad to see Clang gain a home-grown keyed-file management library when such things already exist and work well. Code reuse is a good thing.

My goal was for there to be a 1:1 mapping from the fallback diagnostics to the translation files for indexing reasons, but I’ll need to see how that plays with the concerns Tom outlined below.

Doing translations well will be difficult. The current diagnostic approach of substituting fragments into a message doesn’t suffice to produce great results; there are too many language subtleties regarding arity, gender, and more that I’m not familiar with.

Yes, this will be an important thing to consider, thank you for raising this early on.

The Unicode Consortium created the Message Format Working Group (MFWG) a few years ago to work on improvements for translatable messages. I recommend becoming familiar with their work before embarking on a design for this.

Thanks, I’ll be sure to get up to speed on this, especially if it helps solve the above problem.

tschuett · December 9, 2022, 7:26am

Th Rust community is in the same ship:

github.com/rust-lang/rust

Diagnostic Translation

opened 01:46PM - 18 Aug 22 UTC

davidtwco

A-diagnostics C-tracking-issue S-tracking-impl-incomplete A-translation

The Rust Diagnostics working group is leading an effort to add support for inter…nationalization of error messages in the compiler, allowing the compiler to produce output in languages other than English. This issue tracks the current status of the effort, which was announced in the [*"Contribute to the diagnostic translation effort!"* post on *Inside Rust*](https://blog.rust-lang.org/inside-rust/2022/08/16/diagnostic-effort.html). ## What's the current status? Diagnostic translation will take a long time to be finished. At a high-level, there are four primary steps: - [x] Implement initial translation infrastructure - [ ] Make diagnostics translatable through migration to new infrastructure (**we are here**) - [ ] Set up Pontoon for translators to use - [ ] Establish translation teams for different languages - [ ] Implement infrastructure for distributing language packs in collaboration with infrastructure/release teams (as appropriate) Implementing the initial translation infrastructure provides the groundwork that enables diagnostic messages to be made translatable at all. That initial infrastructure is largely completed - there might be some gaps that we'll discover and patch up as we continue - but it's almost all there. Next, all of the diagnostics in rustc need to be modified so that they can be translatable. There's some bad news - that's a _lot_ of work. But there's also some good news - that's a lot of _highly parallelizable_ work that you can help with! It doesn't require any familiarity with the Rust compiler, just an eagerness to get involved. ## How to get started? It's very easy to get started, the process looks like the following: 1. Join [our Zulip chat](https://rust-lang.zulipchat.com/#narrow/stream/336883-i18n/topic/.23100717.20diag.20translation) and say hello! Everyone is very friendly and eager to help if you have any trouble. 1. [Set up a development environment](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html) for the compiler. 1. Identify a module to migrate (see _"Identifying diagnostics to migrate"_ below). 1. Migrate diagnostics (see _"Migrate diagnostics"_ below). 1. Open a pull request with your changes. 1. Repeat and profit! ### Identifying diagnostics to migrate Our goal is to migrate every diagnostic in the compiler to be translatable and to switch from using a "diagnostic builder" to using "diagnostic structs". That's a lot of diagnostics, so we're splitting the work up by module in the compiler so that nobody steps on anyone else's toes. - [x] `rustc_apfloat` - Completed by @5225225 - #100723 - [x] `rustc_arena` - Completed by @5225225 - #100723 - [x] `rustc_ast` - Completed by @5225225 - #100723 - [x] `rustc_ast_lowering` - Completed by @JeanCASPAR - #100724 - #101049 - [ ] `rustc_ast_passes` - Currently being worked on by @finalchild - #100694 - #101657 - [x] `rustc_ast_pretty` - Completed by @5225225 - #100723 - [x] `rustc_attr` - Completed by @hampuslidin - #100836 - [ ] `rustc_borrowck` - Currently being worked on by @AndyJado - #100798 - #100864 - #100871 - #100900 - #101042 - #101305 - #101301 - #101276 - #103469 - #101686 - #101275 - #103559 - #103960 - #104055 - [ ] `rustc_builtin_macros` - Available to be worked on! - #101408 - #101935 - [ ] `rustc_codegen_cranelift` - Blocked - hard to make the errors use the rustc infrastructure. - [x] `rustc_codegen_gcc` - Completed by @ellishg - #101075 - #102509 - [ ] `rustc_codegen_llvm` - Currently being worked on by @SLASHLogin - #101005 - [ ] `rustc_codegen_ssa` - Currently being worked on by @JhonnyBillM - #102612 - #103792 - [ ] `rustc_const_eval` - Available to be worked on! - #100738 - [x] `rustc_data_structures` - Completed by @5225225 - #100723 - [x] `rustc_driver` - Completed by @adriantombu - #100890 - [x] `rustc_error_codes` - Completed by @5225225 - #100723 - [x] `rustc_error_messages` - Completed by @5225225 - #100723 - [ ] `rustc_errors` - Available to be worked on! - #102684 - [ ] `rustc_expand` - Currently being worked on by @nils - #100651 - [x] `rustc_feature` - Completed by @5225225 - #100723 - [x] `rustc_fs_util` - Completed by @5225225 - #100723 - [x] `rustc_graphviz` - Completed by @5225225 - #100723 - [x] `rustc_hir` - Completed by @5225225 - #100723 - [ ] `rustc_hir_analysis` - Available to work on! - [x] `rustc_hir_pretty` - Completed by @5225225 - #100723 - [ ] `rustc_hir_typeck` - Available to work on! - #100722 - #101007 - [x] `rustc_incremental` - Completed by @davidtwco - #100754 - [x] `rustc_index` - Completed by @5225225 - #100723 - [ ] `rustc_infer` - Currently being worked on by @IntQuant - #100843 - #101153 - #101936 - [x] `rustc_interface` - Completed by @SkiFire13 - #100808 - [x] `rustc_lexer` - Completed by @5225225 - #100723 - [ ] `rustc_lint` - Currently being worked on by @Rejyr - #100776 - #101138 - [x] `rustc_lint_defs` - Completed by @5225225 - #100723 - [x] `rustc_llvm` - Completed by @5225225 - #100723 - [x] `rustc_log` - Completed by @5225225 in #100723 - [x] `rustc_macros` - Completed by @5225225 in #100723 - [x] `rustc_metadata` - Completed by @CleanCut in #100928 - [ ] `rustc_middle` - Available to work on! - #101021 - [ ] `rustc_mir_build` - Available to work on! - #100854 (continue from this partially completed work!) - [x] `rustc_mir_dataflow` - Completed by @5225225 - #100744 - [x] `rustc_monomorphize` - Completed by @CleanCut in #100730 - [ ] `rustc_parse` - Currently being worked on by @Xiretza - #100667 - #100713 - #101619 - [x] `rustc_parse_format` - Completed by @5225225 - #100723 - [ ] `rustc_passes` - Completed by @CleanCut, @rdvdev2 and @diegooliveira - #100870 - #102110 - #101815 - #102110 - #103397 - [x] `rustc_plugin_impl` - Completed by @Facel3ss1 - #100768 - [x] `rustc_privacy` - Completed by @davidtwco - #98420 - [x] `rustc_query_impl` - Completed by @5225225 - #100723 - [x] `rustc_query_system` - Completed by @evopen - #100844 - #102623 - [ ] `rustc_resolve` - Currently being worked on by @rajputrajat - #101162 - [x] `rustc_save_analysis` - Completed by @wonchulee in #100780 - [x] `rustc_serialize` - Completed by @5225225 - #100723 - [x] `rustc_session` - Completed by @LuisCardosoOliveira - #100753 101466 - #101041 - #101266 - [x] `rustc_smir` - Completed by @5225225 - #100723 - [x] `rustc_span` - Completed by @5225225 - #100723 - [x] `rustc_symbol_mangling` - Completed by @JhonnyBillM - #100831 - [x] `rustc_target` - Completed by @5225225 - #100723 - [ ] `rustc_trait_selection::traits::error_reporting::suggestions` - Available to work on! - #101466 - [ ] `rustc_trait_selection` (everything else) - Available to work on! - #100814 - [x] `rustc_traits` - Completed by @5225225 - #100723 - [x] `rustc_transmute` - Completed by @JhonnyBillM - #100842 - [ ] `rustc_ty_utils` - Available to work on! - #100735 - [x] `rustc_type_ir` - Completed by @JhonnyBillM - #100721 - [ ] `rustfmt` - Available to be worked on! - [ ] `clippy` - Available to be worked on! - [ ] `rustdoc` - Available to be worked on!- **Note:** Some of these crates might not have diagnostics in them, in which case we'll just enable our internal lints on them. Some might have lots and lots of work that we can split up further, let us know! If there aren't many crates left, then feel free to leave a comment asking if someone is still working on their crate (check if they commented or have put a PR up recently). Once you've picked a module (**leave a comment letting us know!**), how do you find the diagnostics to migrate? We've created rustc-internal lints that you can apply to a module which will produce an error for every diagnostic that hasn't been migrated. ```rust= #![deny(rustc::untranslatable_diagnostic)] #![deny(rustc::diagnostic_outside_of_impl)] ``` (*an example of using these would just be adding them to the top of a file*) After adding these attributes, you can run `./x.py check` to build the compiler in check mode (just like `cargo check` in another project). You'll notice a bunch of errors that will look something like these: ```text= error: diagnostics should only be created in `SessionDiagnostic`/`AddSubdiagnostic` impls --> compiler/rustc_parse/src/parser/mod.rs:1443:40 | 1443 | let mut err = sess.span_diagnostic.struct_span_err( | ^^^^^^^^^^^^^^^ error: diagnostics should be created using translatable messages --> compiler/rustc_parse/src/parser/mod.rs:1443:40 | 1443 | let mut err = sess.span_diagnostic.struct_span_err( | ^^^^^^^^^^^^^^^ ``` There will be two errors for each diagnostic that isn't migrated: 1. *"diagnostics should be created using translatable messages"* - This error occurs when a diagnostic function is being invoked with something that isn't a translatable message (like a string literal or a formatted string). - e.g. `err.label("an example label")` instead of `err.label(fluent::example_label)` - `fluent::example_label` corresponds to a message in a "Fluent resource" which we can provide different versions of for each language. 2. *"diagnostics should only be created in `Diagnostic`/`Subdiagnostic` impls"* - This error occurs when a diagnostic function is being called outside of an impl of `Diagnostic` or `Subdiagnostic`. One of our goals with this migration is to move all diagnostic emission logic into impls on structs, as it helps keep the compiler tidy and works towards other goals of the diagnostics working group. - There are two ways to resolve this: 1. Using a diagnostic derive to implement them automatically (preferred!) 1. Implementing one of these traits (`Diagnostic` for errors and warnings, `LintDiagnostic` for lints, or `Subdiagnsostic` for parts of an error/warning/lint) manually. - See _"Migrate diagnostics"_ for more on these. We'll know we're finished when we can leave those attributes on every module in the compiler. ### Migrate diagnostics Okay, so you've got a diagnostic in front of you that you need to migrate.. now what? - There's an introduction to performing a migration in the [*"Contribute to the diagnostic translation effort!"* post on *Inside Rust*](https://blog.rust-lang.org/inside-rust/2022/08/16/diagnostic-effort.html), this should serve as a decent introduction to the process. - There's detailed documentation on [diagnostic structs](https://rustc-dev-guide.rust-lang.org/diagnostics/diagnostic-structs.html) and on [diagnostic translation](https://rustc-dev-guide.rust-lang.org/diagnostics/translation.html) in the developer guide that should be useful reference material. - There are a lot of pull requests that perform migrations that you can dig through for examples, just [look for pull requests labelled `A-translation`](https://github.com/rust-lang/rust/pulls?q=is%3Apr+is%3Aopen+sort%3Aupdated-desc+label%3AA-translation). While migrating diagnostics, there might be cases you run into that we've not run across yet. Let us know in Zulip, you might be able to experiment and teach us how to translate some diagnostics, or there may be an opportunity to extend our core infrastructure (e.g. the derives). Don't worry though, you can always skip a diagnostic and leave it for someone else too. ## Where to get help? Discussion is primarily happening in the [`#i18n` stream on Zulip](https://rust-lang.zulipchat.com/#narrow/stream/336883-i18n/topic/.23100717.20diag.20translation). Ask any questions you have in that chat and someone will try to help. If you don't get a response, feel free to ping `@davidtwco` or `@t-compiler/wg-diagnostics`. ## One-off tasks Sometimes there are one-off tasks which improve compiler infrastructure around translation or just make things easier to use, these are listed below, feel free to comment to take them: #### Completed - [x] ~~Extend `SessionDiagnostic` derive to work on enums ([context](https://github.com/rust-lang/rust/pull/100831#discussion_r951403293))~~ #102189 - [x] ~~In [#100753](https://github.com/rust-lang/rust/pull/100753/files#diff-75e151b9de9e9418f72fab3e76335c676cbca4e50db984c2ddb63bdf1b7db3e3R375-R377), there are some functions annotated with `#[rustc_lint_diagnostics]` which is used to know when to trigger our internal lints, that are themselves triggering the internal lint - it would be good to change the internal lint so that it skips functions annotated with `#[rustc_lint_diagnostics]`.~~ #101230 - [x] ~~Support `span_suggestions`-equivalent in diagnostic derives.~~ #103209 #### In-progress - [ ] #103042 - [ ] #104047 #### To-do - [ ] Diagnostic migration lints won't fire on diagnostics emitted in macros, this might be something we can improve ([context](https://github.com/rust-lang/rust/pull/101075#issuecomment-1231791643)) - [ ] Adding support for `DefId` to `Span` conversions in the derive macros, e.g. `#[primary_span(def_span)]` or something like that ([context](https://github.com/rust-lang/rust/pull/100814#discussion_r958612056)). - [ ] Better support for `Option<impl IntoDiagnosticArg>` ([context](https://github.com/rust-lang/rust/pull/101153#discussion_r959259504)) - [ ] Better support for `MultiSpan` ([context](https://github.com/rust-lang/rust/pull/101153#discussion_r959262000)). - [ ] Compile-time checks for unused Fluent messages. - [ ] Add a `IntoDiagnosticSpan` trait that can be implemented for anything usable with `#[primary_span]` (i.e. so we can extend `#[primary_span]` support to `Ident` easily, for example) (partially complete at https://github.com/davidtwco/rust/tree/translation-into-diagnostic-span) - [ ] #101109 - [x] #103539

tschuett · December 9, 2022, 9:07am

How about one td file per language? Then I can build my clang with English because it is complete and two other languages.

Maybe TableGen can compress the non-standard languages to save binary size. Once you get an error, lookup time is a minor issue.

One minute later: I want my clang with e.g. Chinese and English as backup.

pogo59 · December 9, 2022, 1:29pm

One td file per language, yes. I think we don’t want to overburden the existing (often quite large) Diagnostic*.td files with all languages.

Typically each language’s catalog/dictionary gets a filename that includes the locale code, making it trivial for the compiler to determine whether the language file exists. (This is based on memory; I haven’t looked up what people do these days.)

Tooling to avoid headaches of manually keeping the per-language files in sync with the defined diagnostics.

tschuett · December 9, 2022, 1:35pm

I envision something like this for CMake:

-DLLVM_CLANG_DIAGS="Chinese,English"

Such that my clang has two languages.

rengolin · December 9, 2022, 2:01pm

<sarcasm>
How about using machine learning to translate the strings on the fly?
</sarcasm>

tschuett · December 9, 2022, 2:07pm

This thread is just about implementation details. The real issue is getting people to contribute translations of diagnostics.

pogo59 · December 9, 2022, 2:37pm

Contributing a translation is a very large project.
Maintaining the translation over time is a significant commitment.

But this thread is about getting Clang to support translations at all, and starting out with Pig Latin or esrever hsilgnE is enough to show it actually works.

tahonermann · December 9, 2022, 5:34pm

I recommend using IETF BCP-47 (see RFC 4646 and RFC 4647) and the IANA language-subtag-registry for language identification.

In some regions, it is important to support language fallback mechanisms. For example, a Chinese user might request zh-SG. If that language variant is not available, but zh-CN and zh-TW are, then it would be preferable to fallback to zh-CN (another simplified Chinese script) rather than to zh-TW (a traditional Chinese script) or an English variant.

cjdb · December 9, 2022, 5:57pm

One minute later: I want my clang with e.g. Chinese and English as backup.

This is another reason why I’m against building the translations directly into Clang.

tschuett · December 9, 2022, 6:03pm

I am in favour of using td files for diagnostics in different languages. What I wanted to say do not link Turkish into my binary. I only want the content of the Chinese and English td files in my clang.

cjdb · December 9, 2022, 6:10pm

Sure, you can install only the Chinese language pack. It is not reasonable to bake the translations into the binary because non-contributors won’t have the luxury of doing what you want. You wouldn’t get Turkish if you don’t want it.

tschuett · December 9, 2022, 6:42pm

I see what you mean. There is a distinction between my Clang and the Clang that is shipped with Ubuntu. In the latter case, you want the language files adjacent to Clang and not backed into it.

cjdb · December 9, 2022, 7:31pm

I don’t see the benefit in supporting both approaches. This will add a serious maintenance burden. Could you please expand on why your Clang should be built differently to everyone else’s?

Topic		Replies	Views
RFC: Improving Clang’s Diagnostics Clang Frontend	44	6668	April 28, 2023
[Feature request] Enable libclang to report specific diagnostic error Clang Frontend	0	75	March 21, 2014
AST for multiple files and copying the clang diagnostic system design Clang Frontend	0	82	February 12, 2015
Diagnostic Improvements Clang Frontend	9	97	November 18, 2008
Improve Clang diagnostics 2 GSoC gsoc2023	5	870	March 29, 2023

Let's get Clang's diagnostics translatable!

Related Topics