Let's get Clang's diagnostics translatable!

One of the goals stated for improving Clang’s diagnostics is to have a set of diagnostics that are more conversational. After discussing with David Blaikie and Richard Smith, we’ve categorised this as a matter of translation from the incumbent diagnostics. Clang’s internals manual suggests that translations are a long-term goal. Two things we need to consider are where we store the translations, and how we intend to migrate them.

To ensure that we have a robust fallback when translations are incomplete or unavailable, the current diagnostics should remain as they are (built into the Clang binary). All other translations will have their own file, with just an identifier and translation string. This is apparently consistent with how other programs employ internationalisation.

We can either store translations with the Clang source (and have TableGen pack everything into the Clang binary), or we can ship translation files with Clang and have the compiler open the appropriate translation at runtime. Because embedding all translation strings into Clang would cause the Clang binary to grow to frightening levels, I propose that we put translations into JSON files that accompany Clang. To limit repetition, I envision these files as an array of JSON objects with a 1:1 mapping with the fallback diagnostics, and we’d only end up storing an identifier (for readability) and the translated diagnostic in every other translation file.

Our migration story should also be considered: incompleteness will be allowed, and anything that doesn’t have a value will default to the fallback diagnostics that Clang ships with. Future work should be done to ensure that we can have a chained fallback strategy (e.g. conversational Japanese, to formal Japanese, to built-in English).

3 Likes

What do you anticipate the performance impact of parsing this file to be?

I wonder if it’d be better to have a binary format here (eg, a header table giving the offset for each diagnostic, looked up by index) that we can quickly load and use, and teaching, say, diagtool to build those binary files from JSON or YAML translation files that have the contents you describe here.

Richard just noted the above in an internal channel, and I’m inclined to agree.

Given this’ll only be used when the diagnostic is going to be emitted anyway (disabled diagnostics, even if the diagnosing code doesn’t check (some diagnosing code does check preemptively if the diagnostic is disabled to avoid doing expensive work to diagnose something disabled anyway - but not all diagnosing code does this, sometimes it’s cheap enough not to bother), the underlying layers will check before they go to form diagnostic text/print it out) I’m not sure perf is a huge deal - but I guess given the table could/will get rather long, wouldn’t hurt to be able to at least binary search through it or the like, and the plan’s not to use the same format for developers editing the file (td) as clang reading it (originally proposed json, but maybe some binary format as you’re mentioning) then it doesn’t seem like a huge step to have tblgen make something more efficient/binary encoded.

What I’ve seen elsewhere is a dictionary keyed by the unique code for each diagnostic. This is typically some file format convenient to the platform; the ones I’ve worked on have had that convenient format baked into the native filesystem, so no need for custom code to manage the file.

Clang is not good at keeping diagnostic handles unique for all time, so the dictionary(ies) would have to be regenerated for every version. This is probably not a huge onus as there will be new/removed diagnostics in every release needing translation. It does mean the dictionary needs to know which version it was built for, which shouldn’t be a big deal.

Expanding on that, it would make me sad to see Clang gain a home-grown keyed-file management library when such things already exist and work well. Code reuse is a good thing.

Doing translations well will be difficult. The current diagnostic approach of substituting fragments into a message doesn’t suffice to produce great results; there are too many language subtleties regarding arity, gender, and more that I’m not familiar with.

The Unicode Consortium created the Message Format Working Group (MFWG) a few years ago to work on improvements for translatable messages. I recommend becoming familiar with their work before embarking on a design for this.

6 Likes

What I’ve seen elsewhere is a dictionary keyed by the unique code for each diagnostic. This is typically some file format convenient to the platform; the ones I’ve worked on have had that convenient format baked into the native filesystem, so no need for custom code to manage the file. …
Expanding on that, it would make me sad to see Clang gain a home-grown keyed-file management library when such things already exist and work well. Code reuse is a good thing.

My goal was for there to be a 1:1 mapping from the fallback diagnostics to the translation files for indexing reasons, but I’ll need to see how that plays with the concerns Tom outlined below.

Doing translations well will be difficult. The current diagnostic approach of substituting fragments into a message doesn’t suffice to produce great results; there are too many language subtleties regarding arity, gender, and more that I’m not familiar with.

Yes, this will be an important thing to consider, thank you for raising this early on.

The Unicode Consortium created the Message Format Working Group (MFWG) a few years ago to work on improvements for translatable messages. I recommend becoming familiar with their work before embarking on a design for this.

Thanks, I’ll be sure to get up to speed on this, especially if it helps solve the above problem.

Th Rust community is in the same ship:

How about one td file per language? Then I can build my clang with English because it is complete and two other languages.

Maybe TableGen can compress the non-standard languages to save binary size. Once you get an error, lookup time is a minor issue.

One minute later: I want my clang with e.g. Chinese and English as backup.

One td file per language, yes. I think we don’t want to overburden the existing (often quite large) Diagnostic*.td files with all languages.

Typically each language’s catalog/dictionary gets a filename that includes the locale code, making it trivial for the compiler to determine whether the language file exists. (This is based on memory; I haven’t looked up what people do these days.)

Tooling to avoid headaches of manually keeping the per-language files in sync with the defined diagnostics.

I envision something like this for CMake:

-DLLVM_CLANG_DIAGS="Chinese,English"

Such that my clang has two languages.

<sarcasm>
How about using machine learning to translate the strings on the fly?
</sarcasm>

2 Likes

This thread is just about implementation details. The real issue is getting people to contribute translations of diagnostics.

3 Likes

Contributing a translation is a very large project.
Maintaining the translation over time is a significant commitment.

But this thread is about getting Clang to support translations at all, and starting out with Pig Latin or esrever hsilgnE is enough to show it actually works.

I recommend using IETF BCP-47 (see RFC 4646 and RFC 4647) and the IANA language-subtag-registry for language identification.

In some regions, it is important to support language fallback mechanisms. For example, a Chinese user might request zh-SG. If that language variant is not available, but zh-CN and zh-TW are, then it would be preferable to fallback to zh-CN (another simplified Chinese script) rather than to zh-TW (a traditional Chinese script) or an English variant.

1 Like

One minute later: I want my clang with e.g. Chinese and English as backup.

This is another reason why I’m against building the translations directly into Clang.

I am in favour of using td files for diagnostics in different languages. What I wanted to say do not link Turkish into my binary. I only want the content of the Chinese and English td files in my clang.

Sure, you can install only the Chinese language pack. It is not reasonable to bake the translations into the binary because non-contributors won’t have the luxury of doing what you want. You wouldn’t get Turkish if you don’t want it.

I see what you mean. There is a distinction between my Clang and the Clang that is shipped with Ubuntu. In the latter case, you want the language files adjacent to Clang and not backed into it.

I don’t see the benefit in supporting both approaches. This will add a serious maintenance burden. Could you please expand on why your Clang should be built differently to everyone else’s?