Non-deterministic diagnostics due to parallelism

See Parallel input file parsing for some parallelism work on lld.
Parallelism can easily make diagnostic order (warnings and errors) non-deterministic[1] and @dblaikie raised a question/concern on rGe45a5696bb2a . I am therefore creating this post to make users aware and discuss.

Note: linker output needs to be predicatable. There is no excuse. (When the user specified an invalid linker script to overlap output sections, I think not detecting the case is acceptable, at least temporarily.)

[1]:

  • The user will see every line exactly once, but the order is not guaranteed. The next invocation with the same input may get a different permutation.
  • Due to --error-limit and our default value 20, lld will quit early after reaching 20 errors. This means in the presence of so many errors, the next invocation may give a different set of errors.

Some non-determinism diagnostics have been a long time.

–gdb-index: parsing DWARF may lead to diagnostics. We don’t do anything to ensure the diagnostics are emitted in order.

Some non-determinism diagnostics are new. (There is technically a trend of introducing more, but right now I do not find anything easily parallelizable. The next thing, if possible, is diagnostics from relocations.)

The ELF port now initializes local symbols and sections in parallel. Invalid input like too large section indexes, non-local symbols found at index < .symtab’s sh_info, invalid symbol name offset, etc will lead to (usually) errors or (sometimes) warnings.
We simply call warn/error/fatal. If multiple input (very rare in practice) has invalid input, the diagnostics do not have a guaranteed order.

I think it’s a fairly important issue & worth broader consensus (discourse thread, likely) before this moves further forward, potentially reverting or disabling the nondeterministic parallelism (my understanding is that lld had parallelism previous to this work that didn’t cause nondeterministic output - that doesn’t need to be disabled) until this is addressed.

I strongly object to this.

Linker diagnostics which may be parallel are rare. Linker diagnostics are different from compiler diagnostics in that they are genrally not ignorable. Some common linker diagnostics are non-parallel (e.g. non-existent entry point, incompatible options).
For the diagnostics which have values to be parallel, they tend to be fatal and suggest broken input. I don’t find any ignorable and parallel diagnostics.
For invalid input, we sometimes choose warn just so that a link can give the user more information or there was a widely misuse we want to work around a bit.

It is rare to have multiple invalid input. Say an input file has a fatal error like invalid symbol table, broken relocations.
Typically it is one input file/archive which the problem. When this happens, giving information for this file suffices.
There is no requirement that the error need to interleave with other types of errors in a determinic way.
Somtimes a large number of input has the same types of errors (e.g. invalid symbol table). When this happens, this suggests that the tool producing these input files have a problem.
The user is typically not interested in knowning every input file. One example suffices. This point is strengthed by the fact that we decided to default to --error-limit=20.
Fixing one input will likely fix all other input in the next invocation of the linker.

The parallelism patches have improved ld.lld performance greatly. It’s not fair to revert them to have the low-value (as explained previously) deterministic and uncommon diagnostics.

Last, mold does not do anything making diagnostics deterministic.

It is possible to make diagnostics deterministic. We need to invent new sets of functions for warn/error/fatal which record messages in a vector, assign them an ID (sometimes not easily derivable), and sort the messages when we decide to emit them. This is implementable but may come with a large overhead. We have a question whether we should implement it.

1 Like

@bd1976bris @DimitryAndric @igorkudrin @petrhosek @smeenai @smithp35

I’m really fine with diagnostic lines emitted in non-deterministic order, although of course the risk is that multiple runs of the same link job will cause actually different errors, due to the non-determinism.

That said, as long as there is some command line option to (at least temporarily) turn off the parallelism, and revert to the deterministic diagnostics, I think this should be fine. E.g. similar to running “make -j1”. :slight_smile:

Yeah I don’t feel like it’s a big issue for me or my team at least. Faster linker is way more important than the order of the output. A way to make it single threaded might be nice - but not important to us.

We’re not concerned about diagnostics being deterministic on our end. CC @oontvoo and @thakis though, because I recall them discussing this for LLD for Mach-O.

1 Like

I agree that it’s unnecessary revert parallelism in order to ensure that stderr output is deterministically ordered – so long we guarantee that the output files and the exit code remain deterministic.

It is possible to make diagnostics deterministic. We need to invent new sets of functions for warn/error/fatal which record messages in a vector, assign them an ID (sometimes not easily derivable), and sort the messages when we decide to emit them. This is implementable but may come with a large overhead. We have a question whether we should implement it.

This sounds like it might be nice, at least for some kinds of errors, but probably not worth the time to implement and the additional implementation complexity, unless someone actually needs it. Such a person can likely make do with -threads=1 instead.

Runtime overhead of the diagnostic determinism doesn’t seem like it should really be an issue here, however, since it’s a cost that will be paid only upon link failure.

Runtime overhead of the diagnostic determinism doesn’t seem like it should really be an issue here, however, since it’s a cost that will be paid only upon link failure.

Yeah, that was part of my point - runtime cost seems of low importance, code complexity is a real thing - but I wouldn’t’ve thought it’d be terribly complicated.

But admittedly, sounds like other folks don’t see this as as much of an issue as I do… I’m surprised, but fair enough.

I sometimes depend on deterministic ordering of errors & warnings during my development workflow: If I see multiple errors, I try address them one by one. I star with the first one, and when I think I solved it, I rebuild. If the previous 1st problem no longer shows up as 1st problem, I assume it is fixed and move on to the next one - and so on, until I addressed all issues in my code

This workflow no longer works/is more complicated with an indeterministic order among the reported errors/warnings.

That being said, as long as there is some command line parameter through which I can get diagnostics in a deterministic order, that works for me

2 Likes

Intuitively I’d expect the majority of linker errors users are likely to see are:

  • Undefined symbol error
  • Multiple symbol definition error
  • Linker script parsing error
  • Command line parsing error

These are the ones that are most likely to be “user error” in part of the program. I think the majority, if not all of these would be in the serial part of LLD so will remain deterministic. The errors likely to be given by the parallel parts are for problems in the object files, these do happen, but are much rarer. They also tend to be more localised to a small number of objects. In practice I don’t think the current amount of things that happen in parallel will make a lot of difference.

It should be possible to get a deterministic diagnostic order with --threads=1, essentially serialising the link. Worse for performance yes, but I think that is acceptable if the link has multiple errors that need addressing. At a push for errors and warnings emitted during the parallel sections we could add a for deterministic diagnostic order use –threads=1` as a hint.

I think it is worth monitoring the situation rather than reverting now. If there is a significant minority of people with complaints about the error messages, then as MaskRay suggests some extra work can be done to output them in a deterministic order. I’m not too concerned about performance here, at least for error messages, once the link is going to error then I don’t think performance matters as much. How much complexity depends on how much effort and state tracking is used to emulate a no-threads ordering of the diagnostics. I would imagine that giving a determinstic, but not identical to serial order, wouldn’t be too difficult.

1 Like

Chiming in from Sony - we have merged the change now and we don’t think that the non-deterministic ordering of diagnostics message will cause a problem for our tooling or users. Thanks for the work done to make LLD faster :slight_smile:

1 Like

The lld/MachO diag determinism discussions were on ⚙ D126800 Write output sections in parallel

1 Like

Oh, thanks for the context! Glad to see there was a conversation & some similar concerns/tradeoffs were discussed there. I’m still a bit surprised by the tradeoff, but good to know it’s been discussed.

I can’t speak for the linker or Apple as a whole, but when we had a similar discussion in the context of dsymutil, introducing non-determinism is absolutely a dealbreaker for us. So I’m equally surprised reading this thread, I expected it to go in a very different direction.

when we had a similar discussion in the context of dsymutil, introducing non-determinism is absolutely a dealbreaker for us

Perhaps it’d be useful to describe why? I would not expect dsymutil and linker to have different requirements in this regard.

Or, do you just mean from a sort of “purity of design” point of view, you don’t want that behavior in a tool you’re responsible for, rather than having a concrete problem in mind that will be caused by the behavior?

2 Likes

IIRC my mental argument was that if your build fails, compile errors will be in nondeterministic order too, because the build system will start each step in nondeterministic order (if only because each step takes a nondeterministic amount of time, due to OS scheduler, system load, etc).

So “if the step fails, the diag order isn’t deterministic” seems like a somewhat tenable position to me, especially if it significantly impacts performance in the happy path.

If the step succeeds, output (including warnings) should imho be deterministic.

I do think “diags should be deterministic even in the error case” is a very reasonable point of view as well, and as long as it doesn’t conflict with happy-path performance I’m in favor of accepting patches to move us in that direction.

3 Likes

Though even failing builds tend towards determinism as all the non-failing actions finish - especially if I’m building only one target and its failing to link, it’s still quite deterministic - the only action that’s running is the link step, and it’s failing.

The use case of “I got a bunch of linker errors and I’m trying to address them a few at a time to verify that I’m making progress/actually addressing them” seems relevant - if I run the build again, and don’t see the error I was just fixing in the initial/early output from the linker I might think I’d addressed the issue when I hadn’t.

3 Likes

Sure, happy to elaborate. The answer is a bit of both. dsymutil is more than just a debug info linker, it’s also a debug info “optimizer” that leverages the one definition rule to reduce the debug info size. To pick one example: the order in which you visit DIEs hugely impacts the generated output and it’s far from trivial to guarantee that those outputs are equivalent. We build a lot of software and it’s hard enough to reproduce bugs as it is. Adding non-determinism to the mix would make investigating these issues exponentially harder.

EDIT: To be clear, the dsymutil scenario I described here is very different from what’s being discussed here. I do think it makes sense to assess the trade-offs on an individual basis.

Certainly, non-determinism in the output files is a total nonstarter – reproducible builds are critical for a lot of reasons.

The debate here is only about the diagnostic messages. And upon looking at dsymutil code, I think it already has the proposed behavior: it uses multithreading and also reports warning/error diagnostics as they are encountered – which can cause the diagnostics to be non-deterministically ordered depending on the order of processing across the threads.

So it seems like we may actually be in agreement that non-determinism in that limited case is acceptable.

3 Likes

Commenting on this thread as I recently found some non-deterministic diagnostics from clang itself (not because of parallelism, but as a result of iterating DenseSet/Map).

The nondeterministic diagnostics matter in our context because we are working on compiler caching. In order to replay a cached compilation, we need to treat diagnostics (and of course serialized diagnostics) as part of the output, and it is not good when two identical compilation can produce different outputs.

We can argue that nondeterministic diagnostics for clang is a different discussion but it would be interesting to see where we should draw the line. We also don’t have good test infrastructure to detect non-deterministic diagnostics.

there’s some config for reversing the order of iteration of various hash based data structures in LLVM - does that demonstrate the diagnostic order instability? (eg: forward/reverse iteration orders produce different diagnostic order)