Bug in discourse email migration: discarded some parts of messages

It appears that discourse has dropped some parts of imported emails during the import process. I’ve only noticed one example of this so far, but it’s only an accident I noticed this one (since I was reading this thread, and the issue here is blatant). So I assume there’s more similar problems throughout the import.

Check out the difference between this one-paragraph response, which makes it look like the poster failed to ever substantively reply to the questions:

Versus the full message.

(Note, there I’ve linked to the google group mirror instead of the message in the official llvm-dev archives because the original message used colors/indents to denote quoting, which the mailman archive does not render, making it rather unintelligible.)

I also wonder if this is the same (or a similar) problem to the message on the Discourse Retrospective thread in which jcranmer noted that some message text was lost on a (new) email reply sent to discourse.

2 Likes

Hmm, there is a Google Groups mirror… that seems like it would have been a relatively small step away from just replacing Mailman with Google Groups, which seems like it would be much simpler than this. Was that considered?

I don’t know how the process of import for Discourse works or how they validated it, but this is puzzling overall.
My approach would have been to implement this with a round-tripping and ensure (maybe with some normalization of the input) that we don’t lose information in the process (we can recover the input from the output).
I guess the trade-off is in importing the mailing-list archives “as-is” vs importing something that “looks like it was written in Discourse originally”, the latter requiring (imperfect) heuristic to interpret the messages instead of just importing them.

1 Like

We are aware of a few situations that caused emails to not be imported correctly. We are working with the Discourse support team to identify more of the cases and to fix them automatically. However, no data has been lost and we can always fix these up manually in the meantime. You can email discourse-admin@llvm.org with any of these import issues (or tag the admins on the post) so we can pass it along to the Discourse support team and to also fix them.

I’ll update here once we have more to share, but I wanted to share that we have been working on this in the background.

3 Likes

@tonic
Here’s another case:

https://discourse.llvm.org/t/rfc-adding-support-for-marking-allocator-functions-in-llvm-ir/59528

vs
https://groups.google.com/g/llvm-dev/c/I3gcB4lKm04

It seems to have cut off everything after the line of “========”.

Thanks. We are working with the Discourse team to automatically find these and update them. So I haven’t fixed up the list just yet, but I am checking with them if I should do that or wait.

Any update on this? (I just hit such a case today while searching the archives)

FTR, another one cut off after the first line of “=========” here:

original:
https://lists.llvm.org/pipermail/llvm-dev/2020-August/144174.html

I’ve pinged the Discourse support for an update. The main thing is to know how prevalent it is and then second to work on the fix.

It’s clear from the next response there were questions, but they’re not shown in this post.

I’ll pass on the information I was provided.

  • Their email parsing works better when HTML versions of the email are available, because they take cues added by email clients in the HTML tags.
  • They are looking at improvements to their email parsing with only unstructured text.
  • A bug was found which affects parsing of certain emails when HTML is available, causing them to be parsed by text instead. They are working on a fix for this.

Some stats about our import:

237k total emails

text available? HTML available? ultimately parsed from count notes
yes no text 134519
yes yes HTML 89869
yes yes text 11897 ← BUG
no yes HTML 653

Please note that this data says nothing about how prevalent emails have been truncated.

This work is ongoing, and I will update as we have more information.

Another example:

vs
https://lists.llvm.org/pipermail/cfe-dev/2021-November/069423.html

Thank you. I am still working with Discourse to resolve this.