It appears that discourse has dropped some parts of imported emails during the import process. I’ve only noticed one example of this so far, but it’s only an accident I noticed this one (since I was reading this thread, and the issue here is blatant). So I assume there’s more similar problems throughout the import.
Check out the difference between this one-paragraph response, which makes it look like the poster failed to ever substantively reply to the questions:
(Note, there I’ve linked to the google group mirror instead of the message in the official llvm-dev archives because the original message used colors/indents to denote quoting, which the mailman archive does not render, making it rather unintelligible.)
I also wonder if this is the same (or a similar) problem to the message on the Discourse Retrospective thread in which jcranmer noted that some message text was lost on a (new) email reply sent to discourse.
Hmm, there is a Google Groups mirror… that seems like it would have been a relatively small step away from just replacing Mailman with Google Groups, which seems like it would be much simpler than this. Was that considered?
I don’t know how the process of import for Discourse works or how they validated it, but this is puzzling overall.
My approach would have been to implement this with a round-tripping and ensure (maybe with some normalization of the input) that we don’t lose information in the process (we can recover the input from the output).
I guess the trade-off is in importing the mailing-list archives “as-is” vs importing something that “looks like it was written in Discourse originally”, the latter requiring (imperfect) heuristic to interpret the messages instead of just importing them.
We are aware of a few situations that caused emails to not be imported correctly. We are working with the Discourse support team to identify more of the cases and to fix them automatically. However, no data has been lost and we can always fix these up manually in the meantime. You can email discourse-admin@llvm.org with any of these import issues (or tag the admins on the post) so we can pass it along to the Discourse support team and to also fix them.
I’ll update here once we have more to share, but I wanted to share that we have been working on this in the background.
Thanks. We are working with the Discourse team to automatically find these and update them. So I haven’t fixed up the list just yet, but I am checking with them if I should do that or wait.
Their email parsing works better when HTML versions of the email are available, because they take cues added by email clients in the HTML tags.
They are looking at improvements to their email parsing with only unstructured text.
A bug was found which affects parsing of certain emails when HTML is available, causing them to be parsed by text instead. They are working on a fix for this.
Some stats about our import:
237k total emails
text available?
HTML available?
ultimately parsed from
count
notes
yes
no
text
134519
yes
yes
HTML
89869
yes
yes
text
11897
← BUG
no
yes
HTML
653
Please note that this data says nothing about how prevalent emails have been truncated.
This work is ongoing, and I will update as we have more information.