I believe I’m not the first to notice this issue, but I still want to bring it up.
We have a significant number of crash-related issues (not limited to just the Clang frontend, but also including LLVM IR and the backend), with new ones appearing almost every day. While this is not surprising for a large project like LLVM, I’ve realized that some of these issues may come from real projects, while others are not - they might be generated by fuzzers for academic research purposes.
A friend of mine privately shared some data with me today, which he collected partly out of curiosity and partly through some social engineering techniques. We were surprised to find that a considerable number of issues were filed for the purpose of publishing papers; some of these papers have already been published, and others might be under submission.
Although we are not opposed to using fuzzers to find bugs, and these issues are not necessarily harmful for us, however, they do waste the bandwidth and effort of we maintainers to some extent: they get triaged as crash-on-invalid, crash-on-valid or even just crash, mixing with and sometimes overwhelming issues from real users. AFAIK, currently on GitHub, we don’t have a handy approach to tell apart issues which are generated by fuzzers and which are not.
Since this data involves GitHub accounts and real identities/academic paper information, I’m only sharing some statistics here related to issues generated using fuzzers.
Meanwhile, we propose adding a new issue label e.g. “generated-by-fuzzers”, to distinguish whether the issue is from real users. Of course, this requires us to ask the OP about the source of the code - while not 100% reliable, it at least gives us a way to help maintainers filter out some low-priority issues.
Any feedback is appreciated.