Performance of std::remove_if in F18 file I/O

Hi flang-dev

During the code review on the patch to switch F18's file handling code from using C library to C++ library routines [1] we hit upon a case that seemed to show the C++ std::remove_if routine was much slower than the C routine on some platforms. The routine in question is RemoveCarriageReturns in lib/Parser/source.cpp [2] and existing findings are at [3][4][5]. We are seeing quite different performance numbers across different platforms.

Peter Klausler made a little performance harness around the two proposed implementations and a third naïve implementation that we would expect to be the slowest (for reference rather than proposed for use in F18). I have attached this harness, with a few NFC modifications to this email.

Is anyone able to give this a run on their system and report their results? Please enable optimisation whilst compiling as this will match the F18 configuration we most care about. Please include the compiler and C & C++ standard library versions used as well as the architecture and platform. We've not tried this at all on Windows, so we'd be interested if anyone can provide that info.

The implementation using std::remove_if is preferred on LLVM-like style grounds so we would like to make this change. However, we don't want to significantly regress the performance of F18 file reading while we do that. We'd like to make the change but only if the new code is in the same performance ballpark as the existing code.

Thanks
Rich

[1] https://github.com/flang-compiler/f18/pull/1032
[2] f18/source.cpp at 96c6be633ff65ec6d84c18e0d14393137d8097dd · flang-compiler/f18 · GitHub
[3] Replace manual mmap with llvm::MemoryBuffer by DavidTruby · Pull Request #1032 · flang-compiler/f18 · GitHub
[4] Replace manual mmap with llvm::MemoryBuffer by DavidTruby · Pull Request #1032 · flang-compiler/f18 · GitHub
[5] Replace manual mmap with llvm::MemoryBuffer by DavidTruby · Pull Request #1032 · flang-compiler/f18 · GitHub

klauslerbench.cpp (2.45 KB)

The simple loop is not in the benchmark because it was expected to be slow. It is a sanity check on a compiler's ability to optimize remove_if with good inlining to reduce the remove_if instantiation down to something like the simple loop. If remove_if is somehow *faster* than the simple loop, then I would question the validity of the experiment.