Llvm::format is dangerous

Recently, while implementing a new optimization, we ran into undefined behavior in LLVM code: some code was passing a 64-bit integer to a 32-bit format specifier in llvm::format. (See ⚙ D155093 [DWARF] Fix undefined behaviour in dwarf type printer .) printf-style format strings are generally sort of dangerous, but the danger is normally mitigated by -Wformat. However, llvm::format is not checked by -Wformat, so it’s actually more dangerous than a normal snprintf.

If you do actually enable format warnings for llvm::format, the compiler produces a bunch of warnings. Most of them looked like they weren’t actually undefined behavior, but still, a lot of mistakes.

I see the following paths forward to mitigate the issues here:

  • We use the functionality in new versions of clang to allow marking llvm::format with a “format” attribute even though it’s not varargs (⚙ D112579 Allow non-variadic functions to be attributed with `__attribute__((format))`). Only people using clang 16 or newer as a host compiler would get the checks, but that would at least give us some coverage on buildbots.
  • Make llvm::format varargs, at the cost of some efficiency, which would allow using a “format” attribute on all compilers that support such attributes. (Essentially making it equivalent to asprintf.)
  • Discourage use of llvm::format, in favor of other formatting functions like llvm::formatv and llvm::format_hex.

Suggestions welcome.

I am not going to implement it, but Clang depends on LLVM. Why cannot Clang Sema learn what llvm::format is and check it accordingly. Without these strange workarounds using attributes.

Huh. llvm::format seems like a weird combination of C and C++ idioms. In C++, we can actually propagate the type information in a way such that a proper implementation of llvm::format wouldn’t require users to worry about specifying size-modifiers – or even types – in the format string. (See also, absl::StrFormat which is designed to be basically a drop-in replacement for printf, but using the C++ type-system to avoid having any format-string-mismatch issues.)

I imagine we’re still a while away from being able to upgrade LLVM to C++20, but at that point, could we just switch to std::format? It also gets us compile-time format string checking while having a similar-ish API to llvm:format (at first glance, at least).

This sounds like a good starting point at least. It looks like @fay59’s ⚙ D132413 [NFC] Make format() more amenable to format attributes set us up for this already.

Another possible solution is adopting GitHub - fmtlib/fmt: A modern formatting library as a backbone for the llvm::format. But I guess it’s rather impossible.

My original plan was to solve all format issues in LLVM, but buildbots ran into issues that I couldn’t reproduce myself and I put the work aside. I do suspect that my change, if it landed, would have solved @efriedma-quic’s problem, as the list of affected modules does include llvm/DebugInfo.

Is this roughly what llvm:formatv does? Are there some implementation details / properties of fmt that we would want to potentially incorporate into formatv?

To be honest, I’m not sure
This library is resulted in std::format and std::print capabilities in C++20 and C++23 respectively. By default it has a python-like syntax for the formatting. But in the library there are printf-like functions also.
Additionally, there is a compile-time checking for the arguments to match the format string and the author claims it is the fastest formatting library out there, even faster than the printf.

llvm::formatv format strings are similar to std::format. I think the std::format APIs are less efficient writing to a raw_ostream, though. (Switching from OS << llvm::formatv("{}", x); to OS << std::format("{}", x); means allocating memory for an std::string.)

Either way, though, the format strings supported by those APIs are very different from printf format strings, so porting from llvm::format to llvm::formatv/std::fomat is non-trivial. And there are currently around 600 uses of llvm::format in tree.

Actually, there is a clang-tidy check that coverts printf-like functions to std::print: modernize-use-std-print. So I guess it can be applied to llvm::formatstd::format with some tweaks.

Only if the result doesn’t fit in the string’s internal buffer, which is 22 chars for libc++ and 15 chars for libstdc++. From a quick grep, it looks like many uses of llvm::format are for short strings that would not allocate.

What’s the difference between std::format and llvm::formatv (other than the std::string conversion)?

It’s not clear to me how switching from llvm::formatv("{}", x); to OS << std::format("{}", x); brings any safety?

I am happy to read your RFC to update the toolchain requirements to C++20. std::format is C++20 :frowning:

I don’t think anyone is proposing to move to std::format (and std::print when we allow C++23) immediately. Jus that it should be easier to replace all llvm::formatv uses with std ones once we do, so it should be favoured (in addition to be less dangerous).