[PATCH] Let __attribute__((format(…))) accept OFStrings

[cc: cfe-dev, which I don't subscribe to]

Background: Jonathan Schleifer proposed a patch to add new format
specifiers %k/%K or %C/%S to print respectively single characters or
null-terminated strings of type "of_char32_t", which is a typedef for
char32_t (which in C is a typedef for uint_least32_t and in C++ is a
separate built-in character type).

Discussing offline, Jonathan seemed to agree (reluctantly ;)) with my
reaction that this patch was really needed only to paper over a
problem with the latest C and C++ standards — namely, that they
provide these new character types "char16_t" and "char32_t" but don't
provide any printf or scanf specifiers for them. This is inconsistent
with the previous standard's "wchar_t", which got its own "%lc" and
"%ls" format specifiers.

I propose that it would be a very good idea for the next standard to
provide format specifiers for char16_t and char32_t. I would nominate
"%hc" (char16_t) and "%Lc" (char32_t), and the matching "%hs" (array
of char16_t) and "%Ls" (array of char32_t).

Questions:

(A) Does this proposal step on the toes of any existing proposal —
e.g., is "printing char32_t" already in the pipeline to be fixed in
C1y? I'm only vaguely aware of the stuff going into C++1y and I don't
follow C1y at all.

(B) Would Jonathan's -Wformat patch find greater acceptance if ObjFW's
OFString formatting functions adopted the %Lc/%Ls syntax instead of
the previously proposed %C/%S? (However, I believe Jonathan is right
about still needing a new __attribute__((format(__OFString__,1,2))) to
deal with some other things.)

(C) I'm sure I'll be told anyway :wink: but where would be a proper forum
to bring this up with an eye to standardization? We've got a
complicated dance here among Clang, Apple's libc, Apple's NSLog, GCC,
glibc, the C committee, and probably some others I've forgotten. My
primary goal here is really to convince someone one step closer to to
the C committee that this would be a good idea and they should
champion it in the committee. :slight_smile:

Feel free to contact me offline if you don't want to spam the list.

–Arthur

[cc: cfe-dev, which I don't subscribe to]

Background: Jonathan Schleifer proposed a patch to add new format
specifiers %k/%K or %C/%S to print respectively single characters or
null-terminated strings of type "of_char32_t", which is a typedef for
char32_t (which in C is a typedef for uint_least32_t and in C++ is a
separate built-in character type).

Discussing offline, Jonathan seemed to agree (reluctantly ;)) with my
reaction that this patch was really needed only to paper over a
problem with the latest C and C++ standards — namely, that they
provide these new character types "char16_t" and "char32_t" but don't
provide any printf or scanf specifiers for them. This is inconsistent
with the previous standard's "wchar_t", which got its own "%lc" and
"%ls" format specifiers.

I propose that it would be a very good idea for the next standard to
provide format specifiers for char16_t and char32_t. I would nominate
"%hc" (char16_t) and "%Lc" (char32_t), and the matching "%hs" (array
of char16_t) and "%Ls" (array of char32_t).

Arthur, thanks for providing this very clear analysis of the situation which was missing so far.

I don't know enough about the format specifiers to comment on this but it's definitely on topic and something we can work through together, especially if we're moving forward C1y on the side.

Questions:

(A) Does this proposal step on the toes of any existing proposal —
e.g., is "printing char32_t" already in the pipeline to be fixed in
C1y? I'm only vaguely aware of the stuff going into C++1y and I don't
follow C1y at all.

(B) Would Jonathan's -Wformat patch find greater acceptance if ObjFW's
OFString formatting functions adopted the %Lc/%Ls syntax instead of
the previously proposed %C/%S? (However, I believe Jonathan is right
about still needing a new __attribute__((format(__OFString__,1,2))) to
deal with some other things.)

(C) I'm sure I'll be told anyway :wink: but where would be a proper forum
to bring this up with an eye to standardization? We've got a
complicated dance here among Clang, Apple's libc, Apple's NSLog, GCC,
glibc, the C committee, and probably some others I've forgotten. My
primary goal here is really to convince someone one step closer to to
the C committee that this would be a good idea and they should
champion it in the committee. :slight_smile:

Feel free to contact me offline if you don't want to spam the list.

Let's keep it on list and avoid spawning off-list discussions until there's a course of action, but try to keep it contained to this thread so people can tune out if they want. It is after all directly related to one of the big "selling points" of clang, namely fantastic format specifier checking.

One of the problems with the original patch was that it didn't have much context and people were trying to review it even after a newer version was posted to a separate thread. That, along with having non-standard extensions proposed without being subscribed to the list tend to send warning signals during patch review. On the other hand, having a proper discussion like this has been reassuring even if we don't ultimately get answers to all your questions, I'd be more comfortable with the proposed changes because the background and intent are now out in the open.

Alp.

This alone would not solve the problem. C11 has the issue that it is recommended to use Unicode for char16_t and char32_t, but not required and that implementors are free to use another encoding.

So, to really fix this, C1y would need to require Unicode, like C++11 did (no idea why C++11 got it right and they screwed it up in C11 after copying char{16,32}_t over.

The idea is that in the meantime, I do the same Apple does: In order to have a format string as an object, it needs special handling anyway. So I want to introduce the new format string type __OFString__ which takes an OFString object as the format string. I need that anyway, no matter what the outcome of this.

Now that I need my own format string type anyway, I don't see a reason not to do the same as Apple: Interpret %C and %S differently if the format string is an OFString. Apple does *exactly* the same. They special case it to unichar / const unichar*, I special case it to of_unichar_t / const of_unichar_t*.

This does not hurt anybody, as it does not modify any existing behaviour, but instead introduces a new format string type with new behaviour. This is completely independent from the shortcomings of the standard and I'd *really* like to get this in. I need __OFString__ as a format string type anyway, so while I'm at it, I don't see any problem with doing the same special casing Apple does.

While I do map of_unichar_t to C(++)'s char32_t, that does not mean it is the same as char32_t. char32_t is not required to be Unicode - of_unichar_t is. So if C1y introduces a length modifier for char32_t, it would still not be the same: If the system does not use Unicode for char32_t, printf would convert this non-Unicode encoding to whatever multibyte encoding is used for the current locale. So if you put a Unicode character in a char32_t on these systems, it will go wrong.

With of_unichar_t OTOH, I *require* it to be Unicode. Thus I can always assume it is Unicode and convert it to the right multibyte encoding.

So, IMHO, if you really want to fix the standard and do it without any extensions (this could take years, so please, if you are for a standard fix, consider my patch nonetheless), the following would be needed:

* Require char16_t and char32_t to be Unicode (like C++11 does)
** Not required by me, but required to do it right: Require that an array of char16_t may contain UTF-16, so that it is correctly converted to the required multibyte encoding
* Add a length modifier for char16_t / char16_t array / char32_t / char32_t array
** The length modifier for char16_t array should accept UTF-16

And ideally, it should also add the other wchar_t functions for char{16,32}_t - I never got why they were omitted.

But, again, all this will take years. So please, let me just do the same thing for my framework that Apple does for theirs. This worked well for them for years, and it does work well for me too. It will not hurt anybody, will not interfere with anything else and will make me and the users of my framework happy ;).

Thanks.

I would love to see the printf attribute code generalised so that we could have a pragma to declare a formatting character and the type it expected. For the FreeBSD kernel, we have a set of printf extensions for printing complex data structures, and a lot of modern libc implementations provide a mechanism for registering handlers for format strings. It would be great if all of these could be supported, without hard-coding them.

David

I have to agree, a #pragma would be even better. I have to admit, I only thought about an option and not a pragma and using options, this would have become very ugly. But pragmas actually are a nice solution.

But there are a few questions left:
Will the pragma allow it for all format specifier types? Will it only add them to a specific format specifier type?

I would suggest to have several pragmas, e.g.:
#pragma clang format type(formatStringType, type)
#pragma clang format add(formatStringType, formatSpecifier, type)

This would solve both problems for me: Adding OFString as a valid type for a format string and adding the specifiers. So I could use this:

#pragma clang format type(__OFString__, OFString)
#pragma clang format add(__OFString__, "C", of_unichar_t)
#pragma clang format add(__OFString__, "S", const of_unichar_t)

The hard part would be that the format specifier actually needs to be parsed so that something like this would also be possible:
#pragma clang format add(printf, "llc", char32_t)

Any thoughts about this? I think this would be great and better than hardcoding it. I think this would be a solution to make everybody happy. I'd start writing a patch myself right away, but I think I am not enough into the internals of Clang yet ;). I'll see what I can do, but it might take a while due to limited time. If someone else also likes this, feel free to go ahead and do it, you'll most likely have something useful before I do ;).

Another benefit of this approach: Getting rid of all non-standard specifiers in Clang. glibc and Apple could add their extensions in their system headers using those pragmas then.

I would love to see the printf attribute code generalised so that we could have a pragma to declare a formatting character and the type it expected. For the FreeBSD kernel, we have a set of printf extensions for printing complex data structures, and a lot of modern libc implementations provide a mechanism for registering handlers for format strings. It would be great if all of these could be supported, without hard-coding them.

I have to agree, a #pragma would be even better. I have to admit, I only thought about an option and not a pragma and using options, this would have become very ugly. But pragmas actually are a nice solution.

But there are a few questions left:
Will the pragma allow it for all format specifier types? Will it only add them to a specific format specifier type?

I would suggest to have several pragmas, e.g.:
#pragma clang format type(formatStringType, type)
#pragma clang format add(formatStringType, formatSpecifier, type)

This would solve both problems for me: Adding OFString as a valid type for a format string and adding the specifiers. So I could use this:

#pragma clang format type(__OFString__, OFString)
#pragma clang format add(__OFString__, "C", of_unichar_t)
#pragma clang format add(__OFString__, "S", const of_unichar_t)

Having a custom type is meaningless if clang can't interpret it as a literal string at compile time, as it would not be able to check the format argument.
How are you planning to tell clang how to interpret a custom type into a string literal ?

The hard part would be that the format specifier actually needs to be parsed so that something like this would also be possible:
#pragma clang format add(printf, "llc", char32_t)

Any thoughts about this? I think this would be great and better than hardcoding it. I think this would be a solution to make everybody happy. I'd start writing a patch myself right away, but I think I am not enough into the internals of Clang yet ;). I'll see what I can do, but it might take a while due to limited time. If someone else also likes this, feel free to go ahead and do it, you'll most likely have something useful before I do ;).

Another benefit of this approach: Getting rid of all non-standard specifiers in Clang. glib and Apple could add their extensions in their system headers using those pragmas then.

-- Jean-Daniel

You are right. You would need to tell it that it should have e.g. printf syntax, but accept a constant objc string whose type you give (Clang can always interpret it if it generates that constant string anyway). So maybe more like this:

#pragma clang format type(__OFString__, printf, OFString)

Then Clang would accept an OFString* and subclasses (like OFConstantString*) and Clang is able to interpret it (it can generate it after all and with my patch before, it could also handle it - the only difference is that it wouldn't be hardcoded). After that, new format specifiers / length modifiers could be added to that new format string type without modifying what printf accepts.

This seems the wrong way around. You'd want to define a per-method thing, not a per-receiver thing, so we'd just want to define a new printf flavour and add things to it.

David

Uhm, this is exactly what I meant: Add a new printf flavour, but let it accept an OFString* (or subclass) instead of a const char*. Or how exactly are we misunderstanding each other here?

You can only have one format of constant string per compilation unit, so this doesn't seem important. The type of the constant string argument is something that is defined by the parameter type, it doesn't need to be part of the printf extension format.

Instead, you want to be able to define printf-like method sets, and then attach these to methods as you currently do with the standard printf-like things.

David

This seems the wrong way around. You'd want to define a per-method thing, not a per-receiver thing, so we'd just want to define a new printf flavour and add things to it.

Uhm, this is exactly what I meant: Add a new printf flavour, but let it accept an OFString* (or subclass) instead of a const char*. Or how exactly are we misunderstanding each other here?

You can only have one format of constant string per compilation unit, so this doesn't seem important. The type of the constant string argument is something that is defined by the parameter type, it doesn't need to be part of the printf extension format.

That's how I would like it to work, but currently, the printf format attribute enforce the string type.
I'd like to see this restriction relaxed, as it prevents such fun things like supporting the ObjC format using a printf implementation that support customization ( like xprintf ) and letting the compiler check the string.

If we relaxe this restriction, it would then be easier to support the OFString case as it would suppress the argument type issue.

Instead, you want to be able to define printf-like method sets, and then attach these to methods as you currently do with the standard printf-like things.

David

-- Jean-Daniel

You are right, it would be possible to just allow the printf format string type for all constant strings, either C strings or ObjC strings or C(++)11 char{16,32}_t literals, or even wchar_t. Or C++11 custom string literals. That would make __NSString__ and __OFString__ unnecessary and just allow format(printf, …) for all of them. I think this is a cleaner solution and the way to go!

I would love to see the printf attribute code generalised so that we could have a pragma
to declare a formatting character and the type it expected. ...

I have to agree, a #pragma would be even better.

[...]

#pragma clang format type(__OFString__, OFString)
#pragma clang format add(__OFString__, "C", of_unichar_t)
#pragma clang format add(__OFString__, "S", const of_unichar_t)
#pragma clang format add(printf, "llc", char32_t)

Sounds like a great idea to me!

(And formats for standard functions like strfmon() and strftime()
could be implemented in the same way, with header-only patches instead
of special cases in the compiler. In practice I'm sure the compiler
would want to act as if a bunch of these pragmas were hard-coded at
the beginning of each translation unit... but just to unify all those
codepaths will still be a HUGE step forward from the status quo.)

–Arthur