-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for UTF encodings in std::format() and std::print() #68
Comments
On 18/03/2021 21.53, Tom Honermann wrote:
|std::format()| (in C++20) and |std::print()| (proposed in P2093 <https://wg21.link/p2093>) do not allow |char8_t|, |char16_t|, and |char32_t| based strings to be used for either the format string or for field arguments.
There are two distinct concerns:
1. If UTF strings are allowed as formatter strings, what conversions are performed on |char| and |wchar_t| based field arguments?
|std::string s = ...; std::format(u"{}", s); |
2. If UTF strings are allowed as field arguments, what conversions are performed when the format string is |char| or |wchar_t| based?
|std::u16string s = ...; std::format("{}", s); |
The answers to those questions may be dependent on one or both of:
* The literal encoding (execution character set) selected at compile-time (as is proposed in P2093 <https://wg21.link/p2093>).
* The locale dependent system encoding selected at run-time.
printf can take a wchar_t string, and wprintf can take a char string.
In the first case, wcrtomb is used to convert,
in the second case, mbrtowc is used to convert.
For case 1 above, this seems to suggest that "s" is assumed
to be in the encoding that mbrtowc or mbrtowc would expect as
input (presumably, the locale-dependent (multibyte) encoding for w/char
strings), and the formatting produces a UTF-8 string as output.
For case 2 above, the result should be a w/char string, so "s"
needs to be converted to the respective (runtime) encoding.
It seems that the literal encoding is not relevant at all,
and one can just hope that the encoding of the literal
format string happens to just work when interpreted under
the runtime encoding. (Limiting the format string to the basic
character set will probably help, in practice. For example,
ASCII -> UTF-8 works, as does ASCII -> ISO 8859-x, but ASCII
-> EBCDIC doesn't.)
Jens
|
It may suggest that, and assumptions would have to be made. The conclusion that UTF-8 output is produced does not resonate with me. That certainly would not be the case on an EBCDIC based system.
That hope is certainly required in many cases, but it may be reasonable to make decisions based on literal encoding. For example, if the literal encoding is UTF-8, it may be reasonable to only claim conformance if the application is run in a UTF-8 environment. In that case, locale dependencies can be avoided. |
Here is a possible model that takes into account both the literal encoding and the locale dependent run-time encoding.
|
I am no longer intending to pursue this direction. |
I would like to say as C++ programmer that not having support for char8_t, char16_t, and char32_t for std::format is rly bad and honestly I have hard time to see why... If there are only this two distinct concerns then solution is simple, don't let conversion happens at all if its not possible, if format text is char8_t then all field arguments need to be char8_t too... I think for wide char its like that atm as was playing with sdt::format code as this clearly say ( line 3594 format.h ): To create custom std::format function that would support all character types just need few code duplication atm ( at least on windows ) and I made std::format working for all character types ( didn't test chrono formatting only as that define function declaration for chrono formatting is only one left unchanged as it would required changes to code where its used too ) In my opinion its so simple to add support for this new characters type as all support is already there, char8_t can use 100% code that's in for char as for formatting char we are assuming char is UTF8 formatted and not ASCII that's clearly seen in this function : Support for char16_t is also in there as its same for windows wchar_t, they are 100% same as when you create new string with char16_t type VS will report like its text created with L"" prefix even if its created with u"" prefix .... Support for char32_t is already in format code as there is _Decode_utf function for char32_t already in code and on Mac wide character is 32 bit so that also had to work ( and I assume std::format code is same for all platforms with just #define doing different code parts depending of OS differences ) 99.99% of usage of format code is at my opinion on character that are same type and they are the most efficient ones, so please just enable it support for std::format and all difficulty of conversion is better to be left to new class like std::convert and then let user convert all text if needed before used for formatter. I mean atm std::format for simple plain ASCII text is doing full UTF8 decoding/encoding that's just silly I would say.... Thanks in advance, and if I can help in any way I will do, as this is rly bad for C++ not to have supported this, as main power of C++ is its ability to work on lots of platforms and like main issue in making portable code much safer and compact is C++ text support that should be much better then its now unfortunately .... |
Thank you for your comments, @zoran12. I also agree that this is important. The simple answer for why there has been no progress adding support for the
|
Thanks m8 on response, I would like to help, don't think I will be able to make proposal as this papers rly need to be technical but I hope I can do work in code that can provide solution or at least help find one... for FMT its already support all character types, with fmt::format you can now format char8_t, char16_t and char32_t text, I used it 1st and tried to make it to be only header version as my engine code I am transferring to be only headers ( so you just import one header and #define one cpp to be ENGINE_CPP and all should work perfectly ). Didn't notice anything missing that I could help there, all seems to work, I still even have it inside project included as header only and its work fine with all char types ... With experience from transferring FMT to be header only, and when sow that same person created FMT and std::format code I just tried to see if its possible to enable std::format to work with all character types and almost done it with just adding missing template versions for new char types when hit a wall with >> C2491: 'std::numpunct<_Elem>::id' : definition of dllimport static data member not allowed << error... Realize that I cant just add code like this I just created new file that I am going to attach here now : This file is full copy/past std format.h file so I could edit changes directly and also copied code for new numpunct class to be used so I could avoid that error ( original class is in xlocnum.h file ). New class is copy/pasted old class with only few changes like original class has this line that was creating issue: All other changes were simple ones and like all were just to add specialization classes and function for char8_t, char16_t and char32_t that are just copy/paste of already one found in code and with changing types like from char to char8_t... I honestly didn't expect this to work, but it worked and seems to work perfectly : I uploaded stdFormat.hpp file ( in a zip as couldn't upload hpp file type ) just including it and using fmt::std__vFormat or fmt::std__cFormat function would let format now any character type, also all other functions are created like this and are working for all character types ( char, wchar_t, char8_t, char16_t, char32_t ) now. Only limit is that you can only use one character type for a format text and all field texts, as no formatter that would do conversion are added as wasn't sure how they should work ( I think that ascii char to all other types is created by default as soon as I changed support character types to include all types .... ) Only one thing I didn't change was chrono part : // _STATICALLY_WIDEN is used by since C++20 and by since C++23. that was using this function: template As it could require changes in chrono files to use new version so this changes would need to go to original format.h file... So FMT library already supports formatting all character types and even std version in format.h is like having 90% of its support done, so its kind of shame not to have it be finished and finally have at least std::format working for all character types :) |
@zoran12, I'll reply to a few points below, but this github issue is intended more for administrative purposes than for discussion. For further discussion, please post to the SG16 mailing list.
I understand that you were able to modify fmtlib to work for your purposes, but support for these types is not present in fmtlib at present as exhibited by https://godbolt.org/z/TbKdGb1hW. One way that you could help is to submit code changes and tests to fmtlib to add such support for everyone. That would be a great first step towards standardization. Please note that there isn't just one The prototyping you've done demonstrates that some locale enhancements are needed in the standard library to make this work. Per [locale.numpunct.general]p1, implementations are only required to provide the The standard will need to be updated to add (at least) the missing locale facet specializations before support for other character types can be specified for |
Oh sorry m8 for posting here, this is gonna be last message as I am not sure how to post to the SG16 mailing list. I am receiving SG 16 mails as subscribed so only need to send mail to [email protected] ? And make sure I put proper subject text for better mail handling ? Hmm mail communication like this isn't best and make people miss things, like you probably missed that fmt added support for all char types ( only that support is added in new header file call xchar.h ) So to make your example work on compile explorer a simple header including >>#include <fmt/xchar.h><< make it happened. Cant generate link so will post screenshot : Best would be when you confirm that fmt works to maybe start new subject in SG16 mailing list about implementing it changes to std::format or at least to get info about its state from person who wrote it ( Victor Zverovich ) as I think he is also on this list and I will do my best to include myself inside SG16 mailing conversation if I see I can help in any way :) Only hope that at least making you aware of fmt state was worth mess I made here ( if you can remove/delete all my posts please do it so mess get away :) ) |
Exactly. I understand that mailing lists are unfamiliar to some people, but there is no substitute that can scale to the volume of information some of us process daily. Thank you for the information about I'm not sure why you are unable to generate compiler explorer links. Just click the 'Share' drop down menu and then click 'Short link' and copy the URL. I know Victor well; he has been a contributor to SG16 for many years now :) |
std::format()
(in C++20) andstd::print()
(proposed in P2093) do not allowchar8_t
,char16_t
, andchar32_t
based strings to be used for either the format string or for field arguments.There are two distinct concerns:
char
andwchar_t
based field arguments?char
orwchar_t
based?The answers to those questions may be dependent on one or both of:
The text was updated successfully, but these errors were encountered: