-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WG21 P1949: Improve support for Unicode characters in identifiers #48
Comments
cppreference.com has a more informative list of the ranges of allowed characters in identifiers. |
It's unlikely to happen now, but if at all possible it'd be really good to NFC identifiers. |
@ubsan agreed, it is strongly encouraged by UAX#31 However, it is putting the cart before the horse. |
That is not strictly correct as identifiers can contain
I don't agree with this conclusion. The standard is clear regarding how physical source file characters are mapped to the compiler's internal encoding. Source files are portable so long as the compilers used with them 1) support the actual source file encoding, and 2) are correctly informed about the source file encoding. In my opinion, it is that latter case that we need to improve. |
That's the definition of not portable Interestingly Microsoft solves that particular problem by always parsing identifiers as utf8 regardless of the actual encoding of the file. The standard is not clear. it is completely implementation defined. Aka not portable. As a point of data i learned today that vcpkg build every packages on windows with /utf8 |
You’ll have to walk me through to that conclusion.
I’m not sure what you mean by that. Perhaps you mean that identifiers are transcoded from the source file encoding to UTF-8 and then used in that form? Microsoft uses UTF-8 as the internal encoding, so that doesn’t seem surprising.
What isn’t clear?
I don’t disagree, but that doesn’t make it the wrong choice from a backward compatibility and migration perspective.
Yes, I’ve discussed this with Robert previously. If I recall, he had done some scans and found little use of non-ASCII characters. I don’t find that at all surprising within the Windows ecosystem though since the default source file encoding for the Microsoft compiler is locale sensitive. Programmers on Windows that distribute source files have never been able to assume an encoding other than ASCII (and even that breaks with Shift-JIS). I don’t think the vcpkg experience generalizes particularly well. |
no, they are NOT transcoded, the sequence of bytes making the identifier seems not to be parsed using the same encoding as the rest of the file example provided by @ubsan https://gcc.godbolt.org/z/O0309o |
Thats work in a pre-internet, mono-platform environment. I cannot trust that. Compilers do not interpret source in a consistent fashion. [And as you mentioned that forces people to live in an ASCII only world - solution currently is to compile everything with /utf8] |
If we want Unicode identifiers, not withstanding escape sequence we need to ensure that:
should be a valid program ( [é is an example, I'm not suggesting that it should be a valid variable name, i haven't studied uax 31 enough yet] Note that this presents an interesting issue: |
Interesting paper but this bit
is a deal breaker for me - this make Unicode identifiers unusable with reflection, abi, etc I'm not saying people should start putting non ASCII identifiers in their interfaces but if we want to give that ability, it needs to be reliable |
I think this conclusion is incorrect. I think what you are seeing is typical encoding confusion. In the example you provided, UTF-8 source code is being provided to the compiler, but the compiler is being told to interpret it as Windows 1252. The character in question, 🚙 (U+1F699 RECREATIONAL VEHICLE) has a UTF-8 representation of F0 9F 9A 99. In Windows 1252, this corresponds to "🚙" (U+00F0, U+0178, U+0161, U+2122). Microsoft's documentation for allowed identifiers (https://docs.microsoft.com/en-us/cpp/cpp/identifiers-cpp?view=vs-2019) lists which Unicode code points are allowed. If you cross check that list with the Unicode code points for those characters, you'll see that each one is allowed in identifiers. As for Godbolt then displaying the original Unicode character in the disassembly window, I believe that is technically a bug in Godbolt. The disassembly output is very likely Windows 1252, but is being interpreted as UTF-8. |
I don't see how that is relevant. The claim I made is that "Source files are portable so long as the compilers used with them 1) support the actual source file encoding, and 2) are correctly informed about the source file encoding". All compilers don't have to have the same default behavior for source files to be portable.
Please don't tell people to use |
These are ABI issues and outside our purview.
It isn't at all clear to me that
I think that is probably what is desired almost all of the time. |
That is the paper that brought UAX#31 into the standard wording. See [lex.name]p1. Note that the paper is simultaneously WG21 N3146.
The cited text pretty much matches what we just decided for file names for P1689.
I don't agree with that conclusion.
I do want to give programmers that ability and I agree it needs to be reliable. But I think there are multiple approaches to the problem with various pros and cons and it isn't evident to me that all implementors need to solve problems the same way. |
On Sat, Aug 3, 2019, 11:36 PM Tom Honermann ***@***.***> wrote:
no, they are NOT transcoded, the sequence of bytes making the identifier
seems not to be parsed using the same encoding as the rest of the file
I think this conclusion is incorrect. I think what you are seeing is
typical encoding confusion. In the example you provided, UTF-8 source code
is being provided to the compiler, but the compiler is being told to
interpret it as Windows 1252. The character in question, 🚙 (U+1F699
RECREATIONAL VEHICLE) has a UTF-8 representation of F0 9F 9A 99. In Windows
1252, this corresponds to "🚙" (U+00F0, U+0178, U+0161, U+2122).
Microsoft's documentation for allowed identifiers (
https://docs.microsoft.com/en-us/cpp/cpp/identifiers-cpp?view=vs-2019)
lists which Unicode code points are allowed. If you cross check that list
with the Unicode code points for those characters, you'll see that each one
is allowed in identifiers. As for Godbolt then displaying the original
Unicode character in the disassembly window, I believe that is technically
a bug in Godbolt. The disassembly output is very likely Windows 1252, but
is being interpreted as UTF-8.
You might be right and it would make more sense. We would need to run more
test because it seemed that the Microsoft implementation correctly filter
out some Unicode whitespaces (but not all)
—
… You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48?email_source=notifications&email_token=AAKX765LK47PAIEOT3J5CMLQCX27PA5CNFSM4HNFO43KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PWGAY#issuecomment-517956355>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKX762FFASAH2PHIJA4V6TQCX27PANCNFSM4HNFO43A>
.
|
On Sat, Aug 3, 2019, 11:53 PM Tom Honermann ***@***.***> wrote:
If we want Unicode identifiers, not withstanding escape sequence we need
to ensure that:
These are ABI issues and outside our purview.
struct é;
static_assert(is_same_v<unqualid(u8"é"), é>);
should be a valid program (unqualid is an utility that transforms a string
into an identifier, part of the ongoing metaclasses work)
It isn't at all clear to me that unqualid should accept a u8 string.
Sure, it's not necessary - given that identifiers and string literals are
interpreted similarly
Note that this presents an interesting issue: name_of is in the ts
specified to return a NTBS in the execution encoding
I think that is probably what is desired almost all of the time.
Not in the presence of Unicode identifiers, because name of would give you
gibberish
… —
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#48?email_source=notifications&email_token=AAKX766Z67RUDTS4E6ZNO7TQCX46PA5CNFSM4HNFO43KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3PWMTI#issuecomment-517957197>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKX762UDNPESMTG2Z5HWPLQCX46PANCNFSM4HNFO43A>
.
|
I do want to give programmers that ability and I agree it needs to be
reliable. But I think there are multiple approaches to the problem with
various pros and cons and it isn't evident to me that all implementors need
to solve problems the same way.
Implementation specific behaviors should be a last resort. I routinely work
on 3 compilers on many platforms and I need to trust my tools. Failing to
provide portable solutionns leads to people restricting themseles to the
portable subset which is one of the reasons why nobody currently use
Unicode identifiers. A lot of people support a lot more platforms than I do.
One problem is that input methods will vary greatly between platforms and
text editors so people have very little way to control that they write code
in a consistent normalization form.
Which call for normalization.
I am also a bit uncomfortable (well, a lot) going directly against the
Unicode recommendations.
For info, at a glance:
Perl 6 and python 3 normalize
Perl 5 has a guideline to ask people to normalize
Go does not support combining characters but they are considering
normalizing for go 2 golang/go#27896
Rust does not support Unicode identifiers but are considering NFC when they
do.
Swift does not normalize, with some issues and desire to fix which is for
them technically an API break
https://forums.swift.org/t/pitch-unicode-equivalence-for-swift-source/21576
Julia normalize JuliaLang/julia#5434
D does not normalize but Walter bright note that people avoid using Unicode
identifiers
https://news.ycombinator.com/item?id=20320151
C# kinda requires normalization
An identifier in a conforming program must be in the canonical format defined by Unicode Normalization Form C, as defined by Unicode Standard Annex 15. The behavior when encountering an identifier not in Normalization Form C is implementation-defined; however, a diagnostic is not required.
I don't know if i miss anything relevant
|
I think we're on the same page here. Implementation defined behavior doesn't preclude portability; sometimes it just affects the level of abstraction required. Thanks for doing that research; that is good information. There appears to be a clear trend towards normalization, particularly in languages that didn't start off with a normalizing implementation. |
P1949 now tracks a solution for this issue. |
This issue is now tracked by cplusplus/papers#688. |
This is done! |
I've prepared a report and library for "C/C++ Identifier Security using Unicode Standard Annex 39", a massive improvement over TR31 alone. How do I file this officially for WG21/WG14? How do I get a P number? |
Hi @rurban. Please send a link to your proposal to the SG16 mailing list. Myself or someone else will reply with instructions for how to request a P-number and submit your proposal. |
JF raised this issue on the SG16 mailing list.
Briefly, the standard allows the use of Unicode characters outside the basic source character set to be used in identifiers as specified by [lex.name]p1. The standard does not provide a rationale for the ranges of allowed characters that it specifies. It is likely that the specified ranges are not being maintained as new characters are added in new Unicode releases.
The Unicode consortium has published UAX#31, a technical report covering naming of identifiers. This document may provide a better basis for the C++ standard to base its allowances for use of Unicode characters outside the basic source character set in identifier names.
The text was updated successfully, but these errors were encountered: