Skip to content

Commit

Permalink
[lex.charset] Fix various issues with the description of UCNs.
Browse files Browse the repository at this point in the history
Clarify that \U sequences not beginning 00 are ill-formed. Clarify
handling of code points naming reserved or noncharacter code points.
Remove unnecessary circumlocution through "short identifiers" by
directly talking about code points. Use code point values directly
rather than using C++ 0x notation.

[lex.string] Fix description of what UCNs mean, and convert it to a
note.
  • Loading branch information
zygoloid committed Mar 11, 2019
1 parent c45d32d commit 5acbaf2
Showing 1 changed file with 23 additions and 20 deletions.
43 changes: 23 additions & 20 deletions source/lex.tex
Original file line number Diff line number Diff line change
Expand Up @@ -207,18 +207,17 @@
\terminal{\textbackslash U} hex-quad hex-quad
\end{bnf}

The character designated by the \grammarterm{universal-character-name} \tcode{\textbackslash
U00NNNNNN} is that character
that has \tcode{U+NNNNNN} as a code point short identifier;
the character designated by the \grammarterm{universal-character-name}
\tcode{\textbackslash uNNNN} is that character
that has \tcode{U+NNNN} as a code point short identifier.
If a \grammarterm{universal-character-name} does not correspond to
a code point in ISO/IEC 10646 or
if a \grammarterm{universal-character-name} corresponds to
a surrogate code point,
the program is ill-formed. Additionally, if
a \grammarterm{universal-character-name} outside
A \grammarterm{universal-character-name}
designates the character in ISO/IEC 10646 (if any)
whose code point is the hexadecimal number represented by
the sequence of \grammarterm{hexadecimal-digit}s
in the \grammarterm{universal-character-name}.
The program is ill-formed if that number is not a code point
or if it is a surrogate code point.
Noncharacter code points and reserved code points
are considered to designate separate characters distinct from
any ISO/IEC 10646 character.
If a \grammarterm{universal-character-name} outside
the \grammarterm{c-char-sequence}, \grammarterm{s-char-sequence}, or
\grammarterm{r-char-sequence} of
a character or
Expand All @@ -228,10 +227,10 @@
\grammarterm{r-char-sequence}\iref{lex.string} does not form a
\grammarterm{universal-character-name}.}
\begin{note}
ISO/IEC 10646 code points are within the range 0x0-0x10FFFF (inclusive).
A surrogate code point is a value in the range 0xD800-0xDFFF (inclusive).
ISO/IEC 10646 code points are integers in the range $[0, \mathrm{10FFFF}]$ (hexadecimal).
A surrogate code point is a value in the range $[\mathrm{D800}, \mathrm{DFFF}]$ (hexadecimal).
A control character is a character whose code point is
in either of the ranges 0x0-0x1F or 0x7F-0x9F (both inclusive).
in either of the ranges $[0, \mathrm{1F}]$ or $[\mathrm{7F}, \mathrm{9F}]$ (hexadecimal).
\end{note}

\pnum
Expand Down Expand Up @@ -1144,7 +1143,7 @@
provided that the code point value
can be encoded as a single UTF-8 code unit.
\begin{note}
That is, provided the code point value is in the range 0x0-0x7F (inclusive).
That is, provided the code point value is in the range $[0, \mathrm{7F}]$ (hexadecimal).
\end{note}
If the value is not representable with a single UTF-8 code unit,
the program is ill-formed.
Expand All @@ -1163,7 +1162,7 @@
provided that the code point value is
representable with a single 16-bit code unit.
\begin{note}
That is, provided the code point value is in the range 0x0-0xFFFF (inclusive).
That is, provided the code point value is in the range $[0, \mathrm{FFFF}]$ (hexadecimal).
\end{note}
If the value is not representable
with a single 16-bit code unit, the program is ill-formed.
Expand Down Expand Up @@ -1685,9 +1684,13 @@
character requiring a surrogate pair, plus one for the terminating
\tcode{u'\textbackslash 0'}. \begin{note} The size of a \tcode{char16_t}
string literal is the number of code units, not the number of
characters. \end{note} Within \tcode{char32_t} and \tcode{char16_t}
string literals, any \grammarterm{universal-character-name}{s} shall be within the range
\tcode{0x0} to \tcode{0x10FFFF}. The size of a narrow string literal is
characters. \end{note}
\begin{note}
Any \grammarterm{universal-character-name}{s} are required to
correspond to a code point in the range
$[0, \mathrm{D800})$ or $[\mathrm{E000}, \mathrm{10FFFF}]$ (hexadecimal)\iref{lex.charset}.
\end{note}
The size of a narrow string literal is
the total number of escape sequences and other characters, plus at least
one for the multibyte encoding of each \grammarterm{universal-character-name}, plus
one for the terminating \tcode{'\textbackslash 0'}.
Expand Down

0 comments on commit 5acbaf2

Please sign in to comment.