-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenize Unicode identifiers #2284
Conversation
I found this code and am not sure how we handle width of characters |
libcpp/charset.cc
Outdated
/* Returns 1 if C has the XID_Start property, 2 if C has the XID_Continue properties, 3 if C has the both properties, 0 otherwise. */ | ||
int check_xid_property (cppchar_t c) { | ||
// fast path for ASCII | ||
if (c < 0x80) { | ||
if (('A' <= c && c <= 'Z') || ('a' <= c && c <= 'z')) | ||
return 3; | ||
if (('0' <= c && c <= '9') || c == '_') | ||
return 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should return bit flags but how can I do it?
(i.e. None: 0, XID_Start: 1, XID_Continue: 1 << 1)
Is just using int
(or unsigned int
) and exporting constants such as XID_START
and XID_CONTINUE
good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use unsigned int
and export consants via enum
1a03be4
to
41e4ea7
Compare
bool | ||
is_whitespace (char character) | ||
is_whitespace (int character) | ||
{ | ||
return ISSPACE (character); | ||
// https://doc.rust-lang.org/reference/whitespace.html | ||
return character == '\t' || character == '\n' || character == '\v' | ||
|| character == '\f' || character == '\r' || character == ' ' | ||
|| character == 0x0085 // next line | ||
|| character == 0x200e // left-to-right mark | ||
|| character == 0x200f // right-to-left mark | ||
|| character == 0x2028 // line separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some codepoints of whitespaces.
But non-ascii whitespaces are not actually checked during tokenization because this func is called with argument whose type is char
(1 byte)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great work, thank you!
|| character == 0x0085 // next line | ||
|| character == 0x200e // left-to-right mark | ||
|| character == 0x200f // right-to-left mark | ||
|| character == 0x2028 // line separator | ||
|| character == 0x2029; // pragraph separator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all of those characters accepted by rustc
as whitespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. All of these values are defined in the Rust ref.
You can find URL to this just before the selected lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed this! Sorry! Thanks for pointing it out haha
} | ||
|
||
current_char = peek_input (); | ||
current_char32 = peek_codepoint_input (); | ||
skip_codepoint_input (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we skipping the codepoint input here but not the char?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we skip one byte here, only the first byte of current utf-8 character can be skipped by the lexer, which we do not expect.
For example, if the lexer tokenizes identifier あああ , it should skip the first utf8 character あ, not its first byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you!
gcc/rust/lex/rust-lex.cc
Outdated
// TODO some keywords cannot be used for a lifetime label | ||
// https://doc.rust-lang.org/reference/tokens.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good - please open an issue for this and put the issue number in the comment :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opend #2306
44ce808
to
b77a471
Compare
Oops, I mistakenly made the last commit. I will undo it later. |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor nit :) This looks great to me, thank you @tamaroning!
gcc/rust/lex/rust-lex.cc
Outdated
length++; | ||
} | ||
|
||
current_column += length; | ||
|
||
loc += length - 1; | ||
|
||
// TODO some keywords cannot be used for a lifetime label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TODO some keywords cannot be used for a lifetime label | |
// TODO some keywords cannot be used for a lifetime label #2306 |
@tamaroning I noticed that your commit does not include a commit title - please add a short description of the commit as title. It should look something like this:
You can see the commit has a title, which can be anything as long as it is a short description of the commit. Then, you have the commit description - but that's not always necessary. Then the Changelog, and finally the |
@CohenArthur Thanks for your review! I fixed comments and commit messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch looks great though i think what will really help is to split this into two commits. So you put all the libcpp changes into one commit and the gccrs changes in another. The only reason for this is that the libcpp changes will need to be reviewed by other people in the GCC community and it will be easier to do this by separating them out.
Great work!
libcpp/ChangeLog: * charset.cc (check_xid_property):new function to check XID_Start and XID_Continue * include/cpplib.h (check_xid_property):add enum representing XID properties Signed-off-by: Raiki Tamura <[email protected]>
gcc/rust/ChangeLog: * lex/rust-lex.cc (is_whitespace):add all lacked codepoints valid as whitespaces (is_identifier_start):new function to check XID_Start and underscore (is_identifier_continue):new function to check XID_Continue (Lexer::build_token):tokenize Unicode identifiers (Lexer::parse_partial_string_continue):add comments (Lexer::parse_partial_unicode_escape):add comments (Lexer::parse_raw_identifier):change to use `is_identifier_scontinue` (Lexer::parse_identifier_or_keyword):change to use `is_identifier_continue` (Lexer::parse_char_or_lifetime):change to use `is_identifier_start/continue` (Lexer::skip_codepoint_input):do not attempt to skip input when bumping EOF * lex/rust-lex.h:add `current_char32` field Signed-off-by: Raiki Tamura <[email protected]>
Sounds good :) Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not much to say, that's good work.
@@ -116,6 +123,18 @@ is_non_decimal_int_literal_separator (char character) | |||
return character == 'x' || character == 'o' || character == 'b'; | |||
} | |||
|
|||
bool | |||
is_identifier_start (int codepoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to unify with the Codepoint
alias in rust-codepoint.h
? Also what about specifying an explicit size (eg. std::uint32_t
, maybe even wchar_t
?) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think uint32_t
is better. If we unify types for paramters of such functions, other several functions should also use the same type.
e.g. is_x_digit, is_octal_digit, etc.
gccrs/gcc/rust/lex/rust-lex.cc
Lines 83 to 87 in d535c82
bool | |
is_x_digit (char number) | |
{ | |
return ISXDIGIT (number); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Addresses #2287 #418
gcc/rust/ChangeLog:
libcpp/ChangeLog: