-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add utf-8 validation for input source #2374
Conversation
{ | ||
uint8_t input = next_byte (); | ||
uint32_t input = next_byte (); | ||
|
||
if ((int8_t) input == EOF) | ||
if ((int32_t) input == EOF) | ||
return Codepoint::eof (); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed input
from uint8_t
to uint32_t
so as to differentiate 0xff
and EOF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a bugfix
{ | ||
if (offs >= buffer.size ()) | ||
return EOF; | ||
|
||
return buffer.at (offs++); | ||
return (uint8_t) buffer.at (offs++); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added casting to prevend bytes whose MSB is 1 from being sign-extended.
Without casting , for example, 0xfe
becomes 0xfffffffe
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bugfix too
// { dg-excess-errors "stream did not contain valid UTF-8" } | ||
� |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contains a 0xff
in line 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not ÿ (U+FF
) as we see.
gcc/rust/ChangeLog: * lex/rust-lex.cc (Lexer::input_source_is_valid_utf8): New method of `Lexer`. * lex/rust-lex.h: Likewise. * rust-session-manager.cc (Session::compile_crate): Add error. gcc/testsuite/ChangeLog: * rust/compile/broken_utf8.rs: New test. Signed-off-by: Raiki Tamura <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The only thing i might say is it would be nice to have these constants named in some way but utf8 stuff is so specific i dont think that really helps. |
Addresses #2287