-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use TokenSource
to find new location for re-lexing
#12060
Conversation
c0a9a2b
to
7705ef6
Compare
|
f7fd083
to
26791e1
Compare
26791e1
to
e3020ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
@@ -1388,84 +1392,35 @@ impl<'src> Lexer<'src> { | |||
return false; | |||
} | |||
|
|||
let mut current_position = self.current_range().start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, that's a lot of code that is gone now :)
// Earlier we reduced the nesting level unconditionally. Now that we know the lexer's | ||
// position is going to be moved back, the lexer needs to be put back into a | ||
// parenthesized context if the current token is a closing parenthesis. | ||
// | ||
// ```py | ||
// (a, [b, | ||
// c | ||
// ) | ||
// ``` | ||
// | ||
// Here, the parser would request to re-lex the token when it's at `)` and can recover | ||
// from an unclosed `[`. This method will move the lexer back to the newline character | ||
// after `c` which means it goes back into parenthesized context. | ||
if matches!( | ||
self.current_kind, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it still make sense to reduce the nesting level above unconditionally or could we invert the condition here and only then reduce the nesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point, we can invert the condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually, it might still not be possible. Let me confirm
} | ||
} | ||
|
||
if self.lexer.re_lex_logical_token(non_logical_newline_start) { | ||
let current_start = self.current_range().start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not important and I'm fine to keep it this way. I was just wondering if we could store the offset of the non_logical_newline
and truncate the self.tokens
to that position if re_lex_logical_token
returns true
.
## Summary This PR fixes the lexer logic to **not** consume the newline character for an unterminated string literal. Currently, the lexer would consume it to be part of the string itself but that would be bad for recovery because then the lexer wouldn't emit the newline token ever. This PR fixes that to avoid consuming the newline character in that case. This was discovered during #12060. ## Test Plan Update the snapshots and validate them.
Summary
This PR splits the re-lexing logic into two parts:
TokenSource
: The token source will be responsible to find the position the lexer needs to be moved toLexer
: The lexer will be responsible to reduce the nesting level and move itself to the new position if recovered from a parenthesized contextThis split makes it easy to find the new lexer position without needing to implement the backwards lexing logic again which would need to handle cases involving:
F-strings
This change did reveal one thing about re-lexing f-strings. Consider the following example:
Here, the quote as highlighted by the caret (
^
) is the start of a string inside an f-string expression. This is unterminated string which means the token emitted is actuallyUnknown
. The parser tries to recover from it but there's no newline token in the vector so the new logic doesn't recover from it. The previous logic does recover because it's looking at the raw characters instead.The parser would be at
FStringStart
(the one for the second line) when it calls into the re-lexing logic to recover from an unterminated f-string on the first line. So, moving backwards the first character encountered is a newline character but the first token encountered is anUnknown
token.This is improved with #12067
fixes: #12046
fixes: #12036
Test Plan
Update the snapshot and validate the changes.