-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify that lexing is greedy #599
Conversation
ead49a0
to
919f3f2
Compare
fac60df
to
92c1602
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leebyron Looks good 👍
P.S. In retrospective I think making commas optional was a mistake since it causing not only this issue but also complicates writing error recovering parser.
This issue is unrelated to optional commas since the lexical grammar applies before the syntactical grammar. Other languages face the same thing and a clause like this is pretty common I found. Javascript's is even more complex actually, but has a similar clause about the lexer being greedy: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar |
I'd love to learn more about your exploration into error recovering parsing. I know that some strategies involve looking for a comma, semicolon or line break to attempt recovery. The parser used in GraphiQL has some error recovery properties as well, though I'm not sure if there's any academic literature to back up the method it uses. |
92c1602
to
7f62455
Compare
@leebyron Thanks for the link, now I see how it happens with other languages.
But since is GraphQL whitespaces and commas is insignificant both So if we now adopt the same definition of "greedy" lexer as JS and other languages how should |
Great question. I'm actually surprised to see the JS error point to that specific location. I'll need to read more about how JS and other languages are handling this, and if this is implementation specific or specified behavior. Given the ECMAScript spec, I would have assumed that As our spec is currently written, I would expect Perhaps JS and other languages have a more complex number lexer? |
Indeed ECMAScript spec has specific language for this (though, surprisingly it's not in the formal grammar) See the prose and note at the end of this section: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals
I would have expected a negative lookahead in the formal grammar. Something like this adds value since it protects the space in case we ever wanted to use trailing letters for different kinds of numbers (seems unlikely), but certainly it helps combat the potential confusion or poor readability. If we were to pursue adding that, we should do it as a separate RFC instead of an editorial clarification since it's technical a language (breaking) change and so we should go through the stages to get implementations on board. |
Doing a little more research about how other popular languages specify this behavior and looking for specific quirks around numbers... Maximal Munch First, learning about the academic background. A term of art for this is "maximal munch." There's some interesting research about avoiding implementation rules and relying only on unambiguous grammars by using follow restrictions (essentially what I was expecting JS to be doing) For example, we might want to describe the Today we describe some of our Lex grammars with regular expressions which use greedy * and + operators. https://en.wikipedia.org/wiki/Maximal_munch C Uses a maximal munch lexer with extra rules (header preprocessors)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf (page 49) Javascript Uses a maximal munch lexer with extra rules (for regex, template literals). Numbers are not allowed to be followed by a letter (and also a decimal digit - which is an attempt at a disambiguating follow restriction) After reading grammars for a bunch of other languages, I think this limitation is really smart. Most programming languages offer many numeric literals, including I don't know the historical context for this follow restriction in Javascript, but I assume it is to produce clear errors if someone attempts to parse JS-ish C99 code and perhaps there was foresight that they might want to support something like this in the future and wanted to protect the lexical space. http://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar Others Also investigated the lexers for Swift, Kotlin, Python, and a couple other languages which have specifications for their language parsing. They nearly all include some language about a maximal munch style algorithm. None of them have follow limiters for numeric literals. https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html Unrelated notes & findings After spending a night reading modern language specifications, I'm seeing some clear trends, that while unrelated to this issue, I think are interesting and I wanted to document my thoughts. Underscores in numeric literals There seems to be a trend in newer languages (including Python 3) to allow arbitrary underscores in a numeric literal, which are ignored during interpretation. This is to aid readability ie Unicode astral plane support I made a mistake basing GraphQL's initial source and lexical parsing logic on ECMAScript's, in that it now unnecessarily suffers from many of the same Unicode related issues. We've known this for a while (there was an ill-fated attempt at fixing this a couple years back). Nearly all modern languages don't make the same mistake. More specifically, GraphQL's spec borrowed JS's assumptions of a UTF16 (previously UCS-2) encoded world, and mixes concepts of a character (which is ambiguous), code point, and code unit. Because of UTF16, the language specifies source only allowing up to U+FFFF, but ideally we should support up to U+10FFFF. Notably, we should support emoji in string literals without requiring surrogate pair code units. Similarly, we missed an opportunity for correct string literal unicode escape sequences. I'll probably propose an RFC fixing most of these unicode issues in the future. |
7f62455
to
00f9ecf
Compare
00f9ecf
to
27c2602
Compare
Significant update, so I've put this back up for review. @andimarek I'd love your eyes on this as well. This latest update is informed by my research and makes a few significant changes:
|
This question came up in the GraphQL working group meeting today, so I wanted to follow up briefly:
The original intent was actually for this to be a parse error! In fact there is a test in GraphQL.js for exactly this case. However the spec makes no mention of this case or of the expected "greedy" behavior of a lexer, so it would be entirely reasonable that if you only read this spec that you might expect This change makes it clear that |
great work @leebyron |
100d8fe
to
1ebfc40
Compare
Updated to mirror the lookahead restrictions between IntValue and FloatValue and include exponent in the restriction. Also includes more prose explaining the effect of the restrictions. Suggested test cases based on this change (thanks @robzhu for the suggestion to write these): These test cases will most likely already be passing in GraphQL implementations, however were ambiguous or under-defined in the spec before this change.
|
Adds the test cases described in graphql/graphql-spec#599 Makes the lookahead restriction change necessary for the new tests to pass for numbers, all other tests are already passing.
Adds the test cases described in graphql/graphql-spec#599 Makes the lookahead restriction change necessary for the new tests to pass for numbers, all other tests are already passing.
Adds the test cases described in graphql/graphql-spec#599 Replicates graphql/graphql-js@c68acd8
1ebfc40
to
61050cb
Compare
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses #564 (comment)
61050cb
to
8248e62
Compare
Some edge cases around numbers were not handled as expected. This commit adds test cases from the 2 RFCs clarifying the expected behaviour ( graphql/graphql-spec#601, graphql/graphql-spec#599) and updates the Lexer to match. This is technically a breaking change but most cases were likely to lead to validation errors (e.g. "0xF1" being parsed as [0, xF1] when expecting a list of integers).
@leebyron Some feedback, even it it comes very late: I think negative lookaheads is a good way of making some details clear, but I think we could improve some details and give some context for people implementing/maintaining GraphQL parser to make it easier. Specifically: Some negative lookahead is used to make a rule greedy (e.g. Comment). This is the most intuitive way and doesn't need a lot of explanation beside what we have right now. Other lookaheads are more complicated is more about removing ambiguity (for example for StringValue) and not about “greedy”. One of the more challenging rules is In ANTLR (which is the parser generator we use in GraphQL Java) we use the following rule:
The Hand written parsers like graphql.js don't really have this challenge as they can easily lookahead for the escaped triple quote. |
Some edge cases around numbers were not handled as expected. This commit adds test cases from the 2 RFCs clarifying the expected behaviour ( graphql/graphql-spec#601, graphql/graphql-spec#599) and updates the Lexer to match. This is technically a breaking change but most cases were likely to lead to validation errors (e.g. "0xF1" being parsed as [0, xF1] when expecting a list of integers).
GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.
Either way, the additional clarity removes ambiguity from the spec
Partial fix for #564
Fixes #572
Specifically addresses #564 (comment)