Clarify that lexing is greedy #599

leebyron · 2019-07-23T01:52:07Z

GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings.

Either way, the additional clarity removes ambiguity from the spec

Partial fix for #564
Fixes #572

Specifically addresses #564 (comment)

IvanGoncharov

@leebyron Looks good 👍

P.S. In retrospective I think making commas optional was a mistake since it causing not only this issue but also complicates writing error recovering parser.

leebyron · 2019-07-23T18:23:26Z

This issue is unrelated to optional commas since the lexical grammar applies before the syntactical grammar. Other languages face the same thing and a clause like this is pretty common I found.

Javascript's is even more complex actually, but has a similar clause about the lexer being greedy: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar

leebyron · 2019-07-23T18:25:21Z

I'd love to learn more about your exploration into error recovering parsing. I know that some strategies involve looking for a comma, semicolon or line break to attempt recovery.

The parser used in GraphiQL has some error recovery properties as well, though I'm not sure if there's any academic literature to back up the method it uses.

IvanGoncharov · 2019-07-24T20:16:58Z

This issue is unrelated to optional commas since the lexical grammar applies before the syntactical grammar. Other languages face the same thing and a clause like this is pretty common I found.

Javascript's is even more complex actually, but has a similar clause about the lexer being greedy: ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar

@leebyron Thanks for the link, now I see how it happens with other languages.
It helped me to reformulate problem and understand how graphql-js implementation of "greedy" lexer is different from other languages.
As you pointed our JS also has a greedy parser, but it's totally different from GraphQL in how it invalid tokens. For example in Node's repl:

> abc = 1
1
> [1,abc]
[ 1, 1 ]
> [1abc]
Thrown:
[1abc]
 ^

SyntaxError: Invalid or unexpected token

But since is GraphQL whitespaces and commas is insignificant both [1abc] and [1,abc] should be equal. Moreover using the same logic [1efg] should be equivalent to [1, efg] but in the graphql-js lexer, it is interpreted as a number in scientific notation with invalid exponent.

So if we now adopt the same definition of "greedy" lexer as JS and other languages how should [1abc] and [1efg] now parsed?

leebyron · 2019-07-29T02:19:20Z

Great question. I'm actually surprised to see the JS error point to that specific location. I'll need to read more about how JS and other languages are handling this, and if this is implementation specific or specified behavior.

Given the ECMAScript spec, I would have assumed that 1abc would be parsed as 1, abc and then given a syntax error that abc was unexpected (missing comma). But perhaps there's something else going on.

As our spec is currently written, I would expect [1abc] to be valid (though unclear) and equivalent to [1, abc] while abc1 is a valid identifier.

Perhaps JS and other languages have a more complex number lexer?

leebyron · 2019-07-29T02:26:13Z

Indeed ECMAScript spec has specific language for this (though, surprisingly it's not in the formal grammar)

See the prose and note at the end of this section: https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals

The SourceCharacter immediately following a NumericLiteral must not be an IdentifierStart or DecimalDigit.

NOTE For example: 3in is an error and not the two input elements 3 and in.

I would have expected a negative lookahead in the formal grammar.

Something like this adds value since it protects the space in case we ever wanted to use trailing letters for different kinds of numbers (seems unlikely), but certainly it helps combat the potential confusion or poor readability.

If we were to pursue adding that, we should do it as a separate RFC instead of an editorial clarification since it's technical a language (breaking) change and so we should go through the stages to get implementations on board.

leebyron · 2019-07-29T15:37:05Z

Doing a little more research about how other popular languages specify this behavior and looking for specific quirks around numbers...

Maximal Munch

First, learning about the academic background. A term of art for this is "maximal munch." There's some interesting research about avoiding implementation rules and relying only on unambiguous grammars by using follow restrictions (essentially what I was expecting JS to be doing)

For example, we might want to describe the Name Lex Grammar to not be allowed to be followed by a [_0-9A-Za-z]

Today we describe some of our Lex grammars with regular expressions which use greedy * and + operators.

https://en.wikipedia.org/wiki/Maximal_munch
https://research.cs.wisc.edu/wpis/papers/toplas98b.pdf
https://link.springer.com/chapter/10.1007%2F3-540-45937-5_12

C

Uses a maximal munch lexer with extra rules (header preprocessors)

If the input stream has been parsed into preprocessing tokens up to a given character, the
next preprocessing token is the longest sequence of characters that could constitute a
preprocessing token.

http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf (page 49)

Javascript

Uses a maximal munch lexer with extra rules (for regex, template literals). Numbers are not allowed to be followed by a letter (and also a decimal digit - which is an attempt at a disambiguating follow restriction)

After reading grammars for a bunch of other languages, I think this limitation is really smart. Most programming languages offer many numeric literals, including 0b001100 and 0xAaB2C3, and C99 includes type with the literal 123L. These are currently not valid GraphQL literals, however our lexer does not enforce that. Instead 0x1F would be lexed as [0, x, 1, F] given the current specified behavior. That could parse syntactically if it's in a List value position. While it's almost assuredly not going to be valid and produce an error at that point, it's going to produce an odd error.

I don't know the historical context for this follow restriction in Javascript, but I assume it is to produce clear errors if someone attempts to parse JS-ish C99 code and perhaps there was foresight that they might want to support something like this in the future and wanted to protect the lexical space.

http://www.ecma-international.org/ecma-262/10.0/index.html#sec-ecmascript-language-lexical-grammar
https://www.ecma-international.org/ecma-262/10.0/index.html#sec-literals-numeric-literals

Others

Also investigated the lexers for Swift, Kotlin, Python, and a couple other languages which have specifications for their language parsing. They nearly all include some language about a maximal munch style algorithm. None of them have follow limiters for numeric literals.

https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html
https://kotlinlang.org/docs/reference/grammar.html
https://docs.python.org/3/reference/lexical_analysis.html

Unrelated notes & findings

After spending a night reading modern language specifications, I'm seeing some clear trends, that while unrelated to this issue, I think are interesting and I wanted to document my thoughts.

Underscores in numeric literals

There seems to be a trend in newer languages (including Python 3) to allow arbitrary underscores in a numeric literal, which are ignored during interpretation. This is to aid readability ie 0b_0010_1001 and 1_000_000. While tempting to follow a language trend, I don't think we should pursue this since integer literals can be used for ID type inputs, and underscores might actually cause confusion in this context. GraphQL is also typically not number heavy, so the benefit is less than for programming languages.

Unicode astral plane support

I made a mistake basing GraphQL's initial source and lexical parsing logic on ECMAScript's, in that it now unnecessarily suffers from many of the same Unicode related issues. We've known this for a while (there was an ill-fated attempt at fixing this a couple years back). Nearly all modern languages don't make the same mistake.

More specifically, GraphQL's spec borrowed JS's assumptions of a UTF16 (previously UCS-2) encoded world, and mixes concepts of a character (which is ambiguous), code point, and code unit. Because of UTF16, the language specifies source only allowing up to U+FFFF, but ideally we should support up to U+10FFFF. Notably, we should support emoji in string literals without requiring surrogate pair code units.

Similarly, we missed an opportunity for correct string literal unicode escape sequences.

I'll probably propose an RFC fixing most of these unicode issues in the future.

leebyron · 2019-07-30T07:55:53Z

Significant update, so I've put this back up for review. @andimarek I'd love your eyes on this as well.

This latest update is informed by my research and makes a few significant changes:

Adds lookahead restrictions to many lexical token productions which enforces that they must be greedy.
Adds a subsection on lexical analysis and grammar parsing explaining the tokenization process as well as related cleanup and clarity around how token and ignored interact.
Removes regular expressions as a notational representation, since it doesn't speak to our expectations of implementation.
Edits to Appendix A to include the missing description of lookahead restriction notation and some related cleanup.

leebyron · 2019-08-01T17:22:55Z

This question came up in the GraphQL working group meeting today, so I wanted to follow up briefly:

Would this affect parsing of 00 as two zeros?

The original intent was actually for this to be a parse error! In fact there is a test in GraphQL.js for exactly this case. However the spec makes no mention of this case or of the expected "greedy" behavior of a lexer, so it would be entirely reasonable that if you only read this spec that you might expect 00 to be understood to be the token sequence [0, 0].

This change makes it clear that 00 cannot be parsed in this way since the first 0 is followed by a Digit (the second 0) and therefore not a valid token, and 00 doesn't match any token grammar, since IntValue and FloatValue both restrict leading zeros.

andimarek · 2019-08-02T21:19:15Z

great work @leebyron
Some of these comments here are actually worth having in the spec as an explanation imho.

spec/Appendix B -- Grammar Summary.md

leebyron · 2019-08-07T05:42:37Z

Updated to mirror the lookahead restrictions between IntValue and FloatValue and include exponent in the restriction. Also includes more prose explaining the effect of the restrictions.

Suggested test cases based on this change (thanks @robzhu for the suggestion to write these):

These test cases will most likely already be passing in GraphQL implementations, however were ambiguous or under-defined in the spec before this change.

LineTerminator
- U+000D U+000D U+000A U+000A should be 3 line terminators, not 4
- U+000A U+000A U+000D U+000D should be 4 line terminators
Comment
- # abc is one comment ignored token, not a comment followed by the abc token.
- ## is one comment ignored token, not two.
Name
- abc should be one token abc
- abc123 should be one token abc123
IntValue
- 123 should be one token 123
- 00 should be a parse error, not two tokens [0, 0]
- 01 should be a parse error, not two tokens [0, 1]
- 123. is a parse error
- 123e is a parse error, not two tokens [123, e]
- 123E is a parse error, not two tokens [123, E]
FloatValue
- 1.23 should be one token 1.23
- 01.23 should be a parse error, not the two tokens [0, 1.23]
- 1e5 should be one token 1e5, not two tokens [1, e5]
- 1.2e3 should be one token 1.2e3, not two tokens [1.2, e3]
- 1.2e3.4 should be a parse error, not three tokens [1.2, e, 3.4]
- 1.23.4 should be a parse error, not two tokens [1.2, 3.4]
- 1.2e3e should be a parse error, not two tokens [1.2e3, e]
StringValue
- "" is an a single empty string token
- """ is a parse error due to an unterminated block string, not the token "" followed by an unterminated string
- """" is a parse error, not two tokens ["", ""]
- """""" is a single empty block string token

andimarek · 2019-08-07T07:09:53Z

@leebyron @robzhu these test cases are really great!

Adds the test cases described in graphql/graphql-spec#599 Makes the lookahead restriction change necessary for the new tests to pass for numbers, all other tests are already passing.

Adds the test cases described in graphql/graphql-spec#599 Replicates graphql/graphql-js@c68acd8

spec/Appendix B -- Grammar Summary.md

GraphQL syntactical grammars intend to be unambiguous. While lexical grammars should also be - there has historically been an assumption that lexical parsing is greedy. This is obvious for numbers and words, but less obvious for empty block strings. This also removes regular expression representation from the lexical grammar notation, since it wasn't always clear. Either way, the additional clarity removes ambiguity from the spec Partial fix for #564 Specifically addresses #564 (comment)

leebyron · 2020-01-10T20:49:50Z

Following up from https://github.com/graphql/graphql-wg/blob/master/notes/2019-10-10.md#lexer-lookaheads-stage-3-approval-lee-10m

Some edge cases around numbers were not handled as expected. This commit adds test cases from the 2 RFCs clarifying the expected behaviour ( graphql/graphql-spec#601, graphql/graphql-spec#599) and updates the Lexer to match. This is technically a breaking change but most cases were likely to lead to validation errors (e.g. "0xF1" being parsed as [0, xF1] when expecting a list of integers).

andimarek · 2020-07-19T10:49:32Z

@leebyron Some feedback, even it it comes very late:
Overall great PR, which helps to clarify some edge cases!

I think negative lookaheads is a good way of making some details clear, but I think we could improve some details and give some context for people implementing/maintaining GraphQL parser to make it easier. Specifically:

Some negative lookahead is used to make a rule greedy (e.g. Comment). This is the most intuitive way and doesn't need a lot of explanation beside what we have right now.

Other lookaheads are more complicated is more about removing ambiguity (for example for StringValue) and not about “greedy”.

One of the more challenging rules is BlockStringCharacter because it references SourceCharacter (which is one char including ”) but also \””” which is multiple chars.

In ANTLR (which is the parser generator we use in GraphQL Java) we use the following rule:

StringValue:
'""'  { _input.LA(1) != '"'}? |
'"' StringCharacter+ '"' |
'"""' BlockStringCharacter*? '"""';

BlockStringCharacter:
'\\"""'|
SourceCharacter;

The BlockStringCharacter*? is a special notation which makes BlockStringCharacter* nongreedy, because otherwise it would match everything because SourceCharacter matches also ”. I suspect that other parser generator might have a similar challenge.

Hand written parsers like graphql.js don't really have this challenge as they can easily lookahead for the escaped triple quote.

Some edge cases around numbers were not handled as expected. This commit adds test cases from the 2 RFCs clarifying the expected behaviour ( graphql/graphql-spec#601, graphql/graphql-spec#599) and updates the Lexer to match. This is technically a breaking change but most cases were likely to lead to validation errors (e.g. "0xF1" being parsed as [0, xF1] when expecting a list of integers).

leebyron force-pushed the editorial-greedy-lexer branch 2 times, most recently from ead49a0 to 919f3f2 Compare July 23, 2019 02:15

leebyron added the 🤷‍♀️ Ambiguity An issue/PR which identifies or fixes spec ambiguity label Jul 23, 2019

leebyron force-pushed the editorial-greedy-lexer branch 2 times, most recently from fac60df to 92c1602 Compare July 23, 2019 02:33

IvanGoncharov approved these changes Jul 23, 2019

View reviewed changes

leebyron force-pushed the editorial-greedy-lexer branch from 92c1602 to 7f62455 Compare July 23, 2019 18:33

leebyron force-pushed the editorial-greedy-lexer branch from 7f62455 to 00f9ecf Compare July 30, 2019 07:48

leebyron requested a review from IvanGoncharov July 30, 2019 07:48

leebyron force-pushed the editorial-greedy-lexer branch from 00f9ecf to 27c2602 Compare July 30, 2019 07:54

ThisIsMissEm reviewed Aug 4, 2019

View reviewed changes

spec/Appendix B -- Grammar Summary.md Show resolved Hide resolved

leebyron force-pushed the editorial-greedy-lexer branch 3 times, most recently from 100d8fe to 1ebfc40 Compare August 7, 2019 05:33

leebyron mentioned this pull request Aug 7, 2019

RFC: Number value literal lookahead restrictions #601

Merged

leebyron mentioned this pull request Sep 11, 2019

RFC: Lexing is Greedy graphql/graphql-js#2163

Merged

leebyron added 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) 🚀 Next Stage? This RFC believes it is ready for the next stage labels Sep 11, 2019

michaelstaib mentioned this pull request Sep 12, 2019

Add Number value literal lookahead restrictions ChilliCream/graphql-platform#944

Closed

IvanGoncharov added 📄 Draft (RFC 2) RFC Stage 2 (See CONTRIBUTING.md) and removed 🚀 Next Stage? This RFC believes it is ready for the next stage 💡 Proposal (RFC 1) RFC Stage 1 (See CONTRIBUTING.md) labels Sep 13, 2019

roman-vanesyan mentioned this pull request Sep 13, 2019

(fix) make lexing "greedy" roman-vanesyan/graphite#4

Merged

Cito added a commit to graphql-python/graphql-core that referenced this pull request Sep 25, 2019

RFC: Lexing is Greedy

df0f899

Adds the test cases described in graphql/graphql-spec#599 Replicates graphql/graphql-js@c68acd8

eapache reviewed Oct 9, 2019

View reviewed changes

spec/Appendix B -- Grammar Summary.md Show resolved Hide resolved

michaelstaib mentioned this pull request Nov 10, 2019

Review GraphQL 2020 Spec Changes ChilliCream/graphql-platform#1199

Closed

11 tasks

leebyron force-pushed the editorial-greedy-lexer branch from 1ebfc40 to 61050cb Compare January 10, 2020 20:18

leebyron force-pushed the editorial-greedy-lexer branch from 61050cb to 8248e62 Compare January 10, 2020 20:30

leebyron added 🏁 Accepted (RFC 3) RFC Stage 3 (See CONTRIBUTING.md) and removed 📄 Draft (RFC 2) RFC Stage 2 (See CONTRIBUTING.md) labels Jan 10, 2020

leebyron merged commit a73cd6f into master Jan 10, 2020

michaelstaib mentioned this pull request Jan 28, 2020

GraphQL 2020 Lexer Updates ChilliCream/graphql-platform#1427

Closed

lirsacc mentioned this pull request Jan 28, 2020

Track draft specification lirsacc/py-gql#6

Open

9 tasks

leebyron added this to the May2021 milestone Apr 6, 2021

leebyron mentioned this pull request Apr 12, 2021

State greediness of GraphQL #564

Closed

leebyron deleted the editorial-greedy-lexer branch July 1, 2022 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that lexing is greedy #599

Clarify that lexing is greedy #599

leebyron commented Jul 23, 2019 •

edited

Loading

IvanGoncharov left a comment

leebyron commented Jul 23, 2019

leebyron commented Jul 23, 2019

IvanGoncharov commented Jul 24, 2019

leebyron commented Jul 29, 2019

leebyron commented Jul 29, 2019

leebyron commented Jul 29, 2019 •

edited

Loading

leebyron commented Jul 30, 2019

leebyron commented Aug 1, 2019

andimarek commented Aug 2, 2019

leebyron commented Aug 7, 2019 •

edited

Loading

andimarek commented Aug 7, 2019

leebyron commented Jan 10, 2020

andimarek commented Jul 19, 2020 •

edited

Loading

Clarify that lexing is greedy #599

Clarify that lexing is greedy #599

Conversation

leebyron commented Jul 23, 2019 • edited Loading

IvanGoncharov left a comment

Choose a reason for hiding this comment

leebyron commented Jul 23, 2019

leebyron commented Jul 23, 2019

IvanGoncharov commented Jul 24, 2019

leebyron commented Jul 29, 2019

leebyron commented Jul 29, 2019

leebyron commented Jul 29, 2019 • edited Loading

leebyron commented Jul 30, 2019

leebyron commented Aug 1, 2019

andimarek commented Aug 2, 2019

leebyron commented Aug 7, 2019 • edited Loading

andimarek commented Aug 7, 2019

leebyron commented Jan 10, 2020

andimarek commented Jul 19, 2020 • edited Loading

leebyron commented Jul 23, 2019 •

edited

Loading

leebyron commented Jul 29, 2019 •

edited

Loading

leebyron commented Aug 7, 2019 •

edited

Loading

andimarek commented Jul 19, 2020 •

edited

Loading