Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that lexing is greedy #599

Merged
merged 1 commit into from
Jan 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 30 additions & 16 deletions spec/Appendix A -- Notation Conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,10 @@ of the sequences it is defined by, until all non-terminal symbols have been
replaced by terminal characters.

Terminals are represented in this document in a monospace font in two forms: a
specific Unicode character or sequence of Unicode characters (ex. {`=`} or {`terminal`}), and a pattern of Unicode characters defined by a regular expression
(ex {/[0-9]+/}).
specific Unicode character or sequence of Unicode characters (ie. {`=`} or
{`terminal`}), and prose typically describing a specific Unicode code-point
{"Space (U+0020)"}. Sequences of Unicode characters only appear in syntactic
grammars and represent a {Name} token of that specific sequence.

Non-terminal production rules are represented in this document using the
following notation for a non-terminal with a single definition:
Expand All @@ -48,23 +50,25 @@ ListOfLetterA :

The GraphQL language is defined in a syntactic grammar where terminal symbols
are tokens. Tokens are defined in a lexical grammar which matches patterns of
source characters. The result of parsing a sequence of source Unicode characters
produces a GraphQL AST.
source characters. The result of parsing a source text sequence of Unicode
characters first produces a sequence of lexical tokens according to the lexical
grammar which then produces abstract syntax tree (AST) according to the
syntactical grammar.

A Lexical grammar production describes non-terminal "tokens" by
A lexical grammar production describes non-terminal "tokens" by
patterns of terminal Unicode characters. No "whitespace" or other ignored
characters may appear between any terminal Unicode characters in the lexical
grammar production. A lexical grammar production is distinguished by a two colon
`::` definition.

Word :: /[A-Za-z]+/
Word :: Letter+

A Syntactical grammar production describes non-terminal "rules" by patterns of
terminal Tokens. Whitespace and other ignored characters may appear before or
after any terminal Token. A syntactical grammar production is distinguished by a
one colon `:` definition.
terminal Tokens. {WhiteSpace} and other {Ignored} sequences may appear before or
after any terminal {Token}. A syntactical grammar production is distinguished by
a one colon `:` definition.

Sentence : Noun Verb
Sentence : Word+ `.`


## Grammar Notation
Expand All @@ -80,13 +84,11 @@ and their expanded definitions in the context-free grammar.
A grammar production may specify that certain expansions are not permitted by
using the phrase "but not" and then indicating the expansions to be excluded.

For example, the production:
For example, the following production means that the nonterminal {SafeWord} may
be replaced by any sequence of characters that could replace {Word} provided
that the same sequence of characters could not replace {SevenCarlinWords}.

SafeName : Name but not SevenCarlinWords

means that the nonterminal {SafeName} may be replaced by any sequence of
characters that could replace {Name} provided that the same sequence of
characters could not replace {SevenCarlinWords}.
SafeWord : Word but not SevenCarlinWords

A grammar may also list a number of restrictions after "but not" separated
by "or".
Expand All @@ -96,6 +98,18 @@ For example:
NonBooleanName : Name but not `true` or `false`


**Lookahead Restrictions**

A grammar production may specify that certain characters or tokens are not
permitted to follow it by using the pattern {[lookahead != NotAllowed]}.
Lookahead restrictions are often used to remove ambiguity from the grammar.

The following example makes it clear that {Letter+} must be greedy, since {Word}
cannot be followed by yet another {Letter}.

Word :: Letter+ [lookahead != Letter]


**Optionality and Lists**

A subscript suffix "{Symbol?}" is shorthand for two possible sequences, one
Expand Down
48 changes: 36 additions & 12 deletions spec/Appendix B -- Grammar Summary.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# B. Appendix: Grammar Summary

SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/
## Source Text

SourceCharacter ::
- "U+0009"
- "U+000A"
- "U+000D"
- "U+0020–U+FFFF"


## Ignored Tokens
Expand All @@ -20,10 +26,10 @@ WhiteSpace ::

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" [lookahead != "New Line (U+000A)"]
leebyron marked this conversation as resolved.
Show resolved Hide resolved
- "Carriage Return (U+000D)" "New Line (U+000A)"

Comment :: `#` CommentChar*
Comment :: `#` CommentChar* [lookahead != CommentChar]

CommentChar :: SourceCharacter but not LineTerminator

Expand All @@ -41,24 +47,41 @@ Token ::

Punctuator :: one of ! $ & ( ) ... : = @ [ ] { | }

Name :: /[_A-Za-z][_0-9A-Za-z]*/
Name ::
- NameStart NameContinue* [lookahead != NameContinue]

NameStart ::
- Letter
- `_`

NameContinue ::
- Letter
- Digit
- `_`

IntValue :: IntegerPart
Letter :: one of
`A` `B` `C` `D` `E` `F` `G` `H` `I` `J` `K` `L` `M`
`N` `O` `P` `Q` `R` `S` `T` `U` `V` `W` `X` `Y` `Z`
`a` `b` `c` `d` `e` `f` `g` `h` `i` `j` `k` `l` `m`
`n` `o` `p` `q` `r` `s` `t` `u` `v` `w` `x` `y` `z`
leebyron marked this conversation as resolved.
Show resolved Hide resolved

Digit :: one of
`0` `1` `2` `3` `4` `5` `6` `7` `8` `9`

IntValue :: IntegerPart [lookahead != {Digit, `.`, ExponentPart}]

IntegerPart ::
- NegativeSign? 0
- NegativeSign? NonZeroDigit Digit*

NegativeSign :: -

Digit :: one of 0 1 2 3 4 5 6 7 8 9

NonZeroDigit :: Digit but not `0`

FloatValue ::
- IntegerPart FractionalPart
- IntegerPart ExponentPart
- IntegerPart FractionalPart ExponentPart
- IntegerPart FractionalPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]
- IntegerPart FractionalPart [lookahead != {Digit, `.`, ExponentIndicator}]
- IntegerPart ExponentPart [lookahead != {Digit, `.`, ExponentIndicator}]

FractionalPart :: . Digit+

Expand All @@ -69,7 +92,8 @@ ExponentIndicator :: one of `e` `E`
Sign :: one of + -

StringValue ::
- `"` StringCharacter* `"`
- `""` [lookahead != `"`]
- `"` StringCharacter+ `"`
- `"""` BlockStringCharacter* `"""`

StringCharacter ::
Expand All @@ -89,7 +113,7 @@ Note: Block string values are interpreted to exclude blank initial and trailing
lines and uniform indentation with {BlockStringValue()}.


## Document
## Document Syntax

Document : Definition+

Expand Down
Loading