RFC: JS Grammar #1719

MichaReiser · 2021-10-27T07:37:28Z

Summary

RFC for a re-structured AST Facade (see /grammar.ungram file). The main motivations for restructuring the grammar are (#1718 goes into more details of why):

Structure the AST so that every token is exposed on the AST Facade. This may require introducing new intermediate nodes
Be explicit about the optionality of an element so that the AST Facade can return an Err if a mandatory element is missing.
Group children that depend on each other's existence in a sub-node. For example, the else token and the alternative branch of an if statement must both either be present or absent. Therefore, add an optional ElseClause to the if statement that has the mandatory else token and alternative branch children` (if one is present, the other must as well).
Ensure that children appear in fixed ordering. For example, the UpdatExpression (x++ or ++x) must be modelled as two distinct nodes to guarantee that the operator and identifier are always in the same order for each node type (PreUpdateExpression, PostUpdateExpression)
Encode possible position for error nodes (I'll look at specifying error recovery separately).
Align the naming and AST structure with Rome classic (and adapt it to align with the other goals). I tried to call out the main differences to Rome in comments above the relevant nodes.

The grammar does not yet cover typescript. This is my first time trying to design an AST. I'm certain there are some missing children, cases I didn't think of and some confusing names.

Scope

crates/rome-js-grammar/grammar.ungram defines the JS grammar
crates/rome-js-grammar/rfc.md lists open questions, proposed decisions, code gen extensions etc.
The other files are generated or based on @ematipico's work on migrating to ungrammar.

Test Plan

My imagination... There's no parser generating the defined AST today. I sketched some trees by hand but that's all. Changing the parser will bring up any cases that I missed.

cloudflare-workers-and-pages · 2021-10-27T09:08:28Z

Deploying with Cloudflare Pages

Latest commit:	`9a96362`
Status:	✅ Deploy successful!
Preview URL:	https://6b426c46.tools-8rn.pages.dev

View logs

crates/rome-js-grammar/src/grammar.ungram

crates/rome-js-grammar/src/rfc.md

ematipico

Left some comments. Overall it looks good! Great work! :)

crates/rome-js-grammar/Cargo.toml

crates/rome-js-grammar/src/grammar.ungram

crates/rome-js-grammar/src/rfc.md

ematipico · 2021-11-01T13:10:29Z

Re: the + operator, I think we should evaluate creating a PR inside ungrammar repository. At least, propose it and see it our feature makes sense.

crates/rome-js-grammar/Cargo.toml

crates/rome-js-grammar/src/rfc.md

crates/rome-js-grammar/src/grammar.ungram

MichaReiser · 2021-11-01T14:03:08Z

crates/rome-js-grammar/src/grammar.ungram

+    declarators: JsVariableDeclarator*
+
+JsVariableDeclarator =
+    id: JsBinding '=' init: JsExpression?


Right, we should change declarators to (JsVariableDeclarator (',' JsVariableDeclarator)*) (maybe with trailing)?

crates/rome-js-grammar/src/grammar.ungram

MichaReiser · 2021-11-01T14:05:18Z

crates/rome-js-grammar/src/grammar.ungram

+
+JsTemplateElement = JsStringTemplateElement | JsExpression
+
+JsStringTemplateElement = value: 'js_string_template_element'


I created JsStringTemplateElement because unions should always be over nodes and not tokens.

I created a js_string_template_element because it isn't a quoted string as it is the case for js_string

crates/rome-js-grammar/src/grammar.ungram

crates/rome-js-grammar/src/rfc.md

yassere · 2021-11-02T15:22:48Z

crates/rome-js-grammar/src/rfc.md

+Having fine granular nodes has the advantage that the API allows querying for a very specific node when, for example, using `node.descendants::<PreIncrementExpression>()`.
+
+Having more coarse-grained nodes on the other hand has the advantage that it's easier to implement common behaviour that, for example, applies to all Binary expressions.


Could we get the best of both worlds by having additional specific AST nodes that allow you to refine a coarse node? The specific nodes would need additional checks on the can_cast method (that would probably be hand-written in many cases). And we'd implement From<SpecificNode> for CoarseNode to make them easier to use.

Alternately, we could add granularity through utility functions: node.descendants().filter_map(UnaryExpression::is_pre_increment)

Yes, and I think this is overlapping with what @ematipico proposed here. For example, we could define a BinaryExpression union that contains the types for each specific operator. You can then call BinaryExpression::cast(specific_expr.syntax()) to get a BinaryExpression out of it or just do specific_expr.into() in a place where a BinaryExpression is expected.

The only downside is that these fine-grained unions will expand, for example, into the Expr union. But that might be less of a problem because these matches will not be exhaustive for most use cases?

crates/rome-js-grammar/src/grammar.ungram

jamiebuilds · 2021-11-02T19:07:44Z

crates/rome-js-grammar/src/grammar.ungram

+
+
+JsIdentifier =
+    name: 'js_identifier'


Should this be named like JsStatic*Name or just get folded into JsStaticMemberName?

Could make sense to unify them (I also see it as an option to use JsIdentifier in place of JsStatic*Name).

I kinda worry about using the more generic sounding JsIdentifier because the other types have been split out:

JsBindingIdentifier JsIdentifierAssignmentTarget JsIdentifier JsReferenceIdentifierExpression

A user might match on JsIdentifier thinking it will return all kinds of identifiers, especially since that is the case in Babel/etc.

crates/rome-js-grammar/src/grammar.ungram

jamiebuilds · 2021-11-02T19:46:59Z

I think we should also take a moment to consider how some of these nodes relate to the TypeScript and JSX grammars. For example there are a number of different types of identifiers that we have already:

JsBindingIdentifier
JsIdentifierAssignmentTarget
JsIdentifier
JsReferenceIdentifierExpression

TypeScript has parallels for almost all of these (except the JsIdentifierAssignmentTarget):

type TsBindingIdentifier<
  TsBindingIdentifier extends TsIdentifierReferenceExpression = TsIdentifierReferenceExpression
> = {
  TsStaticMemberName: TsIdentifierReferenceExpression.TsStaticMemberName
}

JSX also has a few:

<>
  {/* member expression */}
  <jsxIdentifier.jsxIdentifier jsxIdentifier/>
  <JsxReferenceIdentifier.jsxIdentifier/>
  {/* namespaced name */}
  <JsxReferenceIdentifier:jsxIdentifier jsxIdentifier:jsxIdentifier/>
  {/* tsx */}
  <JsxReferenceIdentifier<TsReferenceIdentifier> />
</>

Initially I was leaning towards saying that we should reuse grammar names when possible (i.e. No separate JsBindingIdentifier and TsBindingIdentifier) thinking that would be better for letting people operate on these nodes abstractly. However, that also makes it worse for people that are trying to work with a specific grammar and now need their own logic to figure out where in the grammar they are. It would also imply that we should try aligning in more places like using JsMemberExpression instead of JsxMemberExpression, but these have important differences in grammar (<foo[bar]/> is not valid JSX syntax for example).

For the people who do want to work with identifiers in the abstract, we could have separate tools for doing so:

// just as an example, each identifier-like node could implement a shared trait
trait Identifier { fn name() -> String }
impl Identifier for JsBindingIdentifier { .. }
impl Identifier for JsIdentifierAssignmentTarget { .. }
// or belong to some enum
enum AnyIdentifier { .. }
// etc...
if node.is_identifier() { .. } // check if a node is some identifier-like node
any_identifier.name(); // get the name of some identifier-like node

We should come up with general categories for these in that case:

Reference Identifiers: JsIdentifierReferenceExpression, TsIdentifierReferenceExpression, JsxIdentifierReferenceExpression
Binding Identifiers: JsBindingIdentifier, TsBindingIdentifier

Maybe even further into the grammar:

Static Member Access: JsMemberExpression, TsQualifiedName, JsxMemberExpression, JsxNamespacedName
Computed Member Access: JsMemberExpression, TsIndexedAccessType

If we do want to go further into the grammar, we might want to consider splitting some nodes like JsMemberExpression up into things like JsStaticMemberExpression and JsComputedMemberExpression (This is what Shift AST does).

crates/rome-js-grammar/src/rfc.md

jamiebuilds · 2021-11-02T20:14:05Z

crates/rome-js-grammar/src/rfc.md

+
+## Ungrammar extension proposals
+
+Add support for `///` comments that we can use to document nodes and tokens.


JSDoc and fam use /** ... */ fairly universally. Did you want to parse inside these comments?

Sorry, I should have explained a bit more.

My understanding is that ungrammar doesn't provide the means for us today to generate documentation for the AST nodes and fields. Meaning, that our AST facade is undocumented.

What I was imagining is that we could write that documentation in the grammar. Having docs right above the grammar definition is not only useful for users of our facade but also if you try to figure out what code a node represents.

/// Some documentation that the code gen will add to the generated `AnyExpression` type type AnyExpression = ... type JsArrayExpression = '[' /// The elements of the array. The elements are separated by a comma. (or something more useful) elements: (AnyJsArrayElemet (',' AnyJsArrayElement)* ','?)

generates

/// Some documentation that the code gen will add to the generated `AnyExpression` type pub enum AnyExpression {... } impl JsArrayExpression { /// The elements of the array. The elements are separated by a comma. (or something more useful) pub fn elements(&self) -> AstSeparatedList<AnyJsArrayElement> {} }

MichaReiser · 2021-11-03T08:17:08Z

I think we should also take a moment to consider how some of these nodes relate to the TypeScript and JSX grammars. For example there are a number of different types of identifiers that we have already:
* `JsBindingIdentifier`

* `JsIdentifierAssignmentTarget`

* `JsIdentifier`

* `JsReferenceIdentifierExpression`
TypeScript has parallels for almost all of these (except the JsIdentifierAssignmentTarget):
type TsBindingIdentifier<
  TsBindingIdentifier extends TsIdentifierReferenceExpression = TsIdentifierReferenceExpression
> = {
  TsStaticMemberName: TsIdentifierReferenceExpression.TsStaticMemberName
}
JSX also has a few:
<>
  {/* member expression */}
  <jsxIdentifier.jsxIdentifier jsxIdentifier/>
  <JsxReferenceIdentifier.jsxIdentifier/>
  {/* namespaced name */}
  <JsxReferenceIdentifier:jsxIdentifier jsxIdentifier:jsxIdentifier/>
  {/* tsx */}
  <JsxReferenceIdentifier<TsReferenceIdentifier> />
</>
Initially I was leaning towards saying that we should reuse grammar names when possible (i.e. No separate JsBindingIdentifier and TsBindingIdentifier) thinking that would be better for letting people operate on these nodes abstractly. However, that also makes it worse for people that are trying to work with a specific grammar and now need their own logic to figure out where in the grammar they are. It would also imply that we should try aligning in more places like using JsMemberExpression instead of JsxMemberExpression, but these have important differences in grammar (<foo[bar]/> is not valid JSX syntax for example).

I haven't spent much time thinking about that yet (maybe I should have). My first question would be if we feel comfortable moving forward with a JS only grammar for now and figure out some of JS's dialects after (or while implementing the JS grammar). My motivation for delaying the decision is that implementing the JS grammar will give us a better understanding of what worked well and what doesn't.

I'm very much undecided right now on which approach to take and would need to look closer into how the grammars differ but I'm probably more leaning toward different nodes if they differ significantly. My main thoughts from a grammar perspective are:

Whatever a node's child is required should never depend on the presence (or absence) of a sibling. This is mainly a concern if certain tokens/nodes are required in TypeScript but not in JavaScript.
We should define a new node whenever the supported syntax for a child differs (JSX member expression example)

For the people who do want to work with identifiers in the abstract, we could have separate tools for doing so:

// just as an example, each identifier-like node could implement a shared trait
trait Identifier { fn name() -> String }
impl Identifier for JsBindingIdentifier { .. }
impl Identifier for JsIdentifierAssignmentTarget { .. }
// or belong to some enum
enum AnyIdentifier { .. }
// etc...
if node.is_identifier() { .. } // check if a node is some identifier-like node
any_identifier.name(); // get the name of some identifier-like node

I'm leaning towards an enum here because we already have the infrastructure in place to define it inside of our grammar and testing if something is an instance of it ``AnyJsIdentifier::cast(node.syntax())`

For the people who do want to work with identifiers in the abstract, we could have separate tools for doing so:

// just as an example, each identifier-like node could implement a shared trait
trait Identifier { fn name() -> String }
impl Identifier for JsBindingIdentifier { .. }
impl Identifier for JsIdentifierAssignmentTarget { .. }
// or belong to some enum
enum AnyIdentifier { .. }
// etc...
if node.is_identifier() { .. } // check if a node is some identifier-like node
any_identifier.name(); // get the name of some identifier-like node

We should come up with general categories for these in that case:

* Reference Identifiers: `JsIdentifierReferenceExpression`, `TsIdentifierReferenceExpression`, `JsxIdentifierReferenceExpression`

* Binding Identifiers: `JsBindingIdentifier`, `TsBindingIdentifier`

Maybe even further into the grammar:

* Static Member Access: `JsStaticMemberExpression`, `TsQualifiedName`, `JsxMemberExpression`, `JsxNamespacedName`

* Computed Member Access: `JsComputedMemberExpression`, `TsIndexedAccessType`

I like this categorisation and having these "abstract" types will simplify writing analysis a lot. Roslyn has the concept of IOperation that abstract a common operation across languages (for example, creating an array).

If we do want to go further into the grammar, we might want to consider splitting some nodes like JsMemberExpression up into things like JsStaticMemberExpression and JsComputedMemberExpression (This is what Shift AST does).

I actually liked Rome's approach of computed/static a bit better because it gets away with fewer node types but it certainly makes sense to split it if it helps to categorise the node types.

MichaReiser · 2021-11-09T07:55:05Z

Moved the files to the RFC folder. Updated the JS grammar with the proposed changes.

Renames and restructures the following statements as we agreed to in our new JS grammar (#1719): * BlockStatement: Rename node and rename `stmts` child to `statements` * DebuggerStatement: Rename node * EmptyStatement: Rename node * ExpressionStatement: Rename node, `expr` child to `expression`, and add optional trailing semicolon * ReturnStatement: Rename node, and rename 'value' child to 'argument' * LabeledStatement: Rename node, change type of `label` child to a token, and rename `stmt` child to `body` * WithStatement: * Rename node * inline "condition" (it's not a condition) * rename `condition` to `object` * rename `cons` to `body` * Add parser test

…1774) Changes the js grammar to match our new AST facade as defined in #1725 (and proposed in #1719) * `Continue` Statement * Rename from `ContinueStmt` to `JsContinueStatement` * Inline the label identifier (It's not a normal "variable" reference but a label reference) * `Break` Statement * Rename from `BreakStmt` to `JsBreakStatement` * Inline the label identifier (It's not a normal "variable" reference but a label reference) * `While` Statement * Rename from `WhileStmt` to `JsWhileStatement` * Rename `cons` to `body` * Inline the condition and rename `condition` to `test` * `DoWhileStatement` * Rename to `JsDoWhileStatement` * Inline condition * rename `cons` to `body` * rename `condition` to `test`

Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725. * Introduce new `JsAnyLiteral` * Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes. * Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry). * Removed some unused keywords and special handling inside of the code gen.

Changes the js grammar of the variable declaration to match our new AST facade as defined in #1725 (and proposed in #1719) * Split `VarDecl` into `JsVariableDeclarationStatement` and `JsVariableDeclaration` to correctly account for variable declarations inside for loops (that aren't statements). * Change the parser to emit a `let` token instead of an ident if let is used inside a variable declaration (let is a valid identifier which is why the lexer returns an `ident` token) * Rename `Declarator` to `JsVariableDeclarator` * Split out the `init` into a `JsEqualValueClause` that holds the `=` token and the initializer expression.

Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725. * Introduce new `JsAnyLiteral` * Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes. * Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry). * Removed some unused keywords and special handling inside of the code gen.

…rrow function expression AST tree structures (#1786) Updates the facade for function declarations and function expression to match the grammar defined in #1719. * Function declaration * Rename from `FnDecl` to `JsFunctionDeclaration` * Rename `ParameterList` to `JsParameterList` * Introduce new `JsIdentifierBinding` type for `id` * Parses `...rest` as `JS_REST_PARAMETER` instead of `JS_REST_PATTERN`. * Split out `TsReturnType` for the `:` and the return type * Function expression * Rename from `FnExpr` to `JsFunctionExpression` * Otherwise the same as for declaration * Arrow function expression * Rename from `ArrowExpr` to `JsArrowFunctionExpression` * Rename `ExprOrBlock` to `JsAnyArrowFunctionBody` * Otherwise the same as for declaration. This PR cleans up some of the function parsing (extracts helpers rather than using boolean params). This improves error reporting for the `func_decl` use case if the name is missing. This PR also removes the support for decorators. The parser had some support for decorators but they haven't been part of the grammar. We should re-add them when we decide to properly support them (and add tests)

MichaReiser commented Oct 27, 2021

View reviewed changes

crates/rome-js-grammar/src/grammar.ungram Outdated Show resolved Hide resolved

crates/rome-js-grammar/src/rfc.md Outdated Show resolved Hide resolved

MichaReiser requested review from jamiebuilds, ematipico, yassere, sebmck and xunilrj October 27, 2021 09:15

MichaReiser self-assigned this Oct 27, 2021

MichaReiser marked this pull request as ready for review October 27, 2021 13:16

MichaReiser linked an issue Oct 29, 2021 that may be closed by this pull request

☂️ AST Façade Improvements #1725

Closed

47 tasks

MichaReiser mentioned this pull request Oct 29, 2021

📎 Fix missing JS ungrammar #1729

Closed

ematipico reviewed Nov 1, 2021

View reviewed changes

MichaReiser commented Nov 1, 2021

View reviewed changes

yassere reviewed Nov 2, 2021

View reviewed changes

crates/rome-js-grammar/src/rfc.md Outdated Show resolved Hide resolved

yassere reviewed Nov 2, 2021

View reviewed changes

jamiebuilds reviewed Nov 2, 2021

View reviewed changes

crates/rome-js-grammar/src/grammar.ungram Outdated Show resolved Hide resolved

jamiebuilds reviewed Nov 2, 2021

View reviewed changes

crates/rome-js-grammar/src/grammar.ungram Outdated Show resolved Hide resolved

jamiebuilds reviewed Nov 2, 2021

View reviewed changes

crates/rome-js-grammar/src/rfc.md Outdated Show resolved Hide resolved

jamiebuilds reviewed Nov 2, 2021

View reviewed changes

MichaReiser force-pushed the feature/js-grammar branch 4 times, most recently from 23e0a27 to 7f3ccfb Compare November 8, 2021 15:56

ematipico mentioned this pull request Nov 8, 2021

☂️ Weekly goals - week 45 #1754

Closed

RFC JS Grammar

a1a8f04

MichaReiser force-pushed the feature/js-grammar branch from 9c84e7d to a1a8f04 Compare November 9, 2021 07:54

MichaReiser mentioned this pull request Nov 9, 2021

feat(codegen): Respect label for tokens & support optional token unions #1755

Merged

Revert codegen changes

9a96362

MichaReiser mentioned this pull request Nov 9, 2021

☂️ AST Façade Improvements #1725

Closed

47 tasks

MichaReiser merged commit 95066f6 into main Nov 10, 2021

MichaReiser deleted the feature/js-grammar branch November 10, 2021 08:36

This was referenced Nov 10, 2021

refactor (rslint_parser): Rename Script & Module, Stmt, and Expr #1767

Merged

refactor(rslint_parser): Restructure simple statements #1769

Merged

This was referenced Nov 12, 2021

refactor(rslint_parser): Update while loops to match new AST Facade #1774

Merged

Update Variable Declaration to match new AST Facade #1775

Merged

MichaReiser mentioned this pull request Nov 12, 2021

Update literals to match new AST Facade #1776

Merged

MichaReiser mentioned this pull request Nov 15, 2021

Function Expression/Declaration AST Facade #1786

Merged

MichaReiser mentioned this pull request Dec 1, 2021

Rename Pattern to Bindings and share parsing logic with Assignment Target #1839

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: JS Grammar #1719

RFC: JS Grammar #1719

MichaReiser commented Oct 27, 2021 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 27, 2021 •

edited

Loading

ematipico left a comment •

edited

Loading

ematipico commented Nov 1, 2021

MichaReiser Nov 1, 2021

MichaReiser Nov 1, 2021

yassere Nov 2, 2021

MichaReiser Nov 2, 2021

jamiebuilds Nov 2, 2021

MichaReiser Nov 3, 2021

jamiebuilds Nov 5, 2021

jamiebuilds commented Nov 2, 2021

jamiebuilds Nov 2, 2021

MichaReiser Nov 3, 2021

MichaReiser commented Nov 3, 2021

MichaReiser commented Nov 9, 2021


		JsTemplateElement = JsStringTemplateElement \| JsExpression

		JsStringTemplateElement = value: 'js_string_template_element'

		Having fine granular nodes has the advantage that the API allows querying for a very specific node when, for example, using `node.descendants::<PreIncrementExpression>()`.

		Having more coarse-grained nodes on the other hand has the advantage that it's easier to implement common behaviour that, for example, applies to all Binary expressions.


		## Ungrammar extension proposals

		Add support for `///` comments that we can use to document nodes and tokens.

RFC: JS Grammar #1719

RFC: JS Grammar #1719

Conversation

MichaReiser commented Oct 27, 2021 • edited Loading

Summary

Scope

Test Plan

cloudflare-workers-and-pages bot commented Oct 27, 2021 • edited Loading

Deploying with Cloudflare Pages

ematipico left a comment • edited Loading

Choose a reason for hiding this comment

ematipico commented Nov 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiebuilds commented Nov 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser commented Nov 3, 2021

MichaReiser commented Nov 9, 2021

MichaReiser commented Oct 27, 2021 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 27, 2021 •

edited

Loading

ematipico left a comment •

edited

Loading