Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

RFC: JS Grammar #1719

Merged
merged 2 commits into from
Nov 10, 2021
Merged

RFC: JS Grammar #1719

merged 2 commits into from
Nov 10, 2021

Conversation

MichaReiser
Copy link
Contributor

@MichaReiser MichaReiser commented Oct 27, 2021

Summary

RFC for a re-structured AST Facade (see /grammar.ungram file). The main motivations for restructuring the grammar are (#1718 goes into more details of why):

  • Structure the AST so that every token is exposed on the AST Facade. This may require introducing new intermediate nodes
  • Be explicit about the optionality of an element so that the AST Facade can return an Err if a mandatory element is missing.
  • Group children that depend on each other's existence in a sub-node. For example, the else token and the alternative branch of an if statement must both either be present or absent. Therefore, add an optional ElseClause to the if statement that has the mandatory else token and alternative branch children` (if one is present, the other must as well).
  • Ensure that children appear in fixed ordering. For example, the UpdatExpression (x++ or ++x) must be modelled as two distinct nodes to guarantee that the operator and identifier are always in the same order for each node type (PreUpdateExpression, PostUpdateExpression)
  • Encode possible position for error nodes (I'll look at specifying error recovery separately).
  • Align the naming and AST structure with Rome classic (and adapt it to align with the other goals). I tried to call out the main differences to Rome in comments above the relevant nodes.

The grammar does not yet cover typescript. This is my first time trying to design an AST. I'm certain there are some missing children, cases I didn't think of and some confusing names.

Scope

  • crates/rome-js-grammar/grammar.ungram defines the JS grammar
  • crates/rome-js-grammar/rfc.md lists open questions, proposed decisions, code gen extensions etc.
  • The other files are generated or based on @ematipico's work on migrating to ungrammar.

Test Plan

My imagination... There's no parser generating the defined AST today. I sketched some trees by hand but that's all. Changing the parser will bring up any cases that I missed.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Oct 27, 2021

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 9a96362
Status: ✅  Deploy successful!
Preview URL: https://6b426c46.tools-8rn.pages.dev

View logs

@MichaReiser MichaReiser self-assigned this Oct 27, 2021
@MichaReiser MichaReiser marked this pull request as ready for review October 27, 2021 13:16
@MichaReiser MichaReiser linked an issue Oct 29, 2021 that may be closed by this pull request
47 tasks
Copy link
Contributor

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. Overall it looks good! Great work! :)

@ematipico
Copy link
Contributor

Re: the + operator, I think we should evaluate creating a PR inside ungrammar repository. At least, propose it and see it our feature makes sense.

declarators: JsVariableDeclarator*

JsVariableDeclarator =
id: JsBinding '=' init: JsExpression?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should change declarators to (JsVariableDeclarator (',' JsVariableDeclarator)*) (maybe with trailing)?


JsTemplateElement = JsStringTemplateElement | JsExpression

JsStringTemplateElement = value: 'js_string_template_element'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created JsStringTemplateElement because unions should always be over nodes and not tokens.

I created a js_string_template_element because it isn't a quoted string as it is the case for js_string

Comment on lines 138 to 140
Having fine granular nodes has the advantage that the API allows querying for a very specific node when, for example, using `node.descendants::<PreIncrementExpression>()`.

Having more coarse-grained nodes on the other hand has the advantage that it's easier to implement common behaviour that, for example, applies to all Binary expressions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get the best of both worlds by having additional specific AST nodes that allow you to refine a coarse node? The specific nodes would need additional checks on the can_cast method (that would probably be hand-written in many cases). And we'd implement From<SpecificNode> for CoarseNode to make them easier to use.

Alternately, we could add granularity through utility functions: node.descendants().filter_map(UnaryExpression::is_pre_increment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I think this is overlapping with what @ematipico proposed here. For example, we could define a BinaryExpression union that contains the types for each specific operator. You can then call BinaryExpression::cast(specific_expr.syntax()) to get a BinaryExpression out of it or just do specific_expr.into() in a place where a BinaryExpression is expected.

The only downside is that these fine-grained unions will expand, for example, into the Expr union. But that might be less of a problem because these matches will not be exhaustive for most use cases?



JsIdentifier =
name: 'js_identifier'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named like JsStatic*Name or just get folded into JsStaticMemberName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make sense to unify them (I also see it as an option to use JsIdentifier in place of JsStatic*Name).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda worry about using the more generic sounding JsIdentifier because the other types have been split out:

JsBindingIdentifier
JsIdentifierAssignmentTarget
JsIdentifier
JsReferenceIdentifierExpression

A user might match on JsIdentifier thinking it will return all kinds of identifiers, especially since that is the case in Babel/etc.

@jamiebuilds
Copy link
Contributor

I think we should also take a moment to consider how some of these nodes relate to the TypeScript and JSX grammars. For example there are a number of different types of identifiers that we have already:

  • JsBindingIdentifier
  • JsIdentifierAssignmentTarget
  • JsIdentifier
  • JsReferenceIdentifierExpression

TypeScript has parallels for almost all of these (except the JsIdentifierAssignmentTarget):

type TsBindingIdentifier<
  TsBindingIdentifier extends TsIdentifierReferenceExpression = TsIdentifierReferenceExpression
> = {
  TsStaticMemberName: TsIdentifierReferenceExpression.TsStaticMemberName
}

JSX also has a few:

<>
  {/* member expression */}
  <jsxIdentifier.jsxIdentifier jsxIdentifier/>
  <JsxReferenceIdentifier.jsxIdentifier/>
  {/* namespaced name */}
  <JsxReferenceIdentifier:jsxIdentifier jsxIdentifier:jsxIdentifier/>
  {/* tsx */}
  <JsxReferenceIdentifier<TsReferenceIdentifier> />
</>

Initially I was leaning towards saying that we should reuse grammar names when possible (i.e. No separate JsBindingIdentifier and TsBindingIdentifier) thinking that would be better for letting people operate on these nodes abstractly. However, that also makes it worse for people that are trying to work with a specific grammar and now need their own logic to figure out where in the grammar they are. It would also imply that we should try aligning in more places like using JsMemberExpression instead of JsxMemberExpression, but these have important differences in grammar (<foo[bar]/> is not valid JSX syntax for example).

For the people who do want to work with identifiers in the abstract, we could have separate tools for doing so:

// just as an example, each identifier-like node could implement a shared trait
trait Identifier { fn name() -> String }
impl Identifier for JsBindingIdentifier { .. }
impl Identifier for JsIdentifierAssignmentTarget { .. }
// or belong to some enum
enum AnyIdentifier { .. }
// etc...
if node.is_identifier() { .. } // check if a node is some identifier-like node
any_identifier.name(); // get the name of some identifier-like node

We should come up with general categories for these in that case:

  • Reference Identifiers: JsIdentifierReferenceExpression, TsIdentifierReferenceExpression, JsxIdentifierReferenceExpression
  • Binding Identifiers: JsBindingIdentifier, TsBindingIdentifier

Maybe even further into the grammar:

  • Static Member Access: JsMemberExpression, TsQualifiedName, JsxMemberExpression, JsxNamespacedName
  • Computed Member Access: JsMemberExpression, TsIndexedAccessType

If we do want to go further into the grammar, we might want to consider splitting some nodes like JsMemberExpression up into things like JsStaticMemberExpression and JsComputedMemberExpression (This is what Shift AST does).


## Ungrammar extension proposals

Add support for `///` comments that we can use to document nodes and tokens.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSDoc and fam use /** ... */ fairly universally. Did you want to parse inside these comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have explained a bit more.

My understanding is that ungrammar doesn't provide the means for us today to generate documentation for the AST nodes and fields. Meaning, that our AST facade is undocumented.

What I was imagining is that we could write that documentation in the grammar. Having docs right above the grammar definition is not only useful for users of our facade but also if you try to figure out what code a node represents.

/// Some documentation that the code gen will add to the generated `AnyExpression` type
type AnyExpression = ...

type JsArrayExpression = 
  '['
  /// The elements of the array. The elements are separated by a comma. (or something more useful)
  elements: (AnyJsArrayElemet (',' AnyJsArrayElement)* ','?)
  

generates

/// Some documentation that the code gen will add to the generated `AnyExpression` type
pub enum AnyExpression {... }

impl JsArrayExpression {
  /// The elements of the array. The elements are separated by a comma. (or something more useful)
  pub fn elements(&self) -> AstSeparatedList<AnyJsArrayElement> {}
}

@MichaReiser
Copy link
Contributor Author

I think we should also take a moment to consider how some of these nodes relate to the TypeScript and JSX grammars. For example there are a number of different types of identifiers that we have already:

* `JsBindingIdentifier`

* `JsIdentifierAssignmentTarget`

* `JsIdentifier`

* `JsReferenceIdentifierExpression`

TypeScript has parallels for almost all of these (except the JsIdentifierAssignmentTarget):

type TsBindingIdentifier<
  TsBindingIdentifier extends TsIdentifierReferenceExpression = TsIdentifierReferenceExpression
> = {
  TsStaticMemberName: TsIdentifierReferenceExpression.TsStaticMemberName
}

JSX also has a few:

<>
  {/* member expression */}
  <jsxIdentifier.jsxIdentifier jsxIdentifier/>
  <JsxReferenceIdentifier.jsxIdentifier/>
  {/* namespaced name */}
  <JsxReferenceIdentifier:jsxIdentifier jsxIdentifier:jsxIdentifier/>
  {/* tsx */}
  <JsxReferenceIdentifier<TsReferenceIdentifier> />
</>

Initially I was leaning towards saying that we should reuse grammar names when possible (i.e. No separate JsBindingIdentifier and TsBindingIdentifier) thinking that would be better for letting people operate on these nodes abstractly. However, that also makes it worse for people that are trying to work with a specific grammar and now need their own logic to figure out where in the grammar they are. It would also imply that we should try aligning in more places like using JsMemberExpression instead of JsxMemberExpression, but these have important differences in grammar (<foo[bar]/> is not valid JSX syntax for example).

I haven't spent much time thinking about that yet (maybe I should have). My first question would be if we feel comfortable moving forward with a JS only grammar for now and figure out some of JS's dialects after (or while implementing the JS grammar). My motivation for delaying the decision is that implementing the JS grammar will give us a better understanding of what worked well and what doesn't.

I'm very much undecided right now on which approach to take and would need to look closer into how the grammars differ but I'm probably more leaning toward different nodes if they differ significantly. My main thoughts from a grammar perspective are:

  • Whatever a node's child is required should never depend on the presence (or absence) of a sibling. This is mainly a concern if certain tokens/nodes are required in TypeScript but not in JavaScript.
  • We should define a new node whenever the supported syntax for a child differs (JSX member expression example)

For the people who do want to work with identifiers in the abstract, we could have separate tools for doing so:

// just as an example, each identifier-like node could implement a shared trait
trait Identifier { fn name() -> String }
impl Identifier for JsBindingIdentifier { .. }
impl Identifier for JsIdentifierAssignmentTarget { .. }
// or belong to some enum
enum AnyIdentifier { .. }
// etc...
if node.is_identifier() { .. } // check if a node is some identifier-like node
any_identifier.name(); // get the name of some identifier-like node

I'm leaning towards an enum here because we already have the infrastructure in place to define it inside of our grammar and testing if something is an instance of it ``AnyJsIdentifier::cast(node.syntax())`

For the people who do want to work with identifiers in the abstract, we could have separate tools for doing so:

// just as an example, each identifier-like node could implement a shared trait
trait Identifier { fn name() -> String }
impl Identifier for JsBindingIdentifier { .. }
impl Identifier for JsIdentifierAssignmentTarget { .. }
// or belong to some enum
enum AnyIdentifier { .. }
// etc...
if node.is_identifier() { .. } // check if a node is some identifier-like node
any_identifier.name(); // get the name of some identifier-like node

We should come up with general categories for these in that case:

* Reference Identifiers: `JsIdentifierReferenceExpression`, `TsIdentifierReferenceExpression`, `JsxIdentifierReferenceExpression`

* Binding Identifiers: `JsBindingIdentifier`, `TsBindingIdentifier`

Maybe even further into the grammar:

* Static Member Access: `JsStaticMemberExpression`, `TsQualifiedName`, `JsxMemberExpression`, `JsxNamespacedName`

* Computed Member Access: `JsComputedMemberExpression`, `TsIndexedAccessType`

I like this categorisation and having these "abstract" types will simplify writing analysis a lot. Roslyn has the concept of IOperation that abstract a common operation across languages (for example, creating an array).

If we do want to go further into the grammar, we might want to consider splitting some nodes like JsMemberExpression up into things like JsStaticMemberExpression and JsComputedMemberExpression (This is what Shift AST does).

I actually liked Rome's approach of computed/static a bit better because it gets away with fewer node types but it certainly makes sense to split it if it helps to categorise the node types.

@MichaReiser MichaReiser force-pushed the feature/js-grammar branch 4 times, most recently from 23e0a27 to 7f3ccfb Compare November 8, 2021 15:56
@MichaReiser
Copy link
Contributor Author

Moved the files to the RFC folder. Updated the JS grammar with the proposed changes.

@MichaReiser MichaReiser mentioned this pull request Nov 9, 2021
47 tasks
@MichaReiser MichaReiser merged commit 95066f6 into main Nov 10, 2021
@MichaReiser MichaReiser deleted the feature/js-grammar branch November 10, 2021 08:36
MichaReiser added a commit that referenced this pull request Nov 11, 2021
Renames and restructures the following statements as we agreed to in our new JS grammar (#1719):

* BlockStatement: Rename node and rename `stmts` child to `statements`
* DebuggerStatement: Rename node
* EmptyStatement: Rename node
* ExpressionStatement: Rename node, `expr` child to `expression`, and add optional trailing semicolon
* ReturnStatement: Rename node, and rename 'value' child to 'argument' 
* LabeledStatement: Rename node, change type of `label` child to a token, and rename `stmt` child to `body`
* WithStatement: 
  * Rename node
  * inline "condition" (it's not a condition)
  * rename `condition` to `object`
  * rename `cons` to `body`
  * Add parser test
MichaReiser added a commit that referenced this pull request Nov 12, 2021
…1774)

Changes the js grammar to match our new AST facade as defined in #1725 (and proposed in #1719)

* `Continue` Statement
  * Rename from `ContinueStmt` to `JsContinueStatement`
  * Inline the label identifier (It's not a normal "variable" reference but a label reference) 
* `Break` Statement
  * Rename from `BreakStmt` to `JsBreakStatement`
  * Inline the label identifier (It's not a normal "variable" reference but a label reference) 
* `While` Statement
  * Rename from `WhileStmt` to `JsWhileStatement`
  * Rename `cons` to `body`
  * Inline the condition and rename `condition` to `test`
* `DoWhileStatement`
  * Rename to `JsDoWhileStatement`
  * Inline condition
  * rename `cons` to `body`
  * rename `condition` to `test`
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Changes the js grammar of the variable declaration to match our new AST facade as defined in #1725 (and proposed in #1719)

* Split `VarDecl` into `JsVariableDeclarationStatement` and `JsVariableDeclaration` to correctly account for variable declarations inside for loops (that aren't statements).
* Change the parser to emit a `let` token instead of an ident if let is used inside a variable declaration (let is a valid identifier which is why the lexer returns an `ident` token)
* Rename `Declarator` to `JsVariableDeclarator`
* Split out the `init` into a `JsEqualValueClause` that holds the `=` token and the initializer expression.
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this pull request Nov 12, 2021
Refines the AST Facade for literals to match the grammar proposed in #1719. This change is part of #1725.

* Introduce new `JsAnyLiteral`
* Split `Literal` into `JsStringLiteral`, `JsBooleanLiteral`, `JsNullLiteral`, `JsRegexLiteral`, and `JsBigIntLiteral`. This allows to implement custom methods on the corresponding literal nodes.
* Renames the `number` and kinds to `JS_NUMBER_LITERAL_TOKEN` `JS_STRING_LITERAL_TOKEN` to avoid conflicts with the TS `number` keyword (and keep symmetry).
* Removed some unused keywords and special handling inside of the code gen.
MichaReiser added a commit that referenced this pull request Nov 16, 2021
…rrow function expression AST tree structures (#1786)

Updates the facade for function declarations and function expression to match the grammar defined in #1719. 

* Function declaration
  * Rename from `FnDecl` to `JsFunctionDeclaration`
  * Rename `ParameterList` to `JsParameterList`
  * Introduce new `JsIdentifierBinding` type for `id`
  * Parses `...rest` as `JS_REST_PARAMETER` instead of `JS_REST_PATTERN`.
  * Split out `TsReturnType` for the `:` and the return type
* Function expression
  * Rename from `FnExpr` to `JsFunctionExpression`
  * Otherwise the same as for declaration
* Arrow function expression
  * Rename from `ArrowExpr` to `JsArrowFunctionExpression`
  * Rename `ExprOrBlock` to `JsAnyArrowFunctionBody`
  * Otherwise the same as for declaration.

This PR cleans up some of the function parsing (extracts helpers rather than using boolean params). This improves error reporting for the `func_decl` use case if the name is missing. 

This PR also removes the support for decorators. The parser had some support for decorators but they haven't been part of the grammar. We should re-add them when we decide to properly support them (and add tests)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

☂️ AST Façade Improvements
5 participants