☂️ Weekly goals - week 41 #1700

ematipico · 2021-10-11T10:38:20Z

High Level Objectives / Questions we're trying to answer

try to format more JavaScript code, possibly using real files instead of small snippets;
format broken JavaScript code using rslint_parser and see what we would be able to achieve;
start thinking more about the APIs of the formatter and how different agents will use them (CLI, LSP server, etc.);
how rslint_parser would fit Rome's architecture;
study and implement the compiler API and evaluate how it should behave; for example, should we prefer "completeness" or "soundness" inside our linter? Should we show a diagnostic error that might be invalid?
evaluate tools/crates that could help us to create a query-based compiler, especially in a multi-thread environment;

Projects / Tasks

widen the coverage of the format logic on more AST nodes;
explore how rslint_parser determines when a JavaScript code is broken and how its green trees are structured. The idea is that nodes that contains EROOR@ nodes should not be formatted and left as is. We will explore some heuristic to see if we're able to do that;
create some intenal POC where we try out salsa
separate rowan from rslint_parser;
attach trivia to tokens; more information can be found in the rust-analyzer issue
defined the Rome's compiler APIs and how they would fit inside our idea of linter;
explore salsa and it's current state;
explore rustc caching model;
create different POCs of a caching system inside a bundler:
- invalidate cache
- invalidate a branch of a dependency tree

Risks

formatting broken ASTs might not work out using the current parser architecture;
rslint_parser might not fit our needs as it is now, and we would need to consider forking it or some parts of it;
salsa or other existing tools might not fit, and we would need to write a in-house caching system

The text was updated successfully, but these errors were encountered:

MichaReiser · 2021-10-12T12:11:13Z

I want to share an update on my current around formatting source files containing parsing errors (see draft PR #1707)

The main change of the PR is to change ToFormatElement to return an Option<FormatElement> rather than a FormatElement. Allowing formatters to return None enables them to return early if the AST node misses mandatory fields, and formatters can implement custom recovery logic dealing with children that couldn't be formatted, for example, skipping that node and continue with the next statement.

This API change is working pretty nicely, and the PR shows that the formatter can format source files containing parsing errors where nodes or tokens are missing. However, the rslint-CST doesn't work well if there are additional tokens, nodes or errors in the tree that wouldn't be present in an otherwise valid tree.

For example the following script:

--;

let a = 10;

with the corresponding CST:

[email protected]
  [email protected] "--"
  [email protected]
    [email protected] ";"
  [email protected] "\n\n"
  [email protected]
    [email protected] "let"
    [email protected] " "
    [email protected]
      [email protected]
        [email protected]
          [email protected] "a"
      [email protected] " "
      [email protected] "="
      [email protected] " "
      [email protected]
        [email protected] "10"
    [email protected] ";"
  [email protected] "\n"

The main issue is that field accessors on the Scripts AstNode skip the MINUS2 node/token because it is no Stmt. This has been an intentional decision and is working well for lining but becomes problematic for formatting because it isn't transparent that the node contains a parsing error and, therefore, shouldn't be formatted.

The same problem exists if the parser inserts an Error node or other sub-nodes, as we can see in this example (notice that the for statement uses three ; rather than two):

for (; --; ++;;) {

}

which rslint parses to

[email protected]
  [email protected]
    [email protected] "for"
    [email protected] " "
    [email protected] "("
    [email protected] ";"
    [email protected] " "
    [email protected]
      [email protected] "--"
    [email protected] ";"
    [email protected] " "
    [email protected]
      [email protected] "++"
    [email protected]
      [email protected] ";"
  [email protected]
    [email protected] ";"
  [email protected]
    [email protected] ")"
  [email protected] " "
  [email protected]
    [email protected] "{"
    [email protected] "\n\n"
    [email protected] "}"
  [email protected] "\n"

rslint inserts the additional [email protected] and [email protected] nodes, both aren't accessible by any AstNode field accessor besides iterating over the SyntaxNodes children.

The only way to work around this without changing the CST structure is to directly work on the node's children but that comes with the downside that the resulting code will be very hard to maintain.

That's why I suggest changing the parser to match Roslyn's behaviour:

Insert missing nodes and tokens when parsing to guarantee that the parsed CST is valid according to our grammar. Roslyn, for example, wraps the -- in expression_statement(pre_increment_expression(minus_minus_token, identifier_name { IsMissing=True})) which now forms a valid Stmt. Nodes that weren't present in the source code are marked as IsMissing: true and their text representation (more specifically, the text representation of the containing tokens) is an empty string.
Error nodes should be part of the union types where Errors can occur, for example, Stmt, Expr. For example, Roslyn has the IncompleteMember node for incompleteclass members and the IncompleteMember is part of the MemberDeclaration union type. In other words, Errors are part of our grammar and should, therefore, be explicitly modelled and listed. I think this has the side-benefit that we're forced to explicitly think about what an error means e.g. when we do semantic analysis rather than just accidentally skipping an error node.

I haven't thought about what that means for our AstNode API. Should field accessors still return Option<TChild> for mandatory fields or should we change it to TChild and instead explicitly check for IsMissing in the few cases where it's needed?

However, enforcing a specific structure for every node combined with explicitly modelling trivia would allow us to use static-offsets in our field accessors, e.g. the if token must always be the first child, whereas the condition is the second child...

Error recovery

I'm still undecided how the error recovery should work. The current implementation skips statements containing syntax errors but then continues with the next statement. I don't even know if this is too conservative or too lax? The main issue I see of doing supposedly clever error-recovering in the formatter is that the parser already performed some error-recovery, thus, what the formatter sees may already be different than what's written in the source file. I currently favour trusting whatever the parser came up with as being correct and improve in the parser's error recovery if this isn't good enough rather than trying to infer something from the CST structure.

Further Improvements

@ematipico and I discovered two additional properties of our AST that make it inconvenient to work with:

Testing if an IfStmt has an alternative branch requires testing if either the else_token or alt fields are present (not None). Ideally, only testing one would be sufficient (possible if we introduce an ElseClause that wraps the else_token and the body).
An explicit node for array holes would simplify iterating over the array children: let a = [,,3,,4,]; -> [JSArrayHole, JSArrayHole, , JSNumericLiteral {value: 3}, JSArrayHole {}, JSNumericLiteral {value: 3}]

ematipico · 2021-10-18T14:56:46Z

The team is going to have a series of meetings to discuss a series of arguments regarding the architecture of the compiler, parser and other stuff.

This means that the next weekly goal will be delayed based on the outcome of these meetings.

ematipico added umbrella Issue to track a collection of other issues weekly goals labels Oct 11, 2021

ematipico pinned this issue Oct 11, 2021

ematipico changed the title ~~☂️ Weekly goeals - week 41~~ ☂️ Weekly goals - week 41 Oct 11, 2021

MichaReiser closed this as completed Oct 12, 2021

MichaReiser reopened this Oct 12, 2021

MichaReiser mentioned this issue Oct 13, 2021

feature(formatter): format ASTs with missing tokens or nodes #1707

Merged

ematipico closed this as completed Oct 18, 2021

ematipico unpinned this issue Oct 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ Weekly goals - week 41 #1700

☂️ Weekly goals - week 41 #1700

ematipico commented Oct 11, 2021 •

edited

Loading

MichaReiser commented Oct 12, 2021 •

edited

Loading

ematipico commented Oct 18, 2021

☂️ Weekly goals - week 41 #1700

☂️ Weekly goals - week 41 #1700

Comments

ematipico commented Oct 11, 2021 • edited Loading

High Level Objectives / Questions we're trying to answer

Projects / Tasks

Risks

MichaReiser commented Oct 12, 2021 • edited Loading

Error recovery

Further Improvements

ematipico commented Oct 18, 2021

ematipico commented Oct 11, 2021 •

edited

Loading

MichaReiser commented Oct 12, 2021 •

edited

Loading