Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

☂️ Weekly goals - week 41 #1700

Closed
ematipico opened this issue Oct 11, 2021 · 2 comments
Closed

☂️ Weekly goals - week 41 #1700

ematipico opened this issue Oct 11, 2021 · 2 comments
Labels
umbrella Issue to track a collection of other issues

Comments

@ematipico
Copy link
Contributor

ematipico commented Oct 11, 2021

High Level Objectives / Questions we're trying to answer

  • try to format more JavaScript code, possibly using real files instead of small snippets;
  • format broken JavaScript code using rslint_parser and see what we would be able to achieve;
  • start thinking more about the APIs of the formatter and how different agents will use them (CLI, LSP server, etc.);
  • how rslint_parser would fit Rome's architecture;
  • study and implement the compiler API and evaluate how it should behave; for example, should we prefer "completeness" or "soundness" inside our linter? Should we show a diagnostic error that might be invalid?
  • evaluate tools/crates that could help us to create a query-based compiler, especially in a multi-thread environment;

Projects / Tasks

  • widen the coverage of the format logic on more AST nodes;
  • explore how rslint_parser determines when a JavaScript code is broken and how its green trees are structured. The idea is that nodes that contains EROOR@ nodes should not be formatted and left as is. We will explore some heuristic to see if we're able to do that;
  • create some intenal POC where we try out salsa
  • separate rowan from rslint_parser;
  • attach trivia to tokens; more information can be found in the rust-analyzer issue
  • defined the Rome's compiler APIs and how they would fit inside our idea of linter;
  • explore salsa and it's current state;
  • explore rustc caching model;
  • create different POCs of a caching system inside a bundler:
    • invalidate cache
    • invalidate a branch of a dependency tree

Risks

  • formatting broken ASTs might not work out using the current parser architecture;
  • rslint_parser might not fit our needs as it is now, and we would need to consider forking it or some parts of it;
  • salsa or other existing tools might not fit, and we would need to write a in-house caching system
@ematipico ematipico added umbrella Issue to track a collection of other issues weekly goals labels Oct 11, 2021
@ematipico ematipico pinned this issue Oct 11, 2021
@ematipico ematipico changed the title ☂️ Weekly goeals - week 41 ☂️ Weekly goals - week 41 Oct 11, 2021
@MichaReiser
Copy link
Contributor

MichaReiser commented Oct 12, 2021

I want to share an update on my current around formatting source files containing parsing errors (see draft PR #1707)

The main change of the PR is to change ToFormatElement to return an Option<FormatElement> rather than a FormatElement. Allowing formatters to return None enables them to return early if the AST node misses mandatory fields, and formatters can implement custom recovery logic dealing with children that couldn't be formatted, for example, skipping that node and continue with the next statement.

This API change is working pretty nicely, and the PR shows that the formatter can format source files containing parsing errors where nodes or tokens are missing. However, the rslint-CST doesn't work well if there are additional tokens, nodes or errors in the tree that wouldn't be present in an otherwise valid tree.

For example the following script:

--;

let a = 10;

with the corresponding CST:

The main issue is that field accessors on the Scripts AstNode skip the MINUS2 node/token because it is no Stmt. This has been an intentional decision and is working well for lining but becomes problematic for formatting because it isn't transparent that the node contains a parsing error and, therefore, shouldn't be formatted.

The same problem exists if the parser inserts an Error node or other sub-nodes, as we can see in this example (notice that the for statement uses three ; rather than two):

for (; --; ++;;) {

}

which rslint parses to

rslint inserts the additional [email protected] and [email protected] nodes, both aren't accessible by any AstNode field accessor besides iterating over the SyntaxNodes children.

The only way to work around this without changing the CST structure is to directly work on the node's children but that comes with the downside that the resulting code will be very hard to maintain.

That's why I suggest changing the parser to match Roslyn's behaviour:

  • Insert missing nodes and tokens when parsing to guarantee that the parsed CST is valid according to our grammar. Roslyn, for example, wraps the -- in expression_statement(pre_increment_expression(minus_minus_token, identifier_name { IsMissing=True})) which now forms a valid Stmt. Nodes that weren't present in the source code are marked as IsMissing: true and their text representation (more specifically, the text representation of the containing tokens) is an empty string.
  • Error nodes should be part of the union types where Errors can occur, for example, Stmt, Expr. For example, Roslyn has the IncompleteMember node for incompleteclass members and the IncompleteMember is part of the MemberDeclaration union type. In other words, Errors are part of our grammar and should, therefore, be explicitly modelled and listed. I think this has the side-benefit that we're forced to explicitly think about what an error means e.g. when we do semantic analysis rather than just accidentally skipping an error node.

I haven't thought about what that means for our AstNode API. Should field accessors still return Option<TChild> for mandatory fields or should we change it to TChild and instead explicitly check for IsMissing in the few cases where it's needed?

However, enforcing a specific structure for every node combined with explicitly modelling trivia would allow us to use static-offsets in our field accessors, e.g. the if token must always be the first child, whereas the condition is the second child...

Error recovery

I'm still undecided how the error recovery should work. The current implementation skips statements containing syntax errors but then continues with the next statement. I don't even know if this is too conservative or too lax? The main issue I see of doing supposedly clever error-recovering in the formatter is that the parser already performed some error-recovery, thus, what the formatter sees may already be different than what's written in the source file. I currently favour trusting whatever the parser came up with as being correct and improve in the parser's error recovery if this isn't good enough rather than trying to infer something from the CST structure.

Further Improvements

@ematipico and I discovered two additional properties of our AST that make it inconvenient to work with:

  • Testing if an IfStmt has an alternative branch requires testing if either the else_token or alt fields are present (not None). Ideally, only testing one would be sufficient (possible if we introduce an ElseClause that wraps the else_token and the body).
  • An explicit node for array holes would simplify iterating over the array children: let a = [,,3,,4,]; -> [JSArrayHole, JSArrayHole, , JSNumericLiteral {value: 3}, JSArrayHole {}, JSNumericLiteral {value: 3}]

@ematipico
Copy link
Contributor Author

The team is going to have a series of meetings to discuss a series of arguments regarding the architecture of the compiler, parser and other stuff.

This means that the next weekly goal will be delayed based on the outcome of these meetings.

@ematipico ematipico unpinned this issue Oct 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
umbrella Issue to track a collection of other issues
Projects
None yet
Development

No branches or pull requests

2 participants