-
Notifications
You must be signed in to change notification settings - Fork 656
☂️ Rome Parser Architecture Changes #1718
Comments
I agree with most of these changes, i dont however think that formatters should even attempt to format invalid syntax, it is probably unhelpful in most cases, it complicates the formatter, and is likely to yield horrible results for somewhat broken code, unless the error recovery is basically perfect, which is hard to achieve, especially in js/ts. |
@RDambrosio016 I share your concerns. Formatting invalid code is challenging and there's a high risk that we mess the code up rather than improve the formatting. Having something to play around with will show us how well full formatting, partial formatting (before any error, only format whitespace, ...), etc. work. Going from full to not formatting files with syntax errors shouldn't be a significant change (it mainly means opting out earlier). Regarding Use fixed slots to store the children. Something we may have overlooked is how to handle lists inside of nodes. The difficulty with lists is that they may contain an arbitrary number of elements, making it impossible to use fixed offsets if these children are stored inline. One option I see is that we create artificial nodes for lists, e.g. Another alternative that might be to build on top of We may be able to model slots more cleverly to avoid that additional heap allocation but I haven't found a way how to do that yet (Ideally, we have two lists, a list with slots and a list with children: Downside, getting the static offset of a child requires calculating the offset from all slots before the slot you're looking for). |
That's what I thought too. My understanding was to store a fixed number of children only nodes that have a predetermined number of sub nodes. For example, a Although we can't do this with, for example, for |
@ematipico exactly. What I started outlining in the previous comment with slots could be implemented with a node format as follow, so that we can store the children inline (downside, paying an additional 8 bytes per slot)
A slot is a number specifying how many children it contains. This count could also be used to model missing ( That's why I took another slot at Roslyn and the way they implement it is that |
about trivia: #1720 |
I believe the approach taken by Roslyn is beneficial besides just getting absolute offsets. Grouping fields with many children in a
A |
I think I finally got my head around how this could work:
One thing left to verify is if |
I created a "discussion" to discuss the changes around storing children in fixed slots. #1721 |
The parser architecture changes have been completed. We can track the formatter in its own task (#1726) |
Description
Last week, we spent our time discussing a number of different architectural changes that we would like to make to the parser and CST/AST. We also talked to some of the maintainers about RSLint about how we would like to work together in the future.
Immediately, we've already started working on these changes in our locally checked in fork of the RSLint parser, but in the coming weeks/months we're going to collaborate to ensure that the needs of both projects are being met, and in the future Rome will be exporting its own crate to be used in RSLint with collaborators from both projects.
cc @RDambrosio016 @Stupremee
Store leading/trailing trivia for each token
We explored two options on how to store trivia like spaces, tabs, newlines, single and multiline comments. Let's use the following example to illustrate how RSLint works today and how we want to store trivia moving onwards.
RSLint stores the trivia as a token of the kind
Whitespace
orComment
and attaches it to its enclosing node. The CST for the array example would look like this (whitespace and comments are highlighted):There are a number of concerns that we have about this design that we'd like to resolve:
O(1)
lookups for child nodes. The problem is that an unspecified number of trivia may exist between any two non-trivia elements. Normally, thecondition
of an if statement is the 3rd element: 1st:if
keyword, 2nd: whitespace, 3rd:condition
. However, this isn't guaranteed asif \n // some comment\ncondition\n...
illustrates, where thecondition
is the 5th element because of the added comment that is followed by a\n
.comments
for a node or token can't be accessed using the AST facade. Resolving the comments would require additional helpers that lookup the comments in the parent's children.That's why we believe that storing leading/trailing trivia on the token is better for our use case. We would use the same rule as Roslyn and Swift to decide to which token a trivia belongs:
Applying these rules to our initial array example creates the following CST:
We should be mindful of:
AstNode
API. For example, the array expression in RSLint contains the expressions for the elements as well as the commas separating them. The problem is, that it isn't possible to access a specific comma. The CST structure should, therefore, be changed so that elements are wrapped by anArrayElement
node that stores the expression together with an optional trailing comma (array[element[literal, comma], element[literal, comma]]
instead ofarray[literal, comma, literal, comma]
).Allow missing children
RSLint already supports missing children but the decision is listed here to be explicit that this is the behaviour we intended. Let's use the following example where the body of the
else
branch is missing.There are two options to parse the if statement
IfStmt
into anUnknownStmt
(orError
) because it misses some mandatory children. TheUnkownStmt
is a simple wrapper node that gives access to its children but otherwise gives no more guarantees about its structure. Parsing this as anUnknownStmt
has the benefit that it's known that all children are present if you get any otherStmt
but comes at the cost that the parser discards the information that this almost certainly is anif
statement. Handling theUnknownStmt
in a meaningful way inside of a CST pass would require re-inferring that theUnknownStmt
is very likely anIfStmt
.We believe that parsing the example as an
IfStmt
and leaving thealt
branch missing is better for our use case because it conveys the most information of the source code, allowing us to provide better guidance. Analysis that depend on the presence of thealt
branch can manually test its existence.Use fixed slots to store the children
RSLint and rust-analyzer only store nodes or tokens if they've been present in the source file. Let's take the following example of a
for
statement that only has a condition but misses theinit
andupdate
nodes.RSLint generates the following CST (ignoring trivia):
The downside of not storing missing (even optional) nodes is that accessing a particular field requires iterating over the children of a node to find its position. For example, the
for-test
is the 4th element if the for-loop has nofor-init
but is the 5th element otherwise.The CST can guarantee fixed positions of its children if missing optional or mandatory children are marked as missing inside its children collection. There are multiple ways to do so, some options are:
None
for missing childrenStoring missing elements increases memory consumption because the data structure must encode the information whatever an element is present or missing. We explored different solutions, including bitflags, to mark present and missing elements to keep the overhead to a minimum.
AST Façade
RSLint's and Rust-analyzer's child accessors defined on the AST nodes return
Option<TNode>
. For example, theIfStmt
child accessors are defined as follow (implementations omitted for brevity):We identified that such an API has two major shortcomings for our use case. Let's explore these using the following example:
RSLint creates the following CST for this script:
IfStmt
has one right parenthesis too many. RSLint excellently recovers from this error by inserting anError
in the place of theif
's consequence. The problem with the API is that accessingif_stmt.cons()
returnsNone
becauseError
isn't aStmt
. This has the advantage that a linter will not bother and try to dive into the body but a formatter needs theError
node to format its content (even if that just means printing it as it was in the original source).Because of this, we plan to make the following two changes to the facade:
Result<TNode, MissinElementError>
(the name and shape of the error are still up for discussion). Changing the return type toResult
encodes that this is a mandatory child and guides the user to handle a missing child appropriately.Error
elements to recover from syntax errors. For example, the parser may create anError
element for an unknown statement. That's why theStmt
union should be extended to include aStmt::Unknown
case. Adding theUnknown
case allows the facade to return anError
element in places ofStmt
s because it's now part of theStmt
union.Designing the grammar requires finding the right balance between recover-granularity and API ergonomics. For instance, the parser must generate an
Error
statement for--;
if the grammar only allows errors on theStmt
level but it can parseexpression_statement(error)
if errors are also allowed in places of expressions. However, allowing errors in too many places increases the complexity of writing CST passes because they must be handled by the authors.The places where a parser wants to insert errors to recover from syntax errors are language-specific.
Error
syntax kind into differentError
kinds, for example,UnkownStmt
,UnknownExpr
, and so on. Splitting theError
s into different kinds allows to query for a specificUnknown
kind and implement custom behaviour per kind using methods or implementing the same trait differently for each kind.The definition of the
IfStmt
andStmt
changes as follows:Considered Alternatives
Own try type
We decided against running our own
Try
type because we don't want to depend on nightly language features for such a central piece of our architecture.Using
Result<Option<T>>
for mandatory childrenWe decided to not use this approach because:
Error
elements and only want to break out of the operation if a mandatory node is missing. These passes can't use thetry
operator onResult<T, MandatoryElementError>
because it would break if the element is missing or if it's an error element. They instead must match on theResult
and handle all three cases, leading to more verbose code.You could argue that the return type for mandatory fields (and optional fields?) should be
Option<Result<T, ErrorNode>>
instead. Doing so may mitigate this problem but theOption<Result>
nesting makes it harder to reason about what thetry
operator usages. Is it testing for missing or if the node is correct?That's why we want to use different
Unknown*
elements. It avoids the nesting and has the additional benefit that the parent node is "correct", as far as it's concerned, even if, for example, thecons
contains anUnknownStmt
. The code working with thecons
statement can handle theUnknownStmt
case.Error
s even in places where the parser won't ever insert an error. For example, the parser will never insert anError
in theElseClause
because it's either absent (in which case theError
is attached to the next statement) or present if there's anelse
token (in which case theError
is attached to thealt
. It would be possible to change our syntax generator to take this into consideration but that would increase the complexity of the source code generator and addingUnknownStmt
to theStmt
union encodes the same information.try
operator tests for when working on optional nodes because they use a nestedResult<Option>
. For example,if_stmt.else_clause()?
means that theelse_clause
contains no error node, but another?
is required to get the actual node.Error
s are not accidentally filtered out or implicitly skipped. For example, wouldError
s be returned when iterating over allStmt
s in a program if that specificError
appears in a position of aStmt
?The main advantage of this approach is that the parser has more flexibility when it comes to error recovering because the parser is free to insert
Error
elements in any place. This can somewhat be mitigated by introducing newUnknown
nodes if it's important to get better error recovery.A second advantage is that there are two different types:
Stmt
represents a valid statementResult<Stmt, ErrorNode>
represents a potentially invalid statementHaving two different types can help to do the error handling in a central place and then assume valid statements in the rest of the code.
Insert missing mandatory children
This is the approach taken by Roslyn. It inserts any missing child into the AST but marks it as missing and uses explicit
Unkonw*
nodes.The main advantage of this approach is that the accessors for mandatory children don't have to return a
Result
nor anOption
.We decided against this approach because users must know that they may have to explicitly check if the node was missing. The API doesn't guide them to handle the missing case and decide what the appropriate behaviour is.
Resources
Action Items
PS -- @Stupremee and @RDambrosio016 we're happy to talk more about these changes, these are just where we landed in our own discussions, but we'd really value any feedback you could provide.
The text was updated successfully, but these errors were encountered: