RFC: transition to Roslyn's model for trivia #6584

matklad · 2020-11-17T17:41:28Z

Lossless syntax trees (rowan), by definition, need to represent whitespace and comments (trivai).

There are two approaches how to represent them in the syntax tree (1 + 1 in the example):

Attach to nodes:

[email protected]
  [email protected]
    [email protected] "1"
  [email protected] " "
  [email protected] "+"
  [email protected] " "
  [email protected]
    [email protected] "1"

Attach to tokens:

[email protected]
  [email protected]
    [email protected] "1"
  [email protected] "+", leading trivia: " "
  [email protected]
    [email protected] "1", leading trivia: " "

The first approach is what's used by rust-analyzer today by IntelliJ. Here, we attach trivia which sits between two nodes to the parent node.

The second approach is what's used by Roslyn & Swift' libsyntax. It's a bit more hacky -- a trivia is attached to a following non-trivia token. That is, each token conceptually stores a leading_trivia: Vec<Trivia> (but of course the encoding is optimized for common cases like single whitespace between tokens). See this doc for a more thorough description.

Why go with this strange hack with "fat" tokens? There are several benefits to it:

fixed structure of the syntax tree. With floating tokens, any node can have /any/ /number/ /of/ /children/. If we attached trivial to tokens, we can classify nodes into two buckets:
- those that have a fixed number of (potentially missing) children (like struct decl)
- those that have any number of children of the same type (like struct's list of fields)
  This in turn gives us O(1) access to a specific child
better programming model. I hypothesize that having trivia attached to nodes makes certain refactors to "just work". Specifically, type-safe modifications are naturally trivia-preserving. For example, if you have two blocks, and you want to append the content of the first block to the second one, you can do roughly:
```
for stmt in b1.stmts() { b2 = b2.append_stmt(stmt); }
```
this works with attached trivia, but, with floating trivia, you'd need to transfer trivia nodes explicitly, or resort to an untyped API. Note that this is a hypothesis: I haven't worked with Roslyn-style API closely, so I don't know how important is it in practice
better performance. This also is hypothetical, but, with token interning, storing trivia inside tokens probably won't increase the overall storage for tokens that much. However, we'd spend 2x less memory on storing pointers to tokens, because roughly half of the tokens are trivia.

I think I lean towards trying the Roslyn trivia model -- it seems like it can be better long term. I wish we can experiment with this in a simple way though :-(

The text was updated successfully, but these errors were encountered:

Diggsey · 2020-11-17T18:12:36Z

"fun"

🤨

Presumably the downside is that it might be easier to lose the "trivia" with modifications that replace tokens? Or there might be cases (like comments at the end of a line) where it would be more natural for the trivia to attach to the prior token.

matklad · 2020-11-17T22:16:23Z

Presumably the downside is that it might be easier to lose the "trivia" with modifications that replace tokens?

Yeah, the might be! For example, I expect that sometimes one wants to lose the spacing. For example, when doing "inline variable", the space around expression should be removed. Here's an example of similar wraningling from Roslyn: https://github.com/dotnet/roslyn/blob/4d8cada2584dd2e58b391c32c84f10f29442fdde/src/Analyzers/CSharp/CodeFixes/InlineDeclaration/CSharpInlineDeclarationCodeFixProvider.cs#L165-L174

(aside: if you think that IDEs are like math, where you juggle recursive ADTs with finesse and precision, you are wrong: every time you need to touch a CST, it's just a giant pile of special cases within special cases)

Or there might be cases (like comments at the end of a line) where it would be more natural for the trivia to attach to the prior token.

The swift documents goes into a bit more of details how this is represented. Tokens actually have a leading and traling trivial, and a run of whitespace can actually be split into two trivias. Generally, tokens include indentation before and newline after. The trivia between tokens on the same line is split more or less arbitrary (it is attached to the following token, but I think attaching to the previous would work as well).

For trivia just at the end of file, there's a clever trick: each file ends with a (zero-width) EOF token, so it's a natural anchor for everyting that's left.

kjeremy · 2020-11-18T14:35:03Z

I really like the shallower trees where the meat of a refactor is easier to get at. I find it a little awkward that trivia can be attached to both ends (I think that might make certain operations more awkward) but maybe that's not a big deal in practice if you stick to a convention like Swift's. Though now that I'm rereading some of the swift docs I suspect a lot of refactoring may either be ignoring trivia or zeroing it out at both ends and reconstructing.

vemoo · 2020-12-20T15:43:04Z

Would attaching it to tokens mean that rowan has to be changed to be trivia aware?

ShuiRuTian · 2021-02-03T09:34:06Z

If I want to do this, very basically I could image these things need to be done:

in rowan
MUST:

add greenTrivia
add redTrivia(SyntaxTrivia)
add trivia structure to node and token and elment
add greenChild to token, which could have trivia as children now.
- so we also need add trivia to greenChild and related work.
add a way for node and token and elment to get leading trivia and trailing trivia (a function or a trait?)
- In Roslyn, it has GetLeadingTriviaCore and GetLeadingTrivia. I am not sure why it does so.

NOT SURE:

change text and text_len, and give variant functions to give "Full Text"(including trivias), "Full text" length.

Not related, but cool:

Roslyn add DiagnosticInfo on green node. This is impressive. I am not sure, but I think if a green node adds any diagnostic info, it should become a new one. This would incluence a lot, How does Roslyn handle this?

in RA
Well, this is part the most I want to ask.

PR itself
- The modification need to have two PR? one in rowan and one in RA, and be merged at the same time?
parser,
- How do we cache the green node today? Do we need to cache trivia differently?
- there must be other questions I miss now....
Red nodes,
- how does the concentrate red node generated?
- Rust does not have reflection, then how to get correct type when in compiling? A little amazing!
Else
- Where the syntax tree is consumed?

Sorry for asking a lot questions, have a overview would really be helpful. And maybe I fotgot some key point?
And do not need to answer all things them in detail(of course it would be sweet to give detailed answer), pointing out where the code places would also be appreciated a lot.

Thanks!

matklad · 2021-04-06T11:20:51Z

@ShuiRuTian sorry, I've missed your comment :( I don't think we are quite ready to do this, as I am still on the fence. I don't have a very detailed plan for how we'd do this if we want, I'd probably just start hacking rowan.

matklad · 2021-04-06T11:23:38Z

another case from today:

        let workspace_build_data = match self.fetch_build_data_queue.last_op_result() {
            Some(Ok(it)) => Some(it.clone()),

            None => None,
  $0
            Some(Err(err)) => None,
        };

In Roslyn model of trivia / swifts model of punctuation, the match arms completely cover the match arm list, so the cursor is considered to be on a match arm.

In our current model, the cursor is considered to be between them, so we don't trigger "merge match arms" assist.

oxalica · 2022-12-04T18:23:16Z

Is there anything going on with this RFC in rowan? I'd like to have this, which could simplify trivia handling in many cases.

MichaReiser · 2022-12-08T11:07:02Z

Is there anything going on with this RFC in rowan? I'd like to have this, which could simplify trivia handling in many cases.

You can try Rome's fork of rowan. It implements Roslyn's trivia handling.

matklad added C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact labels Nov 17, 2020

lnicola added the S-actionable Someone could pick this issue up and work on it right now label Dec 21, 2020

ematipico mentioned this issue Oct 11, 2021

☂️ Weekly goals - week 41 rome/tools#1700

Closed

c42f mentioned this issue Dec 16, 2021

Trivia interface JuliaLang/JuliaSyntax.jl#2

Open

ChayimFriedman2 mentioned this issue Feb 26, 2022

Wrong inferred type in while and if with block conditions #11561

Closed

oxalica mentioned this issue Dec 4, 2022

Add add_to_top_level_lambda_param assist oxalica/nil#41

Merged

matklad mentioned this issue Dec 21, 2022

extend selection should include leading indentation helix-editor/helix#5231

Open

Veykril mentioned this issue Oct 4, 2023

Get rid of mutable syntax node usage #15710

Open

Veykril mentioned this issue Jan 10, 2024

fix: Preserve comments for extracted block expression in 'extract_function' #16333

Merged

Veykril mentioned this issue Jun 25, 2024

A Plan for Making Rust Analyzer Faster #17491

Open

victor-pogor mentioned this issue Jun 26, 2024

Implement lexer sunt-programator/off-dotnet#339

Closed

Veykril mentioned this issue Sep 30, 2024

Support macro None groupings #18211

Open

dlight mentioned this issue Jan 9, 2025

Empty lines and whitespace are not preserved Byron/pulldown-cmark-to-cmark#94

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: transition to Roslyn's model for trivia #6584

RFC: transition to Roslyn's model for trivia #6584

matklad commented Nov 17, 2020

Diggsey commented Nov 17, 2020

matklad commented Nov 17, 2020

kjeremy commented Nov 18, 2020

vemoo commented Dec 20, 2020

ShuiRuTian commented Feb 3, 2021 •

edited

Loading

matklad commented Apr 6, 2021

matklad commented Apr 6, 2021

oxalica commented Dec 4, 2022

MichaReiser commented Dec 8, 2022

RFC: transition to Roslyn's model for trivia #6584

RFC: transition to Roslyn's model for trivia #6584

Comments

matklad commented Nov 17, 2020

Diggsey commented Nov 17, 2020

matklad commented Nov 17, 2020

kjeremy commented Nov 18, 2020

vemoo commented Dec 20, 2020

ShuiRuTian commented Feb 3, 2021 • edited Loading

matklad commented Apr 6, 2021

matklad commented Apr 6, 2021

oxalica commented Dec 4, 2022

MichaReiser commented Dec 8, 2022

ShuiRuTian commented Feb 3, 2021 •

edited

Loading