Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: transition to Roslyn's model for trivia #6584

Open
matklad opened this issue Nov 17, 2020 · 9 comments
Open

RFC: transition to Roslyn's model for trivia #6584

matklad opened this issue Nov 17, 2020 · 9 comments
Labels
C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact S-actionable Someone could pick this issue up and work on it right now

Comments

@matklad
Copy link
Member

matklad commented Nov 17, 2020

Lossless syntax trees (rowan), by definition, need to represent whitespace and comments (trivai).

There are two approaches how to represent them in the syntax tree (1 + 1 in the example):

Attach to nodes:

Attach to tokens:

The first approach is what's used by rust-analyzer today by IntelliJ. Here, we attach trivia which sits between two nodes to the parent node.

The second approach is what's used by Roslyn & Swift' libsyntax. It's a bit more hacky -- a trivia is attached to a following non-trivia token. That is, each token conceptually stores a leading_trivia: Vec<Trivia> (but of course the encoding is optimized for common cases like single whitespace between tokens). See this doc for a more thorough description.

Why go with this strange hack with "fat" tokens? There are several benefits to it:

  • fixed structure of the syntax tree. With floating tokens, any node can have /any/ /number/ /of/ /children/. If we attached trivial to tokens, we can classify nodes into two buckets:

    • those that have a fixed number of (potentially missing) children (like struct decl)
    • those that have any number of children of the same type (like struct's list of fields)
      This in turn gives us O(1) access to a specific child
  • better programming model. I hypothesize that having trivia attached to nodes makes certain refactors to "just work". Specifically, type-safe modifications are naturally trivia-preserving. For example, if you have two blocks, and you want to append the content of the first block to the second one, you can do roughly:

    for stmt in b1.stmts() { b2 = b2.append_stmt(stmt); }

    this works with attached trivia, but, with floating trivia, you'd need to transfer trivia nodes explicitly, or resort to an untyped API. Note that this is a hypothesis: I haven't worked with Roslyn-style API closely, so I don't know how important is it in practice

  • better performance. This also is hypothetical, but, with token interning, storing trivia inside tokens probably won't increase the overall storage for tokens that much. However, we'd spend 2x less memory on storing pointers to tokens, because roughly half of the tokens are trivia.

I think I lean towards trying the Roslyn trivia model -- it seems like it can be better long term. I wish we can experiment with this in a simple way though :-(

@matklad matklad added C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact labels Nov 17, 2020
@Diggsey
Copy link

Diggsey commented Nov 17, 2020

"fun"

🤨

Presumably the downside is that it might be easier to lose the "trivia" with modifications that replace tokens? Or there might be cases (like comments at the end of a line) where it would be more natural for the trivia to attach to the prior token.

@matklad
Copy link
Member Author

matklad commented Nov 17, 2020

Presumably the downside is that it might be easier to lose the "trivia" with modifications that replace tokens?

Yeah, the might be! For example, I expect that sometimes one wants to lose the spacing. For example, when doing "inline variable", the space around expression should be removed. Here's an example of similar wraningling from Roslyn: https://github.com/dotnet/roslyn/blob/4d8cada2584dd2e58b391c32c84f10f29442fdde/src/Analyzers/CSharp/CodeFixes/InlineDeclaration/CSharpInlineDeclarationCodeFixProvider.cs#L165-L174

(aside: if you think that IDEs are like math, where you juggle recursive ADTs with finesse and precision, you are wrong: every time you need to touch a CST, it's just a giant pile of special cases within special cases)

Or there might be cases (like comments at the end of a line) where it would be more natural for the trivia to attach to the prior token.

The swift documents goes into a bit more of details how this is represented. Tokens actually have a leading and traling trivial, and a run of whitespace can actually be split into two trivias. Generally, tokens include indentation before and newline after. The trivia between tokens on the same line is split more or less arbitrary (it is attached to the following token, but I think attaching to the previous would work as well).

For trivia just at the end of file, there's a clever trick: each file ends with a (zero-width) EOF token, so it's a natural anchor for everyting that's left.

@kjeremy
Copy link
Contributor

kjeremy commented Nov 18, 2020

I really like the shallower trees where the meat of a refactor is easier to get at. I find it a little awkward that trivia can be attached to both ends (I think that might make certain operations more awkward) but maybe that's not a big deal in practice if you stick to a convention like Swift's. Though now that I'm rereading some of the swift docs I suspect a lot of refactoring may either be ignoring trivia or zeroing it out at both ends and reconstructing.

@vemoo
Copy link
Contributor

vemoo commented Dec 20, 2020

Would attaching it to tokens mean that rowan has to be changed to be trivia aware?

@lnicola lnicola added the S-actionable Someone could pick this issue up and work on it right now label Dec 21, 2020
@ShuiRuTian
Copy link
Contributor

ShuiRuTian commented Feb 3, 2021

If I want to do this, very basically I could image these things need to be done:

in rowan
MUST:

  • add greenTrivia
  • add redTrivia(SyntaxTrivia)
  • add trivia structure to node and token and elment
  • add greenChild to token, which could have trivia as children now.
    • so we also need add trivia to greenChild and related work.
  • add a way for node and token and elment to get leading trivia and trailing trivia (a function or a trait?)
    • In Roslyn, it has GetLeadingTriviaCore and GetLeadingTrivia. I am not sure why it does so.

NOT SURE:

  • change text and text_len, and give variant functions to give "Full Text"(including trivias), "Full text" length.

Not related, but cool:

  • Roslyn add DiagnosticInfo on green node. This is impressive. I am not sure, but I think if a green node adds any diagnostic info, it should become a new one. This would incluence a lot, How does Roslyn handle this?

in RA
Well, this is part the most I want to ask.

  1. PR itself

    • The modification need to have two PR? one in rowan and one in RA, and be merged at the same time?
  2. parser,

    • How do we cache the green node today? Do we need to cache trivia differently?
    • there must be other questions I miss now....
  3. Red nodes,

    • how does the concentrate red node generated?
    • Rust does not have reflection, then how to get correct type when in compiling? A little amazing!
  4. Else

    • Where the syntax tree is consumed?

Sorry for asking a lot questions, have a overview would really be helpful. And maybe I fotgot some key point?
And do not need to answer all things them in detail(of course it would be sweet to give detailed answer), pointing out where the code places would also be appreciated a lot.

Thanks!

@matklad
Copy link
Member Author

matklad commented Apr 6, 2021

@ShuiRuTian sorry, I've missed your comment :( I don't think we are quite ready to do this, as I am still on the fence. I don't have a very detailed plan for how we'd do this if we want, I'd probably just start hacking rowan.

@matklad
Copy link
Member Author

matklad commented Apr 6, 2021

another case from today:

        let workspace_build_data = match self.fetch_build_data_queue.last_op_result() {
            Some(Ok(it)) => Some(it.clone()),

            None => None,
  $0
            Some(Err(err)) => None,
        };

In Roslyn model of trivia / swifts model of punctuation, the match arms completely cover the match arm list, so the cursor is considered to be on a match arm.

In our current model, the cursor is considered to be between them, so we don't trigger "merge match arms" assist.

@oxalica
Copy link
Contributor

oxalica commented Dec 4, 2022

Is there anything going on with this RFC in rowan? I'd like to have this, which could simplify trivia handling in many cases.

@MichaReiser
Copy link

Is there anything going on with this RFC in rowan? I'd like to have this, which could simplify trivia handling in many cases.

You can try Rome's fork of rowan. It implements Roslyn's trivia handling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-Architecture Big architectural things which we need to figure up-front (or suggestions for rewrites :0) ) E-hard fun A technically challenging issue with high impact S-actionable Someone could pick this issue up and work on it right now
Projects
None yet
Development

No branches or pull requests

8 participants