Attaching trivia to GreenToken #1738

xunilrj · 2021-11-01T16:16:00Z

part of: #1720

Tasks

Debug

  [email protected]
    " " [email protected] "let" " "         <- Range is bigger than "let" because we need the trivia now
    [email protected]
      [email protected]
        [email protected]
           [email protected] "a" " "
       [email protected] "=" " "
      [email protected]
         [email protected] "1"
     [email protected] ";"

Decisions

Now the token "slice" contains all its trivia.
To trim this trivia we have to ask leading().text_len() and trailing().text_len() and computer where the token text is.
Formatter is not working because we must manually insert trailing space between tokens. Maybe we need a couple of helper methods for this.

How to test

> cargo test -p rslint_parser
> cargo xtask coverage
...
│ Tests ran │ Passed │ Failed │ Panics │ Coverage │
├───────────┼────────┼────────┼────────┼──────────┤
│   17604   │ 16767  │  836   │   1    │  95.25   │
...

Coverage is the same as before.

crates/rome_rowan/src/green/token.rs

crates/rome_rowan/src/green/builder.rs

crates/rome_rowan/src/green/token.rs

crates/rslint_lexer/src/lib.rs

ematipico · 2021-11-01T17:39:19Z

crates/rslint_parser/src/lossless_tree_sink.rs

+
+		let trailing = self.get_trivia(true);
+		let leading = self.get_trivia(false);
+		let leading = std::mem::replace(&mut self.next_token_leading_trivia, leading);


What exactly are we doing here?

at line 148, "self.next_token_leading_trivia" has the leading trivia for the current token. I will start collecting the trailing trivia, and the leading for the next token.
The replace swaps the "current leading trivia" with the "new leading trivia".

Maybe it will be more clear if I process the current leading, and update the vec after this.

yassere

Sorry I didn't get to this sooner. I left some comments, but I don't think the memory-related concerns need to be addressed before this PR merges.

yassere · 2021-11-07T21:16:18Z

crates/rome_rowan/src/green/token.rs

+pub enum GreenTokenTrivia {
+	None,
+	One(ThinTrivia),
+	Many(Vec<ThinTrivia>),
+}


It does make sense to try to optimize for the common case of 0 or 1 trivia (e.g. a token with a single leading space and no trailing trivia), but I don't think this approach accomplishes that. The size of this enum is 32 bytes, which is even more than a vector alone would be.

A vector is probably not the right approach either since it uses 8 bytes to store its capacity, which we don't need in an immutable data structure.

Maybe GreenTokenTrivia could be something like ThinArc<TriviaHead, TriviaToken> that's used as an Option<GreenTokenTrivia>?
That would put it at 8 bytes and still allow checking for is_none() with no indirection.

I agree with @yassere that we probably haven't found the right balance yet. I also spent some more time thinking about a possible representation for trivia and start to like Roslyn's approach of heal allocating trivia and caching them more and more:

TriviaCollection: ThinArc<TriviaCollectionHead, GreenTrivia>. Heap allocated collection storing the leading or trailing trivia of a token. We cache the collections to reduce heap-allocations and memory usage.

GreenTrivia: ThinArc<TriviaKind, u8> (very similar to a token): Stores a single trivia with its string

I do like this approach because it finds a good balance between total memory consumption and the number of heap allocations. My assumption is that the leading/trailing trivia can be shared across many nodes. Therefore, caching the collections significantly reduces the number of heap allocations and total memory consumption, because the collection can be shared across a single file, even projects.

Having two pointers increases the token size only by 2 words (16) bytes which is less than the cost of a vector.

I'm not sure if it's worth wrapping the pointer in an option (but it probably doesn't matter because we don't pay any size overhead) because every token at least has a leading trivia and storing a pointer to a shared empty collection is little overhead.

I also agree with @yassere that we don't need to make these changes in this PR but we should follow up on the data structure.

every token at least has a leading trivia and storing a pointer to a shared empty collection is little overhead.

Storing a thin pointer to a shared empty collection means that you have to dereference the pointer to know that the collection is empty. You mentioned avoiding Option<Vec>>, but I think that's different. An owned vector can tell it's empty without any indirection because it stores its length independently from its slice pointer.

This might not actually matter in practice, but it could be worth taking advantage of null-pointer optimization to be able to identify the lack of leading/trailing trivia without following a pointer and without increasing mem size.

It's true that most tokens have leading trivia, but not always. The file console.log("Hello World!"); contains 7 tokens all without any trivia.

yassere · 2021-11-07T21:16:28Z

crates/rome_rowan/src/green/token.rs

+	leading_trivia: GreenTokenTrivia,
+	trailing_trivia: GreenTokenTrivia,


This increases the size of GreenTokenHead from 2 bytes to 72 bytes, which I think is too much to add to every single green token. Reducing the size of GreenTokenTrivia would help here, but we may need to consider further optimizations in the future.

yassere · 2021-11-07T21:16:35Z

crates/rome_rowan/src/green/token.rs

@@ -101,24 +193,76 @@ impl GreenTokenData {
 		unsafe { std::str::from_utf8_unchecked(self.data.slice()) }
 	}

+	/// Text of this Token with trivia.
+	/// TODO do we need String here?


We should definitely try to avoid creating a String here. Ideally, we could return something like SyntaxText which I think allows for text comparisons without allocating anything (and has to_string() for when it's necessary), but I haven't looked very closely at how it's implemented.

yassere · 2021-11-07T21:52:52Z

crates/rome_rowan/src/green/token.rs

+pub enum Trivia {
+	Whitespace(usize),
+	Comment(usize),
+}
+
+impl Trivia {
+	pub fn as_thin(self, text: &str) -> ThinTrivia {
+		let ptr = ThinArc::from_header_and_iter(self, text.bytes());
+		ThinTrivia(ptr)
+	}
+
+	fn text_len(&self) -> TextSize {
+		match self {
+			Trivia::Whitespace(n) => (*n as u32).into(),
+			Trivia::Comment(n) => (*n as u32).into(),
+		}
+	}
+}


Can you clarify how this Trivia enum is intended to be used? It feels odd that you can construct a Comment(42) with no text data. And that you can then call as_thin on it with text of any length. Also, the HeaderSlice inside a ThinArc already has a length field, so what's the advantage of storing a separate length inside the header field?

I also think that the type used internally as a header probably shouldn't be directly exposed.

…ing trivia

xunilrj · 2021-11-18T14:27:11Z

I created three other PR's that are more easy to understand. See #1720.

xunilrj requested review from ematipico and MichaReiser November 1, 2021 16:16

MichaReiser reviewed Nov 1, 2021

View reviewed changes

crates/rome_rowan/src/green/token.rs Outdated Show resolved Hide resolved

MichaReiser reviewed Nov 1, 2021

View reviewed changes

crates/rome_rowan/src/green/builder.rs Outdated Show resolved Hide resolved

MichaReiser reviewed Nov 1, 2021

View reviewed changes

crates/rome_rowan/src/green/token.rs Outdated Show resolved Hide resolved

ematipico reviewed Nov 1, 2021

View reviewed changes

xunilrj mentioned this pull request Nov 2, 2021

☂️ Attach trivia to tokens #1720

Closed

4 tasks

xunilrj changed the title ~~Trivia in tokens~~ Attaching trivia to GreenNodes Nov 2, 2021

xunilrj force-pushed the trivia_in_tokens branch from 104c031 to ca7633c Compare November 3, 2021 10:19

xunilrj changed the title ~~Attaching trivia to GreenNodes~~ Attaching trivia to GreenToken Nov 3, 2021

xunilrj force-pushed the trivia_in_tokens branch from ca7633c to c853d95 Compare November 4, 2021 20:45

yassere reviewed Nov 7, 2021

View reviewed changes

xunilrj force-pushed the trivia_in_tokens branch 2 times, most recently from 7bbe40e to 28a4e22 Compare November 9, 2021 08:32

MichaReiser mentioned this pull request Nov 10, 2021

Use fixed offsets for Script, Module, ... #1762

Closed

xunilrj added 15 commits November 11, 2021 22:36

starting trivia in tokens

7a77093

rslint_lexer breaking newline and space in two tokens

b1d80a7

renaming option to break trivia

3339c52

better treatment of CR

0d3d10b

removing printlns

b289967

draft verion of attached trivia

f386212

better SyntaxNode debug fmt

95997dc

rebasing and fixing trivia for private green node visibility

ae561ee

parser and lossless_tree_sink working

a396a5e

fixing token text and ranges methods

d711562

trim format_raw when formatter finds an error

50de7bf

rebase

5266c7e

fixing clippy issue

9cf919e

greentoken slice contains everything from first leading to last trail…

a34cd90

…ing trivia

fixing text and text_with_trivia and ranges

1977fbf

rebasing and adapting code to slots

8e328e6

xunilrj force-pushed the trivia_in_tokens branch from c967564 to 8e328e6 Compare November 11, 2021 22:53

fixing clippy issues

567a319

This was referenced Nov 12, 2021

trivia attached to GreenTokens #1783

Merged

typed api access to each piece of the trivia #1798

Merged

LosslessTreeSinker attaching trivia to tokens #1801

Merged

xunilrj closed this Nov 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attaching trivia to GreenToken #1738

Attaching trivia to GreenToken #1738

xunilrj commented Nov 1, 2021 •

edited

Loading

ematipico Nov 1, 2021

xunilrj Nov 1, 2021

yassere left a comment

yassere Nov 7, 2021 •

edited

Loading

MichaReiser Nov 8, 2021

yassere Nov 8, 2021

yassere Nov 7, 2021

yassere Nov 7, 2021

yassere Nov 7, 2021

xunilrj commented Nov 18, 2021

		leading_trivia: GreenTokenTrivia,
		trailing_trivia: GreenTokenTrivia,

Attaching trivia to GreenToken #1738

Attaching trivia to GreenToken #1738

Conversation

xunilrj commented Nov 1, 2021 • edited Loading

Tasks

Debug

Decisions

How to test

ematipico Nov 1, 2021

Choose a reason for hiding this comment

xunilrj Nov 1, 2021

Choose a reason for hiding this comment

yassere left a comment

Choose a reason for hiding this comment

yassere Nov 7, 2021 • edited Loading

Choose a reason for hiding this comment

MichaReiser Nov 8, 2021

Choose a reason for hiding this comment

yassere Nov 8, 2021

Choose a reason for hiding this comment

yassere Nov 7, 2021

Choose a reason for hiding this comment

yassere Nov 7, 2021

Choose a reason for hiding this comment

yassere Nov 7, 2021

Choose a reason for hiding this comment

xunilrj commented Nov 18, 2021

xunilrj commented Nov 1, 2021 •

edited

Loading

yassere Nov 7, 2021 •

edited

Loading