Skip to content
This repository has been archived by the owner on Aug 31, 2023. It is now read-only.

Attaching trivia to GreenToken #1738

Closed
wants to merge 17 commits into from
Closed

Conversation

xunilrj
Copy link
Contributor

@xunilrj xunilrj commented Nov 1, 2021

part of: #1720

Tasks

  • GreenToken will have an enum with the trivial case and a fallback using Vec.
  • SyntaxTrivia will wrap GreenTokenTrivia
  • Change PartialEq and Hash implementations for the *Token structs to consider the trivia.
  • Change Display and Debug implementations to print trivia.
  • Improve Display and Debug
  • Change the NodeCache to respect trivia (not fixed by changing Hash implementation)
  • Change LossyTreeSink
  • Change LosslessTreeSink
  • Fix GreenToken slice to contains leading, token and trailing trivia. In one slice. Otherwise the Syntax layer end up with all offsets wrong.
  • Fix doctests
  • Improve test_trivia_attached_to_tokens

Debug

Decisions

  • Now the token "slice" contains all its trivia.
  • To trim this trivia we have to ask leading().text_len() and trailing().text_len() and computer where the token text is.
  • Formatter is not working because we must manually insert trailing space between tokens. Maybe we need a couple of helper methods for this.

How to test

> cargo test -p rslint_parser
> cargo xtask coverage
...
│ Tests ran │ PassedFailedPanicsCoverage │
├───────────┼────────┼────────┼────────┼──────────┤
│   1760416767836195.25   │
...

Coverage is the same as before.


let trailing = self.get_trivia(true);
let leading = self.get_trivia(false);
let leading = std::mem::replace(&mut self.next_token_leading_trivia, leading);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly are we doing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at line 148, "self.next_token_leading_trivia" has the leading trivia for the current token. I will start collecting the trailing trivia, and the leading for the next token.
The replace swaps the "current leading trivia" with the "new leading trivia".

Maybe it will be more clear if I process the current leading, and update the vec after this.

@xunilrj xunilrj mentioned this pull request Nov 2, 2021
4 tasks
@xunilrj xunilrj changed the title Trivia in tokens Attaching trivia to GreenNodes Nov 2, 2021
@xunilrj xunilrj changed the title Attaching trivia to GreenNodes Attaching trivia to GreenToken Nov 3, 2021
Copy link
Contributor

@yassere yassere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get to this sooner. I left some comments, but I don't think the memory-related concerns need to be addressed before this PR merges.

Comment on lines 63 to 38
pub enum GreenTokenTrivia {
None,
One(ThinTrivia),
Many(Vec<ThinTrivia>),
}
Copy link
Contributor

@yassere yassere Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense to try to optimize for the common case of 0 or 1 trivia (e.g. a token with a single leading space and no trailing trivia), but I don't think this approach accomplishes that. The size of this enum is 32 bytes, which is even more than a vector alone would be.

A vector is probably not the right approach either since it uses 8 bytes to store its capacity, which we don't need in an immutable data structure.

Maybe GreenTokenTrivia could be something like ThinArc<TriviaHead, TriviaToken> that's used as an Option<GreenTokenTrivia>?
That would put it at 8 bytes and still allow checking for is_none() with no indirection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @yassere that we probably haven't found the right balance yet. I also spent some more time thinking about a possible representation for trivia and start to like Roslyn's approach of heal allocating trivia and caching them more and more:

  • TriviaCollection: ThinArc<TriviaCollectionHead, GreenTrivia>. Heap allocated collection storing the leading or trailing trivia of a token. We cache the collections to reduce heap-allocations and memory usage.
  • GreenTrivia: ThinArc<TriviaKind, u8> (very similar to a token): Stores a single trivia with its string

I do like this approach because it finds a good balance between total memory consumption and the number of heap allocations. My assumption is that the leading/trailing trivia can be shared across many nodes. Therefore, caching the collections significantly reduces the number of heap allocations and total memory consumption, because the collection can be shared across a single file, even projects.

Having two pointers increases the token size only by 2 words (16) bytes which is less than the cost of a vector.

I'm not sure if it's worth wrapping the pointer in an option (but it probably doesn't matter because we don't pay any size overhead) because every token at least has a leading trivia and storing a pointer to a shared empty collection is little overhead.

I also agree with @yassere that we don't need to make these changes in this PR but we should follow up on the data structure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every token at least has a leading trivia and storing a pointer to a shared empty collection is little overhead.

Storing a thin pointer to a shared empty collection means that you have to dereference the pointer to know that the collection is empty. You mentioned avoiding Option<Vec>>, but I think that's different. An owned vector can tell it's empty without any indirection because it stores its length independently from its slice pointer.

This might not actually matter in practice, but it could be worth taking advantage of null-pointer optimization to be able to identify the lack of leading/trailing trivia without following a pointer and without increasing mem size.

It's true that most tokens have leading trivia, but not always. The file console.log("Hello World!"); contains 7 tokens all without any trivia.

Comment on lines +96 to +55
leading_trivia: GreenTokenTrivia,
trailing_trivia: GreenTokenTrivia,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This increases the size of GreenTokenHead from 2 bytes to 72 bytes, which I think is too much to add to every single green token. Reducing the size of GreenTokenTrivia would help here, but we may need to consider further optimizations in the future.

@@ -101,24 +193,76 @@ impl GreenTokenData {
unsafe { std::str::from_utf8_unchecked(self.data.slice()) }
}

/// Text of this Token with trivia.
/// TODO do we need String here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely try to avoid creating a String here. Ideally, we could return something like SyntaxText which I think allows for text comparisons without allocating anything (and has to_string() for when it's necessary), but I haven't looked very closely at how it's implemented.

Comment on lines 17 to 29
pub enum Trivia {
Whitespace(usize),
Comment(usize),
}

impl Trivia {
pub fn as_thin(self, text: &str) -> ThinTrivia {
let ptr = ThinArc::from_header_and_iter(self, text.bytes());
ThinTrivia(ptr)
}

fn text_len(&self) -> TextSize {
match self {
Trivia::Whitespace(n) => (*n as u32).into(),
Trivia::Comment(n) => (*n as u32).into(),
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify how this Trivia enum is intended to be used? It feels odd that you can construct a Comment(42) with no text data. And that you can then call as_thin on it with text of any length. Also, the HeaderSlice inside a ThinArc already has a length field, so what's the advantage of storing a separate length inside the header field?

I also think that the type used internally as a header probably shouldn't be directly exposed.

@xunilrj xunilrj force-pushed the trivia_in_tokens branch 2 times, most recently from 7bbe40e to 28a4e22 Compare November 9, 2021 08:32
@xunilrj
Copy link
Contributor Author

xunilrj commented Nov 18, 2021

I created three other PR's that are more easy to understand. See #1720.

@xunilrj xunilrj closed this Nov 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants