-
Notifications
You must be signed in to change notification settings - Fork 656
Conversation
|
||
let trailing = self.get_trivia(true); | ||
let leading = self.get_trivia(false); | ||
let leading = std::mem::replace(&mut self.next_token_leading_trivia, leading); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly are we doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at line 148, "self.next_token_leading_trivia" has the leading trivia for the current token. I will start collecting the trailing trivia, and the leading for the next token.
The replace swaps the "current leading trivia" with the "new leading trivia".
Maybe it will be more clear if I process the current leading, and update the vec after this.
104c031
to
ca7633c
Compare
ca7633c
to
c853d95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't get to this sooner. I left some comments, but I don't think the memory-related concerns need to be addressed before this PR merges.
crates/rome_rowan/src/green/token.rs
Outdated
pub enum GreenTokenTrivia { | ||
None, | ||
One(ThinTrivia), | ||
Many(Vec<ThinTrivia>), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does make sense to try to optimize for the common case of 0 or 1 trivia (e.g. a token with a single leading space and no trailing trivia), but I don't think this approach accomplishes that. The size of this enum is 32 bytes, which is even more than a vector alone would be.
A vector is probably not the right approach either since it uses 8 bytes to store its capacity, which we don't need in an immutable data structure.
Maybe GreenTokenTrivia
could be something like ThinArc<TriviaHead, TriviaToken>
that's used as an Option<GreenTokenTrivia>
?
That would put it at 8 bytes and still allow checking for is_none()
with no indirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @yassere that we probably haven't found the right balance yet. I also spent some more time thinking about a possible representation for trivia and start to like Roslyn's approach of heal allocating trivia and caching them more and more:
TriviaCollection
:ThinArc<TriviaCollectionHead, GreenTrivia>
. Heap allocated collection storing the leading or trailing trivia of a token. We cache the collections to reduce heap-allocations and memory usage.GreenTrivia
:ThinArc<TriviaKind, u8>
(very similar to a token): Stores a single trivia with its string
I do like this approach because it finds a good balance between total memory consumption and the number of heap allocations. My assumption is that the leading/trailing trivia can be shared across many nodes. Therefore, caching the collections significantly reduces the number of heap allocations and total memory consumption, because the collection can be shared across a single file, even projects.
Having two pointers increases the token size only by 2 words (16) bytes which is less than the cost of a vector.
I'm not sure if it's worth wrapping the pointer in an option (but it probably doesn't matter because we don't pay any size overhead) because every token at least has a leading trivia and storing a pointer to a shared empty collection is little overhead.
I also agree with @yassere that we don't need to make these changes in this PR but we should follow up on the data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
every token at least has a leading trivia and storing a pointer to a shared empty collection is little overhead.
Storing a thin pointer to a shared empty collection means that you have to dereference the pointer to know that the collection is empty. You mentioned avoiding Option<Vec>>
, but I think that's different. An owned vector can tell it's empty without any indirection because it stores its length independently from its slice pointer.
This might not actually matter in practice, but it could be worth taking advantage of null-pointer optimization to be able to identify the lack of leading/trailing trivia without following a pointer and without increasing mem size.
It's true that most tokens have leading trivia, but not always. The file console.log("Hello World!");
contains 7 tokens all without any trivia.
leading_trivia: GreenTokenTrivia, | ||
trailing_trivia: GreenTokenTrivia, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This increases the size of GreenTokenHead
from 2 bytes to 72 bytes, which I think is too much to add to every single green token. Reducing the size of GreenTokenTrivia
would help here, but we may need to consider further optimizations in the future.
crates/rome_rowan/src/green/token.rs
Outdated
@@ -101,24 +193,76 @@ impl GreenTokenData { | |||
unsafe { std::str::from_utf8_unchecked(self.data.slice()) } | |||
} | |||
|
|||
/// Text of this Token with trivia. | |||
/// TODO do we need String here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely try to avoid creating a String
here. Ideally, we could return something like SyntaxText which I think allows for text comparisons without allocating anything (and has to_string()
for when it's necessary), but I haven't looked very closely at how it's implemented.
crates/rome_rowan/src/green/token.rs
Outdated
pub enum Trivia { | ||
Whitespace(usize), | ||
Comment(usize), | ||
} | ||
|
||
impl Trivia { | ||
pub fn as_thin(self, text: &str) -> ThinTrivia { | ||
let ptr = ThinArc::from_header_and_iter(self, text.bytes()); | ||
ThinTrivia(ptr) | ||
} | ||
|
||
fn text_len(&self) -> TextSize { | ||
match self { | ||
Trivia::Whitespace(n) => (*n as u32).into(), | ||
Trivia::Comment(n) => (*n as u32).into(), | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify how this Trivia
enum is intended to be used? It feels odd that you can construct a Comment(42)
with no text data. And that you can then call as_thin
on it with text of any length. Also, the HeaderSlice
inside a ThinArc
already has a length
field, so what's the advantage of storing a separate length inside the header
field?
I also think that the type used internally as a header probably shouldn't be directly exposed.
7bbe40e
to
28a4e22
Compare
c967564
to
8e328e6
Compare
I created three other PR's that are more easy to understand. See #1720. |
part of: #1720
Tasks
GreenToken
will have an enum with the trivial case and a fallback using Vec.SyntaxTrivia
will wrapGreenTokenTrivia
PartialEq
andHash
implementations for the *Token structs to consider the trivia.Display
andDebug
implementations to print trivia.Display
andDebug
NodeCache
to respect trivia (not fixed by changingHash
implementation)LossyTreeSink
LosslessTreeSink
Syntax
layer end up with all offsets wrong.test_trivia_attached_to_tokens
Debug
Decisions
How to test
Coverage is the same as before.