Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the way trivia extraction works #125

Merged
merged 16 commits into from
Mar 7, 2021

Conversation

JohnnyMorganz
Copy link
Collaborator

@JohnnyMorganz JohnnyMorganz commented Feb 26, 2021

Previously, trivia was bound to tokens using the rules defined for swift-syntax

This PR changes the first rule to follow how the Rosyln Compiler works.
To quote their page:

In general, a token owns any trivia after it on the same line up to the next token. Any trivia after that line is associated with the following token. The first token in the source file gets all the initial trivia, and the last sequence of trivia in the file is tacked onto the end-of-file token, which otherwise has zero width.

Examples of change

I will explicitly show the newline character to give a sense of how it was parsed.
Previously, the old rules would parse something like this:

local foo = bar
[\n] local bar = baz

i.e., the newline character is added to the leading trivia of the local bar = baz line.
Our new rules change this around

local foo = bar [\n]
local bar = baz

The newline character is now added as trailing trivia for local foo = bar
Similarly, if there is a line break in between the two statements, such as:

local foo = bar [\n]
[\n]
local bar = baz

The first newline will be parsed as trailing trivia for local foo = bar, whilst the second is left as leading trivia for local bar = baz (previously, they were both leading trivia for local bar = baz)

I will also show another example with comments.

local foo = bar -- comment 1[\n]
-- comment 2[\n]
local bar = baz

In this example, -- comment 1 will still be parsed as trailing trivia to local foo = bar. We then have a newline character. Previously, this was parsed as leading trivia of local bar = baz, but now it is parsed as trailing trivia for local foo = bar.
We then see -- comment 2. As we have already seen a newline character, we stop taking in any more trailing trivia, so -- comment 2 is extracted as leading trivia for local bar = baz.

Old Suggestion Instead of stopping at the first newline trivia found when parsing trailing trivia, we now continue taking it in, except toggle a state indicating that we have seen a newline. We then use this state if we then reach a comment. If we have already seen a newline, we will not parse in the comment as trailing trivia - it will be left as leading trivia for the next token. If we haven't seen a newline, the comment will be parsed as trailing trivia as normal.

This is best explained through some examples. I will explicitly show the newline character to give a sense of how it was parsed.
Previously, the old rules would parse something like this:

local foo = bar
[\n] local bar = baz

i.e., the newline character is added to the leading trivia of the local bar = baz line.
Our new rules change this around

local foo = bar [\n]
local bar = baz

The newline character is now added as trailing trivia for local foo = bar
Similarly, if there is a line break in between the two statements, such as:

local foo = bar [\n]
[\n]
local bar = baz

Both of these newlines will now be parsed as trailing trivia for local foo = bar (previously, they were leading trivia for local bar = baz)

I will also show another example with comments. This shows the reasoning behind our "state".

local foo = bar -- comment 1[\n]
-- comment 2[\n]
local bar = baz

In this example, -- comment 1 will still be parsed as trailing trivia to local foo = bar. We then have a newline character. Previously, this was parsed as leading trivia of local bar = baz, but now it is parsed as trailing trivia for local foo = bar.
We then see -- comment 2. As we have already seen a newline character, we stop taking in any more trailing trivia, so -- comment 2 is extracted as leading trivia for local bar = baz.

This behaviour regarding comments remains the same as before. This PR should not change how comments are extracted - it only changes newlines.

This PR is a breaking change for any users of the library who relied on newline trivia. Comment trivia logic should remain the same.

@Kampfkarren
Copy link
Owner

I need to look into why the tests didn't fail, I'll try today or tomorrow.

@tomprince
Copy link
Contributor

I'm curious what the motivation for this, and wonder if splitting up the whitespace trivia into multiple pieces might be a way to address the underlying issue.

@JohnnyMorganz
Copy link
Collaborator Author

I'm curious what the motivation for this, and wonder if splitting up the whitespace trivia into multiple pieces might be a way to address the underlying issue.

The reasoning behind this was because currently with one of my projects (https://github.com/JohnnyMorganz/StyLua), I deal with each statement individually, and in this, new lines are placed at the end of statements. The problem here was that full moon would parse new line as leading trivia for the next token. I discussed this with Kampfkarren, and he was open to the PR. I also believe it makes more logical sense for the new line to trail the end of the line it was related to.

I'm not sure if splitting whitespace further would help, as the issue here isn't the parsing of the whitespace tokens itself, but how the trivia gets bound to the closest token.

@tomprince
Copy link
Contributor

Hmm. Looking closer at the definition of the whitespace token, it appears that it is an optional newline, followed by non-newline whitespace (note that this does not match the regex in the comment in tokenizer::parse_whitespace). Without having used this in practice, I think if you are making the change describe above, the newline should be terminating, rather than starting a whitespace token, so that the indent on the following line is still trivia of the first token on it.

@JohnnyMorganz
Copy link
Collaborator Author

Hmm. Looking closer at the definition of the whitespace token, it appears that it is an optional newline, followed by non-newline whitespace (note that this does not match the regex in the comment in tokenizer::parse_whitespace). Without having used this in practice, I think if you are making the change describe above, the newline should be terminating, rather than starting a whitespace token, so that the indent on the following line is still trivia of the first token on it.

This is a very good point which I seemed to have missed, thanks! I'll take a look tomorrow

@tomprince
Copy link
Contributor

This is a very good point which I seemed to have missed, thanks! I'll take a look tomorrow

The second point in #128 might be relevant to you, particularly if you are changing how whitespace trivia works.

@Kampfkarren
Copy link
Owner

Does that answer your question as to why the tests aren't failing then?

@JohnnyMorganz
Copy link
Collaborator Author

Does that answer your question as to why the tests aren't failing then?

I'm not entirely sure what you are referring to. If you are talking about the comment above about whitespace parsing, I think that is a different issue (although changing this would also cause tests to fail!)

You can see what I mean by "tests should have failed" if you delete the ast.json in one of the test folders where the source contained a new line (try anonymous-function-1 for example). Running the tests again, you will see when ast.json is recreated, there is a diff in the file - the tests aren't picking this up. I think this is an issue in the pass_cases.rs test

@Kampfkarren
Copy link
Owner

Ah, I see. I will look into it today, promise.

@JohnnyMorganz
Copy link
Collaborator Author

I've been rethinking over this PR, and I wanted to get a second opinion. I've been thinking of the new line character here as a line terminator, in contrast to the original rule. I believe it is a valid interpretation, but is it the interpretation we want to go for in the library?

Before I proceed any further, I just wanted to ask you both what you think makes more sense? (I can most likely accommodate my code for either scenario.)

@Kampfkarren
Copy link
Owner

Personally, any way works for me, as long as:

  1. It's useful for immutable consumers, such as Selene, without much boilerplate. I believe the only place this is done in that codebase in particular is https://github.com/Kampfkarren/selene/blob/9a522dca0ad0c06a6df2e9a26350f5594d9d4b32/selene-lib/src/lint_filtering.rs#L96.
  2. It's useful for mutable consumers, such as your own StyLua. I don't have any projects that are like this, so I trust your judgment on that whatever it ends up being.

I think StyLua is a good test case for this--choose whichever one makes your life easier.

@Kampfkarren
Copy link
Owner

Kampfkarren commented Mar 4, 2021

Ah, I know why tests didn't fail--leading/trailing trivia wasn't in the tests to begin with.

there is a diff in the file - the tests aren't picking this up

This is confusing, but I'll figure it out.

@Kampfkarren
Copy link
Owner

Okay, it's because the tests are using Eq, but Eq passes even if the trivia is different! I'll file a fix for that, and you can throw it into your branch.

@Kampfkarren
Copy link
Owner

#134

@JohnnyMorganz
Copy link
Collaborator Author

JohnnyMorganz commented Mar 4, 2021

the newline should be terminating, rather than starting a whitespace token, so that the indent on the following line is still trivia of the first token on it.

I was looking into this, and there seems to be a flaw in my rules to determine when to stop extracting trivia. Currently, we only stop at comments after newlines, so that the comment can be parsed as leading trivia for the next token. But this means, for something like:

do [\n]
[\t] call() [\n]
end

the indent trivia (in this case, a tab character preceding call(), is still parsed as trailing trivia for do.
To solve this, I think we should follow what the Rosyln Compiler (compiler for C#/.NET) does instead for trivia.
To quote their page:

In general, a token owns any trivia after it on the same line up to the next token. Any trivia after that line is associated with the following token. The first token in the source file gets all the initial trivia, and the last sequence of trivia in the file is tacked onto the end-of-file token, which otherwise has zero width.

This in turn should also make the rule simpler, and it seems to be well established.

EDIT: You may find https://sharplab.io/ useful, if you want to visualise what C# does

@Kampfkarren Kampfkarren added this to the 1.0.0 milestone Mar 5, 2021
@Kampfkarren
Copy link
Owner

#134 is merged. Update your CHANGELOG and lets see if the CI passes.

@Kampfkarren
Copy link
Owner

Re: Your last comment, is this still the PR you want to go with?

@JohnnyMorganz
Copy link
Collaborator Author

Re: Your last comment, is this still the PR you want to go with?

Yeah, I researched into this further and found that this is what the Roslyn compiler does. I simplified my logic (see the edit for the original PR message) greatly to follow what they do aswell - no need for "state" anymore.

Since it's used in practice by quite a large parser, I think it is an OK change to make

@Kampfkarren Kampfkarren merged commit 713a293 into Kampfkarren:master Mar 7, 2021
kennethloeffler added a commit to kennethloeffler/moonwave that referenced this pull request Oct 2, 2021
triple_dash and free_function tests currently fail, likely due to
Kampfkarren/full-moon#125
kennethloeffler added a commit to kennethloeffler/moonwave that referenced this pull request Oct 2, 2021
triple_dash and free_function tests currently fail, likely due to
Kampfkarren/full-moon#125
evaera pushed a commit to evaera/moonwave that referenced this pull request Oct 2, 2021
* Upgrade full_moon to 0.13.1 and add default roblox feature

triple_dash and free_function tests currently fail, likely due to
Kampfkarren/full-moon#125

* Fix newline handling
@JohnnyMorganz JohnnyMorganz deleted the extract-trivia branch October 19, 2021 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants