Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the prism compiler #44

Merged
merged 2 commits into from
Jun 7, 2024
Merged

Support for the prism compiler #44

merged 2 commits into from
Jun 7, 2024

Conversation

kddnewton
Copy link
Contributor

No description provided.

@kddnewton kddnewton requested a review from mame May 20, 2024 17:50
@mame
Copy link
Member

mame commented May 20, 2024

Nice! That's an elegant implementation using Prism API.

We need a CI with prism. Could you add it please?

@kddnewton
Copy link
Contributor Author

Yes! Done!

return unless path

lineno = loc.lineno
column = RubyVM::AbstractSyntaxTree.node_id_for_backtrace_location(loc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kddnewton Hey, does the prism-derived iseq have columns in the node_id field? It is a bit hacky. What column is in the field?

I'm a little concerned the information loss of projecting node to lineno and column. If the column of the call foo.bar.baz for example is the first column of the receiver (i.e., f), then foo.bar and foo.bar.baz would be indistinguishable.
Do you always use the column of a surface position where the node is bottom-most in the tunnel? (somewhere in .baz in this example)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mame — I have been nervous about this as well for a while, but I think it might be okay. When we're creating the instructions in the compiler, the current compile.c compiles with (lineno, node_id) for most instructions. We compile with (lineno, column). So in the case of the call nodes that you specified (foo.bar.baz), the columns are 0, 4, and 8 (they match up to the location of the name of the method).

The tunnel method always goes as far is it can down into the nodes, so it's effectively a function of tunnel(line, column) => {node}. It's stable across parses, so it will always be consistent. Because we walk backward through the list to find the bottom-most node, it should always give us the correct call.

I'm open to changing it, as I have definitely been considering the things you're asking for here as well. However, I think it is working as it is and will continue to work as it's currently set up. (The reason I'm reluctant to add a node id is just because I don't want to add another 4 bytes to every node.) What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. In a trivial example, Prism.parse("1") results in PrismNode -> StatementsNode -> IntegerNode, but they all have the completely same location and cannot be distinguished by (lineno, column).

Indeed, I don't think this will be a problem in most cases, including error_highlight. However, why do you avoid node_id?

The reason I'm reluctant to add a node id is just because I don't want to add another 4 bytes to every node.

You can save node_id instead of column. There should be no difference in performance, since the entire file will still be reparsed.

I know node_id seems not very cool (it was suggested by @ko1 and I didn't like it much at first myself). However, I am now happy with it, as it has worked well so far. If there is no clear problem, I vote not to change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't store column on the node at all, we only have offsets from the beginning of the file. line/column information is calculated lazily for the users that need it. Our nodes are basically:

  • 2 bytes type
  • 2 bytes flags
  • 8 bytes start pointer
  • 8 bytes end pointer

I've also got a branch where we're experimenting with tagged pointers so that it's:

  • 2 bytes type
  • 2 bytes flags
  • 2 bytes start offset
  • 2 bytes length

which is really nice because we can pack the nodes into a single pointer without having to allocate anything. So this is why I'm pushing back on this because if we have a 32-bit node ID, there's no way to do tagged pointers, we'll always have to allocate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, can you elaborate on that idea a bit more? Or do you have documentation or suggestions? I am wondering how attributes and child elements would be represented.

If we want to reduce the size that much, is there any way to use a reproducible node identification method, such as post order traversal? There is no need to save the node_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record: I talked with @kddnewton and @ko1, and we decided to go with this once.

@mame mame merged commit 24f4b44 into master Jun 7, 2024
6 checks passed
@mame mame deleted the prism branch June 7, 2024 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants