-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explore alternatives to RustPython for Python AST parsing #286
Comments
Alternatively, could we use the CPython AST parser directly? It's written in C, AFAIK it should be very fast (perhaps even faster than the RustPython parser). |
Alternatively, we could try to continue merging improvements back to RustPython. |
Alternatively, we could revisit the use of LibCST. |
LibCST will also likely be required for auto-formatting. |
tree-sitter is an interesting option (demo in #295). It produces a CST, so we could use it to power auto-formatting. It's slower than RustPython but still quite fast (~350ms to iterate over the CPython codebase vs. ~230ms with RustPython). Because it's based on S-expressions, it has a whole pattern-matching syntax built-in, which we could use to power plugins... I think the ergonomics on our end won't be great, because it just exposes a single Node type with a Going to explore a few other options, but I'm intrigued... |
There's also rust-python-parser, which is based on nom. |
There's also rust-sitter, which sits on top of tree-sitter and lets you define the grammar on the Rust side via macros. In return, you get semantically meaningful structs when you generate the parse tree. The downside is that we'd have to re-write the Python grammar ourselves. |
If we were to use tree-sitter, we may want to write code to transform the generated tree into AST types on the Rust side, similar to the |
rust-python-parser takes ~480ms (vs. 280ms for RustPython's parser) and fails on these files:
|
Ok, time for me to revive this thread and start thinking on the right long-term parser strategy. We need to support the 3.10 constructs! |
@Seamooo - Do you have any thoughts on or interest in working on this too? |
Definitely have some interest in this area. A couple of comments as I've been going down various rabbit holes of python parser implementations:
|
Ideally when #742 becomes mergeable, implementing those traits for the IR output by the parser would make it immediately useable. |
Yeah rust-peg does look good, and I believe that's what LibCST uses. My only hesitation there is that the LibCST parser is really slow compared to the RustPython parser? But I don't know how much, if any, of that is due to rust-peg. This is just based on the fixtures in the LibCST codebase:
( |
(And agreed on #742.) |
Perhaps we could extend |
CC: @davidhalter (I think you were working on a Rust-based Fast Python Parser [?]) |
I'm going to start playing around with a straight port of pegen to Rust. (Or, more specifically, modifying pegen to output a Rust target, which IIUC will also require rewriting the Python code from |
Thanks for bringing this up @isidentical. I have indeed written a Python parser in Rust. But I do not feel comfortable sharing it yet. While it's faster than any other Python parser I have seen, it's a bit rough around error recovery and few other things, but generally parses all valid 3.10 Python programs (AFAIK). So if I ever release it, I'm happy to let you guys know. |
Thanks @davidhalter! Super cool. Would love to see it if you ever release it. If you don't mind sharing: did you write it from scratch? Or is it built atop a parser generator? |
(I started on this in earnest tonight, I don't know if it will prove to be the right approach but the lower bound is that I learn a lot.) |
@charliermarsh Yes I wrote it from scratch, it is a kind of weird mixture of LL and PEG, it's also a parser generator. I essentially pass it a slightly modified version of the official EBNF grammar. I personally love writing parser generators, so when the opportunity presented itself, I took it :) |
@Seamooo - Do you have any interest in collaborating w/ me on the Rust port of pegen? It's a private repo right now but would be happy to add you and talk about how I'm iterating on it. |
Definitely would be interested |
@Seamooo - Added you, repo is rough but your help is very welcome! |
I'd love (read) access as well, if possible. I don't think I'm much help in contributing, but I've been looking at |
@ljodal - Added! Would love to have any help / feedback. (Anyone is welcome to be added, I'm just avoiding making it totally public while it's still in such an incomplete state.) |
I'd like to take you up on that invitation if the attempt is still ongoing :-). We've recently been thinking of doing the same with RustPython's parser, porting |
@DimitrisJim - Added! You'll be able to tell from the contribution dates, but I haven't been able to push on it as much as I'd like lately (just other Ruff work taking priority). Any / all help or feedback welcome. While you're here: to be clear, my preference would be to continue using the RustPython parser :) The goal of this task is, implicitly, to remove the parser as the bottleneck for what Ruff can support, i.e., to get us to a point where the parser can support all current language features. |
I was thinking of adding a |
@fanninpm - I added you to my |
There's also a PEG grammar (https://github.com/charliermarsh/rust-pegen/blob/main/pegen/data/rust.gram) (has to differ from the Python grammar because the PEG grammars include actual code). The generated code is here. |
@qingshi163 what do you think? |
As an update: my current hope here is to continue to use RustPython, especially since there's been a lot of good progress in the parser recently, although structural pattern matching in some form is still the biggest hurdle. |
What would be the downside to using the CPython AST parser directly, since that is guaranteed to always be up-to-date? |
The big one I think is that CPython’s AST doesn’t seem to have all the formatting information (whitespace, comments) stored. That’s why CST parsers exist, like RustPython’s and LibCST (which is written in Rust, but as said above, seems to be slow). I’m pretty sure Ruff needs this information to operate on comments and to do autofixes without mangling comments and whitespace. (Actually, I think the old Parser created an intermediate CST, but the new one doesn’t anymore) If a well maintained, fast CST library written in C existed, and we rephrased the question to “why not use that C library”, I think there would be two ways to do it, each with their downside:
|
Interestingly, RustPython's AST doesn't include any CST-like information -- it's very similar to the CPython AST. We get comments and other non-AST tokens from the token stream directly (comments are included in the token stream, but not in the AST). |
To me the downsides to using the CPython AST parser are as follows:
|
So, as far as I understand, there are several things here. Ruff needs a CST representation of Python. The existing solution is LibCST which is slow. Could the reason for its slowness be the fact that it does more work? So, I see a new question:
|
We've continued to improve the RustPython parser and it's working well for us! Some of these questions are coming up again with the autoformatter, where I have to map the AST to a CST by enriching it with information derived from the token stream. In that context, I'm mostly focused right now on the CST representation itself and not the process by which it's parsed / extracted, which I'm punting until later. |
I'm going to close this for now, as we're continuing to use and contribute to the RustPython parser. Nothing actionable here. |
RustPython has been a good foundation, but Ruff is currently limited by what the RustPython parser does and does not support. See: #282, #54, #245.
Since we only need the parser (and RustPython is more ambitious -- they're trying to build an entire runtime / interpreter), there may be better options. RustPython is also going to be limited by their use of an LL(1) parser.
We could consider using the following:
The text was updated successfully, but these errors were encountered: