-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Unicode Annex 31 methods to char
#2693
Conversation
char
I think this could probably be done as a PR to Rust directly rather than an RFC. |
Co-Authored-By: notriddle <[email protected]>
Co-Authored-By: notriddle <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very iffy on this, I don't see a strong enough motivation to include it in the stdlib, while I see very clear reasons for avoiding things that change every unicode version to be in the stdlib. We moved unicode_segmentation out of tree as well for similar reasons, despite it being very useful in unicode-aware string handling.
a standardized set of code point categories for defining computer language syntax. | ||
|
||
This is being used in production Rust code already. | ||
Rust's own compiler already has functions to check against Annex 31 code point categories in the lexer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the unstable, old, non_ascii_idents feature, which we have already RFCd to change. The change will still need some function like this, but it may need tailoring for bidi characters. This is true for other implementors too, the XID functions often need tailoring.
They're not super hard to tailor, though, so we could still expose these functions to match spec and let those tailoring just suffix the calls with || ch == ... && ch != ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The XID part is currently unstable. But stable Rust does respect Pattern_White_Space
, so there's already committed-to Annex 31-based syntax. https://internals.rust-lang.org/t/do-we-need-unicode-whitespace/9876
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a concrete case of tailoring, Rust actually uses XID_Start | '_'
and not a plain XID_Start
. I've actually got bitten by that when implementing lexer for IntelliJ Rust a while ago :)
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matching unicode versions is also a big issue here. This function will be inaccurate half the time as we don't always update our data files immediately (it's not always straightforward). Even if we do, the behavior of this function will change every year, and while we don't have a guarantee on stdlib behavior stability, this does mean that older compilers will lead to different results on code that compiles. This further makes me feel like this should be a versioned crate.
is_whitespace already opened the doors to this issue, but I don't want to make it worse. is_whitespace is a small relatively stable list whereas XID expands all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Added it.
# Motivation | ||
[motivation]: #motivation | ||
|
||
As a systems language, Rust is heavily used for parsing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me this reads as motivation for why Rust needs such functions somewhere, but inclusion in the stdlib is a much higher bar, especially when this gets us tangled up with unicode versioning.
Yes, we already have is_whitespace, but that's an old API that was grandfathered in, and it has fewer unicode stability issues than XID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One argument for it being in std is discoverability, but if you're parsing some grammar that grammar will likely tell you about XID.
Co-Authored-By: notriddle <[email protected]>
cc rust-lang/rust#62848 which proposes to move in the opposite direction: remove these methods from |
rust-lang/rust#62848 appears to have reached consensus for the opposite direction; does that amount to an agreement to close this one? |
Yeah, I will definitely rescind this whole thing, if that's going to be the solution. |
@notriddle Based on the comment above I am going to optimistically close this issue and remove the nominated label, but please reopen if I misunderstood your intentions. |
Rendered