Skip to content

fix: Convert numbers to hexadecimal inside Unicode escape sequences #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

graphemecluster
Copy link
Contributor

@graphemecluster graphemecluster commented Sep 27, 2024

Feel free to suggest a better comment that you feel comfortable with.

For simplicity, this PR doesn’t distinguish between Unicode and Unicode property escape sequences. It works by the fact that, among the currently allowed LoneUnicodePropertyNameOrValue, only gc, sc and scx start with a lowercase character and none of them are purely a~f. Don’t hesitate to close this PR if you would like to achieve this by separating ENCLOSED_TOKEN to P_TOKEN and U_TOKEN instead.

Only minimal test cases are written, I’d love it if you could take over and include more.

@slevithan
Copy link
Owner

Nice catch! And thanks for the PR!

It works by the fact that, among the currently allowed LoneUnicodePropertyNameOrValue, only gc, sc and scx start with a lowercase character and none of them are purely a~f.

That's clever, but relying on this will lead to confusing errors. Ex: If given \p{${12}} it will generate \p{c} which is correctly an error (thanks to JS's case sensitivity for Unicode properties), but it looks a lot like the valid \p{C}.

Don’t hesitate to close this PR if you would like to achieve this by separating ENCLOSED_TOKEN to P_TOKEN and U_TOKEN instead.

That's a good idea (since I don't love relying on the generated error), but if I decide to change this I can do it in a follow up diff to avoid holding back your PR.

Only minimal test cases are written, I’d love it if you could take over and include more.

No problem. Will do!

@slevithan slevithan merged commit bf956e4 into slevithan:main Sep 27, 2024
2 checks passed
slevithan added a commit that referenced this pull request Oct 1, 2024
@slevithan
Copy link
Owner

slevithan commented Oct 1, 2024

I've now added additional tests and split ENCLOSED_TOKEN into two values. Thanks again!

Also, I realized that converting to hex within enclosed \p{…} and \P{…} would have introduced a bug since you can do e.g. regex`\p{C${12}}`. That would have given /\p{Cc}/v, which is not an appropriate result but doesn't throw because it's a valid regex.


among the currently allowed LoneUnicodePropertyNameOrValue, only gc, sc and scx start with a lowercase character

There are a few others starting with a lowercase letter that JS supports:

  • \p{cntrl} - An alias for general category Cc/Control.
  • \p{digit} - An alias for general category Nd/Decimal_Number.
  • \p{punct} - An alias for general category P/Punctuation.
  • p{space} - An alias for binary property White_Space, NOT general category Z/Separator.

As far as I know, that's all of them.

slevithan added a commit that referenced this pull request Oct 1, 2024
@slevithan
Copy link
Owner

Published in v4.3.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants