Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding possible incompatible regexp features in future development #23021

Open
zherczeg opened this issue Feb 22, 2025 · 7 comments
Open

Avoiding possible incompatible regexp features in future development #23021

zherczeg opened this issue Feb 22, 2025 · 7 comments

Comments

@zherczeg
Copy link

I am sorry if this is not the right place for such discussion. Please let me know the right place for it.

In PCRE2 regular expression engine we have been adding some new regexp features, and it would be good if we could avoid incompatible features in the future, i.e. perl wil not use the syntax of them for something else. Feature flags could still be used, but it is better if we don't need to.

  • This one is already released. I think there is a low chance of reusing it.

Syntax: (*scan_substring:(CAPTURE_LIST)PATTERN) or (*scs:(CAPTURE_LIST)PATTERN)

More about it: https://zherczeg.github.io/sljit/scan_substring.html

  • The next one has a higher chance:

The (?PARNO) recursive subpattern syntax is extended with capture list: (?PARNO:CAPTURE_LIST). The capture list is a comma separated list of capturing brackets. The value of these captures are not restored after the recursive matching is completed.

This is not released, so the syntax can be changed.

CC @NWilson

@jkeenan
Copy link
Contributor

jkeenan commented Feb 23, 2025

I am sorry if this is not the right place for such discussion. Please let me know the right place for it.

Thank you for calling our attention to these developments. Since what you are in effect requesting is for Perl to take a certain development track going forward, at this point the best place to have this discussion is on the perl5-porters mailing list (https://www.nntp.perl.org/group/perl.perl5.porters/). That's because the initial stage of this discussion has to be seen by the widest range of people concerned with Perl's development. Once we get a consensus as to Perl's policy with respect to keeping development in synch with PCRE2 is, then we can use some mixture of our PPC process and this issue tracker to guide development.

@demerphq
Copy link
Collaborator

demerphq commented Feb 23, 2025 via email

@zherczeg
Copy link
Author

Thank you for the feedback! I remember we had discussions about the syntax several years ago, but I could not find where. It would be great to continue those plans. Perhaps setting up a low-traffic mailing list for it?

It looks like I totally misunderstood the naming of (*id: constructs. I thought capital letters are reserved for verbs exclusively, and lowercase letters for generic constructs. Perl has some: (*script_run: or (*pla:. In PCRE2, we have non-atomic versions, such as (*napla:.

The (*scan_substring:(CAPTURE_LIST)PATTERN) tried to be similar to conditional blocks: (?(condition)yes-pattern|no-pattern), the ? is replaced by *scan_substring:, which represents the "command", and the condition is extended to a list. I suspect this feature is less interesting for perl, since captures are available as variables, and code blocks can be nested into patterns.

(?PARNO:(CAPTURE_LIST)) was one of the variants we were discussing to use, so we will change the syntax. Honestly, any syntax is good for me as long as it is not overly complex.

@demerphq
Copy link
Collaborator

demerphq commented Feb 23, 2025 via email

@zherczeg
Copy link
Author

We have just released the code so we have at least six months before the next one. Plenty of time to make any decisions.

@NWilson
Copy link

NWilson commented Feb 27, 2025

Hi, I've been following this discussion. I've recently taken over project maintenance of PCRE2, from Philip Hazel.

I don't have much to add here, but I'm glad we're discussing and coordinating the syntax we offer. Every difference between the various regex engines is a nasty papercut for users (like the different meaning of "\Z" in Python...).

At the moment, it seems that only control-flow-verbs (ACCEPT, THEN, ...) are uppercase, and matching constructs are lowercase. This seems consistent and good.

Keeping the surface syntax in sync between engines is a smaller worry, to me. I am not terribly upset if there is some string which is treated as valid in one regex dialect, but not in another. That is annoying for users, if they have to use different syntax for different engines (eg the total messup over named capture syntax...). For worse however is the case where two engines both accept the same string as a valid regex, but interpret it as matching a different set of subject strings!

That is something I would aim to avoid at all costs. If we ever do adopt identical (or overlapping) syntax for some new feature, we'd better make sure that it behaves the same way.

@zherczeg
Copy link
Author

To move things forward, I have created a mailing list for syntax discussions: https://groups.google.com/g/regexp-syntax
Anybody can join. I expect the traffic will be very low. The list will keep all past discussions, and will be easier to search something there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants