Avoiding possible incompatible regexp features in future development #23021

zherczeg · 2025-02-22T08:14:05Z

I am sorry if this is not the right place for such discussion. Please let me know the right place for it.

In PCRE2 regular expression engine we have been adding some new regexp features, and it would be good if we could avoid incompatible features in the future, i.e. perl wil not use the syntax of them for something else. Feature flags could still be used, but it is better if we don't need to.

This one is already released. I think there is a low chance of reusing it.

Syntax: (*scan_substring:(CAPTURE_LIST)PATTERN) or (*scs:(CAPTURE_LIST)PATTERN)

More about it: https://zherczeg.github.io/sljit/scan_substring.html

The next one has a higher chance:

The (?PARNO) recursive subpattern syntax is extended with capture list: (?PARNO:CAPTURE_LIST). The capture list is a comma separated list of capturing brackets. The value of these captures are not restored after the recursive matching is completed.

This is not released, so the syntax can be changed.

CC @NWilson

The text was updated successfully, but these errors were encountered:

jkeenan · 2025-02-23T11:38:21Z

I am sorry if this is not the right place for such discussion. Please let me know the right place for it.

Thank you for calling our attention to these developments. Since what you are in effect requesting is for Perl to take a certain development track going forward, at this point the best place to have this discussion is on the perl5-porters mailing list (https://www.nntp.perl.org/group/perl.perl5.porters/). That's because the initial stage of this discussion has to be seen by the widest range of people concerned with Perl's development. Once we get a consensus as to Perl's policy with respect to keeping development in synch with PCRE2 is, then we can use some mixture of our PPC process and this issue tracker to guide development.

demerphq · 2025-02-23T14:12:01Z

On Sat, 22 Feb 2025 at 09:14, Zoltan Herczeg ***@***.***> wrote: I am sorry if this is not the right place for such discussion. Please let me know the right place for it. In PCRE2 regular expression engine we have been adding some new regexp features, and it would be good if we could avoid incompatible features in the future, i.e. perl wil not use the syntax of them for something else. Feature flags could still be used, but it is better if we don't need to. - This one is already released. I think there is a low chance of reusing it. Syntax: (*scan_substring:(CAPTURE_LIST)PATTERN) or (*scs:(CAPTURE_LIST)PATTERN) More about it: https://zherczeg.github.io/sljit/scan_substring.html - The next one has a higher chance: The (?PARNO) recursive subpattern syntax is extended with capture list: (?PARNO:CAPTURE_LIST). The capture list is a comma separated list of capturing brackets. The value of these captures are not restored after the recursive matching is completed. This is not released, so the syntax can be changed. CC @NWilson <https://github.com/NWilson> — Reply to this email directly, view it on GitHub <#23021>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAZ5R737Q6OIRRMS7DKXFL2RAWWLAVCNFSM6AAAAABXUX2XTCVHI2DSMVQWIX3LMV43ASLTON2WKOZSHA3TANJZGI2DSMI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> [image: zherczeg]*zherczeg* created an issue (Perl/perl5#23021) <#23021> I am sorry if this is not the right place for such discussion. Please let me know the right place for it.

This is more or less the right place for this. Long ago I reached out to Philip Hazel to try to establish some kind of regex syntax oversight process. Neither project has added that many, and I was too busy to pursue it formally, so the process never "firmed up". It is very good of you to reach out. Let's figure out a good way we can cooperate. Please note that in the below that when I say "our" or "we" or "us" I mean the Perl project, and when I say "you" or "your" i mean the PCRE project.

In PCRE2 regular expression engine we have been adding some new regexp features, and it would be good if we could avoid incompatible features in the future, i.e. perl wil not use the syntax of them for something else.

Generally speaking we would do our best to avoid using something you add in a way that is totally different from what you do. There are some minor differences in how the two engines approach certain matters, so there may be minor discrepancies, but on our side we would do our best to avoid such problems. However I do think it is good to involve us when you choose to add a new construct. It may be that we have strong feelings about how things are spelled out, and getting us involved early will prevent any differences of opinion festering and causing bad blood between the projects.

Feature flags could still be used, but it is better if we don't need to.

Totally agree.

- This one is already released. I think there is a low chance of reusing it. Syntax: (*scan_substring:(CAPTURE_LIST)PATTERN) or (*scs:(CAPTURE_LIST)PATTERN) More about it: https://zherczeg.github.io/sljit/scan_substring.html

I have no objection to this really. Especially as it is already released. I am moderately disappointed that we did not come up with a convention for when to use uppercase and when to use lowercase in these "verb like" constructs [that is (*IDENTIFIER) style meta-patterns and directives], but that may have been us, and not you. Perhaps it would be a good idea if we took a bit of time to think of some conventions so we dont end up with a mess in the long run. I am not super keen on short-forms like 'scs' but I can live with it. So for instance when I originally added verbs they were all uppercase, later on Karl added some that did not have to do with controlling match behavior, and they were made to be lower-case (IIRC). IMO it would be nice if we could have some bright-line guidance like that which made it more obvious why a given construct was upper-case or lower-case, or used a particular convention in how it expressed things. I say this because one of the reasons Perl syntax caught on and became the dominant syntax was that Larry cleaned up the earlier conventions so that it was much easier to remember what happened when something was back-slashed. Eg, in Perl syntax backslash-non-alphanumeric like \[ is a literal and NOT a meta-character, and backslash-alphanumeric is a meta-character and NOT a literal. Other regex engines have a weird mix of both cases, which makes it hard to remember how to write a pattern (Eg, in vim \< is a left sided word-break). Given this precedent it would be good if we had a convention which was easy to remember and understand.

- The next one has a higher chance: The (?PARNO) recursive subpattern syntax is extended with capture list: (?PARNO:CAPTURE_LIST). The capture list is a comma separated list of capturing brackets. The value of these captures are not restored after the recursive matching is completed. This is not released, so the syntax can be changed.

I have no strong objection to this. The syntax sounds reasonable. The intent seems reasonable. Whether or not it is doable in the current Perl engine is another question. But I definitely think it would be a nice feature and I see no reason we would not follow your precedent. I do /kinda/ wonder if we are setting ourselves up for problems in terms of establishing some conventions for this. In most of the verbs a colon suffix indicates a mark name. (Something i thought would be used much more than has proved to be the case!), and in this case we are adding colon suffix meaning capture lists. This kinda bothers me. As does the fact that in your example above (*scan_substring:(CAPTURE_LIST)PATTERN) the capture list is in parens, but in the PARNO proposal case the parens are not required. I am inclined to say one of the two is wrong. From a language design and language learning perspective using a common form for similar things makes the language easier to learn. So perhaps given "scan_substring" is already released it would be better to make the PARNO case also be parenthesized. On the other hand, it seems to me a better approach would be if 'scan_substring' was of the form: (*scan_substring:CAPTURE_LIST:pattern) but maybe it is too late to change it. There is something to be said for a rule like "when a capture list is specified in a meta-pattern it MUST be parenthesized, and be comma separated". So then the PARNO case would be (?PARNO:(CAPTURE_LIST)) The mnemonic being that parens inside of a verb-like meta-pattern should always contain a list of capture names or indexes. Anyway, thanks for reaching out to us, we really should formalize our relationship so that we dont add things that mess up your plans, and vice versa. At a certain level it would be nice if we both used the same code, but i doubt that will ever happen. cheers, Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

zherczeg · 2025-02-23T16:47:29Z

Thank you for the feedback! I remember we had discussions about the syntax several years ago, but I could not find where. It would be great to continue those plans. Perhaps setting up a low-traffic mailing list for it?

It looks like I totally misunderstood the naming of (*id: constructs. I thought capital letters are reserved for verbs exclusively, and lowercase letters for generic constructs. Perl has some: (*script_run: or (*pla:. In PCRE2, we have non-atomic versions, such as (*napla:.

The (*scan_substring:(CAPTURE_LIST)PATTERN) tried to be similar to conditional blocks: (?(condition)yes-pattern|no-pattern), the ? is replaced by *scan_substring:, which represents the "command", and the condition is extended to a list. I suspect this feature is less interesting for perl, since captures are available as variables, and code blocks can be nested into patterns.

(?PARNO:(CAPTURE_LIST)) was one of the variants we were discussing to use, so we will change the syntax. Honestly, any syntax is good for me as long as it is not overly complex.

demerphq · 2025-02-23T17:11:21Z

On Sun, 23 Feb 2025 at 17:47, Zoltan Herczeg ***@***.***> wrote: Thank you for the feedback! I remember we had discussions about the syntax several years ago, but I could not find where. It would be great to continue those plans. Perhaps setting up a low-traffic mailing list for it? It looks like I totally misunderstood the naming of (*id: constructs. I thought capital letters are reserved for verbs exclusively, and lowercase letters for generic constructs. Perl has some: (*script_run: or (*pla:. In PCRE2, we have non-atomic versions, such as (*napla:. The (*scan_substring:(CAPTURE_LIST)PATTERN) tried to be similar to conditional blocks: (?(condition)yes-pattern|no-pattern), the ? is replaced by *scan_substring:, which represents the "command", and the condition is extended to a list. I suspect this feature is less interesting for perl, since captures are available as variables, and code blocks can be nested into patterns. (?PARNO:(CAPTURE_LIST)) was one of the variants we were discussing to use, so we will change the syntax. Honestly, any syntax is good for me as long as it is not overly complex. — Reply to this email directly, view it on GitHub <#23021 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAZ5R5ICWNG7ECSZRUFJIT2RH3TRAVCNFSM6AAAAABXUX2XTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNZWHE4DENRVGM> . You are receiving this because you commented.Message ID: ***@***.***> [image: zherczeg]*zherczeg* left a comment (Perl/perl5#23021) <#23021 (comment)> Thank you for the feedback! I remember we had discussions about the syntax several years ago, but I could not find where. It would be great to continue those plans. Perhaps setting up a low-traffic mailing list for it?

I think that would be a good idea. It looks like I totally misunderstood the naming of (*id: constructs. I

thought capital letters are reserved for verbs exclusively, and lowercase letters for generic constructs. Perl has some: (*script_run: or (*pla:. In PCRE2, we have non-atomic versions, such as (*napla:. It may be that I am the one in the wrong here. Perhaps we just need to

clarify where we are now so we dont shoot ourselves in the foot in the future.

The (*scan_substring:(CAPTURE_LIST)PATTERN) tried to be similar to conditional blocks: (?(condition)yes-pattern|no-pattern), the ? is replaced by *scan_substring:, which represents the "command", and the condition is extended to a list. I suspect this feature is less interesting for perl, since captures are available as variables, and code blocks can be nested into patterns.

Ah, that is also something we must consider, your use case is more generic than ours, so we must be flexible to your needs.

(?PARNO:(CAPTURE_LIST)) was one of the variants we were discussing to use, so we will change the syntax. Honestly, any syntax is good for me as long as it is not overly complex.

I agree more or less with the caveat that whatever we do should be easy to remember and not contain contradictions. I am not the most inspired in regard to language design, which is why I cc'ed the people I did on this. They have long played a role at some level in these discussions, and broader feedback I think can only help. How long would you be comfortable with us making a decision? Are you bursting to release this ASAP, or can we wait a few weeks for people to mull it over? Yves

…

-- perl -Mre=debug -e "/just|another|perl|hacker/"

zherczeg · 2025-02-23T19:53:26Z

We have just released the code so we have at least six months before the next one. Plenty of time to make any decisions.

NWilson · 2025-02-27T07:09:37Z

Hi, I've been following this discussion. I've recently taken over project maintenance of PCRE2, from Philip Hazel.

I don't have much to add here, but I'm glad we're discussing and coordinating the syntax we offer. Every difference between the various regex engines is a nasty papercut for users (like the different meaning of "\Z" in Python...).

At the moment, it seems that only control-flow-verbs (ACCEPT, THEN, ...) are uppercase, and matching constructs are lowercase. This seems consistent and good.

Keeping the surface syntax in sync between engines is a smaller worry, to me. I am not terribly upset if there is some string which is treated as valid in one regex dialect, but not in another. That is annoying for users, if they have to use different syntax for different engines (eg the total messup over named capture syntax...). For worse however is the case where two engines both accept the same string as a valid regex, but interpret it as matching a different set of subject strings!

That is something I would aim to avoid at all costs. If we ever do adopt identical (or overlapping) syntax for some new feature, we'd better make sure that it behaves the same way.

zherczeg · 2025-02-27T07:24:59Z

To move things forward, I have created a mailing list for syntax discussions: https://groups.google.com/g/regexp-syntax
Anybody can join. I expect the traffic will be very low. The list will keep all past discussions, and will be easier to search something there.

zherczeg added the Needs Triage label Feb 22, 2025

jkeenan removed the Needs Triage label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoiding possible incompatible regexp features in future development #23021

Avoiding possible incompatible regexp features in future development #23021

zherczeg commented Feb 22, 2025

jkeenan commented Feb 23, 2025

demerphq commented Feb 23, 2025 via email

zherczeg commented Feb 23, 2025

demerphq commented Feb 23, 2025 via email

zherczeg commented Feb 23, 2025

NWilson commented Feb 27, 2025

zherczeg commented Feb 27, 2025

Avoiding possible incompatible regexp features in future development #23021

Avoiding possible incompatible regexp features in future development #23021

Comments

zherczeg commented Feb 22, 2025

jkeenan commented Feb 23, 2025

demerphq commented Feb 23, 2025 via email

zherczeg commented Feb 23, 2025

demerphq commented Feb 23, 2025 via email

zherczeg commented Feb 23, 2025

NWilson commented Feb 27, 2025

zherczeg commented Feb 27, 2025