Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider implementing the means of abstraction used by Perl5/PCRE regexes #962

Closed
db48x opened this issue Feb 27, 2023 · 6 comments
Closed
Labels

Comments

@db48x
Copy link

db48x commented Feb 27, 2023

Perl and PCRE define a means of defining names for parts of a regex and then reusing to those parts by name much as one will often define functions in Rust and then call them by name.

An example:

(?(DEFINE)
  (?P<quant>many|some|five)
  (?P<adj>blue|large|interesting)                   
  (?P<object>cars|elephants|problems)                
  (?P<noun_phrase>(?&quant)\ (?&adj)\ (?&object))   
  (?P<verb>borrow|solve|resemble)
)
(?&noun_phrase)\ (?&verb)\ (?&noun_phrase)

I’m not very familiar with the implementation details, but I believe that most of the changes needed to implement this would be in regex-syntax, plus the part of the compiler that builds the Hir. The compiler would replace each call by the definition and then compile that. The resulting Hir would then be identical to one where the named expressions were written out by hand.

Given the design goals of this crate, I would recommend keeping the language regular by returning an error when a recursive call is detected. This can be done during compilation by simply keeping a stack of calls and returning an error if a call is made to a group that is already in the stack.

However, it would still be desirable to keep the list of definitions around even after the regex is fully compiled. This would allow them to be reused while building new regexes. Suppose one had a regex containing definitions, called a grammar:

let grammar = Regex::new(r"(?x)
                           (?(DEFINE)
                             (?P<quant>many|some|five)
                             (?P<adj>blue|large|interesting)
                             (?P<object>cars|elephants|problems)
                             (?P<noun_phrase>(?&quant)\ (?&adj)\ (?&object))
                             (?P<verb>borrow|solve|resemble)
                           )").unwrap();

This grammar could be reused to create multiple new regexes by substitution:

let sentence = Regex::new(format!(r"(?x)
                                    (?&noun_phrase)\ (?&verb_phrase)\ (?&noun_phrase)
                                    (?(DEFINE)
                                      (?P<adverb>quickly|throughly|confidently)
                                      (?P<verb_phrase>(?&adverb)\ (?&verb))
                                    )
                                    {grammar}")).unwrap();

Obviously, one could imagine additional APIs beyond simple textual substitution, but that is a separate topic. Similarly, the exact syntax for definitions and calls could be different. PCRE supports three variations originating from Perl5, Python, and Ruby; we could choose any of them or invent our own.

@BurntSushi
Copy link
Member

I think this pushes regexes, at least how they are currently conceived, too far IMO. I rarely if ever see this sort of advanced syntax used in the wild, and in my experience it isn't well supported outside of Perl and PCRE. If you get to the point where your regex is so complicated as to benefit from this, then I'd rather see you use your programming language features to assemble the regex. (In Rust, that might be format! or a templating library.)

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Feb 27, 2023
@db48x
Copy link
Author

db48x commented Feb 27, 2023

That’s a shame. The whole point of adding advanced features is to promote them to a new audience outside of Perl and Raku (and Python and Ruby and…) who might not otherwise have access to them.

I'd rather see you use your programming language features to assemble the regex. (In Rust, that might be format! or a templating library.)

True, as presented here it could be done purely with format!, but that’s beside the point. I would like to see multiple regex crates with different goals but the same syntax and tools. For example, I might want to use this crate for some things, and another for those cases where I do want to match a recursive language. It helps to have as much commonality between them as possible. And of course string substitution cannot handle the recursive cases at all.

Of course, I would also like to see some regex crates that don’t embed the regexes in strings at all. Something akin to Emacs’ rx syntax would be really nice to have as well. Of course that is completely orthogonal to the question of which features the crate has.

@BurntSushi
Copy link
Member

I don't think using format! is beside the point of the feature request as it stand. It might be beside the point for the goal of increasing interoperability. While of course the regex crate syntax wants to be similar to other regex engines for obvious reasons, there are still many differences. And there is a ticket for documenting some of the more common ones, but it's actually not feasible to do so exhaustively because just about every regex engine has its own dialect. Even regex engines written by the same author with the same design goals have differences.

Basically what this means is that there has to be some "reasonableness" threshold at which point we say "no there are differences here and we aren't going to resolve them." I think this sort of complex syntax falls well beyond that reasonableness threshold for me personally.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Feb 25, 2024

I was looking through this repo to see whether anyone had translated from emacs regexp syntax using regex-syntax yet (my eventual goal is to produce an emacs native module). I'm pretty sure I could do that by transpiling the emacs regexp pattern string to a regex_syntax::ast::Ast, since the AST looks pretty simple, and emacs regexps don't have any special features: it pushes that to the language instead.

As mentioned by @db48x in #962 (comment), emacs provides a high-level rx sexp notation, but I think that actually aligns with @BurntSushi's framing of "using your programming language features to assemble the regex" from #962 (comment)! So I was hoping to demonstrate that regex-syntax makes that sort of thing possible without implementing that sort of higher-level abstraction itself directly (this is actually what makes it more generic/reusable)!

At first glance I think e.g. emacs-regexp-syntax can just be its own crate that depends on regex-syntax and generates a regex_syntax::ast::Ast, and I can propose upstreaming any extensions I need from the generic regex-syntax AST to this repo (although this seems unlikely to be necessary). I can then have another crate emacs-regexp which depends on regex-automata and exposes the emacs native module interface.

@BurntSushi is that how you would suggest architecting this if I wanted to interface other regexp interfaces to regex-syntax and co?

EDIT: started this in https://github.com/cosmicexplorer/emacs-regexp!

@BurntSushi
Copy link
Member

Thanks for the question! Although I'd love for these kinds of things to be new Discussion questions so that they can be curated and made discoverable for others.

@BurntSushi is that how you would suggest architecting this if I wanted to interface other regexp interfaces to regex-syntax and co?

Maybe? I've never done the exercise before, so I'm not sure. It's plausible the right target is regex-syntax::hir::Hir. Namely, an Hir is perhaps closer to what you might think of as a traditional AST where as an Ast is perhaps closer to a Concrete Syntax Tree. A concrete syntax tree essentially tries to describe the concrete syntax precisely so that it can be round-tripped. This makes it very detailed and possibly contains a lot of stuff you don't need to care about. (Regrettably, it isn't a true concrete syntax tree because of comments and whitespace in x mode. But it's close.)

Either target is reasonable IMO. With that said, if you need to deal with Unicode, then targeting the Ast potentially gives you some very big benefits. The Ast -> Hir conversion is what resolves things like [\p{Letter}&&\p{Greek}] into a single sequence of non-overlapping non-adjacent ranges of codepoints. That is a surprisingly non-trivial thing to do. If you only care about ASCII though, then you could feasibly target Hir without too much work.

@cosmicexplorer
Copy link
Contributor

Thanks for the question! Although I'd love for these kinds of things to be new Discussion questions so that they can be curated and made discoverable for others.

Done! #1167

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants