Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add regexp_match_substring_all function to yaml #469

Merged
merged 2 commits into from
Mar 22, 2023

Conversation

richtia
Copy link
Member

@richtia richtia commented Mar 9, 2023

BREAKING CHANGE: group argument added to regexp_match_substring function

Add regexp_match_substring_all function

Resolves #466

@richtia richtia requested review from vibhatha and westonpace March 9, 2023 18:40
@richtia
Copy link
Member Author

richtia commented Mar 9, 2023

@ianmcook

@ianmcook
Copy link
Contributor

ianmcook commented Mar 9, 2023

The existing implementation of regexp_match_substring takes the arguments input, pattern, position, occurrence.

The implementation of regexp_match_substring_all proposed here takes the arguments input, pattern, group.

I understand why there is no occurrence argument here. (Because it returns a list of all occurrences.)

But I do not understand:

  • Why is there no group argument in regexp_match_substring?
  • Why is there no position argument in regexp_match_substring_all?

@richtia
Copy link
Member Author

richtia commented Mar 9, 2023

The existing implementation of regexp_match_substring takes the arguments input, pattern, position, occurrence.

The implementation of regexp_match_substring_all proposed here takes the arguments input, pattern, group.

I understand why there is no occurrence argument here. (Because it returns a list of all occurrences.)

But I do not understand:

  • Why is there no group argument in regexp_match_substring?

I was wondering the same thing when I submitted this PR. I think this was the case because I was looking at other functions named REGEXP_SUBSTR and those didn't have the group argument (MySQL, Oracle). I can add the group parameter to the function in this PR though.

  • Why is there no position argument in regexp_match_substring_all?

I wasn't sure whether or not to put the position argument for this function, since I felt like it defeated the purpose of the all part. I don't have a strong opinion either way though. I'm fine with adding it back into this function as well.

@ianmcook
Copy link
Contributor

ianmcook commented Mar 9, 2023

I wasn't sure whether or not to put the position argument for this function, since I felt like it defeated the purpose of the all part. I don't have a strong opinion either way though. I'm fine with adding it back into this function as well.

IIUC, the position argument is still relevant here, even with all. The occurrence argument is the one that is not needed with all.

I think we should aim for these two functions regexp_match_substring and regexp_match_substring_all and their documentation to be identical in all respects except that the former returns one occurrence and the later returns all occurrences.

@richtia
Copy link
Member Author

richtia commented Mar 9, 2023

IIUC, the position argument is still relevant here, even with all. The occurrence argument is the one that is not needed with all.

I think we should aim for these two functions regexp_match_substring and regexp_match_substring_all and their documentation to be identical in all respects except that the former returns one occurrence and the later returns all occurrences.

Sounds good, I've updated both.

on. The `position` argument should be a positive non-zero integer. The regular
expression capture group can be specified using the `group` argument. Specifying `0`
will match the entire regular expression. The default group value is `1`, which means the
first capture group will be matched. The `group` argument should be a non-negative integer.
Copy link
Contributor

@ianmcook ianmcook Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richtia thank you for adding this group argument to the existing regexp_match_substring function. This creates better correspondence between this existing function and the new regexp_match_substring_all function added here. But to be clear, it's not only for the sake of completeness. Multiple real-world functions that extract a single regex match—including R's stringr::str_extract() and Snowflake's with REGEXP_SUBSTR—accept an optional group number argument which allows the user to specify which capture group to return. So this is a practical addition.

Is it considered a breaking change when we add a new argument like this? I forget whether or not Substrait technically allows optional arguments or whether any new argument counts as a breaking change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be considered a breaking change. I'm not 100% sure though. @westonpace

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding a new positional argument is a breaking change (adding a new option is not).

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand yet but I haven't looked at the regex functions before so it's likely just ignorance on my part.

Comment on lines 165 to 167
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or
the position value is out of range.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these things undefined behaviors and not errors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW the existing regexp functions in functions_string.yaml use this same kind of boilerplate "Behavior is undefined" language.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's just the decision we made during the original PR. But if erroring out is preferred, we can go that route too.

@richtia richtia force-pushed the regexp_match_substring_all branch 2 times, most recently from f64048e to f5973e8 Compare March 14, 2023 19:32
@richtia richtia requested a review from westonpace March 15, 2023 18:30
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One new question and there is one outstanding (related to undefined behaviors). I've resolved the rest of my comments for clarity.

Comment on lines 166 to 167
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or
the position value is out of range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Behavior is undefined if the regex fails to compile, the occurrence value is out of range, or
the position value is out of range.
Behavior is undefined if the regex fails to compile, the occurrence value is out of range,
the position value is out of range, or the group value is out of range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richtia could you please also change the "Behavior is undefined" sentence for the regexp_match_substring function to mention the group value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@richtia richtia force-pushed the regexp_match_substring_all branch from f5973e8 to 10f9d30 Compare March 21, 2023 19:21
fix: add group to match_substring and position to match_substring_all

docs: update descriptions

docs: update description

docs: update undefined behavior in match substring
@richtia richtia force-pushed the regexp_match_substring_all branch from 10f9d30 to 6440272 Compare March 21, 2023 19:21
@ianmcook
Copy link
Contributor

BREAKING CHANGE: adding a new positional argument

@richtia can you please change the PR description to say:

BREAKING CHANGE: group argument added to regexp_match_substring function

Comment on lines 166 to 167
Behavior is undefined if the regex fails to compile, the occurrence value is out of range,
the position value is out of range, or the group value is out of range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, there is no occurrence value for regexp_match_substring_all

Suggested change
Behavior is undefined if the regex fails to compile, the occurrence value is out of range,
the position value is out of range, or the group value is out of range.
Behavior is undefined if the regex fails to compile, the position value is out of range,
or the group value is out of range.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good. @ianmcook let me know when you're content.

@ianmcook
Copy link
Contributor

It looks good to me

@westonpace
Copy link
Member

+1

Since this is now a breaking extension function change (adding new positional arguments is always a breaking change) it requires two SMC votes.

@cpcloud / @jacques-n ?

@ianmcook
Copy link
Contributor

FYI: DuckDB just added a regexp_extract_all function in duckdb/duckdb#6685

Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@richtia richtia merged commit b4d81fb into substrait-io:main Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add regexp_match_substring_all function
5 participants